Message ID | 20200102155208.8977-1-longman@redhat.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim | expand |
> On Jan 2, 2020, at 10:52 AM, Waiman Long <longman@redhat.com> wrote: > > Depending on the workloads, the following circular locking dependency > warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo > lock) may show up: > > ====================================================== > WARNING: possible circular locking dependency detected > 5.0.0-rc1+ #60 Tainted: G W > ------------------------------------------------------ > fsfreeze/4346 is trying to acquire lock: > 0000000026f1d784 (fs_reclaim){+.+.}, at: > fs_reclaim_acquire.part.19+0x5/0x30 > > but task is already holding lock: > 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 > > which lock already depends on the new lock. > : > Possible unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(sb_internal); > lock(fs_reclaim); > lock(sb_internal); > lock(fs_reclaim); > > *** DEADLOCK *** > > 4 locks held by fsfreeze/4346: > #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650 > #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290 > #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650 > #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 > > stack backtrace: > Call Trace: > dump_stack+0xe0/0x19a > print_circular_bug.isra.10.cold.34+0x2f4/0x435 > check_prev_add.constprop.19+0xca1/0x15f0 > validate_chain.isra.14+0x11af/0x3b50 > __lock_acquire+0x728/0x1200 > lock_acquire+0x269/0x5a0 > fs_reclaim_acquire.part.19+0x29/0x30 > fs_reclaim_acquire+0x19/0x20 > kmem_cache_alloc+0x3e/0x3f0 > kmem_zone_alloc+0x79/0x150 > xfs_trans_alloc+0xfa/0x9d0 > xfs_sync_sb+0x86/0x170 > xfs_log_sbcount+0x10f/0x140 > xfs_quiesce_attr+0x134/0x270 > xfs_fs_freeze+0x4a/0x70 > freeze_super+0x1af/0x290 > do_vfs_ioctl+0xedc/0x16c0 > ksys_ioctl+0x41/0x80 > __x64_sys_ioctl+0x73/0xa9 > do_syscall_64+0x18f/0xd23 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > According to Dave Chinner: > > Freezing the filesystem, after all the data has been cleaned. IOWs > memory reclaim will never run the above writeback path when > the freeze process is trying to allocate a transaction here because > there are no dirty data pages in the filesystem at this point. > > Indeed, this xfs_sync_sb() path sets XFS_TRANS_NO_WRITECOUNT so that > it /doesn't deadlock/ by taking freeze references for the > transaction. We've just drained all the transactions > in progress and written back all the dirty metadata, too, and so the > filesystem is completely clean and only needs the superblock to be > updated to complete the freeze process. And to do that, it does not > take a freeze reference because calling sb_start_intwrite() here > would deadlock. > > IOWs, this is a false positive, caused by the fact that > xfs_trans_alloc() is called from both above and below memory reclaim > as well as within /every level/ of freeze processing. Lockdep is > unable to describe the staged flush logic in the freeze process that > prevents deadlocks from occurring, and hence we will pretty much > always see false positives in the freeze path.... > > Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock > may fix the issue. However, that will greatly complicate the logic and > may not be worth it. > > Another way to fix it is to disable the taking of the fs_reclaim > pseudo lock when in the freezing code path as a reclaim on the freezed > filesystem is not possible as stated above. This patch takes this > approach by setting the __GFP_NOLOCKDEP flag in the slab memory > allocation calls when the filesystem has been freezed. > > Without this patch, the command sequence below will show that the lock > dependency chain sb_internal -> fs_reclaim exists. > > # fsfreeze -f /home > # fsfreeze --unfreeze /home > # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal > > After applying the patch, such sb_internal -> fs_reclaim lock dependency > chain can no longer be found. Because of that, the locking dependency > warning will not be shown. There was an attempt to fix this in the past, but Dave rejected right away for any workaround in xfs and insisted to make lockdep smarter instead. No sure your approach will make any difference this time. Good luck.
On Thu, Jan 02, 2020 at 11:19:51AM -0500, Qian Cai wrote: > > > > On Jan 2, 2020, at 10:52 AM, Waiman Long <longman@redhat.com> wrote: > > > > Depending on the workloads, the following circular locking dependency > > warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo > > lock) may show up: > > > > ====================================================== > > WARNING: possible circular locking dependency detected > > 5.0.0-rc1+ #60 Tainted: G W > > ------------------------------------------------------ > > fsfreeze/4346 is trying to acquire lock: > > 0000000026f1d784 (fs_reclaim){+.+.}, at: > > fs_reclaim_acquire.part.19+0x5/0x30 > > > > but task is already holding lock: > > 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 > > > > which lock already depends on the new lock. > > : > > Possible unsafe locking scenario: > > > > CPU0 CPU1 > > ---- ---- > > lock(sb_internal); > > lock(fs_reclaim); > > lock(sb_internal); > > lock(fs_reclaim); > > > > *** DEADLOCK *** > > > > 4 locks held by fsfreeze/4346: > > #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650 > > #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290 > > #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650 > > #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 > > > > stack backtrace: > > Call Trace: > > dump_stack+0xe0/0x19a > > print_circular_bug.isra.10.cold.34+0x2f4/0x435 > > check_prev_add.constprop.19+0xca1/0x15f0 > > validate_chain.isra.14+0x11af/0x3b50 > > __lock_acquire+0x728/0x1200 > > lock_acquire+0x269/0x5a0 > > fs_reclaim_acquire.part.19+0x29/0x30 > > fs_reclaim_acquire+0x19/0x20 > > kmem_cache_alloc+0x3e/0x3f0 > > kmem_zone_alloc+0x79/0x150 > > xfs_trans_alloc+0xfa/0x9d0 > > xfs_sync_sb+0x86/0x170 > > xfs_log_sbcount+0x10f/0x140 > > xfs_quiesce_attr+0x134/0x270 > > xfs_fs_freeze+0x4a/0x70 > > freeze_super+0x1af/0x290 > > do_vfs_ioctl+0xedc/0x16c0 > > ksys_ioctl+0x41/0x80 > > __x64_sys_ioctl+0x73/0xa9 > > do_syscall_64+0x18f/0xd23 > > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > > > According to Dave Chinner: > > > > Freezing the filesystem, after all the data has been cleaned. IOWs > > memory reclaim will never run the above writeback path when > > the freeze process is trying to allocate a transaction here because > > there are no dirty data pages in the filesystem at this point. > > > > Indeed, this xfs_sync_sb() path sets XFS_TRANS_NO_WRITECOUNT so that > > it /doesn't deadlock/ by taking freeze references for the > > transaction. We've just drained all the transactions > > in progress and written back all the dirty metadata, too, and so the > > filesystem is completely clean and only needs the superblock to be > > updated to complete the freeze process. And to do that, it does not > > take a freeze reference because calling sb_start_intwrite() here > > would deadlock. > > > > IOWs, this is a false positive, caused by the fact that > > xfs_trans_alloc() is called from both above and below memory reclaim > > as well as within /every level/ of freeze processing. Lockdep is > > unable to describe the staged flush logic in the freeze process that > > prevents deadlocks from occurring, and hence we will pretty much > > always see false positives in the freeze path.... > > > > Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock > > may fix the issue. However, that will greatly complicate the logic and > > may not be worth it. > > > > Another way to fix it is to disable the taking of the fs_reclaim > > pseudo lock when in the freezing code path as a reclaim on the freezed > > filesystem is not possible as stated above. This patch takes this > > approach by setting the __GFP_NOLOCKDEP flag in the slab memory > > allocation calls when the filesystem has been freezed. > > > > Without this patch, the command sequence below will show that the lock > > dependency chain sb_internal -> fs_reclaim exists. > > > > # fsfreeze -f /home > > # fsfreeze --unfreeze /home > > # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal > > > > After applying the patch, such sb_internal -> fs_reclaim lock dependency > > chain can no longer be found. Because of that, the locking dependency > > warning will not be shown. > > There was an attempt to fix this in the past, but Dave rejected right > away for any workaround in xfs and insisted to make lockdep smarter > instead. No sure your approach will make any difference this time. > Good luck. /me wonders if you can fix this by having the freeze path call memalloc_nofs_save() since we probably don't want to be recursing into the fs for reclaim while freezing it? Probably not, because that's a bigger hammer than we really need here. We can certainly steal memory from other filesystems that aren't frozen. It doesn't solve the key issue that lockdep isn't smart enough to know that we can't recurse into the fs that's being frozen and therefore there's no chance of deadlock. --D
On 1/2/20 1:24 PM, Darrick J. Wong wrote: > On Thu, Jan 02, 2020 at 11:19:51AM -0500, Qian Cai wrote: >> >>> On Jan 2, 2020, at 10:52 AM, Waiman Long <longman@redhat.com> wrote: >>> >>> Depending on the workloads, the following circular locking dependency >>> warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo >>> lock) may show up: >>> >>> ====================================================== >>> WARNING: possible circular locking dependency detected >>> 5.0.0-rc1+ #60 Tainted: G W >>> ------------------------------------------------------ >>> fsfreeze/4346 is trying to acquire lock: >>> 0000000026f1d784 (fs_reclaim){+.+.}, at: >>> fs_reclaim_acquire.part.19+0x5/0x30 >>> >>> but task is already holding lock: >>> 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 >>> >>> which lock already depends on the new lock. >>> : >>> Possible unsafe locking scenario: >>> >>> CPU0 CPU1 >>> ---- ---- >>> lock(sb_internal); >>> lock(fs_reclaim); >>> lock(sb_internal); >>> lock(fs_reclaim); >>> >>> *** DEADLOCK *** >>> >>> 4 locks held by fsfreeze/4346: >>> #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650 >>> #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290 >>> #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650 >>> #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 >>> >>> stack backtrace: >>> Call Trace: >>> dump_stack+0xe0/0x19a >>> print_circular_bug.isra.10.cold.34+0x2f4/0x435 >>> check_prev_add.constprop.19+0xca1/0x15f0 >>> validate_chain.isra.14+0x11af/0x3b50 >>> __lock_acquire+0x728/0x1200 >>> lock_acquire+0x269/0x5a0 >>> fs_reclaim_acquire.part.19+0x29/0x30 >>> fs_reclaim_acquire+0x19/0x20 >>> kmem_cache_alloc+0x3e/0x3f0 >>> kmem_zone_alloc+0x79/0x150 >>> xfs_trans_alloc+0xfa/0x9d0 >>> xfs_sync_sb+0x86/0x170 >>> xfs_log_sbcount+0x10f/0x140 >>> xfs_quiesce_attr+0x134/0x270 >>> xfs_fs_freeze+0x4a/0x70 >>> freeze_super+0x1af/0x290 >>> do_vfs_ioctl+0xedc/0x16c0 >>> ksys_ioctl+0x41/0x80 >>> __x64_sys_ioctl+0x73/0xa9 >>> do_syscall_64+0x18f/0xd23 >>> entry_SYSCALL_64_after_hwframe+0x49/0xbe >>> >>> According to Dave Chinner: >>> >>> Freezing the filesystem, after all the data has been cleaned. IOWs >>> memory reclaim will never run the above writeback path when >>> the freeze process is trying to allocate a transaction here because >>> there are no dirty data pages in the filesystem at this point. >>> >>> Indeed, this xfs_sync_sb() path sets XFS_TRANS_NO_WRITECOUNT so that >>> it /doesn't deadlock/ by taking freeze references for the >>> transaction. We've just drained all the transactions >>> in progress and written back all the dirty metadata, too, and so the >>> filesystem is completely clean and only needs the superblock to be >>> updated to complete the freeze process. And to do that, it does not >>> take a freeze reference because calling sb_start_intwrite() here >>> would deadlock. >>> >>> IOWs, this is a false positive, caused by the fact that >>> xfs_trans_alloc() is called from both above and below memory reclaim >>> as well as within /every level/ of freeze processing. Lockdep is >>> unable to describe the staged flush logic in the freeze process that >>> prevents deadlocks from occurring, and hence we will pretty much >>> always see false positives in the freeze path.... >>> >>> Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock >>> may fix the issue. However, that will greatly complicate the logic and >>> may not be worth it. >>> >>> Another way to fix it is to disable the taking of the fs_reclaim >>> pseudo lock when in the freezing code path as a reclaim on the freezed >>> filesystem is not possible as stated above. This patch takes this >>> approach by setting the __GFP_NOLOCKDEP flag in the slab memory >>> allocation calls when the filesystem has been freezed. >>> >>> Without this patch, the command sequence below will show that the lock >>> dependency chain sb_internal -> fs_reclaim exists. >>> >>> # fsfreeze -f /home >>> # fsfreeze --unfreeze /home >>> # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal >>> >>> After applying the patch, such sb_internal -> fs_reclaim lock dependency >>> chain can no longer be found. Because of that, the locking dependency >>> warning will not be shown. >> There was an attempt to fix this in the past, but Dave rejected right >> away for any workaround in xfs and insisted to make lockdep smarter >> instead. No sure your approach will make any difference this time. >> Good luck. > /me wonders if you can fix this by having the freeze path call > memalloc_nofs_save() since we probably don't want to be recursing into > the fs for reclaim while freezing it? Probably not, because that's a > bigger hammer than we really need here. We can certainly steal memory > from other filesystems that aren't frozen. > > It doesn't solve the key issue that lockdep isn't smart enough to know > that we can't recurse into the fs that's being frozen and therefore > there's no chance of deadlock. Lockdep only looks at all the possible locking chains to see if a circular deadlock is possible. It doesn't have the smart to understand filesystem internals. The problem here is caused by the fact that fs_reclaim is a global pseudo lock that is acquired whenever there is a chance that FS reclaim can happen. As I said in the commit log, it may be possible to fix that by breaking up fs_reclaim into a set of per-filesystem pseudo locks, but that will add quite a bit of complexity to the code. That is why I don't want to go this route. This patch is the least invasive that I can think of to address the problem without inhibiting other valid lockdep checking. Cheers, Longman
On Thu, Jan 02, 2020 at 10:52:08AM -0500, Waiman Long wrote: > Depending on the workloads, the following circular locking dependency > warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo > lock) may show up: > > ====================================================== > WARNING: possible circular locking dependency detected > 5.0.0-rc1+ #60 Tainted: G W > ------------------------------------------------------ > fsfreeze/4346 is trying to acquire lock: > 0000000026f1d784 (fs_reclaim){+.+.}, at: > fs_reclaim_acquire.part.19+0x5/0x30 > > but task is already holding lock: > 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 > > which lock already depends on the new lock. > : > Possible unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(sb_internal); > lock(fs_reclaim); > lock(sb_internal); > lock(fs_reclaim); > > *** DEADLOCK *** > > 4 locks held by fsfreeze/4346: > #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650 > #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290 > #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650 > #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 > > stack backtrace: > Call Trace: > dump_stack+0xe0/0x19a > print_circular_bug.isra.10.cold.34+0x2f4/0x435 > check_prev_add.constprop.19+0xca1/0x15f0 > validate_chain.isra.14+0x11af/0x3b50 > __lock_acquire+0x728/0x1200 > lock_acquire+0x269/0x5a0 > fs_reclaim_acquire.part.19+0x29/0x30 > fs_reclaim_acquire+0x19/0x20 > kmem_cache_alloc+0x3e/0x3f0 > kmem_zone_alloc+0x79/0x150 > xfs_trans_alloc+0xfa/0x9d0 > xfs_sync_sb+0x86/0x170 > xfs_log_sbcount+0x10f/0x140 > xfs_quiesce_attr+0x134/0x270 > xfs_fs_freeze+0x4a/0x70 > freeze_super+0x1af/0x290 > do_vfs_ioctl+0xedc/0x16c0 > ksys_ioctl+0x41/0x80 > __x64_sys_ioctl+0x73/0xa9 > do_syscall_64+0x18f/0xd23 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > According to Dave Chinner: > > Freezing the filesystem, after all the data has been cleaned. IOWs > memory reclaim will never run the above writeback path when > the freeze process is trying to allocate a transaction here because > there are no dirty data pages in the filesystem at this point. > > Indeed, this xfs_sync_sb() path sets XFS_TRANS_NO_WRITECOUNT so that > it /doesn't deadlock/ by taking freeze references for the > transaction. We've just drained all the transactions > in progress and written back all the dirty metadata, too, and so the > filesystem is completely clean and only needs the superblock to be > updated to complete the freeze process. And to do that, it does not > take a freeze reference because calling sb_start_intwrite() here > would deadlock. > > IOWs, this is a false positive, caused by the fact that > xfs_trans_alloc() is called from both above and below memory reclaim > as well as within /every level/ of freeze processing. Lockdep is > unable to describe the staged flush logic in the freeze process that > prevents deadlocks from occurring, and hence we will pretty much > always see false positives in the freeze path.... > > Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock > may fix the issue. However, that will greatly complicate the logic and > may not be worth it. ANd it won't work, because now we'll just get lockedp warnings on the per-fs reclaim lockdep context. > Another way to fix it is to disable the taking of the fs_reclaim > pseudo lock when in the freezing code path as a reclaim on the freezed > filesystem is not possible as stated above. This patch takes this > approach by setting the __GFP_NOLOCKDEP flag in the slab memory > allocation calls when the filesystem has been freezed. IOWs, "fix" it by stating that "lockdep can't track freeze dependencies correctly"? In the past we have just used KM_NOFS for that, because __GFP_NOLOCKDEP didn't exist. But that has just been a nasty hack because lockdep isn't capable of understanding allocation context constraints because allocation contexts are much more complex than a "lock".... > --- a/fs/xfs/kmem.h > +++ b/fs/xfs/kmem.h > @@ -20,6 +20,12 @@ typedef unsigned __bitwise xfs_km_flags_t; > #define KM_MAYFAIL ((__force xfs_km_flags_t)0x0008u) > #define KM_ZERO ((__force xfs_km_flags_t)0x0010u) > > +#ifdef CONFIG_LOCKDEP > +#define KM_NOLOCKDEP ((__force xfs_km_flags_t)0x0020u) > +#else > +#define KM_NOLOCKDEP ((__force xfs_km_flags_t)0) > +#endif Nope. We are getting rid of kmem_alloc wrappers and all the associated flags, not adding new ones. Part of that process is identifying all the places we currently use KM_NOFS to "shut up lockdep" and converting them to explicit __GFP_NOLOCKDEP flags. So right now, this change needs to be queued up behind the API changes that are currently in progress... > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index f6006d94a581..b1997649ecd8 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -454,7 +454,8 @@ xfs_log_reserve( > XFS_STATS_INC(mp, xs_try_logspace); > > ASSERT(*ticp == NULL); > - tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent, 0); > + tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent, > + mp->m_super->s_writers.frozen ? KM_NOLOCKDEP : 0); This is pretty nasty. Having to spew conditional code like this across every allocation that could be done in freeze conditions is a non-starter. We already have a flag to tell us we are doing a transaction in a freeze state, so use that to turn off lockdep. That is: > *ticp = tic; > > xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > index 3b208f9a865c..c0e42e4f5b77 100644 > --- a/fs/xfs/xfs_trans.c > +++ b/fs/xfs/xfs_trans.c > @@ -262,8 +262,14 @@ xfs_trans_alloc( > * Allocate the handle before we do our freeze accounting and setting up > * GFP_NOFS allocation context so that we avoid lockdep false positives > * by doing GFP_KERNEL allocations inside sb_start_intwrite(). > + * > + * To prevent false positive lockdep warning of circular locking > + * dependency between sb_internal and fs_reclaim, disable the > + * acquisition of the fs_reclaim pseudo-lock when the superblock > + * has been frozen or in the process of being frozen. > */ > - tp = kmem_zone_zalloc(xfs_trans_zone, 0); > + tp = kmem_zone_zalloc(xfs_trans_zone, > + mp->m_super->s_writers.frozen ? KM_NOLOCKDEP : 0); > if (!(flags & XFS_TRANS_NO_WRITECOUNT)) > sb_start_intwrite(mp->m_super); This code here should be setting PF_GFP_NOLOCKDEP state to turn off lockdep for all allocations in this context, similar to the way we use memalloc_nofs_save/restore so that all nested allocations inherit GFP_NOFS state... -Dave.
On 1/3/20 9:36 PM, Dave Chinner wrote: > On Thu, Jan 02, 2020 at 10:52:08AM -0500, Waiman Long wrote: >> Depending on the workloads, the following circular locking dependency >> warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo >> lock) may show up: >> >> ====================================================== >> WARNING: possible circular locking dependency detected >> 5.0.0-rc1+ #60 Tainted: G W >> ------------------------------------------------------ >> fsfreeze/4346 is trying to acquire lock: >> 0000000026f1d784 (fs_reclaim){+.+.}, at: >> fs_reclaim_acquire.part.19+0x5/0x30 >> >> but task is already holding lock: >> 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 >> >> which lock already depends on the new lock. >> : >> Possible unsafe locking scenario: >> >> CPU0 CPU1 >> ---- ---- >> lock(sb_internal); >> lock(fs_reclaim); >> lock(sb_internal); >> lock(fs_reclaim); >> >> *** DEADLOCK *** >> >> 4 locks held by fsfreeze/4346: >> #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650 >> #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290 >> #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650 >> #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 >> >> stack backtrace: >> Call Trace: >> dump_stack+0xe0/0x19a >> print_circular_bug.isra.10.cold.34+0x2f4/0x435 >> check_prev_add.constprop.19+0xca1/0x15f0 >> validate_chain.isra.14+0x11af/0x3b50 >> __lock_acquire+0x728/0x1200 >> lock_acquire+0x269/0x5a0 >> fs_reclaim_acquire.part.19+0x29/0x30 >> fs_reclaim_acquire+0x19/0x20 >> kmem_cache_alloc+0x3e/0x3f0 >> kmem_zone_alloc+0x79/0x150 >> xfs_trans_alloc+0xfa/0x9d0 >> xfs_sync_sb+0x86/0x170 >> xfs_log_sbcount+0x10f/0x140 >> xfs_quiesce_attr+0x134/0x270 >> xfs_fs_freeze+0x4a/0x70 >> freeze_super+0x1af/0x290 >> do_vfs_ioctl+0xedc/0x16c0 >> ksys_ioctl+0x41/0x80 >> __x64_sys_ioctl+0x73/0xa9 >> do_syscall_64+0x18f/0xd23 >> entry_SYSCALL_64_after_hwframe+0x49/0xbe >> >> According to Dave Chinner: >> >> Freezing the filesystem, after all the data has been cleaned. IOWs >> memory reclaim will never run the above writeback path when >> the freeze process is trying to allocate a transaction here because >> there are no dirty data pages in the filesystem at this point. >> >> Indeed, this xfs_sync_sb() path sets XFS_TRANS_NO_WRITECOUNT so that >> it /doesn't deadlock/ by taking freeze references for the >> transaction. We've just drained all the transactions >> in progress and written back all the dirty metadata, too, and so the >> filesystem is completely clean and only needs the superblock to be >> updated to complete the freeze process. And to do that, it does not >> take a freeze reference because calling sb_start_intwrite() here >> would deadlock. >> >> IOWs, this is a false positive, caused by the fact that >> xfs_trans_alloc() is called from both above and below memory reclaim >> as well as within /every level/ of freeze processing. Lockdep is >> unable to describe the staged flush logic in the freeze process that >> prevents deadlocks from occurring, and hence we will pretty much >> always see false positives in the freeze path.... >> >> Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock >> may fix the issue. However, that will greatly complicate the logic and >> may not be worth it. > ANd it won't work, because now we'll just get lockedp warnings on > the per-fs reclaim lockdep context. It may or may not work depending on how we are going to break it up. I haven't thought through that alternative yet as I am expecting that it will be a bigger change if we decide to go this route. > >> Another way to fix it is to disable the taking of the fs_reclaim >> pseudo lock when in the freezing code path as a reclaim on the freezed >> filesystem is not possible as stated above. This patch takes this >> approach by setting the __GFP_NOLOCKDEP flag in the slab memory >> allocation calls when the filesystem has been freezed. > IOWs, "fix" it by stating that "lockdep can't track freeze > dependencies correctly"? The lockdep code has a singular focus on spotting possible deadlock scenarios from a locking point of view. The freeze dependency has to be properly translated into appropriate locking sequences to make lockdep work correctly. I would say that the use of a global fs_reclaim pseudo lock is not a perfect translation and so it has exception cases we need to handle. > > In the past we have just used KM_NOFS for that, because > __GFP_NOLOCKDEP didn't exist. But that has just been a nasty hack > because lockdep isn't capable of understanding allocation context > constraints because allocation contexts are much more complex than a > "lock".... > > >> --- a/fs/xfs/kmem.h >> +++ b/fs/xfs/kmem.h >> @@ -20,6 +20,12 @@ typedef unsigned __bitwise xfs_km_flags_t; >> #define KM_MAYFAIL ((__force xfs_km_flags_t)0x0008u) >> #define KM_ZERO ((__force xfs_km_flags_t)0x0010u) >> >> +#ifdef CONFIG_LOCKDEP >> +#define KM_NOLOCKDEP ((__force xfs_km_flags_t)0x0020u) >> +#else >> +#define KM_NOLOCKDEP ((__force xfs_km_flags_t)0) >> +#endif > Nope. We are getting rid of kmem_alloc wrappers and all the > associated flags, not adding new ones. Part of that process is > identifying all the places we currently use KM_NOFS to "shut up > lockdep" and converting them to explicit __GFP_NOLOCKDEP flags. > > So right now, this change needs to be queued up behind the API > changes that are currently in progress... That is fine. I can wait and post a revised patch after that. Who is going to make these changes? >> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c >> index f6006d94a581..b1997649ecd8 100644 >> --- a/fs/xfs/xfs_log.c >> +++ b/fs/xfs/xfs_log.c >> @@ -454,7 +454,8 @@ xfs_log_reserve( >> XFS_STATS_INC(mp, xs_try_logspace); >> >> ASSERT(*ticp == NULL); >> - tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent, 0); >> + tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent, >> + mp->m_super->s_writers.frozen ? KM_NOLOCKDEP : 0); > This is pretty nasty. Having to spew conditional code like this > across every allocation that could be done in freeze conditions is > a non-starter. > > We already have a flag to tell us we are doing a transaction in a > freeze state, so use that to turn off lockdep. That is: OK. >> *ticp = tic; >> >> xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt >> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c >> index 3b208f9a865c..c0e42e4f5b77 100644 >> --- a/fs/xfs/xfs_trans.c >> +++ b/fs/xfs/xfs_trans.c >> @@ -262,8 +262,14 @@ xfs_trans_alloc( >> * Allocate the handle before we do our freeze accounting and setting up >> * GFP_NOFS allocation context so that we avoid lockdep false positives >> * by doing GFP_KERNEL allocations inside sb_start_intwrite(). >> + * >> + * To prevent false positive lockdep warning of circular locking >> + * dependency between sb_internal and fs_reclaim, disable the >> + * acquisition of the fs_reclaim pseudo-lock when the superblock >> + * has been frozen or in the process of being frozen. >> */ >> - tp = kmem_zone_zalloc(xfs_trans_zone, 0); >> + tp = kmem_zone_zalloc(xfs_trans_zone, >> + mp->m_super->s_writers.frozen ? KM_NOLOCKDEP : 0); >> if (!(flags & XFS_TRANS_NO_WRITECOUNT)) >> sb_start_intwrite(mp->m_super); > This code here should be setting PF_GFP_NOLOCKDEP state to turn off > lockdep for all allocations in this context, similar to the way we > use memalloc_nofs_save/restore so that all nested allocations > inherit GFP_NOFS state... Sure. Thanks for the comments. Cheers, Longman
On Mon, Jan 06, 2020 at 11:12:32AM -0500, Waiman Long wrote: > On 1/3/20 9:36 PM, Dave Chinner wrote: > > On Thu, Jan 02, 2020 at 10:52:08AM -0500, Waiman Long wrote: > >> Depending on the workloads, the following circular locking dependency > >> warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo > >> lock) may show up: .... > >> IOWs, this is a false positive, caused by the fact that > >> xfs_trans_alloc() is called from both above and below memory reclaim > >> as well as within /every level/ of freeze processing. Lockdep is > >> unable to describe the staged flush logic in the freeze process that > >> prevents deadlocks from occurring, and hence we will pretty much > >> always see false positives in the freeze path.... > >> > >> Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock > >> may fix the issue. However, that will greatly complicate the logic and > >> may not be worth it. > > ANd it won't work, because now we'll just get lockedp warnings on > > the per-fs reclaim lockdep context. > > It may or may not work depending on how we are going to break it up. I > haven't thought through that alternative yet as I am expecting that it > will be a bigger change if we decide to go this route. A per-filesystem lock will not work because a single XFS filesystem can trigger this "locking" inversion -both contexts that lockdep warns about occur in the normal operation of that single filesystem. The only way to avoid this is to have multiple context specific reclaim lockdep contexts per filesystem, and that becomes a mess really quickly. The biggest problem with this is that the "lock context" is no longer validated consistently across the entire filesystem - we con only internally validate the order or each lock context against itself, and not against operations in the other lock contexts. Hence there's no global validation of lock orders - lockdep allows different lock orders in different contexts and so that defeats the purpose of using lockdep for this validation. Indeed, we've been down this path before with lockdep for XFS inode locking vs inode reclaim(*), and we removed it years ago because multiple lock contexts for different operations and/or life-cycle stages just hasn't been reliable or maintainable. We still get false positives because we haven't solved the "lockdep can't express the dependencies fully" problem, yet we've reduced the lock order validation scope of lockdep considerably.... > >> Another way to fix it is to disable the taking of the fs_reclaim > >> pseudo lock when in the freezing code path as a reclaim on the freezed > >> filesystem is not possible as stated above. This patch takes this > >> approach by setting the __GFP_NOLOCKDEP flag in the slab memory > >> allocation calls when the filesystem has been freezed. > > IOWs, "fix" it by stating that "lockdep can't track freeze > > dependencies correctly"? > The lockdep code has a singular focus on spotting possible deadlock > scenarios from a locking point of view. IMO, lockdep only works for very simplistic locking strategies. Anything complex requires more work to get lockdep annotations "correct enough" to prevent false positives than it does to actually read the code and very the locking is correct. Have you noticed we do runs of nested trylock-and-back-out-on- failure because we lock objects in an order that might deadlock because of cross-object state dependencies that can't be covered by lockdep? e.g. xfs_lock_inodes() which nests up to 5 inodes deep, can nest 3 separate locks per inode and has to take into account journal flushing depenedencies when locking multiple inodes? Lockdep is very simplisitic and the complex annotations we need to handle situations like the above are very difficult to design, use and maintainr. It's so much simpler just to use big hammers like GFP_NOFS to shut up all the different types of false positives lockdep throws up for reclaim context false positives because after all these years there still isn't a viable mechanism to easily express this sort of complex dependency chain. > The freeze dependency has to be > properly translated into appropriate locking sequences to make lockdep > work correctly.i This is part of the problem - freeze context is not actually a lock but it's close enough that freezing on simple filesystems can be validated with lockdep annotations. i.e. same basic concept as the lockdep reclaim context. However, it just doesn't work reliably for more complex subsystems where there are much more subtle and complex behavioural interactions and dependencies that a single lock context cannot express or be annotated to express. That's where lockdep falls down.... > I would say that the use of a global fs_reclaim pseudo > lock is not a perfect translation and so it has exception cases we need > to handle. Exactly the problem. > > Nope. We are getting rid of kmem_alloc wrappers and all the > > associated flags, not adding new ones. Part of that process is > > identifying all the places we currently use KM_NOFS to "shut up > > lockdep" and converting them to explicit __GFP_NOLOCKDEP flags. > > > > So right now, this change needs to be queued up behind the API > > changes that are currently in progress... > > That is fine. I can wait and post a revised patch after that. Who is > going to make these changes? https://lore.kernel.org/linux-xfs/20191120104425.407213-1-cmaiolino@redhat.com/ Cheers, Dave.
On 1/6/20 4:01 PM, Dave Chinner wrote: > On Mon, Jan 06, 2020 at 11:12:32AM -0500, Waiman Long wrote: >> On 1/3/20 9:36 PM, Dave Chinner wrote: >>> On Thu, Jan 02, 2020 at 10:52:08AM -0500, Waiman Long wrote: >>>> Depending on the workloads, the following circular locking dependency >>>> warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo >>>> lock) may show up: > .... >>>> IOWs, this is a false positive, caused by the fact that >>>> xfs_trans_alloc() is called from both above and below memory reclaim >>>> as well as within /every level/ of freeze processing. Lockdep is >>>> unable to describe the staged flush logic in the freeze process that >>>> prevents deadlocks from occurring, and hence we will pretty much >>>> always see false positives in the freeze path.... >>>> >>>> Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock >>>> may fix the issue. However, that will greatly complicate the logic and >>>> may not be worth it. >>> ANd it won't work, because now we'll just get lockedp warnings on >>> the per-fs reclaim lockdep context. >> It may or may not work depending on how we are going to break it up. I >> haven't thought through that alternative yet as I am expecting that it >> will be a bigger change if we decide to go this route. > A per-filesystem lock will not work because a single XFS filesystem > can trigger this "locking" inversion -both contexts that lockdep > warns about occur in the normal operation of that single filesystem. > > The only way to avoid this is to have multiple context specific > reclaim lockdep contexts per filesystem, and that becomes a mess > really quickly. The biggest problem with this is that the "lock > context" is no longer validated consistently across the entire > filesystem - we con only internally validate the order or each lock > context against itself, and not against operations in the other lock > contexts. Hence there's no global validation of lock orders - > lockdep allows different lock orders in different contexts and so > that defeats the purpose of using lockdep for this validation. > > Indeed, we've been down this path before with lockdep for XFS inode > locking vs inode reclaim(*), and we removed it years ago because > multiple lock contexts for different operations and/or life-cycle > stages just hasn't been reliable or maintainable. We still get false > positives because we haven't solved the "lockdep can't express the > dependencies fully" problem, yet we've reduced the lock order > validation scope of lockdep considerably.... > That is true. You can make lockdep to check these kind of complicated dependency accurately. >>>> Another way to fix it is to disable the taking of the fs_reclaim >>>> pseudo lock when in the freezing code path as a reclaim on the freezed >>>> filesystem is not possible as stated above. This patch takes this >>>> approach by setting the __GFP_NOLOCKDEP flag in the slab memory >>>> allocation calls when the filesystem has been freezed. >>> IOWs, "fix" it by stating that "lockdep can't track freeze >>> dependencies correctly"? >> The lockdep code has a singular focus on spotting possible deadlock >> scenarios from a locking point of view. > IMO, lockdep only works for very simplistic locking strategies. > Anything complex requires more work to get lockdep annotations > "correct enough" to prevent false positives than it does to actually > read the code and very the locking is correct. > > Have you noticed we do runs of nested trylock-and-back-out-on- > failure because we lock objects in an order that might deadlock > because of cross-object state dependencies that can't be covered by > lockdep? e.g. xfs_lock_inodes() which nests up to 5 inodes deep, > can nest 3 separate locks per inode and has to take into account > journal flushing depenedencies when locking multiple inodes? > > Lockdep is very simplisitic and the complex annotations we need to > handle situations like the above are very difficult to design, > use and maintainr. It's so much simpler just to use big hammers like > GFP_NOFS to shut up all the different types of false positives > lockdep throws up for reclaim context false positives because after > all these years there still isn't a viable mechanism to easily > express this sort of complex dependency chain. Regarding the trylock-and-back-out-on_failure code, do you think adding new APIs with timeout for mutex and rwsem and may be spin count for spinlock will help to at least reduce the number of failures that can happen in the code. RT mutex does have a rt_mutex_timed_lock(), but there is no equivalent for mutex and rwsem. >> The freeze dependency has to be >> properly translated into appropriate locking sequences to make lockdep >> work correctly.i > This is part of the problem - freeze context is not actually a lock > but it's close enough that freezing on simple filesystems can be > validated with lockdep annotations. i.e. same basic concept as the > lockdep reclaim context. However, it just doesn't work reliably for > more complex subsystems where there are much more subtle and complex > behavioural interactions and dependencies that a single lock context > cannot express or be annotated to express. That's where lockdep > falls down.... > >> I would say that the use of a global fs_reclaim pseudo >> lock is not a perfect translation and so it has exception cases we need >> to handle. > Exactly the problem. > >>> Nope. We are getting rid of kmem_alloc wrappers and all the >>> associated flags, not adding new ones. Part of that process is >>> identifying all the places we currently use KM_NOFS to "shut up >>> lockdep" and converting them to explicit __GFP_NOLOCKDEP flags. >>> >>> So right now, this change needs to be queued up behind the API >>> changes that are currently in progress... >> That is fine. I can wait and post a revised patch after that. Who is >> going to make these changes? > https://lore.kernel.org/linux-xfs/20191120104425.407213-1-cmaiolino@redhat.com/ Thanks for the pointer. I will rebase my patch on top of that. Cheers, Longman
On Tue, Jan 07, 2020 at 10:40:12AM -0500, Waiman Long wrote: > On 1/6/20 4:01 PM, Dave Chinner wrote: > > On Mon, Jan 06, 2020 at 11:12:32AM -0500, Waiman Long wrote: > >> On 1/3/20 9:36 PM, Dave Chinner wrote: > >>> On Thu, Jan 02, 2020 at 10:52:08AM -0500, Waiman Long wrote: > >>> IOWs, "fix" it by stating that "lockdep can't track freeze > >>> dependencies correctly"? > >> The lockdep code has a singular focus on spotting possible deadlock > >> scenarios from a locking point of view. > > IMO, lockdep only works for very simplistic locking strategies. > > Anything complex requires more work to get lockdep annotations > > "correct enough" to prevent false positives than it does to actually > > read the code and very the locking is correct. > > > > Have you noticed we do runs of nested trylock-and-back-out-on- > > failure because we lock objects in an order that might deadlock > > because of cross-object state dependencies that can't be covered by > > lockdep? e.g. xfs_lock_inodes() which nests up to 5 inodes deep, > > can nest 3 separate locks per inode and has to take into account > > journal flushing depenedencies when locking multiple inodes? > > > > Lockdep is very simplisitic and the complex annotations we need to > > handle situations like the above are very difficult to design, > > use and maintainr. It's so much simpler just to use big hammers like > > GFP_NOFS to shut up all the different types of false positives > > lockdep throws up for reclaim context false positives because after > > all these years there still isn't a viable mechanism to easily > > express this sort of complex dependency chain. > > Regarding the trylock-and-back-out-on_failure code, do you think adding > new APIs with timeout for mutex and rwsem and may be spin count for Timeouts have no place in generic locking APIs. Indeed, in these cases, timeouts do nothing to avoid the issues that require us to use trylock-and-back-out algorithms, so timeouts do nothing but hold off racing inode locks for unnecessarily long time periods while we wait for something to happen somewhere else that we have no direct control over.... > spinlock will help to at least reduce the number of failures that can > happen in the code. RT mutex does have a rt_mutex_timed_lock(), but > there is no equivalent for mutex and rwsem. Realtime has all sorts of problems with blocking where normal kernels don't (e.g. by turning spinlocks into mutexes) and so it forces rt code to jump through all sorts of crazy hoops to prevent priority inversions and deadlocks. If lock timeouts are necessary to avoid deadlocks and/or priority inversions in your code, then that indicates that there are locking algorithm problems that need fixing. Cheers, Dave.
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h index 6143117770e9..901189dcc474 100644 --- a/fs/xfs/kmem.h +++ b/fs/xfs/kmem.h @@ -20,6 +20,12 @@ typedef unsigned __bitwise xfs_km_flags_t; #define KM_MAYFAIL ((__force xfs_km_flags_t)0x0008u) #define KM_ZERO ((__force xfs_km_flags_t)0x0010u) +#ifdef CONFIG_LOCKDEP +#define KM_NOLOCKDEP ((__force xfs_km_flags_t)0x0020u) +#else +#define KM_NOLOCKDEP ((__force xfs_km_flags_t)0) +#endif + /* * We use a special process flag to avoid recursive callbacks into * the filesystem during transactions. We will also issue our own @@ -30,7 +36,7 @@ kmem_flags_convert(xfs_km_flags_t flags) { gfp_t lflags; - BUG_ON(flags & ~(KM_NOFS|KM_MAYFAIL|KM_ZERO)); + BUG_ON(flags & ~(KM_NOFS|KM_MAYFAIL|KM_ZERO|KM_NOLOCKDEP)); lflags = GFP_KERNEL | __GFP_NOWARN; if (flags & KM_NOFS) @@ -49,6 +55,9 @@ kmem_flags_convert(xfs_km_flags_t flags) if (flags & KM_ZERO) lflags |= __GFP_ZERO; + if (flags & KM_NOLOCKDEP) + lflags |= __GFP_NOLOCKDEP; + return lflags; } diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index f6006d94a581..b1997649ecd8 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -454,7 +454,8 @@ xfs_log_reserve( XFS_STATS_INC(mp, xs_try_logspace); ASSERT(*ticp == NULL); - tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent, 0); + tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent, + mp->m_super->s_writers.frozen ? KM_NOLOCKDEP : 0); *ticp = tic; xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 3b208f9a865c..c0e42e4f5b77 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -262,8 +262,14 @@ xfs_trans_alloc( * Allocate the handle before we do our freeze accounting and setting up * GFP_NOFS allocation context so that we avoid lockdep false positives * by doing GFP_KERNEL allocations inside sb_start_intwrite(). + * + * To prevent false positive lockdep warning of circular locking + * dependency between sb_internal and fs_reclaim, disable the + * acquisition of the fs_reclaim pseudo-lock when the superblock + * has been frozen or in the process of being frozen. */ - tp = kmem_zone_zalloc(xfs_trans_zone, 0); + tp = kmem_zone_zalloc(xfs_trans_zone, + mp->m_super->s_writers.frozen ? KM_NOLOCKDEP : 0); if (!(flags & XFS_TRANS_NO_WRITECOUNT)) sb_start_intwrite(mp->m_super);
Depending on the workloads, the following circular locking dependency warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo lock) may show up: ====================================================== WARNING: possible circular locking dependency detected 5.0.0-rc1+ #60 Tainted: G W ------------------------------------------------------ fsfreeze/4346 is trying to acquire lock: 0000000026f1d784 (fs_reclaim){+.+.}, at: fs_reclaim_acquire.part.19+0x5/0x30 but task is already holding lock: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 which lock already depends on the new lock. : Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sb_internal); lock(fs_reclaim); lock(sb_internal); lock(fs_reclaim); *** DEADLOCK *** 4 locks held by fsfreeze/4346: #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650 #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290 #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650 #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 stack backtrace: Call Trace: dump_stack+0xe0/0x19a print_circular_bug.isra.10.cold.34+0x2f4/0x435 check_prev_add.constprop.19+0xca1/0x15f0 validate_chain.isra.14+0x11af/0x3b50 __lock_acquire+0x728/0x1200 lock_acquire+0x269/0x5a0 fs_reclaim_acquire.part.19+0x29/0x30 fs_reclaim_acquire+0x19/0x20 kmem_cache_alloc+0x3e/0x3f0 kmem_zone_alloc+0x79/0x150 xfs_trans_alloc+0xfa/0x9d0 xfs_sync_sb+0x86/0x170 xfs_log_sbcount+0x10f/0x140 xfs_quiesce_attr+0x134/0x270 xfs_fs_freeze+0x4a/0x70 freeze_super+0x1af/0x290 do_vfs_ioctl+0xedc/0x16c0 ksys_ioctl+0x41/0x80 __x64_sys_ioctl+0x73/0xa9 do_syscall_64+0x18f/0xd23 entry_SYSCALL_64_after_hwframe+0x49/0xbe According to Dave Chinner: Freezing the filesystem, after all the data has been cleaned. IOWs memory reclaim will never run the above writeback path when the freeze process is trying to allocate a transaction here because there are no dirty data pages in the filesystem at this point. Indeed, this xfs_sync_sb() path sets XFS_TRANS_NO_WRITECOUNT so that it /doesn't deadlock/ by taking freeze references for the transaction. We've just drained all the transactions in progress and written back all the dirty metadata, too, and so the filesystem is completely clean and only needs the superblock to be updated to complete the freeze process. And to do that, it does not take a freeze reference because calling sb_start_intwrite() here would deadlock. IOWs, this is a false positive, caused by the fact that xfs_trans_alloc() is called from both above and below memory reclaim as well as within /every level/ of freeze processing. Lockdep is unable to describe the staged flush logic in the freeze process that prevents deadlocks from occurring, and hence we will pretty much always see false positives in the freeze path.... Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock may fix the issue. However, that will greatly complicate the logic and may not be worth it. Another way to fix it is to disable the taking of the fs_reclaim pseudo lock when in the freezing code path as a reclaim on the freezed filesystem is not possible as stated above. This patch takes this approach by setting the __GFP_NOLOCKDEP flag in the slab memory allocation calls when the filesystem has been freezed. Without this patch, the command sequence below will show that the lock dependency chain sb_internal -> fs_reclaim exists. # fsfreeze -f /home # fsfreeze --unfreeze /home # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal After applying the patch, such sb_internal -> fs_reclaim lock dependency chain can no longer be found. Because of that, the locking dependency warning will not be shown. Signed-off-by: Waiman Long <longman@redhat.com> --- fs/xfs/kmem.h | 11 ++++++++++- fs/xfs/xfs_log.c | 3 ++- fs/xfs/xfs_trans.c | 8 +++++++- 3 files changed, 19 insertions(+), 3 deletions(-)