Message ID | 20161006171758.GA21707@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Oct 06, 2016 at 07:17:58PM +0200, Oleg Nesterov wrote: > Probably false positive? Although when I look at the comment above xfs_sync_sb() > I think that may be sometging like below makes sense, but I know absolutely nothing > about fs/ and XFS in particular. > > Oleg. > > > --- x/fs/xfs/xfs_trans.c > +++ x/fs/xfs/xfs_trans.c > @@ -245,7 +245,8 @@ xfs_trans_alloc( > atomic_inc(&mp->m_active_trans); > > tp = kmem_zone_zalloc(xfs_trans_zone, > - (flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP); > + (flags & (XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT)) > + ? KM_NOFS : KM_SLEEP); > tp->t_magic = XFS_TRANS_HEADER_MAGIC; > tp->t_flags = flags; > tp->t_mountp = mp; Brief examination says caller should set XFS_TRANS_NOFS, not change the implementation to make XFS_TRANS_NO_WRITECOUNT flag to also mean XFS_TRANS_NOFS. Cheers, Dave.
On 10/07, Dave Chinner wrote: > > On Thu, Oct 06, 2016 at 07:17:58PM +0200, Oleg Nesterov wrote: > > Probably false positive? Although when I look at the comment above xfs_sync_sb() > > I think that may be sometging like below makes sense, but I know absolutely nothing > > about fs/ and XFS in particular. > > > > Oleg. > > > > > > --- x/fs/xfs/xfs_trans.c > > +++ x/fs/xfs/xfs_trans.c > > @@ -245,7 +245,8 @@ xfs_trans_alloc( > > atomic_inc(&mp->m_active_trans); > > > > tp = kmem_zone_zalloc(xfs_trans_zone, > > - (flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP); > > + (flags & (XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT)) > > + ? KM_NOFS : KM_SLEEP); > > tp->t_magic = XFS_TRANS_HEADER_MAGIC; > > tp->t_flags = flags; > > tp->t_mountp = mp; > > Brief examination says caller should set XFS_TRANS_NOFS, not change > the implementation to make XFS_TRANS_NO_WRITECOUNT flag to also mean > XFS_TRANS_NOFS. I didn't mean the change above can fix the problem, and I don't really understand your suggestion. Obviously any GFP_FS allocation in xfs_fs_freeze() paths will trigger the same warning. I added this hack --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1333,10 +1333,15 @@ xfs_fs_freeze( struct super_block *sb) { struct xfs_mount *mp = XFS_M(sb); + int ret; + current->flags |= PF_FSTRANS; // tell kmem_flags_convert() to remove GFP_FS xfs_save_resvblks(mp); xfs_quiesce_attr(mp); - return xfs_sync_sb(mp, true); + ret = xfs_sync_sb(mp, true); + current->flags &= ~PF_FSTRANS; + + return ret; } just for testing purposes and after that I got another warning below. I didn't read it carefully yet, but _at first glance_ it looks like the lock inversion uncovered by 2/2, although I can be easily wrong. cancel_delayed_work_sync(l_work) under sb_internal can hang if xfs_log_worker() waits for this rwsem?` Oleg. ========================================================= [ INFO: possible irq lock inversion dependency detected ] 4.8.0+ #10 Tainted: G W --------------------------------------------------------- kswapd0/32 just changed the state of lock: (sb_internal){++++.?}, at: [<ffffffffb92925c7>] __sb_start_write+0xb7/0xf0 but this lock took another, RECLAIM_FS-unsafe lock in the past: ((&(&log->l_work)->work)){+.+.+.} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock((&(&log->l_work)->work)); local_irq_disable(); lock(sb_internal); lock((&(&log->l_work)->work)); <Interrupt> lock(sb_internal); *** DEADLOCK *** no locks held by kswapd0/32. the shortest dependencies between 2nd lock and 1st lock: -> ((&(&log->l_work)->work)){+.+.+.} ops: 803 { HARDIRQ-ON-W at: [<ffffffffb9107ef1>] __lock_acquire+0x611/0x1870 [<ffffffffb91097dd>] lock_acquire+0x10d/0x200 [<ffffffffb90cad49>] process_one_work+0x1c9/0x690 [<ffffffffb90cb25e>] worker_thread+0x4e/0x480 [<ffffffffb90d1d13>] kthread+0xf3/0x110 [<ffffffffb9a2146f>] ret_from_fork+0x1f/0x40 SOFTIRQ-ON-W at: [<ffffffffb9107f23>] __lock_acquire+0x643/0x1870 [<ffffffffb91097dd>] lock_acquire+0x10d/0x200 [<ffffffffb90cad49>] process_one_work+0x1c9/0x690 [<ffffffffb90cb25e>] worker_thread+0x4e/0x480 [<ffffffffb90d1d13>] kthread+0xf3/0x110 [<ffffffffb9a2146f>] ret_from_fork+0x1f/0x40 RECLAIM_FS-ON-W at: [<ffffffffb910735f>] mark_held_locks+0x6f/0xa0 [<ffffffffb910a5f3>] lockdep_trace_alloc+0xd3/0x120 [<ffffffffb92603bf>] kmem_cache_alloc+0x2f/0x280 [<ffffffffb941f5e1>] kmem_zone_alloc+0x81/0x120 [<ffffffffb941ec4c>] xfs_trans_alloc+0x6c/0x130 [<ffffffffb93ef649>] xfs_sync_sb+0x39/0x80 [<ffffffffb9423914>] xfs_log_worker+0xd4/0xf0 [<ffffffffb90cad70>] process_one_work+0x1f0/0x690 [<ffffffffb90cb25e>] worker_thread+0x4e/0x480 [<ffffffffb90d1d13>] kthread+0xf3/0x110 [<ffffffffb9a2146f>] ret_from_fork+0x1f/0x40 INITIAL USE at: [<ffffffffb9107b2b>] __lock_acquire+0x24b/0x1870 [<ffffffffb91097dd>] lock_acquire+0x10d/0x200 [<ffffffffb90cad49>] process_one_work+0x1c9/0x690 [<ffffffffb90cb25e>] worker_thread+0x4e/0x480 [<ffffffffb90d1d13>] kthread+0xf3/0x110 [<ffffffffb9a2146f>] ret_from_fork+0x1f/0x40 } ... key at: [<ffffffffbb133758>] __key.130101+0x0/0x8 ... acquired at: [<ffffffffb91097dd>] lock_acquire+0x10d/0x200 [<ffffffffb90c981c>] flush_work+0x4c/0x2c0 [<ffffffffb90cbe01>] __cancel_work_timer+0x131/0x1f0 [<ffffffffb90cbef3>] cancel_delayed_work_sync+0x13/0x20 [<ffffffffb94245a4>] xfs_log_quiesce+0x34/0x3b0 [<ffffffffb9419c6d>] xfs_quiesce_attr+0x6d/0xb0 [<ffffffffb9419ce3>] xfs_fs_freeze+0x33/0x50 [<ffffffffb9291eff>] freeze_super+0xcf/0x190 [<ffffffffb92a528f>] do_vfs_ioctl+0x55f/0x6c0 [<ffffffffb92a5469>] SyS_ioctl+0x79/0x90 [<ffffffffb9a2123c>] entry_SYSCALL_64_fastpath+0x1f/0xbd -> (sb_internal){++++.?} ops: 28351355 { HARDIRQ-ON-W at: [<ffffffffb9107ef1>] __lock_acquire+0x611/0x1870 [<ffffffffb91097dd>] lock_acquire+0x10d/0x200 [<ffffffffb9a1e3a6>] down_write+0x36/0x70 [<ffffffffb9102fb3>] percpu_down_write+0x33/0x100 [<ffffffffb9291eed>] freeze_super+0xbd/0x190 [<ffffffffb92a528f>] do_vfs_ioctl+0x55f/0x6c0 [<ffffffffb92a5469>] SyS_ioctl+0x79/0x90 [<ffffffffb9a2123c>] entry_SYSCALL_64_fastpath+0x1f/0xbd HARDIRQ-ON-R at: [<ffffffffb9107bed>] __lock_acquire+0x30d/0x1870 [<ffffffffb91097dd>] lock_acquire+0x10d/0x200 [<ffffffffb91030c8>] percpu_down_read_trylock+0x48/0xa0 [<ffffffffb929258a>] __sb_start_write+0x7a/0xf0 [<ffffffffb941ecc3>] xfs_trans_alloc+0xe3/0x130 [<ffffffffb940dcba>] xfs_vn_update_time+0x7a/0x210 [<ffffffffb92b1038>] touch_atime+0xa8/0xd0 [<ffffffffb91ed573>] generic_file_read_iter+0x703/0x8d0 [<ffffffffb9401513>] xfs_file_buffered_aio_read+0x63/0x170 [<ffffffffb9401688>] xfs_file_read_iter+0x68/0xc0 [<ffffffffb928de90>] __vfs_read+0xe0/0x150 [<ffffffffb928efb5>] vfs_read+0x95/0x140 [<ffffffffb92904e8>] SyS_read+0x58/0xc0 [<ffffffffb9003edc>] do_syscall_64+0x6c/0x1e0 [<ffffffffb9a212ff>] return_from_SYSCALL_64+0x0/0x7a SOFTIRQ-ON-W at: [<ffffffffb9107f23>] __lock_acquire+0x643/0x1870 [<ffffffffb91097dd>] lock_acquire+0x10d/0x200 [<ffffffffb9a1e3a6>] down_write+0x36/0x70 [<ffffffffb9102fb3>] percpu_down_write+0x33/0x100 [<ffffffffb9291eed>] freeze_super+0xbd/0x190 [<ffffffffb92a528f>] do_vfs_ioctl+0x55f/0x6c0 [<ffffffffb92a5469>] SyS_ioctl+0x79/0x90 [<ffffffffb9a2123c>] entry_SYSCALL_64_fastpath+0x1f/0xbd SOFTIRQ-ON-R at: [<ffffffffb9107f23>] __lock_acquire+0x643/0x1870 [<ffffffffb91097dd>] lock_acquire+0x10d/0x200 [<ffffffffb91030c8>] percpu_down_read_trylock+0x48/0xa0 [<ffffffffb929258a>] __sb_start_write+0x7a/0xf0 [<ffffffffb941ecc3>] xfs_trans_alloc+0xe3/0x130 [<ffffffffb940dcba>] xfs_vn_update_time+0x7a/0x210 [<ffffffffb92b1038>] touch_atime+0xa8/0xd0 [<ffffffffb91ed573>] generic_file_read_iter+0x703/0x8d0 [<ffffffffb9401513>] xfs_file_buffered_aio_read+0x63/0x170 [<ffffffffb9401688>] xfs_file_read_iter+0x68/0xc0 [<ffffffffb928de90>] __vfs_read+0xe0/0x150 [<ffffffffb928efb5>] vfs_read+0x95/0x140 [<ffffffffb92904e8>] SyS_read+0x58/0xc0 [<ffffffffb9003edc>] do_syscall_64+0x6c/0x1e0 [<ffffffffb9a212ff>] return_from_SYSCALL_64+0x0/0x7a IN-RECLAIM_FS-R at: [<ffffffffb9107c4d>] __lock_acquire+0x36d/0x1870 [<ffffffffb91097dd>] lock_acquire+0x10d/0x200 [<ffffffffb9102ecc>] percpu_down_read+0x3c/0x90 [<ffffffffb92925c7>] __sb_start_write+0xb7/0xf0 [<ffffffffb941ecc3>] xfs_trans_alloc+0xe3/0x130 [<ffffffffb940d157>] xfs_iomap_write_allocate+0x1f7/0x380 [<ffffffffb93f17fa>] xfs_map_blocks+0x22a/0x380 [<ffffffffb93f2f78>] xfs_do_writepage+0x188/0x6c0 [<ffffffffb93f34eb>] xfs_vm_writepage+0x3b/0x70 [<ffffffffb9204a20>] pageout.isra.46+0x190/0x380 [<ffffffffb9207d1b>] shrink_page_list+0x9ab/0xa70 [<ffffffffb9208602>] shrink_inactive_list+0x252/0x5d0 [<ffffffffb920928f>] shrink_node_memcg+0x5af/0x790 [<ffffffffb9209551>] shrink_node+0xe1/0x320 [<ffffffffb920aa47>] kswapd+0x387/0x8b0 [<ffffffffb90d1d13>] kthread+0xf3/0x110 [<ffffffffb9a2146f>] ret_from_fork+0x1f/0x40 RECLAIM_FS-ON-R at: [<ffffffffb910735f>] mark_held_locks+0x6f/0xa0 [<ffffffffb910a5f3>] lockdep_trace_alloc+0xd3/0x120 [<ffffffffb92603bf>] kmem_cache_alloc+0x2f/0x280 [<ffffffffb941f5e1>] kmem_zone_alloc+0x81/0x120 [<ffffffffb941ec4c>] xfs_trans_alloc+0x6c/0x130 [<ffffffffb940dcba>] xfs_vn_update_time+0x7a/0x210 [<ffffffffb92b1038>] touch_atime+0xa8/0xd0 [<ffffffffb91ed573>] generic_file_read_iter+0x703/0x8d0 [<ffffffffb9401513>] xfs_file_buffered_aio_read+0x63/0x170 [<ffffffffb9401688>] xfs_file_read_iter+0x68/0xc0 [<ffffffffb928de90>] __vfs_read+0xe0/0x150 [<ffffffffb928efb5>] vfs_read+0x95/0x140 [<ffffffffb92904e8>] SyS_read+0x58/0xc0 [<ffffffffb9003edc>] do_syscall_64+0x6c/0x1e0 [<ffffffffb9a212ff>] return_from_SYSCALL_64+0x0/0x7a INITIAL USE at: [<ffffffffb9107b2b>] __lock_acquire+0x24b/0x1870 [<ffffffffb91097dd>] lock_acquire+0x10d/0x200 [<ffffffffb91030c8>] percpu_down_read_trylock+0x48/0xa0 [<ffffffffb929258a>] __sb_start_write+0x7a/0xf0 [<ffffffffb941ecc3>] xfs_trans_alloc+0xe3/0x130 [<ffffffffb940dcba>] xfs_vn_update_time+0x7a/0x210 [<ffffffffb92b1038>] touch_atime+0xa8/0xd0 [<ffffffffb91ed573>] generic_file_read_iter+0x703/0x8d0 [<ffffffffb9401513>] xfs_file_buffered_aio_read+0x63/0x170 [<ffffffffb9401688>] xfs_file_read_iter+0x68/0xc0 [<ffffffffb928de90>] __vfs_read+0xe0/0x150 [<ffffffffb928efb5>] vfs_read+0x95/0x140 [<ffffffffb92904e8>] SyS_read+0x58/0xc0 [<ffffffffb9003edc>] do_syscall_64+0x6c/0x1e0 [<ffffffffb9a212ff>] return_from_SYSCALL_64+0x0/0x7a } ... key at: [<ffffffffba0b1a00>] xfs_fs_type+0x60/0x80 ... acquired at: [<ffffffffb91061d1>] check_usage_forwards+0x151/0x160 [<ffffffffb9106fea>] mark_lock+0x34a/0x650 [<ffffffffb9107c4d>] __lock_acquire+0x36d/0x1870 [<ffffffffb91097dd>] lock_acquire+0x10d/0x200 [<ffffffffb9102ecc>] percpu_down_read+0x3c/0x90 [<ffffffffb92925c7>] __sb_start_write+0xb7/0xf0 [<ffffffffb941ecc3>] xfs_trans_alloc+0xe3/0x130 [<ffffffffb940d157>] xfs_iomap_write_allocate+0x1f7/0x380 [<ffffffffb93f17fa>] xfs_map_blocks+0x22a/0x380 [<ffffffffb93f2f78>] xfs_do_writepage+0x188/0x6c0 [<ffffffffb93f34eb>] xfs_vm_writepage+0x3b/0x70 [<ffffffffb9204a20>] pageout.isra.46+0x190/0x380 [<ffffffffb9207d1b>] shrink_page_list+0x9ab/0xa70 [<ffffffffb9208602>] shrink_inactive_list+0x252/0x5d0 [<ffffffffb920928f>] shrink_node_memcg+0x5af/0x790 [<ffffffffb9209551>] shrink_node+0xe1/0x320 [<ffffffffb920aa47>] kswapd+0x387/0x8b0 [<ffffffffb90d1d13>] kthread+0xf3/0x110 [<ffffffffb9a2146f>] ret_from_fork+0x1f/0x40 stack backtrace: CPU: 0 PID: 32 Comm: kswapd0 Tainted: G W 4.8.0+ #10 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 0000000000000086 0000000020cdca74 ffff880136c5b490 ffffffffb95b7a73 ffffffffbae7ed80 ffff880136c5b4f0 ffff880136c5b4d0 ffffffffb91e7c5a ffff880136c3aee8 ffff880136c3aee8 ffff880136c3a680 ffffffffb9e8bbdd Call Trace: [<ffffffffb95b7a73>] dump_stack+0x85/0xc2 [<ffffffffb91e7c5a>] print_irq_inversion_bug.part.36+0x1a4/0x1b0 [<ffffffffb91061d1>] check_usage_forwards+0x151/0x160 [<ffffffffb9106fea>] mark_lock+0x34a/0x650 [<ffffffffb9106080>] ? print_shortest_lock_dependencies+0x1a0/0x1a0 [<ffffffffb9107c4d>] __lock_acquire+0x36d/0x1870 [<ffffffffb9107d55>] ? __lock_acquire+0x475/0x1870 [<ffffffffb91097dd>] lock_acquire+0x10d/0x200 [<ffffffffb92925c7>] ? __sb_start_write+0xb7/0xf0 [<ffffffffb9102ecc>] percpu_down_read+0x3c/0x90 [<ffffffffb92925c7>] ? __sb_start_write+0xb7/0xf0 [<ffffffffb92925c7>] __sb_start_write+0xb7/0xf0 [<ffffffffb941ecc3>] xfs_trans_alloc+0xe3/0x130 [<ffffffffb940d157>] xfs_iomap_write_allocate+0x1f7/0x380 [<ffffffffb93f16b3>] ? xfs_map_blocks+0xe3/0x380 [<ffffffffb91268b8>] ? rcu_read_lock_sched_held+0x58/0x60 [<ffffffffb93f17fa>] xfs_map_blocks+0x22a/0x380 [<ffffffffb93f2f78>] xfs_do_writepage+0x188/0x6c0 [<ffffffffb93f34eb>] xfs_vm_writepage+0x3b/0x70 [<ffffffffb9204a20>] pageout.isra.46+0x190/0x380 [<ffffffffb9207d1b>] shrink_page_list+0x9ab/0xa70 [<ffffffffb9208602>] shrink_inactive_list+0x252/0x5d0 [<ffffffffb920928f>] shrink_node_memcg+0x5af/0x790 [<ffffffffb9209551>] shrink_node+0xe1/0x320 [<ffffffffb920aa47>] kswapd+0x387/0x8b0 [<ffffffffb920a6c0>] ? mem_cgroup_shrink_node+0x2e0/0x2e0 [<ffffffffb90d1d13>] kthread+0xf3/0x110 [<ffffffffb9a2146f>] ret_from_fork+0x1f/0x40 [<ffffffffb90d1c20>] ? kthread_create_on_node+0x230/0x230 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Oct 07, 2016 at 07:15:18PM +0200, Oleg Nesterov wrote: > On 10/07, Dave Chinner wrote: > > > > On Thu, Oct 06, 2016 at 07:17:58PM +0200, Oleg Nesterov wrote: > > > Probably false positive? Although when I look at the comment above xfs_sync_sb() > > > I think that may be sometging like below makes sense, but I know absolutely nothing > > > about fs/ and XFS in particular. > > > > > > Oleg. > > > > > > > > > --- x/fs/xfs/xfs_trans.c > > > +++ x/fs/xfs/xfs_trans.c > > > @@ -245,7 +245,8 @@ xfs_trans_alloc( > > > atomic_inc(&mp->m_active_trans); > > > > > > tp = kmem_zone_zalloc(xfs_trans_zone, > > > - (flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP); > > > + (flags & (XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT)) > > > + ? KM_NOFS : KM_SLEEP); > > > tp->t_magic = XFS_TRANS_HEADER_MAGIC; > > > tp->t_flags = flags; > > > tp->t_mountp = mp; > > > > Brief examination says caller should set XFS_TRANS_NOFS, not change > > the implementation to make XFS_TRANS_NO_WRITECOUNT flag to also mean > > XFS_TRANS_NOFS. > > I didn't mean the change above can fix the problem, and I don't really > understand your suggestion. xfs_syncsb() does: tp = xfs_trans_alloc(... , XFS_TRANS_NO_WRITECOUNT, ....); but it's running in a GFP_NOFS context when a freeze is being finalised. SO, rather than changing what XFS_TRANS_NO_WRITECOUNT does in xfs_trans_alloc(), we should tell it to do a GFP_NOFS allocation. i.e. tp = xfs_trans_alloc(... , XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT, ....); and nothing inside xfs_trans_alloc() changes at all. > Obviously any GFP_FS allocation in xfs_fs_freeze() > paths will trigger the same warning. Of which there should be none except for that xfs_trans_alloc() call. > I added this hack > > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -1333,10 +1333,15 @@ xfs_fs_freeze( > struct super_block *sb) > { > struct xfs_mount *mp = XFS_M(sb); > + int ret; > > + current->flags |= PF_FSTRANS; // tell kmem_flags_convert() to remove GFP_FS > xfs_save_resvblks(mp); > xfs_quiesce_attr(mp); > - return xfs_sync_sb(mp, true); > + ret = xfs_sync_sb(mp, true); > + current->flags &= ~PF_FSTRANS; > + > + return ret; > } /me shudders > just for testing purposes and after that I got another warning below. I didn't > read it carefully yet, but _at first glance_ it looks like the lock inversion > uncovered by 2/2, although I can be easily wrong. cancel_delayed_work_sync(l_work) > under sb_internal can hang if xfs_log_worker() waits for this rwsem?` Actually: I *can't read it*. I've got no fucking clue what lockdep is trying to say here. This /looks/ like a lockdep is getting confused between memory reclaim contexts (which /aren't locks/ but overload interrupt levels) and freeze contexts (which /aren't locks/) and workqueue locks which /aren't nested/ inside transactions or freeze contexts. But, really, I can't follow it because I have to guess at what "lock contexts" are not locks but are something else. cheers, Dave.
On 10/08, Dave Chinner wrote: > > On Fri, Oct 07, 2016 at 07:15:18PM +0200, Oleg Nesterov wrote: > > > > > > > > --- x/fs/xfs/xfs_trans.c > > > > +++ x/fs/xfs/xfs_trans.c > > > > @@ -245,7 +245,8 @@ xfs_trans_alloc( > > > > atomic_inc(&mp->m_active_trans); > > > > > > > > tp = kmem_zone_zalloc(xfs_trans_zone, > > > > - (flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP); > > > > + (flags & (XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT)) > > > > + ? KM_NOFS : KM_SLEEP); > > > > tp->t_magic = XFS_TRANS_HEADER_MAGIC; > > > > tp->t_flags = flags; > > > > tp->t_mountp = mp; > > > > > > Brief examination says caller should set XFS_TRANS_NOFS, not change > > > the implementation to make XFS_TRANS_NO_WRITECOUNT flag to also mean > > > XFS_TRANS_NOFS. > > > > I didn't mean the change above can fix the problem, and I don't really > > understand your suggestion. > > xfs_syncsb() does: > > tp = xfs_trans_alloc(... , XFS_TRANS_NO_WRITECOUNT, ....); > > but it's running in a GFP_NOFS context when a freeze is being > finalised. SO, rather than changing what XFS_TRANS_NO_WRITECOUNT > does in xfs_trans_alloc(), we should tell it to do a GFP_NOFS > allocation. i.e. > > tp = xfs_trans_alloc(... , XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT, ....); Ah. This is clear but I am not sure it is enough, > > Obviously any GFP_FS allocation in xfs_fs_freeze() > > paths will trigger the same warning. > > Of which there should be none except for that xfs_trans_alloc() > call. Really? Again, I can be easily wrong, but when I look at xfs_freeze_fs() paths I can see xfs_fs_freeze()->xfs_quiesce_attr()->xfs_log_quiesce()->xfs_log_unmount_write() ->xfs_log_reserve()->xlog_ticket_alloc(KM_SLEEP) at least. But I can test the change above, perhaps this call chain is not possible... > > I added this hack > > > > --- a/fs/xfs/xfs_super.c > > +++ b/fs/xfs/xfs_super.c > > @@ -1333,10 +1333,15 @@ xfs_fs_freeze( > > struct super_block *sb) > > { > > struct xfs_mount *mp = XFS_M(sb); > > + int ret; > > > > + current->flags |= PF_FSTRANS; // tell kmem_flags_convert() to remove GFP_FS > > xfs_save_resvblks(mp); > > xfs_quiesce_attr(mp); > > - return xfs_sync_sb(mp, true); > > + ret = xfs_sync_sb(mp, true); > > + current->flags &= ~PF_FSTRANS; > > + > > + return ret; > > } > > /me shudders don't worry, this debugging change won't escape my testing machine! > > just for testing purposes and after that I got another warning below. I didn't > > read it carefully yet, but _at first glance_ it looks like the lock inversion > > uncovered by 2/2, although I can be easily wrong. cancel_delayed_work_sync(l_work) > > under sb_internal can hang if xfs_log_worker() waits for this rwsem?` > > Actually: I *can't read it*. I've got no fucking clue what lockdep > is trying to say here. This /looks/ like a lockdep is getting > confused I can almost never understand what lockdep tells me, it is too clever for me. But this time I think it is right. Suppose that freeze_super() races with xfs_log_worker() callback. freeze_super() takes sb_internal lock and xfs_log_quiesce() calls cancel_delayed_work_sync(l_work). This will sleep until xfs_log_worker() finishes. xfs_log_worker() does a __GFP_FS alloc, triggers reclaim, and blocks on the same sb_internal lock. Say, in xfs_do_writepage()->xfs_trans_alloc() path. Deadlock. The worker thread can't take sb_internal hold by freeze_super(), cancel_delayed_work_sync() will sleep forever because xfs_log_worker() can't finish. So xfs_log_worker() should run in a GFP_NOFS context too. And perhaps the change above in xfs_trans_alloc() or in xfs_sync_sb() can help if it doesn't do other allocatiions, I dunno. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Oct 09, 2016 at 06:14:57PM +0200, Oleg Nesterov wrote: > On 10/08, Dave Chinner wrote: > > > > On Fri, Oct 07, 2016 at 07:15:18PM +0200, Oleg Nesterov wrote: > > > > > > > > > > --- x/fs/xfs/xfs_trans.c > > > > > +++ x/fs/xfs/xfs_trans.c > > > > > @@ -245,7 +245,8 @@ xfs_trans_alloc( > > > > > atomic_inc(&mp->m_active_trans); > > > > > > > > > > tp = kmem_zone_zalloc(xfs_trans_zone, > > > > > - (flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP); > > > > > + (flags & (XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT)) > > > > > + ? KM_NOFS : KM_SLEEP); > > > > > tp->t_magic = XFS_TRANS_HEADER_MAGIC; > > > > > tp->t_flags = flags; > > > > > tp->t_mountp = mp; > > > > > > > > Brief examination says caller should set XFS_TRANS_NOFS, not change > > > > the implementation to make XFS_TRANS_NO_WRITECOUNT flag to also mean > > > > XFS_TRANS_NOFS. > > > > > > I didn't mean the change above can fix the problem, and I don't really > > > understand your suggestion. > > > > xfs_syncsb() does: > > > > tp = xfs_trans_alloc(... , XFS_TRANS_NO_WRITECOUNT, ....); > > > > but it's running in a GFP_NOFS context when a freeze is being > > finalised. SO, rather than changing what XFS_TRANS_NO_WRITECOUNT > > does in xfs_trans_alloc(), we should tell it to do a GFP_NOFS > > allocation. i.e. > > > > tp = xfs_trans_alloc(... , XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT, ....); > > Ah. This is clear but I am not sure it is enough, > > > > Obviously any GFP_FS allocation in xfs_fs_freeze() > > > paths will trigger the same warning. > > > > Of which there should be none except for that xfs_trans_alloc() > > call. > > Really? /Should/ is an indication of *intent*. Reality is that there may be *bugs*. We all know that testing can't prove the absence of bugs, so even after years and years of exercising the code with producing any evidence there may still be problems. So, it's time to waste more time explaining why lockdep is telling us about something that *isn't a bug*. > Again, I can be easily wrong, but when I look at xfs_freeze_fs() > paths I can see > > xfs_fs_freeze()->xfs_quiesce_attr()->xfs_log_quiesce()->xfs_log_unmount_write() > ->xfs_log_reserve()->xlog_ticket_alloc(KM_SLEEP) So, the problem being indicated here is that memory reclaim might either try to a) write back dirty pages (which require allocation transactions), b) might run a shrinker that requires running a transaction or c) we might run periodic background inode reclaim. For the case of a), this /can't happen/ because we've already run the part of a freeze that stops pages being dirtied and then written them all back and made them clean. So we won't run transactions from page cache reclaim and so we can't deadlock. For the case of b), well, that's even easier - the only shrinker path we care about here is inode reclaim through super_cache_scan(). Before that shrinker runs anything it calls: if (!trylock_super(sb)) return SHRINK_STOP; Now, we're running memory allocation for the freeze context, which means we are holding the sb->s_umount semaphore in write mode. That means the shrinker is going to /fail to lock the superblock/ and therefore not run any reclaim on that superblock. IOWs, while we hold the s_umount lock in write mode across a memory allocation, the shrinkers run in GFP_NOFS mode automatically. So we can't run transactions from memory reclaim and so we will can't deadlock. For the case of c), xfs_quiesce_attr() does this: /* force the log to unpin objects from the now complete transactions */ xfs_log_force(mp, XFS_LOG_SYNC); /* reclaim inodes to do any IO before the freeze completes */ xfs_reclaim_inodes(mp, 0); xfs_reclaim_inodes(mp, SYNC_WAIT); We basically unpin, clean, and reclaim all the unused inodes the XFS inode cache. With the shrinker not reclaiming any inodes, and there being no cached, dirty, unreclaimed inodes remaining in the cache, there can be no background memory allocations, transactions or IO triggered memory reclaim in this filesystem. At this point, memory reclaim /should never block/ trying to reclaim objects from this filesystem that require transactions to free. From this, it should be obvious that we don't even need to change the code in xfs_syncsb() - in the freeze context that the allocation is run we've got a clean filesystem where memory reclaim won't block on the filesystem being frozen, so the code is safe as it stands. > > > just for testing purposes and after that I got another warning below. I didn't > > > read it carefully yet, but _at first glance_ it looks like the lock inversion > > > uncovered by 2/2, although I can be easily wrong. cancel_delayed_work_sync(l_work) > > > under sb_internal can hang if xfs_log_worker() waits for this rwsem?` > > > > Actually: I *can't read it*. I've got no fucking clue what lockdep > > is trying to say here. This /looks/ like a lockdep is getting > > confused > > I can almost never understand what lockdep tells me, it is too clever for me. > But this time I think it is right. > > Suppose that freeze_super() races with xfs_log_worker() callback. > > freeze_super() takes sb_internal lock and xfs_log_quiesce() calls > cancel_delayed_work_sync(l_work). This will sleep until xfs_log_worker() > finishes. > > xfs_log_worker() does a __GFP_FS alloc, triggers reclaim, and blocks > on the same sb_internal lock. Say, in xfs_do_writepage()->xfs_trans_alloc() > path. See above - xfs_log_worker will not block on memory reclaim on the filesystem because a) there are no dirty pages and b) the superblock shrinker will not get the sb->s_umount lock and hence operates for all contexts as though they are doing GFP_NOFS allocations. Basically, what we are seeing here is yet another case of "lockdep is just smart enough to be really dumb" because we cannot fully express or cleanly annotate the contexts in which it is being asked to validate. Unless we do something we shouldn't be doing (i.e. marking GFP_KERNEL allocations that are safe with GFP_NOFS to shut up lockdep), all we've just done is introduce another vector for lockdep false positives... Cheers, Dave.
Dave, I am sorry for delay. On 10/10, Dave Chinner wrote: > > So, it's time to waste more time explaining why lockdep is telling > us about something that *isn't a bug*. > [... snip ...] OK, thanks. I am not surprised although I have to admit I wasn't sure. > Basically, what we are seeing here is yet another case of "lockdep > is just smart enough to be really dumb" because we cannot fully > express or cleanly annotate the contexts in which it is being asked > to validate. Yes... perhaps we can add the new lockdep helpers to avoid the false- positives like this one, but so far it is not clear to me what we can do. Somehow we need to tell it to to avoid check_prev_add() because we know that the work function won't take sb_internal, but at the same time we should complain if it actually does this. Lets ignore this patch for now. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- x/fs/xfs/xfs_trans.c +++ x/fs/xfs/xfs_trans.c @@ -245,7 +245,8 @@ xfs_trans_alloc( atomic_inc(&mp->m_active_trans); tp = kmem_zone_zalloc(xfs_trans_zone, - (flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP); + (flags & (XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT)) + ? KM_NOFS : KM_SLEEP); tp->t_magic = XFS_TRANS_HEADER_MAGIC; tp->t_flags = flags; tp->t_mountp = mp;