diff mbox

[V2,2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths

Message ID 20161006171758.GA21707@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Oleg Nesterov Oct. 6, 2016, 5:17 p.m. UTC
On 10/05, Oleg Nesterov wrote:
>
> On 10/05, Dave Chinner wrote:
> >
> > On Tue, Oct 04, 2016 at 01:43:43PM +0200, Oleg Nesterov wrote:
> >
> > > plus the following warnings:
> > >
> > > 	[ 1894.500040] run fstests generic/070 at 2016-10-04 05:03:39
> > > 	[ 1895.076655] =================================
> > > 	[ 1895.077136] [ INFO: inconsistent lock state ]
> > > 	[ 1895.077574] 4.8.0 #1 Not tainted
> > > 	[ 1895.077900] ---------------------------------
> > > 	[ 1895.078330] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
> > > 	[ 1895.078993] fsstress/18239 [HC0[0]:SC0[0]:HE1:SE1] takes:
> > > 	[ 1895.079522]  (&xfs_nondir_ilock_class){++++?-}, at: [<ffffffffc049ad45>] xfs_ilock+0x165/0x210 [xfs]
> > > 	[ 1895.080529] {IN-RECLAIM_FS-W} state was registered at:
> >
> > And that is a bug in the lockdep annotations for memory allocation because it
> > fails to take into account the current task flags that are set via
> > memalloc_noio_save() to prevent vmalloc from doing GFP_KERNEL allocations. i.e.
> > in _xfs_buf_map_pages():
>
> OK, I see...
>
> I'll re-test with the following change:
>
> 	--- a/kernel/locking/lockdep.c
> 	+++ b/kernel/locking/lockdep.c
> 	@@ -2867,7 +2867,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
> 			return;
>
> 		/* We're only interested __GFP_FS allocations for now */
> 	-       if (!(gfp_mask & __GFP_FS))
> 	+       if ((curr->flags & PF_MEMALLOC_NOIO) || !(gfp_mask & __GFP_FS))
> 			return;
>

and with the change above "./check -b auto" finishes without lockdep warnings,
probably I should send this patch to lockdep maintainers.

Now, with 2/2 applied I got the following:

	[ INFO: inconsistent lock state ]
	4.8.0+ #4 Tainted: G        W
	---------------------------------
	inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
	kswapd0/32 [HC0[0]:SC0[0]:HE1:SE1] takes:
	 (sb_internal){+++++?}, at: [<ffffffff91292557>] __sb_start_write+0xb7/0xf0
	{RECLAIM_FS-ON-W} state was registered at:
	  [<ffffffff9110735f>] mark_held_locks+0x6f/0xa0
	  [<ffffffff9110a5f3>] lockdep_trace_alloc+0xd3/0x120
	  [<ffffffff9126034f>] kmem_cache_alloc+0x2f/0x280
	  [<ffffffffc023a251>] kmem_zone_alloc+0x81/0x120 [xfs]
	  [<ffffffffc02398bc>] xfs_trans_alloc+0x6c/0x130 [xfs]
	  [<ffffffffc020a2c9>] xfs_sync_sb+0x39/0x80 [xfs]
	  [<ffffffffc02332fd>] xfs_log_sbcount+0x4d/0x50 [xfs]
	  [<ffffffffc02348d7>] xfs_quiesce_attr+0x57/0xb0 [xfs]
	  [<ffffffffc0234951>] xfs_fs_freeze+0x21/0x40 [xfs]
	  [<ffffffff91291e8f>] freeze_super+0xcf/0x190
	  [<ffffffff912a521f>] do_vfs_ioctl+0x55f/0x6c0
	  [<ffffffff912a53f9>] SyS_ioctl+0x79/0x90
	  [<ffffffff918af23c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
	irq event stamp: 36471805
	hardirqs last  enabled at (36471805): [<ffffffff911f9c8d>] clear_page_dirty_for_io+0x1ed/0x2e0
	hardirqs last disabled at (36471804): [<ffffffff911f9c5d>] clear_page_dirty_for_io+0x1bd/0x2e0
	softirqs last  enabled at (36468590): [<ffffffff918b24ea>] __do_softirq+0x37a/0x44d
	softirqs last disabled at (36468579): [<ffffffff910b2f15>] irq_exit+0xe5/0xf0

	other info that might help us debug this:
	 Possible unsafe locking scenario:
	       CPU0
	       ----
	  lock(sb_internal);
	  <Interrupt>
	    lock(sb_internal);

	 *** DEADLOCK ***
	no locks held by kswapd0/32.

	stack backtrace:
	CPU: 0 PID: 32 Comm: kswapd0 Tainted: G        W       4.8.0+ #4
	Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
	 0000000000000086 00000000028a3434 ffff880139b2b520 ffffffff91449193
	 ffff880139b1a680 ffffffff928c1e70 ffff880139b2b570 ffffffff91106c75
	 0000000000000000 0000000000000001 ffff880100000001 000000000000000a
	Call Trace:
	 [<ffffffff91449193>] dump_stack+0x85/0xc2
	 [<ffffffff91106c75>] print_usage_bug+0x215/0x240
	 [<ffffffff9110722b>] mark_lock+0x58b/0x650
	 [<ffffffff91106080>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
	 [<ffffffff91107c4d>] __lock_acquire+0x36d/0x1870
	 [<ffffffff911097dd>] lock_acquire+0x10d/0x200
	 [<ffffffff91292557>] ? __sb_start_write+0xb7/0xf0
	 [<ffffffff91102ecc>] percpu_down_read+0x3c/0x90
	 [<ffffffff91292557>] ? __sb_start_write+0xb7/0xf0
	 [<ffffffff91292557>] __sb_start_write+0xb7/0xf0
	 [<ffffffffc0239933>] xfs_trans_alloc+0xe3/0x130 [xfs]
	 [<ffffffffc0227dd7>] xfs_iomap_write_allocate+0x1f7/0x380 [xfs]
	 [<ffffffffc020c333>] ? xfs_map_blocks+0xe3/0x380 [xfs]
	 [<ffffffff911268b8>] ? rcu_read_lock_sched_held+0x58/0x60
	 [<ffffffffc020c47a>] xfs_map_blocks+0x22a/0x380 [xfs]
	 [<ffffffffc020dbf8>] xfs_do_writepage+0x188/0x6c0 [xfs]
	 [<ffffffffc020e16b>] xfs_vm_writepage+0x3b/0x70 [xfs]
	 [<ffffffff912049b0>] pageout.isra.46+0x190/0x380
	 [<ffffffff91207cab>] shrink_page_list+0x9ab/0xa70
	 [<ffffffff91208592>] shrink_inactive_list+0x252/0x5d0
	 [<ffffffff9120921f>] shrink_node_memcg+0x5af/0x790
	 [<ffffffff912094e1>] shrink_node+0xe1/0x320
	 [<ffffffff9120a9d7>] kswapd+0x387/0x8b0

Probably false positive? Although when I look at the comment above xfs_sync_sb()
I think that may be sometging like below makes sense, but I know absolutely nothing
about fs/ and XFS in particular.

Oleg.



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Dave Chinner Oct. 6, 2016, 9:59 p.m. UTC | #1
On Thu, Oct 06, 2016 at 07:17:58PM +0200, Oleg Nesterov wrote:
> Probably false positive? Although when I look at the comment above xfs_sync_sb()
> I think that may be sometging like below makes sense, but I know absolutely nothing
> about fs/ and XFS in particular.
> 
> Oleg.
> 
> 
> --- x/fs/xfs/xfs_trans.c
> +++ x/fs/xfs/xfs_trans.c
> @@ -245,7 +245,8 @@ xfs_trans_alloc(
>  	atomic_inc(&mp->m_active_trans);
>  
>  	tp = kmem_zone_zalloc(xfs_trans_zone,
> -		(flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP);
> +		(flags & (XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT))
> +			? KM_NOFS : KM_SLEEP);
>  	tp->t_magic = XFS_TRANS_HEADER_MAGIC;
>  	tp->t_flags = flags;
>  	tp->t_mountp = mp;

Brief examination says caller should set XFS_TRANS_NOFS, not change
the implementation to make XFS_TRANS_NO_WRITECOUNT flag to also mean
XFS_TRANS_NOFS.

Cheers,

Dave.
Oleg Nesterov Oct. 7, 2016, 5:15 p.m. UTC | #2
On 10/07, Dave Chinner wrote:
>
> On Thu, Oct 06, 2016 at 07:17:58PM +0200, Oleg Nesterov wrote:
> > Probably false positive? Although when I look at the comment above xfs_sync_sb()
> > I think that may be sometging like below makes sense, but I know absolutely nothing
> > about fs/ and XFS in particular.
> >
> > Oleg.
> >
> >
> > --- x/fs/xfs/xfs_trans.c
> > +++ x/fs/xfs/xfs_trans.c
> > @@ -245,7 +245,8 @@ xfs_trans_alloc(
> >  	atomic_inc(&mp->m_active_trans);
> >
> >  	tp = kmem_zone_zalloc(xfs_trans_zone,
> > -		(flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP);
> > +		(flags & (XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT))
> > +			? KM_NOFS : KM_SLEEP);
> >  	tp->t_magic = XFS_TRANS_HEADER_MAGIC;
> >  	tp->t_flags = flags;
> >  	tp->t_mountp = mp;
>
> Brief examination says caller should set XFS_TRANS_NOFS, not change
> the implementation to make XFS_TRANS_NO_WRITECOUNT flag to also mean
> XFS_TRANS_NOFS.

I didn't mean the change above can fix the problem, and I don't really
understand your suggestion. Obviously any GFP_FS allocation in xfs_fs_freeze()
paths will trigger the same warning.

I added this hack

	--- a/fs/xfs/xfs_super.c
	+++ b/fs/xfs/xfs_super.c
	@@ -1333,10 +1333,15 @@ xfs_fs_freeze(
		struct super_block      *sb)
	 {
		struct xfs_mount        *mp = XFS_M(sb);
	+       int ret;
	 
	+       current->flags |= PF_FSTRANS; // tell kmem_flags_convert() to remove GFP_FS
		xfs_save_resvblks(mp);
		xfs_quiesce_attr(mp);
	-       return xfs_sync_sb(mp, true);
	+       ret = xfs_sync_sb(mp, true);
	+       current->flags &= ~PF_FSTRANS;
	+
	+       return ret;
	 }

just for testing purposes and after that I got another warning below. I didn't
read it carefully yet, but _at first glance_ it looks like the lock inversion
uncovered by 2/2, although I can be easily wrong. cancel_delayed_work_sync(l_work)
under sb_internal can hang if xfs_log_worker() waits for this rwsem?`

Oleg.

=========================================================
[ INFO: possible irq lock inversion dependency detected ]
4.8.0+ #10 Tainted: G        W      
---------------------------------------------------------
kswapd0/32 just changed the state of lock:
 (sb_internal){++++.?}, at: [<ffffffffb92925c7>] __sb_start_write+0xb7/0xf0
but this lock took another, RECLAIM_FS-unsafe lock in the past:
 ((&(&log->l_work)->work)){+.+.+.}

and interrupts could create inverse lock ordering between them.

other info that might help us debug this:
 Possible interrupt unsafe locking scenario:
       CPU0                    CPU1
       ----                    ----
  lock((&(&log->l_work)->work));
                               local_irq_disable();
                               lock(sb_internal);
                               lock((&(&log->l_work)->work));
  <Interrupt>
    lock(sb_internal);

 *** DEADLOCK ***
no locks held by kswapd0/32.

the shortest dependencies between 2nd lock and 1st lock:
 -> ((&(&log->l_work)->work)){+.+.+.} ops: 803 {
    HARDIRQ-ON-W at:
                      [<ffffffffb9107ef1>] __lock_acquire+0x611/0x1870
                      [<ffffffffb91097dd>] lock_acquire+0x10d/0x200
                      [<ffffffffb90cad49>] process_one_work+0x1c9/0x690
                      [<ffffffffb90cb25e>] worker_thread+0x4e/0x480
                      [<ffffffffb90d1d13>] kthread+0xf3/0x110
                      [<ffffffffb9a2146f>] ret_from_fork+0x1f/0x40
    SOFTIRQ-ON-W at:
                      [<ffffffffb9107f23>] __lock_acquire+0x643/0x1870
                      [<ffffffffb91097dd>] lock_acquire+0x10d/0x200
                      [<ffffffffb90cad49>] process_one_work+0x1c9/0x690
                      [<ffffffffb90cb25e>] worker_thread+0x4e/0x480
                      [<ffffffffb90d1d13>] kthread+0xf3/0x110
                      [<ffffffffb9a2146f>] ret_from_fork+0x1f/0x40
    RECLAIM_FS-ON-W at:
                         [<ffffffffb910735f>] mark_held_locks+0x6f/0xa0
                         [<ffffffffb910a5f3>] lockdep_trace_alloc+0xd3/0x120
                         [<ffffffffb92603bf>] kmem_cache_alloc+0x2f/0x280
                         [<ffffffffb941f5e1>] kmem_zone_alloc+0x81/0x120
                         [<ffffffffb941ec4c>] xfs_trans_alloc+0x6c/0x130
                         [<ffffffffb93ef649>] xfs_sync_sb+0x39/0x80
                         [<ffffffffb9423914>] xfs_log_worker+0xd4/0xf0
                         [<ffffffffb90cad70>] process_one_work+0x1f0/0x690
                         [<ffffffffb90cb25e>] worker_thread+0x4e/0x480
                         [<ffffffffb90d1d13>] kthread+0xf3/0x110
                         [<ffffffffb9a2146f>] ret_from_fork+0x1f/0x40
    INITIAL USE at:
                     [<ffffffffb9107b2b>] __lock_acquire+0x24b/0x1870
                     [<ffffffffb91097dd>] lock_acquire+0x10d/0x200
                     [<ffffffffb90cad49>] process_one_work+0x1c9/0x690
                     [<ffffffffb90cb25e>] worker_thread+0x4e/0x480
                     [<ffffffffb90d1d13>] kthread+0xf3/0x110
                     [<ffffffffb9a2146f>] ret_from_fork+0x1f/0x40
  }
  ... key      at: [<ffffffffbb133758>] __key.130101+0x0/0x8
  ... acquired at:
   [<ffffffffb91097dd>] lock_acquire+0x10d/0x200
   [<ffffffffb90c981c>] flush_work+0x4c/0x2c0
   [<ffffffffb90cbe01>] __cancel_work_timer+0x131/0x1f0
   [<ffffffffb90cbef3>] cancel_delayed_work_sync+0x13/0x20
   [<ffffffffb94245a4>] xfs_log_quiesce+0x34/0x3b0
   [<ffffffffb9419c6d>] xfs_quiesce_attr+0x6d/0xb0
   [<ffffffffb9419ce3>] xfs_fs_freeze+0x33/0x50
   [<ffffffffb9291eff>] freeze_super+0xcf/0x190
   [<ffffffffb92a528f>] do_vfs_ioctl+0x55f/0x6c0
   [<ffffffffb92a5469>] SyS_ioctl+0x79/0x90
   [<ffffffffb9a2123c>] entry_SYSCALL_64_fastpath+0x1f/0xbd

-> (sb_internal){++++.?} ops: 28351355 {
   HARDIRQ-ON-W at:
                    [<ffffffffb9107ef1>] __lock_acquire+0x611/0x1870
                    [<ffffffffb91097dd>] lock_acquire+0x10d/0x200
                    [<ffffffffb9a1e3a6>] down_write+0x36/0x70
                    [<ffffffffb9102fb3>] percpu_down_write+0x33/0x100
                    [<ffffffffb9291eed>] freeze_super+0xbd/0x190
                    [<ffffffffb92a528f>] do_vfs_ioctl+0x55f/0x6c0
                    [<ffffffffb92a5469>] SyS_ioctl+0x79/0x90
                    [<ffffffffb9a2123c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
   HARDIRQ-ON-R at:
                    [<ffffffffb9107bed>] __lock_acquire+0x30d/0x1870
                    [<ffffffffb91097dd>] lock_acquire+0x10d/0x200
                    [<ffffffffb91030c8>] percpu_down_read_trylock+0x48/0xa0
                    [<ffffffffb929258a>] __sb_start_write+0x7a/0xf0
                    [<ffffffffb941ecc3>] xfs_trans_alloc+0xe3/0x130
                    [<ffffffffb940dcba>] xfs_vn_update_time+0x7a/0x210
                    [<ffffffffb92b1038>] touch_atime+0xa8/0xd0
                    [<ffffffffb91ed573>] generic_file_read_iter+0x703/0x8d0
                    [<ffffffffb9401513>] xfs_file_buffered_aio_read+0x63/0x170
                    [<ffffffffb9401688>] xfs_file_read_iter+0x68/0xc0
                    [<ffffffffb928de90>] __vfs_read+0xe0/0x150
                    [<ffffffffb928efb5>] vfs_read+0x95/0x140
                    [<ffffffffb92904e8>] SyS_read+0x58/0xc0
                    [<ffffffffb9003edc>] do_syscall_64+0x6c/0x1e0
                    [<ffffffffb9a212ff>] return_from_SYSCALL_64+0x0/0x7a
   SOFTIRQ-ON-W at:
                    [<ffffffffb9107f23>] __lock_acquire+0x643/0x1870
                    [<ffffffffb91097dd>] lock_acquire+0x10d/0x200
                    [<ffffffffb9a1e3a6>] down_write+0x36/0x70
                    [<ffffffffb9102fb3>] percpu_down_write+0x33/0x100
                    [<ffffffffb9291eed>] freeze_super+0xbd/0x190
                    [<ffffffffb92a528f>] do_vfs_ioctl+0x55f/0x6c0
                    [<ffffffffb92a5469>] SyS_ioctl+0x79/0x90
                    [<ffffffffb9a2123c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
   SOFTIRQ-ON-R at:
                    [<ffffffffb9107f23>] __lock_acquire+0x643/0x1870
                    [<ffffffffb91097dd>] lock_acquire+0x10d/0x200
                    [<ffffffffb91030c8>] percpu_down_read_trylock+0x48/0xa0
                    [<ffffffffb929258a>] __sb_start_write+0x7a/0xf0
                    [<ffffffffb941ecc3>] xfs_trans_alloc+0xe3/0x130
                    [<ffffffffb940dcba>] xfs_vn_update_time+0x7a/0x210
                    [<ffffffffb92b1038>] touch_atime+0xa8/0xd0
                    [<ffffffffb91ed573>] generic_file_read_iter+0x703/0x8d0
                    [<ffffffffb9401513>] xfs_file_buffered_aio_read+0x63/0x170
                    [<ffffffffb9401688>] xfs_file_read_iter+0x68/0xc0
                    [<ffffffffb928de90>] __vfs_read+0xe0/0x150
                    [<ffffffffb928efb5>] vfs_read+0x95/0x140
                    [<ffffffffb92904e8>] SyS_read+0x58/0xc0
                    [<ffffffffb9003edc>] do_syscall_64+0x6c/0x1e0
                    [<ffffffffb9a212ff>] return_from_SYSCALL_64+0x0/0x7a
   IN-RECLAIM_FS-R at:
                       [<ffffffffb9107c4d>] __lock_acquire+0x36d/0x1870
                       [<ffffffffb91097dd>] lock_acquire+0x10d/0x200
                       [<ffffffffb9102ecc>] percpu_down_read+0x3c/0x90
                       [<ffffffffb92925c7>] __sb_start_write+0xb7/0xf0
                       [<ffffffffb941ecc3>] xfs_trans_alloc+0xe3/0x130
                       [<ffffffffb940d157>] xfs_iomap_write_allocate+0x1f7/0x380
                       [<ffffffffb93f17fa>] xfs_map_blocks+0x22a/0x380
                       [<ffffffffb93f2f78>] xfs_do_writepage+0x188/0x6c0
                       [<ffffffffb93f34eb>] xfs_vm_writepage+0x3b/0x70
                       [<ffffffffb9204a20>] pageout.isra.46+0x190/0x380
                       [<ffffffffb9207d1b>] shrink_page_list+0x9ab/0xa70
                       [<ffffffffb9208602>] shrink_inactive_list+0x252/0x5d0
                       [<ffffffffb920928f>] shrink_node_memcg+0x5af/0x790
                       [<ffffffffb9209551>] shrink_node+0xe1/0x320
                       [<ffffffffb920aa47>] kswapd+0x387/0x8b0
                       [<ffffffffb90d1d13>] kthread+0xf3/0x110
                       [<ffffffffb9a2146f>] ret_from_fork+0x1f/0x40
   RECLAIM_FS-ON-R at:
                       [<ffffffffb910735f>] mark_held_locks+0x6f/0xa0
                       [<ffffffffb910a5f3>] lockdep_trace_alloc+0xd3/0x120
                       [<ffffffffb92603bf>] kmem_cache_alloc+0x2f/0x280
                       [<ffffffffb941f5e1>] kmem_zone_alloc+0x81/0x120
                       [<ffffffffb941ec4c>] xfs_trans_alloc+0x6c/0x130
                       [<ffffffffb940dcba>] xfs_vn_update_time+0x7a/0x210
                       [<ffffffffb92b1038>] touch_atime+0xa8/0xd0
                       [<ffffffffb91ed573>] generic_file_read_iter+0x703/0x8d0
                       [<ffffffffb9401513>] xfs_file_buffered_aio_read+0x63/0x170
                       [<ffffffffb9401688>] xfs_file_read_iter+0x68/0xc0
                       [<ffffffffb928de90>] __vfs_read+0xe0/0x150
                       [<ffffffffb928efb5>] vfs_read+0x95/0x140
                       [<ffffffffb92904e8>] SyS_read+0x58/0xc0
                       [<ffffffffb9003edc>] do_syscall_64+0x6c/0x1e0
                       [<ffffffffb9a212ff>] return_from_SYSCALL_64+0x0/0x7a
   INITIAL USE at:
                   [<ffffffffb9107b2b>] __lock_acquire+0x24b/0x1870
                   [<ffffffffb91097dd>] lock_acquire+0x10d/0x200
                   [<ffffffffb91030c8>] percpu_down_read_trylock+0x48/0xa0
                   [<ffffffffb929258a>] __sb_start_write+0x7a/0xf0
                   [<ffffffffb941ecc3>] xfs_trans_alloc+0xe3/0x130
                   [<ffffffffb940dcba>] xfs_vn_update_time+0x7a/0x210
                   [<ffffffffb92b1038>] touch_atime+0xa8/0xd0
                   [<ffffffffb91ed573>] generic_file_read_iter+0x703/0x8d0
                   [<ffffffffb9401513>] xfs_file_buffered_aio_read+0x63/0x170
                   [<ffffffffb9401688>] xfs_file_read_iter+0x68/0xc0
                   [<ffffffffb928de90>] __vfs_read+0xe0/0x150
                   [<ffffffffb928efb5>] vfs_read+0x95/0x140
                   [<ffffffffb92904e8>] SyS_read+0x58/0xc0
                   [<ffffffffb9003edc>] do_syscall_64+0x6c/0x1e0
                   [<ffffffffb9a212ff>] return_from_SYSCALL_64+0x0/0x7a
 }
 ... key      at: [<ffffffffba0b1a00>] xfs_fs_type+0x60/0x80
 ... acquired at:
   [<ffffffffb91061d1>] check_usage_forwards+0x151/0x160
   [<ffffffffb9106fea>] mark_lock+0x34a/0x650
   [<ffffffffb9107c4d>] __lock_acquire+0x36d/0x1870
   [<ffffffffb91097dd>] lock_acquire+0x10d/0x200
   [<ffffffffb9102ecc>] percpu_down_read+0x3c/0x90
   [<ffffffffb92925c7>] __sb_start_write+0xb7/0xf0
   [<ffffffffb941ecc3>] xfs_trans_alloc+0xe3/0x130
   [<ffffffffb940d157>] xfs_iomap_write_allocate+0x1f7/0x380
   [<ffffffffb93f17fa>] xfs_map_blocks+0x22a/0x380
   [<ffffffffb93f2f78>] xfs_do_writepage+0x188/0x6c0
   [<ffffffffb93f34eb>] xfs_vm_writepage+0x3b/0x70
   [<ffffffffb9204a20>] pageout.isra.46+0x190/0x380
   [<ffffffffb9207d1b>] shrink_page_list+0x9ab/0xa70
   [<ffffffffb9208602>] shrink_inactive_list+0x252/0x5d0
   [<ffffffffb920928f>] shrink_node_memcg+0x5af/0x790
   [<ffffffffb9209551>] shrink_node+0xe1/0x320
   [<ffffffffb920aa47>] kswapd+0x387/0x8b0
   [<ffffffffb90d1d13>] kthread+0xf3/0x110
   [<ffffffffb9a2146f>] ret_from_fork+0x1f/0x40


stack backtrace:
CPU: 0 PID: 32 Comm: kswapd0 Tainted: G        W       4.8.0+ #10
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 0000000000000086 0000000020cdca74 ffff880136c5b490 ffffffffb95b7a73
 ffffffffbae7ed80 ffff880136c5b4f0 ffff880136c5b4d0 ffffffffb91e7c5a
 ffff880136c3aee8 ffff880136c3aee8 ffff880136c3a680 ffffffffb9e8bbdd
Call Trace:
 [<ffffffffb95b7a73>] dump_stack+0x85/0xc2
 [<ffffffffb91e7c5a>] print_irq_inversion_bug.part.36+0x1a4/0x1b0
 [<ffffffffb91061d1>] check_usage_forwards+0x151/0x160
 [<ffffffffb9106fea>] mark_lock+0x34a/0x650
 [<ffffffffb9106080>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
 [<ffffffffb9107c4d>] __lock_acquire+0x36d/0x1870
 [<ffffffffb9107d55>] ? __lock_acquire+0x475/0x1870
 [<ffffffffb91097dd>] lock_acquire+0x10d/0x200
 [<ffffffffb92925c7>] ? __sb_start_write+0xb7/0xf0
 [<ffffffffb9102ecc>] percpu_down_read+0x3c/0x90
 [<ffffffffb92925c7>] ? __sb_start_write+0xb7/0xf0
 [<ffffffffb92925c7>] __sb_start_write+0xb7/0xf0
 [<ffffffffb941ecc3>] xfs_trans_alloc+0xe3/0x130
 [<ffffffffb940d157>] xfs_iomap_write_allocate+0x1f7/0x380
 [<ffffffffb93f16b3>] ? xfs_map_blocks+0xe3/0x380
 [<ffffffffb91268b8>] ? rcu_read_lock_sched_held+0x58/0x60
 [<ffffffffb93f17fa>] xfs_map_blocks+0x22a/0x380
 [<ffffffffb93f2f78>] xfs_do_writepage+0x188/0x6c0
 [<ffffffffb93f34eb>] xfs_vm_writepage+0x3b/0x70
 [<ffffffffb9204a20>] pageout.isra.46+0x190/0x380
 [<ffffffffb9207d1b>] shrink_page_list+0x9ab/0xa70
 [<ffffffffb9208602>] shrink_inactive_list+0x252/0x5d0
 [<ffffffffb920928f>] shrink_node_memcg+0x5af/0x790
 [<ffffffffb9209551>] shrink_node+0xe1/0x320
 [<ffffffffb920aa47>] kswapd+0x387/0x8b0
 [<ffffffffb920a6c0>] ? mem_cgroup_shrink_node+0x2e0/0x2e0
 [<ffffffffb90d1d13>] kthread+0xf3/0x110
 [<ffffffffb9a2146f>] ret_from_fork+0x1f/0x40
 [<ffffffffb90d1c20>] ? kthread_create_on_node+0x230/0x230

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Oct. 7, 2016, 10:52 p.m. UTC | #3
On Fri, Oct 07, 2016 at 07:15:18PM +0200, Oleg Nesterov wrote:
> On 10/07, Dave Chinner wrote:
> >
> > On Thu, Oct 06, 2016 at 07:17:58PM +0200, Oleg Nesterov wrote:
> > > Probably false positive? Although when I look at the comment above xfs_sync_sb()
> > > I think that may be sometging like below makes sense, but I know absolutely nothing
> > > about fs/ and XFS in particular.
> > >
> > > Oleg.
> > >
> > >
> > > --- x/fs/xfs/xfs_trans.c
> > > +++ x/fs/xfs/xfs_trans.c
> > > @@ -245,7 +245,8 @@ xfs_trans_alloc(
> > >  	atomic_inc(&mp->m_active_trans);
> > >
> > >  	tp = kmem_zone_zalloc(xfs_trans_zone,
> > > -		(flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP);
> > > +		(flags & (XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT))
> > > +			? KM_NOFS : KM_SLEEP);
> > >  	tp->t_magic = XFS_TRANS_HEADER_MAGIC;
> > >  	tp->t_flags = flags;
> > >  	tp->t_mountp = mp;
> >
> > Brief examination says caller should set XFS_TRANS_NOFS, not change
> > the implementation to make XFS_TRANS_NO_WRITECOUNT flag to also mean
> > XFS_TRANS_NOFS.
> 
> I didn't mean the change above can fix the problem, and I don't really
> understand your suggestion.

xfs_syncsb() does:

	tp = xfs_trans_alloc(... , XFS_TRANS_NO_WRITECOUNT, ....);

but it's running in a GFP_NOFS context when a freeze is being
finalised. SO, rather than changing what XFS_TRANS_NO_WRITECOUNT
does in xfs_trans_alloc(), we should tell it to do a GFP_NOFS
allocation. i.e.

	tp = xfs_trans_alloc(... , XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT, ....);

and nothing inside xfs_trans_alloc() changes at all.

> Obviously any GFP_FS allocation in xfs_fs_freeze()
> paths will trigger the same warning.

Of which there should be none except for that xfs_trans_alloc()
call.

> I added this hack
> 
> 	--- a/fs/xfs/xfs_super.c
> 	+++ b/fs/xfs/xfs_super.c
> 	@@ -1333,10 +1333,15 @@ xfs_fs_freeze(
> 		struct super_block      *sb)
> 	 {
> 		struct xfs_mount        *mp = XFS_M(sb);
> 	+       int ret;
> 	 
> 	+       current->flags |= PF_FSTRANS; // tell kmem_flags_convert() to remove GFP_FS
> 		xfs_save_resvblks(mp);
> 		xfs_quiesce_attr(mp);
> 	-       return xfs_sync_sb(mp, true);
> 	+       ret = xfs_sync_sb(mp, true);
> 	+       current->flags &= ~PF_FSTRANS;
> 	+
> 	+       return ret;
> 	 }

/me shudders

> just for testing purposes and after that I got another warning below. I didn't
> read it carefully yet, but _at first glance_ it looks like the lock inversion
> uncovered by 2/2, although I can be easily wrong. cancel_delayed_work_sync(l_work)
> under sb_internal can hang if xfs_log_worker() waits for this rwsem?`

Actually: I *can't read it*. I've got no fucking clue what lockdep
is trying to say here. This /looks/ like a lockdep is getting
confused between memory reclaim contexts (which /aren't locks/ but
overload interrupt levels) and freeze contexts (which /aren't locks/)
and workqueue locks which /aren't nested/ inside transactions or
freeze contexts. But, really, I can't follow it because I have to
guess at what "lock contexts" are not locks but are something else.

cheers,

Dave.
Oleg Nesterov Oct. 9, 2016, 4:14 p.m. UTC | #4
On 10/08, Dave Chinner wrote:
>
> On Fri, Oct 07, 2016 at 07:15:18PM +0200, Oleg Nesterov wrote:
> > > >
> > > > --- x/fs/xfs/xfs_trans.c
> > > > +++ x/fs/xfs/xfs_trans.c
> > > > @@ -245,7 +245,8 @@ xfs_trans_alloc(
> > > >  	atomic_inc(&mp->m_active_trans);
> > > >
> > > >  	tp = kmem_zone_zalloc(xfs_trans_zone,
> > > > -		(flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP);
> > > > +		(flags & (XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT))
> > > > +			? KM_NOFS : KM_SLEEP);
> > > >  	tp->t_magic = XFS_TRANS_HEADER_MAGIC;
> > > >  	tp->t_flags = flags;
> > > >  	tp->t_mountp = mp;
> > >
> > > Brief examination says caller should set XFS_TRANS_NOFS, not change
> > > the implementation to make XFS_TRANS_NO_WRITECOUNT flag to also mean
> > > XFS_TRANS_NOFS.
> >
> > I didn't mean the change above can fix the problem, and I don't really
> > understand your suggestion.
>
> xfs_syncsb() does:
>
> 	tp = xfs_trans_alloc(... , XFS_TRANS_NO_WRITECOUNT, ....);
>
> but it's running in a GFP_NOFS context when a freeze is being
> finalised. SO, rather than changing what XFS_TRANS_NO_WRITECOUNT
> does in xfs_trans_alloc(), we should tell it to do a GFP_NOFS
> allocation. i.e.
>
> 	tp = xfs_trans_alloc(... , XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT, ....);

Ah. This is clear but I am not sure it is enough,

> > Obviously any GFP_FS allocation in xfs_fs_freeze()
> > paths will trigger the same warning.
>
> Of which there should be none except for that xfs_trans_alloc()
> call.

Really? Again, I can be easily wrong, but when I look at xfs_freeze_fs()
paths I can see

xfs_fs_freeze()->xfs_quiesce_attr()->xfs_log_quiesce()->xfs_log_unmount_write()
->xfs_log_reserve()->xlog_ticket_alloc(KM_SLEEP)

at least. But I can test the change above, perhaps this call chain is
not possible...

> > I added this hack
> >
> > 	--- a/fs/xfs/xfs_super.c
> > 	+++ b/fs/xfs/xfs_super.c
> > 	@@ -1333,10 +1333,15 @@ xfs_fs_freeze(
> > 		struct super_block      *sb)
> > 	 {
> > 		struct xfs_mount        *mp = XFS_M(sb);
> > 	+       int ret;
> >
> > 	+       current->flags |= PF_FSTRANS; // tell kmem_flags_convert() to remove GFP_FS
> > 		xfs_save_resvblks(mp);
> > 		xfs_quiesce_attr(mp);
> > 	-       return xfs_sync_sb(mp, true);
> > 	+       ret = xfs_sync_sb(mp, true);
> > 	+       current->flags &= ~PF_FSTRANS;
> > 	+
> > 	+       return ret;
> > 	 }
>
> /me shudders

don't worry, this debugging change won't escape my testing machine!

> > just for testing purposes and after that I got another warning below. I didn't
> > read it carefully yet, but _at first glance_ it looks like the lock inversion
> > uncovered by 2/2, although I can be easily wrong. cancel_delayed_work_sync(l_work)
> > under sb_internal can hang if xfs_log_worker() waits for this rwsem?`
>
> Actually: I *can't read it*. I've got no fucking clue what lockdep
> is trying to say here. This /looks/ like a lockdep is getting
> confused

I can almost never understand what lockdep tells me, it is too clever for me.
But this time I think it is right.

Suppose that freeze_super() races with xfs_log_worker() callback.

freeze_super() takes sb_internal lock and xfs_log_quiesce() calls
cancel_delayed_work_sync(l_work). This will sleep until xfs_log_worker()
finishes.

xfs_log_worker() does a __GFP_FS alloc, triggers reclaim, and blocks
on the same sb_internal lock. Say, in xfs_do_writepage()->xfs_trans_alloc()
path.

Deadlock. The worker thread can't take sb_internal hold by freeze_super(),
cancel_delayed_work_sync() will sleep forever because xfs_log_worker()
can't finish.

So xfs_log_worker() should run in a GFP_NOFS context too. And perhaps
the change above in xfs_trans_alloc() or in xfs_sync_sb() can help if
it doesn't do other allocatiions, I dunno.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Oct. 10, 2016, 1:02 a.m. UTC | #5
On Sun, Oct 09, 2016 at 06:14:57PM +0200, Oleg Nesterov wrote:
> On 10/08, Dave Chinner wrote:
> >
> > On Fri, Oct 07, 2016 at 07:15:18PM +0200, Oleg Nesterov wrote:
> > > > >
> > > > > --- x/fs/xfs/xfs_trans.c
> > > > > +++ x/fs/xfs/xfs_trans.c
> > > > > @@ -245,7 +245,8 @@ xfs_trans_alloc(
> > > > >  	atomic_inc(&mp->m_active_trans);
> > > > >
> > > > >  	tp = kmem_zone_zalloc(xfs_trans_zone,
> > > > > -		(flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP);
> > > > > +		(flags & (XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT))
> > > > > +			? KM_NOFS : KM_SLEEP);
> > > > >  	tp->t_magic = XFS_TRANS_HEADER_MAGIC;
> > > > >  	tp->t_flags = flags;
> > > > >  	tp->t_mountp = mp;
> > > >
> > > > Brief examination says caller should set XFS_TRANS_NOFS, not change
> > > > the implementation to make XFS_TRANS_NO_WRITECOUNT flag to also mean
> > > > XFS_TRANS_NOFS.
> > >
> > > I didn't mean the change above can fix the problem, and I don't really
> > > understand your suggestion.
> >
> > xfs_syncsb() does:
> >
> > 	tp = xfs_trans_alloc(... , XFS_TRANS_NO_WRITECOUNT, ....);
> >
> > but it's running in a GFP_NOFS context when a freeze is being
> > finalised. SO, rather than changing what XFS_TRANS_NO_WRITECOUNT
> > does in xfs_trans_alloc(), we should tell it to do a GFP_NOFS
> > allocation. i.e.
> >
> > 	tp = xfs_trans_alloc(... , XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT, ....);
> 
> Ah. This is clear but I am not sure it is enough,
> 
> > > Obviously any GFP_FS allocation in xfs_fs_freeze()
> > > paths will trigger the same warning.
> >
> > Of which there should be none except for that xfs_trans_alloc()
> > call.
> 
> Really?

/Should/ is an indication of *intent*. Reality is that there may be
*bugs*. We all know that testing can't prove the absence of bugs, so
even after years and years of exercising the code with producing any
evidence there may still be problems.

So, it's time to waste more time explaining why lockdep is telling
us about something that *isn't a bug*.

> Again, I can be easily wrong, but when I look at xfs_freeze_fs()
> paths I can see
> 
> xfs_fs_freeze()->xfs_quiesce_attr()->xfs_log_quiesce()->xfs_log_unmount_write()
> ->xfs_log_reserve()->xlog_ticket_alloc(KM_SLEEP)

So, the problem being indicated here is that memory reclaim might
either try to a) write back dirty pages (which require allocation
transactions), b) might run a shrinker that requires running a
transaction or c) we might run periodic background inode reclaim.

For the case of a), this /can't happen/ because we've already run
the part of a freeze that stops pages being dirtied and then written
them all back and made them clean. So we won't run transactions from
page cache reclaim and so we can't deadlock.

For the case of b), well, that's even easier - the only shrinker
path we care about here is inode reclaim through super_cache_scan().
Before that shrinker runs anything it calls:

	if (!trylock_super(sb))
		return SHRINK_STOP;

Now, we're running memory allocation for the freeze context, which
means we are holding the sb->s_umount semaphore in write mode. That
means the shrinker is going to /fail to lock the superblock/ and
therefore not run any reclaim on that superblock.

IOWs, while we hold the s_umount lock in write mode across a memory
allocation, the shrinkers run in GFP_NOFS mode automatically. So we
can't run transactions from memory reclaim and so we will can't
deadlock.

For the case of c), xfs_quiesce_attr() does this:

	/* force the log to unpin objects from the now complete transactions */
	xfs_log_force(mp, XFS_LOG_SYNC);

	/* reclaim inodes to do any IO before the freeze completes */
	xfs_reclaim_inodes(mp, 0);
	xfs_reclaim_inodes(mp, SYNC_WAIT);

We basically unpin, clean, and reclaim all the unused inodes the XFS
inode cache. With the shrinker not reclaiming any inodes, and there
being no cached, dirty, unreclaimed inodes remaining in the cache,
there can be no background memory allocations, transactions or IO
triggered memory reclaim in this filesystem. At this point, memory
reclaim /should never block/ trying to reclaim objects from this
filesystem that require transactions to free.

From this, it should be obvious that we don't even need to
change the code in xfs_syncsb() - in the freeze context that the
allocation is run we've got a clean filesystem where memory reclaim
won't block on the filesystem being frozen, so the code is safe as
it stands.

> > > just for testing purposes and after that I got another warning below. I didn't
> > > read it carefully yet, but _at first glance_ it looks like the lock inversion
> > > uncovered by 2/2, although I can be easily wrong. cancel_delayed_work_sync(l_work)
> > > under sb_internal can hang if xfs_log_worker() waits for this rwsem?`
> >
> > Actually: I *can't read it*. I've got no fucking clue what lockdep
> > is trying to say here. This /looks/ like a lockdep is getting
> > confused
> 
> I can almost never understand what lockdep tells me, it is too clever for me.
> But this time I think it is right.
> 
> Suppose that freeze_super() races with xfs_log_worker() callback.
> 
> freeze_super() takes sb_internal lock and xfs_log_quiesce() calls
> cancel_delayed_work_sync(l_work). This will sleep until xfs_log_worker()
> finishes.
> 
> xfs_log_worker() does a __GFP_FS alloc, triggers reclaim, and blocks
> on the same sb_internal lock. Say, in xfs_do_writepage()->xfs_trans_alloc()
> path.

See above - xfs_log_worker will not block on memory reclaim on the
filesystem because a) there are no dirty pages and b) the superblock
shrinker will not get the sb->s_umount lock and hence operates for
all contexts as though they are doing GFP_NOFS allocations.

Basically, what we are seeing here is yet another case of "lockdep
is just smart enough to be really dumb" because we cannot fully
express or cleanly annotate the contexts in which it is being asked
to validate. Unless we do something we shouldn't be doing (i.e.
marking GFP_KERNEL allocations that are safe with GFP_NOFS to shut
up lockdep), all we've just done is introduce another vector for
lockdep false positives...

Cheers,

Dave.
Oleg Nesterov Oct. 13, 2016, 4:58 p.m. UTC | #6
Dave, I am sorry for delay.

On 10/10, Dave Chinner wrote:
>
> So, it's time to waste more time explaining why lockdep is telling
> us about something that *isn't a bug*.
> [... snip ...]

OK, thanks. I am not surprised although I have to admit I wasn't sure.

> Basically, what we are seeing here is yet another case of "lockdep
> is just smart enough to be really dumb" because we cannot fully
> express or cleanly annotate the contexts in which it is being asked
> to validate.

Yes... perhaps we can add the new lockdep helpers to avoid the false-
positives like this one, but so far it is not clear to me what we can do.
Somehow we need to tell it to to avoid check_prev_add() because we know
that the work function won't take sb_internal, but at the same time we
should complain if it actually does this. Lets ignore this patch for now.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- x/fs/xfs/xfs_trans.c
+++ x/fs/xfs/xfs_trans.c
@@ -245,7 +245,8 @@  xfs_trans_alloc(
 	atomic_inc(&mp->m_active_trans);
 
 	tp = kmem_zone_zalloc(xfs_trans_zone,
-		(flags & XFS_TRANS_NOFS) ? KM_NOFS : KM_SLEEP);
+		(flags & (XFS_TRANS_NOFS | XFS_TRANS_NO_WRITECOUNT))
+			? KM_NOFS : KM_SLEEP);
 	tp->t_magic = XFS_TRANS_HEADER_MAGIC;
 	tp->t_flags = flags;
 	tp->t_mountp = mp;