diff mbox series

[2/2] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim

Message ID 20200615160830.8471-3-longman@redhat.com (mailing list archive)
State Superseded, archived
Headers show
Series sched, xfs: Add PF_MEMALLOC_NOLOCKDEP to fix lockdep problem in xfs | expand

Commit Message

Waiman Long June 15, 2020, 4:08 p.m. UTC
Depending on the workloads, the following circular locking dependency
warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
lock) may show up:

======================================================
WARNING: possible circular locking dependency detected
5.0.0-rc1+ #60 Tainted: G        W
------------------------------------------------------
fsfreeze/4346 is trying to acquire lock:
0000000026f1d784 (fs_reclaim){+.+.}, at:
fs_reclaim_acquire.part.19+0x5/0x30

but task is already holding lock:
0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650

which lock already depends on the new lock.
  :
 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(sb_internal);
                               lock(fs_reclaim);
                               lock(sb_internal);
  lock(fs_reclaim);

 *** DEADLOCK ***

4 locks held by fsfreeze/4346:
 #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650
 #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290
 #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650
 #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650

stack backtrace:
Call Trace:
 dump_stack+0xe0/0x19a
 print_circular_bug.isra.10.cold.34+0x2f4/0x435
 check_prev_add.constprop.19+0xca1/0x15f0
 validate_chain.isra.14+0x11af/0x3b50
 __lock_acquire+0x728/0x1200
 lock_acquire+0x269/0x5a0
 fs_reclaim_acquire.part.19+0x29/0x30
 fs_reclaim_acquire+0x19/0x20
 kmem_cache_alloc+0x3e/0x3f0
 kmem_zone_alloc+0x79/0x150
 xfs_trans_alloc+0xfa/0x9d0
 xfs_sync_sb+0x86/0x170
 xfs_log_sbcount+0x10f/0x140
 xfs_quiesce_attr+0x134/0x270
 xfs_fs_freeze+0x4a/0x70
 freeze_super+0x1af/0x290
 do_vfs_ioctl+0xedc/0x16c0
 ksys_ioctl+0x41/0x80
 __x64_sys_ioctl+0x73/0xa9
 do_syscall_64+0x18f/0xd23
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

This is a false positive as all the dirty pages are flushed out before
the filesystem can be frozen.

Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock
may fix the issue. However, that will greatly complicate the logic and
may not be worth it.

Another way to fix it is to disable the taking of the fs_reclaim
pseudo lock when in the freezing code path as a reclaim on the
freezed filesystem is not possible. By using the newly introduced
PF_MEMALLOC_NOLOCKDEP flag, lockdep checking is disabled in
xfs_trans_alloc() if XFS_TRANS_NO_WRITECOUNT flag is set.

In the freezing path, there is another path where memory allocation
is being done without the XFS_TRANS_NO_WRITECOUNT flag:

  xfs_fs_freeze()
  => xfs_quiesce_attr()
     => xfs_log_quiesce()
        => xfs_log_unmount_write()
           => xlog_unmount_write()
              => xfs_log_reserve()
	         => xlog_ticket_alloc()

In this case, we just disable fs reclaim for this particular 600 bytes
memory allocation.

Without this patch, the command sequence below will show that the lock
dependency chain sb_internal -> fs_reclaim exists.

 # fsfreeze -f /home
 # fsfreeze --unfreeze /home
 # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal

After applying the patch, such sb_internal -> fs_reclaim lock dependency
chain can no longer be found. Because of that, the locking dependency
warning will not be shown.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/xfs/xfs_log.c   |  9 +++++++++
 fs/xfs/xfs_trans.c | 30 ++++++++++++++++++++++++++----
 2 files changed, 35 insertions(+), 4 deletions(-)

Comments

Darrick J. Wong June 15, 2020, 4:43 p.m. UTC | #1
On Mon, Jun 15, 2020 at 12:08:30PM -0400, Waiman Long wrote:
> Depending on the workloads, the following circular locking dependency
> warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
> lock) may show up:
> 
> ======================================================
> WARNING: possible circular locking dependency detected
> 5.0.0-rc1+ #60 Tainted: G        W
> ------------------------------------------------------
> fsfreeze/4346 is trying to acquire lock:
> 0000000026f1d784 (fs_reclaim){+.+.}, at:
> fs_reclaim_acquire.part.19+0x5/0x30
> 
> but task is already holding lock:
> 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
> 
> which lock already depends on the new lock.
>   :
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(sb_internal);
>                                lock(fs_reclaim);
>                                lock(sb_internal);
>   lock(fs_reclaim);
> 
>  *** DEADLOCK ***
> 
> 4 locks held by fsfreeze/4346:
>  #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650
>  #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290
>  #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650
>  #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
> 
> stack backtrace:
> Call Trace:
>  dump_stack+0xe0/0x19a
>  print_circular_bug.isra.10.cold.34+0x2f4/0x435
>  check_prev_add.constprop.19+0xca1/0x15f0
>  validate_chain.isra.14+0x11af/0x3b50
>  __lock_acquire+0x728/0x1200
>  lock_acquire+0x269/0x5a0
>  fs_reclaim_acquire.part.19+0x29/0x30
>  fs_reclaim_acquire+0x19/0x20
>  kmem_cache_alloc+0x3e/0x3f0
>  kmem_zone_alloc+0x79/0x150
>  xfs_trans_alloc+0xfa/0x9d0
>  xfs_sync_sb+0x86/0x170
>  xfs_log_sbcount+0x10f/0x140
>  xfs_quiesce_attr+0x134/0x270
>  xfs_fs_freeze+0x4a/0x70
>  freeze_super+0x1af/0x290
>  do_vfs_ioctl+0xedc/0x16c0
>  ksys_ioctl+0x41/0x80
>  __x64_sys_ioctl+0x73/0xa9
>  do_syscall_64+0x18f/0xd23
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> This is a false positive as all the dirty pages are flushed out before
> the filesystem can be frozen.
> 
> Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock
> may fix the issue. However, that will greatly complicate the logic and
> may not be worth it.
> 
> Another way to fix it is to disable the taking of the fs_reclaim
> pseudo lock when in the freezing code path as a reclaim on the
> freezed filesystem is not possible. By using the newly introduced
> PF_MEMALLOC_NOLOCKDEP flag, lockdep checking is disabled in
> xfs_trans_alloc() if XFS_TRANS_NO_WRITECOUNT flag is set.
> 
> In the freezing path, there is another path where memory allocation
> is being done without the XFS_TRANS_NO_WRITECOUNT flag:
> 
>   xfs_fs_freeze()
>   => xfs_quiesce_attr()
>      => xfs_log_quiesce()
>         => xfs_log_unmount_write()
>            => xlog_unmount_write()
>               => xfs_log_reserve()
> 	         => xlog_ticket_alloc()
> 
> In this case, we just disable fs reclaim for this particular 600 bytes
> memory allocation.
> 
> Without this patch, the command sequence below will show that the lock
> dependency chain sb_internal -> fs_reclaim exists.
> 
>  # fsfreeze -f /home
>  # fsfreeze --unfreeze /home
>  # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal
> 
> After applying the patch, such sb_internal -> fs_reclaim lock dependency
> chain can no longer be found. Because of that, the locking dependency
> warning will not be shown.
> 
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  fs/xfs/xfs_log.c   |  9 +++++++++
>  fs/xfs/xfs_trans.c | 30 ++++++++++++++++++++++++++----
>  2 files changed, 35 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 00fda2e8e738..33244680d0d4 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -830,8 +830,17 @@ xlog_unmount_write(
>  	xfs_lsn_t		lsn;
>  	uint			flags = XLOG_UNMOUNT_TRANS;
>  	int			error;
> +	unsigned long		pflags;
>  
> +	/*
> +	 * xfs_log_reserve() allocates memory. This can lead to fs reclaim
> +	 * which may conflicts with the unmount process. To avoid that,
> +	 * disable fs reclaim for this allocation.
> +	 */
> +	current_set_flags_nested(&pflags, PF_MEMALLOC_NOFS);
>  	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0);
> +	current_restore_flags_nested(&pflags, PF_MEMALLOC_NOFS);
> +
>  	if (error)
>  		goto out_err;
>  
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 3c94e5ff4316..ddb10ad3f51f 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -255,7 +255,27 @@ xfs_trans_alloc(
>  	struct xfs_trans	**tpp)
>  {
>  	struct xfs_trans	*tp;
> -	int			error;
> +	int			error = 0;
> +	unsigned long		pflags = -1;
> +
> +	/*
> +	 * When XFS_TRANS_NO_WRITECOUNT is set, it means there are no dirty
> +	 * data pages in the filesystem at this point.

That's not true.  Look at the other callers of xfs_trans_alloc_empty.

Also: Why not set PF_MEMALLOC_NOFS at the start of the freeze call
chain?

--D

> +	 * So even if fs reclaim
> +	 * is being done, it won't happen to this filesystem. In this case,
> +	 * PF_MEMALLOC_NOLOCKDEP should be set to avoid false positive
> +	 * lockdep splat like:
> +	 *
> +	 *       CPU0                    CPU1
> +	 *       ----                    ----
> +	 *  lock(sb_internal);
> +	 *                               lock(fs_reclaim);
> +	 *                               lock(sb_internal);
> +	 *  lock(fs_reclaim);
> +	 *
> +	 *  *** DEADLOCK ***
> +	 */
> +	if (PF_MEMALLOC_NOLOCKDEP && (flags & XFS_TRANS_NO_WRITECOUNT))
> +		current_set_flags_nested(&pflags, PF_MEMALLOC_NOLOCKDEP);
>  
>  	/*
>  	 * Allocate the handle before we do our freeze accounting and setting up
> @@ -284,13 +304,15 @@ xfs_trans_alloc(
>  	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
>  	if (error) {
>  		xfs_trans_cancel(tp);
> -		return error;
> +		goto out;
>  	}
>  
>  	trace_xfs_trans_alloc(tp, _RET_IP_);
> -
>  	*tpp = tp;
> -	return 0;
> +out:
> +	if (PF_MEMALLOC_NOLOCKDEP && (pflags != -1))
> +		current_restore_flags_nested(&pflags, PF_MEMALLOC_NOLOCKDEP);
> +	return error;
>  }
>  
>  /*
> -- 
> 2.18.1
>
Waiman Long June 15, 2020, 8:53 p.m. UTC | #2
On 6/15/20 12:43 PM, Darrick J. Wong wrote:
> On Mon, Jun 15, 2020 at 12:08:30PM -0400, Waiman Long wrote:
>> Depending on the workloads, the following circular locking dependency
>> warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
>> lock) may show up:
>>
>> ======================================================
>> WARNING: possible circular locking dependency detected
>> 5.0.0-rc1+ #60 Tainted: G        W
>> ------------------------------------------------------
>> fsfreeze/4346 is trying to acquire lock:
>> 0000000026f1d784 (fs_reclaim){+.+.}, at:
>> fs_reclaim_acquire.part.19+0x5/0x30
>>
>> but task is already holding lock:
>> 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
>>
>> which lock already depends on the new lock.
>>    :
>>   Possible unsafe locking scenario:
>>
>>         CPU0                    CPU1
>>         ----                    ----
>>    lock(sb_internal);
>>                                 lock(fs_reclaim);
>>                                 lock(sb_internal);
>>    lock(fs_reclaim);
>>
>>   *** DEADLOCK ***
>>
>> 4 locks held by fsfreeze/4346:
>>   #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650
>>   #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290
>>   #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650
>>   #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
>>
>> stack backtrace:
>> Call Trace:
>>   dump_stack+0xe0/0x19a
>>   print_circular_bug.isra.10.cold.34+0x2f4/0x435
>>   check_prev_add.constprop.19+0xca1/0x15f0
>>   validate_chain.isra.14+0x11af/0x3b50
>>   __lock_acquire+0x728/0x1200
>>   lock_acquire+0x269/0x5a0
>>   fs_reclaim_acquire.part.19+0x29/0x30
>>   fs_reclaim_acquire+0x19/0x20
>>   kmem_cache_alloc+0x3e/0x3f0
>>   kmem_zone_alloc+0x79/0x150
>>   xfs_trans_alloc+0xfa/0x9d0
>>   xfs_sync_sb+0x86/0x170
>>   xfs_log_sbcount+0x10f/0x140
>>   xfs_quiesce_attr+0x134/0x270
>>   xfs_fs_freeze+0x4a/0x70
>>   freeze_super+0x1af/0x290
>>   do_vfs_ioctl+0xedc/0x16c0
>>   ksys_ioctl+0x41/0x80
>>   __x64_sys_ioctl+0x73/0xa9
>>   do_syscall_64+0x18f/0xd23
>>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> This is a false positive as all the dirty pages are flushed out before
>> the filesystem can be frozen.
>>
>> Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock
>> may fix the issue. However, that will greatly complicate the logic and
>> may not be worth it.
>>
>> Another way to fix it is to disable the taking of the fs_reclaim
>> pseudo lock when in the freezing code path as a reclaim on the
>> freezed filesystem is not possible. By using the newly introduced
>> PF_MEMALLOC_NOLOCKDEP flag, lockdep checking is disabled in
>> xfs_trans_alloc() if XFS_TRANS_NO_WRITECOUNT flag is set.
>>
>> In the freezing path, there is another path where memory allocation
>> is being done without the XFS_TRANS_NO_WRITECOUNT flag:
>>
>>    xfs_fs_freeze()
>>    => xfs_quiesce_attr()
>>       => xfs_log_quiesce()
>>          => xfs_log_unmount_write()
>>             => xlog_unmount_write()
>>                => xfs_log_reserve()
>> 	         => xlog_ticket_alloc()
>>
>> In this case, we just disable fs reclaim for this particular 600 bytes
>> memory allocation.
>>
>> Without this patch, the command sequence below will show that the lock
>> dependency chain sb_internal -> fs_reclaim exists.
>>
>>   # fsfreeze -f /home
>>   # fsfreeze --unfreeze /home
>>   # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal
>>
>> After applying the patch, such sb_internal -> fs_reclaim lock dependency
>> chain can no longer be found. Because of that, the locking dependency
>> warning will not be shown.
>>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   fs/xfs/xfs_log.c   |  9 +++++++++
>>   fs/xfs/xfs_trans.c | 30 ++++++++++++++++++++++++++----
>>   2 files changed, 35 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
>> index 00fda2e8e738..33244680d0d4 100644
>> --- a/fs/xfs/xfs_log.c
>> +++ b/fs/xfs/xfs_log.c
>> @@ -830,8 +830,17 @@ xlog_unmount_write(
>>   	xfs_lsn_t		lsn;
>>   	uint			flags = XLOG_UNMOUNT_TRANS;
>>   	int			error;
>> +	unsigned long		pflags;
>>   
>> +	/*
>> +	 * xfs_log_reserve() allocates memory. This can lead to fs reclaim
>> +	 * which may conflicts with the unmount process. To avoid that,
>> +	 * disable fs reclaim for this allocation.
>> +	 */
>> +	current_set_flags_nested(&pflags, PF_MEMALLOC_NOFS);
>>   	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0);
>> +	current_restore_flags_nested(&pflags, PF_MEMALLOC_NOFS);
>> +
>>   	if (error)
>>   		goto out_err;
>>   
>> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
>> index 3c94e5ff4316..ddb10ad3f51f 100644
>> --- a/fs/xfs/xfs_trans.c
>> +++ b/fs/xfs/xfs_trans.c
>> @@ -255,7 +255,27 @@ xfs_trans_alloc(
>>   	struct xfs_trans	**tpp)
>>   {
>>   	struct xfs_trans	*tp;
>> -	int			error;
>> +	int			error = 0;
>> +	unsigned long		pflags = -1;
>> +
>> +	/*
>> +	 * When XFS_TRANS_NO_WRITECOUNT is set, it means there are no dirty
>> +	 * data pages in the filesystem at this point.
> That's not true.  Look at the other callers of xfs_trans_alloc_empty.
Yes, I am aware of that. I can change it to check the freeze state.
>
> Also: Why not set PF_MEMALLOC_NOFS at the start of the freeze call
> chain?

I guess we can do that, but it eliminates a potential source for memory 
reclaim leading to freeze error when not much free memory is left. We 
can go this route if you think this is not a problem.

Cheers,
Longman
Dave Chinner June 15, 2020, 11:28 p.m. UTC | #3
On Mon, Jun 15, 2020 at 09:43:51AM -0700, Darrick J. Wong wrote:
> Also: Why not set PF_MEMALLOC_NOFS at the start of the freeze call
> chain?

Because there's no guarantee that we are always going to do this
final work in the freeze syscall context? i.e. the state is specific
to the context in which we are running the transaction, not the
task context it is running in...

Cheers,

Dave.
Darrick J. Wong June 16, 2020, 4:29 p.m. UTC | #4
On Mon, Jun 15, 2020 at 04:53:38PM -0400, Waiman Long wrote:
> On 6/15/20 12:43 PM, Darrick J. Wong wrote:
> > On Mon, Jun 15, 2020 at 12:08:30PM -0400, Waiman Long wrote:
> > > Depending on the workloads, the following circular locking dependency
> > > warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
> > > lock) may show up:
> > > 
> > > ======================================================
> > > WARNING: possible circular locking dependency detected
> > > 5.0.0-rc1+ #60 Tainted: G        W
> > > ------------------------------------------------------
> > > fsfreeze/4346 is trying to acquire lock:
> > > 0000000026f1d784 (fs_reclaim){+.+.}, at:
> > > fs_reclaim_acquire.part.19+0x5/0x30
> > > 
> > > but task is already holding lock:
> > > 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
> > > 
> > > which lock already depends on the new lock.
> > >    :
> > >   Possible unsafe locking scenario:
> > > 
> > >         CPU0                    CPU1
> > >         ----                    ----
> > >    lock(sb_internal);
> > >                                 lock(fs_reclaim);
> > >                                 lock(sb_internal);
> > >    lock(fs_reclaim);
> > > 
> > >   *** DEADLOCK ***
> > > 
> > > 4 locks held by fsfreeze/4346:
> > >   #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650
> > >   #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290
> > >   #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650
> > >   #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
> > > 
> > > stack backtrace:
> > > Call Trace:
> > >   dump_stack+0xe0/0x19a
> > >   print_circular_bug.isra.10.cold.34+0x2f4/0x435
> > >   check_prev_add.constprop.19+0xca1/0x15f0
> > >   validate_chain.isra.14+0x11af/0x3b50
> > >   __lock_acquire+0x728/0x1200
> > >   lock_acquire+0x269/0x5a0
> > >   fs_reclaim_acquire.part.19+0x29/0x30
> > >   fs_reclaim_acquire+0x19/0x20
> > >   kmem_cache_alloc+0x3e/0x3f0
> > >   kmem_zone_alloc+0x79/0x150
> > >   xfs_trans_alloc+0xfa/0x9d0
> > >   xfs_sync_sb+0x86/0x170
> > >   xfs_log_sbcount+0x10f/0x140
> > >   xfs_quiesce_attr+0x134/0x270
> > >   xfs_fs_freeze+0x4a/0x70
> > >   freeze_super+0x1af/0x290
> > >   do_vfs_ioctl+0xedc/0x16c0
> > >   ksys_ioctl+0x41/0x80
> > >   __x64_sys_ioctl+0x73/0xa9
> > >   do_syscall_64+0x18f/0xd23
> > >   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > > 
> > > This is a false positive as all the dirty pages are flushed out before
> > > the filesystem can be frozen.
> > > 
> > > Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock
> > > may fix the issue. However, that will greatly complicate the logic and
> > > may not be worth it.
> > > 
> > > Another way to fix it is to disable the taking of the fs_reclaim
> > > pseudo lock when in the freezing code path as a reclaim on the
> > > freezed filesystem is not possible. By using the newly introduced
> > > PF_MEMALLOC_NOLOCKDEP flag, lockdep checking is disabled in
> > > xfs_trans_alloc() if XFS_TRANS_NO_WRITECOUNT flag is set.
> > > 
> > > In the freezing path, there is another path where memory allocation
> > > is being done without the XFS_TRANS_NO_WRITECOUNT flag:
> > > 
> > >    xfs_fs_freeze()
> > >    => xfs_quiesce_attr()
> > >       => xfs_log_quiesce()
> > >          => xfs_log_unmount_write()
> > >             => xlog_unmount_write()
> > >                => xfs_log_reserve()
> > > 	         => xlog_ticket_alloc()
> > > 
> > > In this case, we just disable fs reclaim for this particular 600 bytes
> > > memory allocation.
> > > 
> > > Without this patch, the command sequence below will show that the lock
> > > dependency chain sb_internal -> fs_reclaim exists.
> > > 
> > >   # fsfreeze -f /home
> > >   # fsfreeze --unfreeze /home
> > >   # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal
> > > 
> > > After applying the patch, such sb_internal -> fs_reclaim lock dependency
> > > chain can no longer be found. Because of that, the locking dependency
> > > warning will not be shown.
> > > 
> > > Suggested-by: Dave Chinner <david@fromorbit.com>
> > > Signed-off-by: Waiman Long <longman@redhat.com>
> > > ---
> > >   fs/xfs/xfs_log.c   |  9 +++++++++
> > >   fs/xfs/xfs_trans.c | 30 ++++++++++++++++++++++++++----
> > >   2 files changed, 35 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > > index 00fda2e8e738..33244680d0d4 100644
> > > --- a/fs/xfs/xfs_log.c
> > > +++ b/fs/xfs/xfs_log.c
> > > @@ -830,8 +830,17 @@ xlog_unmount_write(
> > >   	xfs_lsn_t		lsn;
> > >   	uint			flags = XLOG_UNMOUNT_TRANS;
> > >   	int			error;
> > > +	unsigned long		pflags;
> > > +	/*
> > > +	 * xfs_log_reserve() allocates memory. This can lead to fs reclaim
> > > +	 * which may conflicts with the unmount process. To avoid that,
> > > +	 * disable fs reclaim for this allocation.
> > > +	 */
> > > +	current_set_flags_nested(&pflags, PF_MEMALLOC_NOFS);
> > >   	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0);
> > > +	current_restore_flags_nested(&pflags, PF_MEMALLOC_NOFS);
> > > +
> > >   	if (error)
> > >   		goto out_err;
> > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > > index 3c94e5ff4316..ddb10ad3f51f 100644
> > > --- a/fs/xfs/xfs_trans.c
> > > +++ b/fs/xfs/xfs_trans.c
> > > @@ -255,7 +255,27 @@ xfs_trans_alloc(
> > >   	struct xfs_trans	**tpp)
> > >   {
> > >   	struct xfs_trans	*tp;
> > > -	int			error;
> > > +	int			error = 0;
> > > +	unsigned long		pflags = -1;
> > > +
> > > +	/*
> > > +	 * When XFS_TRANS_NO_WRITECOUNT is set, it means there are no dirty
> > > +	 * data pages in the filesystem at this point.
> > That's not true.  Look at the other callers of xfs_trans_alloc_empty.
> Yes, I am aware of that. I can change it to check the freeze state.

<nod>

> > Also: Why not set PF_MEMALLOC_NOFS at the start of the freeze call
> > chain?
> 
> I guess we can do that, but it eliminates a potential source for memory
> reclaim leading to freeze error when not much free memory is left. We can go
> this route if you think this is not a problem.

<shrug> It sounds like you & Dave had already worked that out, so we can
leave this as it is.  I saw the untrue claim in the code comment and
started asking more questions. ;)

(Says me who has been checked out the last few days, not following
the various lockdep shuttup patch threads...)

--D

> Cheers,
> Longman
>
diff mbox series

Patch

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 00fda2e8e738..33244680d0d4 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -830,8 +830,17 @@  xlog_unmount_write(
 	xfs_lsn_t		lsn;
 	uint			flags = XLOG_UNMOUNT_TRANS;
 	int			error;
+	unsigned long		pflags;
 
+	/*
+	 * xfs_log_reserve() allocates memory. This can lead to fs reclaim
+	 * which may conflicts with the unmount process. To avoid that,
+	 * disable fs reclaim for this allocation.
+	 */
+	current_set_flags_nested(&pflags, PF_MEMALLOC_NOFS);
 	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0);
+	current_restore_flags_nested(&pflags, PF_MEMALLOC_NOFS);
+
 	if (error)
 		goto out_err;
 
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 3c94e5ff4316..ddb10ad3f51f 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -255,7 +255,27 @@  xfs_trans_alloc(
 	struct xfs_trans	**tpp)
 {
 	struct xfs_trans	*tp;
-	int			error;
+	int			error = 0;
+	unsigned long		pflags = -1;
+
+	/*
+	 * When XFS_TRANS_NO_WRITECOUNT is set, it means there are no dirty
+	 * data pages in the filesystem at this point. So even if fs reclaim
+	 * is being done, it won't happen to this filesystem. In this case,
+	 * PF_MEMALLOC_NOLOCKDEP should be set to avoid false positive
+	 * lockdep splat like:
+	 *
+	 *       CPU0                    CPU1
+	 *       ----                    ----
+	 *  lock(sb_internal);
+	 *                               lock(fs_reclaim);
+	 *                               lock(sb_internal);
+	 *  lock(fs_reclaim);
+	 *
+	 *  *** DEADLOCK ***
+	 */
+	if (PF_MEMALLOC_NOLOCKDEP && (flags & XFS_TRANS_NO_WRITECOUNT))
+		current_set_flags_nested(&pflags, PF_MEMALLOC_NOLOCKDEP);
 
 	/*
 	 * Allocate the handle before we do our freeze accounting and setting up
@@ -284,13 +304,15 @@  xfs_trans_alloc(
 	error = xfs_trans_reserve(tp, resp, blocks, rtextents);
 	if (error) {
 		xfs_trans_cancel(tp);
-		return error;
+		goto out;
 	}
 
 	trace_xfs_trans_alloc(tp, _RET_IP_);
-
 	*tpp = tp;
-	return 0;
+out:
+	if (PF_MEMALLOC_NOLOCKDEP && (pflags != -1))
+		current_restore_flags_nested(&pflags, PF_MEMALLOC_NOLOCKDEP);
+	return error;
 }
 
 /*