btrfs: take the cleaner_mutex earlier in qgroup disable

Message ID	a78874868a1d3d8d78e6b0fdbb97debc88923734.1713573156.git.josef@toxicpanda.com (mailing list archive)
State	New, archived
Headers	show Received: from mail-qt1-f175.google.com (mail-qt1-f175.google.com [209.85.160.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 05F8F38B for <linux-btrfs@vger.kernel.org>; Sat, 20 Apr 2024 00:32:53 +0000 (UTC) From: Josef Bacik <josef@toxicpanda.com> To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH] btrfs: take the cleaner_mutex earlier in qgroup disable Date: Fri, 19 Apr 2024 20:32:48 -0400 Message-ID: <a78874868a1d3d8d78e6b0fdbb97debc88923734.1713573156.git.josef@toxicpanda.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	btrfs: take the cleaner_mutex earlier in qgroup disable \| expand btrfs: take the cleaner_mutex earlier in qgroup disable

Message ID

a78874868a1d3d8d78e6b0fdbb97debc88923734.1713573156.git.josef@toxicpanda.com (mailing list archive)

State

New, archived

Headers

From: Josef Bacik <josef@toxicpanda.com>
To: linux-btrfs@vger.kernel.org,
	kernel-team@fb.com
Subject: [PATCH] btrfs: take the cleaner_mutex earlier in qgroup disable
Date: Fri, 19 Apr 2024 20:32:48 -0400
Message-ID: 
 <a78874868a1d3d8d78e6b0fdbb97debc88923734.1713573156.git.josef@toxicpanda.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

btrfs: take the cleaner_mutex earlier in qgroup disable | expand

Commit Message

Josef Bacik April 20, 2024, 12:32 a.m. UTC

One of my CI runs popped the following lockdep splat

======================================================
WARNING: possible circular locking dependency detected
6.9.0-rc4+ #1 Not tainted
------------------------------------------------------
btrfs/471533 is trying to acquire lock:
ffff92ba46980850 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_quota_disable+0x54/0x4c0

but task is already holding lock:
ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&fs_info->subvol_sem){++++}-{3:3}:
       down_read+0x42/0x170
       btrfs_rename+0x607/0xb00
       btrfs_rename2+0x2e/0x70
       vfs_rename+0xaf8/0xfc0
       do_renameat2+0x586/0x600
       __x64_sys_rename+0x43/0x50
       do_syscall_64+0x95/0x180
       entry_SYSCALL_64_after_hwframe+0x76/0x7e

-> #1 (&sb->s_type->i_mutex_key#16){++++}-{3:3}:
       down_write+0x3f/0xc0
       btrfs_inode_lock+0x40/0x70
       prealloc_file_extent_cluster+0x1b0/0x370
       relocate_file_extent_cluster+0xb2/0x720
       relocate_data_extent+0x107/0x160
       relocate_block_group+0x442/0x550
       btrfs_relocate_block_group+0x2cb/0x4b0
       btrfs_relocate_chunk+0x50/0x1b0
       btrfs_balance+0x92f/0x13d0
       btrfs_ioctl+0x1abf/0x2600
       __x64_sys_ioctl+0x97/0xd0
       do_syscall_64+0x95/0x180
       entry_SYSCALL_64_after_hwframe+0x76/0x7e

-> #0 (&fs_info->cleaner_mutex){+.+.}-{3:3}:
       __lock_acquire+0x13e7/0x2180
       lock_acquire+0xcb/0x2e0
       __mutex_lock+0xbe/0xc00
       btrfs_quota_disable+0x54/0x4c0
       btrfs_ioctl+0x206b/0x2600
       __x64_sys_ioctl+0x97/0xd0
       do_syscall_64+0x95/0x180
       entry_SYSCALL_64_after_hwframe+0x76/0x7e

other info that might help us debug this:

Chain exists of:
  &fs_info->cleaner_mutex --> &sb->s_type->i_mutex_key#16 --> &fs_info->subvol_sem

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&fs_info->subvol_sem);
                               lock(&sb->s_type->i_mutex_key#16);
                               lock(&fs_info->subvol_sem);
  lock(&fs_info->cleaner_mutex);

 *** DEADLOCK ***

2 locks held by btrfs/471533:
 #0: ffff92ba4319e420 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0x3b5/0x2600
 #1: ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600

stack backtrace:
CPU: 1 PID: 471533 Comm: btrfs Kdump: loaded Not tainted 6.9.0-rc4+ #1
Call Trace:
 <TASK>
 dump_stack_lvl+0x77/0xb0
 check_noncircular+0x148/0x160
 ? lock_acquire+0xcb/0x2e0
 __lock_acquire+0x13e7/0x2180
 lock_acquire+0xcb/0x2e0
 ? btrfs_quota_disable+0x54/0x4c0
 ? lock_is_held_type+0x9a/0x110
 __mutex_lock+0xbe/0xc00
 ? btrfs_quota_disable+0x54/0x4c0
 ? srso_return_thunk+0x5/0x5f
 ? lock_acquire+0xcb/0x2e0
 ? btrfs_quota_disable+0x54/0x4c0
 ? btrfs_quota_disable+0x54/0x4c0
 btrfs_quota_disable+0x54/0x4c0
 btrfs_ioctl+0x206b/0x2600
 ? srso_return_thunk+0x5/0x5f
 ? __do_sys_statfs+0x61/0x70
 __x64_sys_ioctl+0x97/0xd0
 do_syscall_64+0x95/0x180
 ? srso_return_thunk+0x5/0x5f
 ? reacquire_held_locks+0xd1/0x1f0
 ? do_user_addr_fault+0x307/0x8a0
 ? srso_return_thunk+0x5/0x5f
 ? lock_acquire+0xcb/0x2e0
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 ? find_held_lock+0x2b/0x80
 ? srso_return_thunk+0x5/0x5f
 ? lock_release+0xca/0x2a0
 ? srso_return_thunk+0x5/0x5f
 ? do_user_addr_fault+0x35c/0x8a0
 ? srso_return_thunk+0x5/0x5f
 ? trace_hardirqs_off+0x4b/0xc0
 ? srso_return_thunk+0x5/0x5f
 ? lockdep_hardirqs_on_prepare+0xde/0x190
 ? srso_return_thunk+0x5/0x5f

This happens because when we call rename we already have the inode mutex
held, and then we acquire the subvol_sem if we are a subvolume.  This
makes the dependency

inode lock -> subvol sem

When we're running data relocation we will preallocate space for the
data relocation inode, and we always run the relocation under the
->cleaner_mutex.  This now creates the dependency of

cleaner_mutex -> inode lock (from the prealloc) -> subvol_sem

Qgroup delete is doing this in the opposite order, it is acquiring the
subvol_sem and then it is acquiring the cleaner_mutex, which results in
this lockdep splat.  This deadlock can't happen in reality, because we
won't ever rename the data reloc inode, nor is the data reloc inode a
subvolume.

However this is fairly easy to fix, simply take the cleaner mutex in the
case where we are disabling qgroups before we take the subvol_sem.  This
resolves the lockdep splat.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ioctl.c  | 20 +++++++++++++++++---
 fs/btrfs/qgroup.c | 21 ++++++++-------------
 2 files changed, 25 insertions(+), 16 deletions(-)

Comments

Filipe Manana April 22, 2024, 3:27 p.m. UTC | #1

On Sat, Apr 20, 2024 at 1:33 AM Josef Bacik <josef@toxicpanda.com> wrote:
>
> One of my CI runs popped the following lockdep splat
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 6.9.0-rc4+ #1 Not tainted
> ------------------------------------------------------
> btrfs/471533 is trying to acquire lock:
> ffff92ba46980850 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_quota_disable+0x54/0x4c0
>
> but task is already holding lock:
> ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (&fs_info->subvol_sem){++++}-{3:3}:
>        down_read+0x42/0x170
>        btrfs_rename+0x607/0xb00
>        btrfs_rename2+0x2e/0x70
>        vfs_rename+0xaf8/0xfc0
>        do_renameat2+0x586/0x600
>        __x64_sys_rename+0x43/0x50
>        do_syscall_64+0x95/0x180
>        entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> -> #1 (&sb->s_type->i_mutex_key#16){++++}-{3:3}:
>        down_write+0x3f/0xc0
>        btrfs_inode_lock+0x40/0x70
>        prealloc_file_extent_cluster+0x1b0/0x370
>        relocate_file_extent_cluster+0xb2/0x720
>        relocate_data_extent+0x107/0x160
>        relocate_block_group+0x442/0x550
>        btrfs_relocate_block_group+0x2cb/0x4b0
>        btrfs_relocate_chunk+0x50/0x1b0
>        btrfs_balance+0x92f/0x13d0
>        btrfs_ioctl+0x1abf/0x2600
>        __x64_sys_ioctl+0x97/0xd0
>        do_syscall_64+0x95/0x180
>        entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> -> #0 (&fs_info->cleaner_mutex){+.+.}-{3:3}:
>        __lock_acquire+0x13e7/0x2180
>        lock_acquire+0xcb/0x2e0
>        __mutex_lock+0xbe/0xc00
>        btrfs_quota_disable+0x54/0x4c0
>        btrfs_ioctl+0x206b/0x2600
>        __x64_sys_ioctl+0x97/0xd0
>        do_syscall_64+0x95/0x180
>        entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> other info that might help us debug this:
>
> Chain exists of:
>   &fs_info->cleaner_mutex --> &sb->s_type->i_mutex_key#16 --> &fs_info->subvol_sem
>
>  Possible unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(&fs_info->subvol_sem);
>                                lock(&sb->s_type->i_mutex_key#16);
>                                lock(&fs_info->subvol_sem);
>   lock(&fs_info->cleaner_mutex);
>
>  *** DEADLOCK ***
>
> 2 locks held by btrfs/471533:
>  #0: ffff92ba4319e420 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0x3b5/0x2600
>  #1: ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600
>
> stack backtrace:
> CPU: 1 PID: 471533 Comm: btrfs Kdump: loaded Not tainted 6.9.0-rc4+ #1
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0x77/0xb0
>  check_noncircular+0x148/0x160
>  ? lock_acquire+0xcb/0x2e0
>  __lock_acquire+0x13e7/0x2180
>  lock_acquire+0xcb/0x2e0
>  ? btrfs_quota_disable+0x54/0x4c0
>  ? lock_is_held_type+0x9a/0x110
>  __mutex_lock+0xbe/0xc00
>  ? btrfs_quota_disable+0x54/0x4c0
>  ? srso_return_thunk+0x5/0x5f
>  ? lock_acquire+0xcb/0x2e0
>  ? btrfs_quota_disable+0x54/0x4c0
>  ? btrfs_quota_disable+0x54/0x4c0
>  btrfs_quota_disable+0x54/0x4c0
>  btrfs_ioctl+0x206b/0x2600
>  ? srso_return_thunk+0x5/0x5f
>  ? __do_sys_statfs+0x61/0x70
>  __x64_sys_ioctl+0x97/0xd0
>  do_syscall_64+0x95/0x180
>  ? srso_return_thunk+0x5/0x5f
>  ? reacquire_held_locks+0xd1/0x1f0
>  ? do_user_addr_fault+0x307/0x8a0
>  ? srso_return_thunk+0x5/0x5f
>  ? lock_acquire+0xcb/0x2e0
>  ? srso_return_thunk+0x5/0x5f
>  ? srso_return_thunk+0x5/0x5f
>  ? find_held_lock+0x2b/0x80
>  ? srso_return_thunk+0x5/0x5f
>  ? lock_release+0xca/0x2a0
>  ? srso_return_thunk+0x5/0x5f
>  ? do_user_addr_fault+0x35c/0x8a0
>  ? srso_return_thunk+0x5/0x5f
>  ? trace_hardirqs_off+0x4b/0xc0
>  ? srso_return_thunk+0x5/0x5f
>  ? lockdep_hardirqs_on_prepare+0xde/0x190
>  ? srso_return_thunk+0x5/0x5f
>
> This happens because when we call rename we already have the inode mutex
> held, and then we acquire the subvol_sem if we are a subvolume.  This
> makes the dependency
>
> inode lock -> subvol sem
>
> When we're running data relocation we will preallocate space for the
> data relocation inode, and we always run the relocation under the
> ->cleaner_mutex.  This now creates the dependency of
>
> cleaner_mutex -> inode lock (from the prealloc) -> subvol_sem
>
> Qgroup delete is doing this in the opposite order, it is acquiring the
> subvol_sem and then it is acquiring the cleaner_mutex, which results in
> this lockdep splat.  This deadlock can't happen in reality, because we
> won't ever rename the data reloc inode, nor is the data reloc inode a
> subvolume.
>
> However this is fairly easy to fix, simply take the cleaner mutex in the
> case where we are disabling qgroups before we take the subvol_sem.  This
> resolves the lockdep splat.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/ioctl.c  | 20 +++++++++++++++++---
>  fs/btrfs/qgroup.c | 21 ++++++++-------------
>  2 files changed, 25 insertions(+), 16 deletions(-)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 0c977b7cc253..494f4d1dc38d 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3758,15 +3758,30 @@ static long btrfs_ioctl_quota_ctl(struct file *file, void __user *arg)
>                 goto drop_write;
>         }
>
> -       down_write(&fs_info->subvol_sem);
> -
>         switch (sa->cmd) {
>         case BTRFS_QUOTA_CTL_ENABLE:
>         case BTRFS_QUOTA_CTL_ENABLE_SIMPLE_QUOTA:
> +               down_write(&fs_info->subvol_sem);
>                 ret = btrfs_quota_enable(fs_info, sa);
> +               up_write(&fs_info->subvol_sem);
>                 break;
>         case BTRFS_QUOTA_CTL_DISABLE:
> +               /*
> +                * Lock the cleaner mutex to prevent races with concurrent
> +                * relocation, because relocation may be building backrefs for
> +                * blocks of the quota root while we are deleting the root. This
> +                * is like dropping fs roots of deleted snapshots/subvolumes, we
> +                * need the same protection.
> +                *
> +                * This also prevents races between concurrent tasks trying to
> +                * disable quotas, because we will unlock and relock
> +                * qgroup_ioctl_lock across BTRFS_FS_QUOTA_ENABLED changes.
> +                */
> +               mutex_lock(&fs_info->cleaner_mutex);
> +               down_write(&fs_info->subvol_sem);

Everything is correct and makes sense.

I'm afraid in the future someone looking into this code after the
patch is merged, will wonder why we have this duplicated locking of
the
subvol_sem in each case of the switch statement and then decide "ah
let's get rid of this duplicated code and make the locking before the
switch and the unlocking after".

So just a comment saying we must lock the cleaner_mutex before
subvol_sem to avoid a lockdep splat would be enough to discourage
that.
Just a suggestion.

Anyway, it looks good so:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

>                 ret = btrfs_quota_disable(fs_info);
> +               up_write(&fs_info->subvol_sem);
> +               mutex_unlock(&fs_info->cleaner_mutex);
>                 break;
>         default:
>                 ret = -EINVAL;
> @@ -3774,7 +3789,6 @@ static long btrfs_ioctl_quota_ctl(struct file *file, void __user *arg)
>         }
>
>         kfree(sa);
> -       up_write(&fs_info->subvol_sem);
>  drop_write:
>         mnt_drop_write_file(file);
>         return ret;
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 9a9f84c4d3b8..7d48ecf9ee7f 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1342,16 +1342,10 @@ int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
>         lockdep_assert_held_write(&fs_info->subvol_sem);
>
>         /*
> -        * Lock the cleaner mutex to prevent races with concurrent relocation,
> -        * because relocation may be building backrefs for blocks of the quota
> -        * root while we are deleting the root. This is like dropping fs roots
> -        * of deleted snapshots/subvolumes, we need the same protection.
> -        *
> -        * This also prevents races between concurrent tasks trying to disable
> -        * quotas, because we will unlock and relock qgroup_ioctl_lock across
> -        * BTRFS_FS_QUOTA_ENABLED changes.
> +        * Relocation will mess with backrefs, so make sure we have the
> +        * cleaner_mutex held to protect us from relocate.
>          */
> -       mutex_lock(&fs_info->cleaner_mutex);
> +       lockdep_assert_held(&fs_info->cleaner_mutex);
>
>         mutex_lock(&fs_info->qgroup_ioctl_lock);
>         if (!fs_info->quota_root)
> @@ -1373,9 +1367,13 @@ int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
>         clear_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags);
>         btrfs_qgroup_wait_for_completion(fs_info, false);
>
> +       /*
> +        * We have nothing held here and no trans handle, just return the error
> +        * if there is one.
> +        */
>         ret = flush_reservations(fs_info);
>         if (ret)
> -               goto out_unlock_cleaner;
> +               return ret;
>
>         /*
>          * 1 For the root item
> @@ -1439,9 +1437,6 @@ int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
>                 btrfs_end_transaction(trans);
>         else if (trans)
>                 ret = btrfs_commit_transaction(trans);
> -out_unlock_cleaner:
> -       mutex_unlock(&fs_info->cleaner_mutex);
> -
>         return ret;
>  }
>
> --
> 2.43.0
>
>

David Sterba April 23, 2024, 1:53 p.m. UTC | #2

On Fri, Apr 19, 2024 at 08:32:48PM -0400, Josef Bacik wrote:
> One of my CI runs popped the following lockdep splat
> 
> ======================================================
> WARNING: possible circular locking dependency detected
> 6.9.0-rc4+ #1 Not tainted
> ------------------------------------------------------
> btrfs/471533 is trying to acquire lock:
> ffff92ba46980850 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_quota_disable+0x54/0x4c0
> 
> but task is already holding lock:
> ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600
> 
> which lock already depends on the new lock.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #2 (&fs_info->subvol_sem){++++}-{3:3}:
>        down_read+0x42/0x170
>        btrfs_rename+0x607/0xb00
>        btrfs_rename2+0x2e/0x70
>        vfs_rename+0xaf8/0xfc0
>        do_renameat2+0x586/0x600
>        __x64_sys_rename+0x43/0x50
>        do_syscall_64+0x95/0x180
>        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> -> #1 (&sb->s_type->i_mutex_key#16){++++}-{3:3}:
>        down_write+0x3f/0xc0
>        btrfs_inode_lock+0x40/0x70
>        prealloc_file_extent_cluster+0x1b0/0x370
>        relocate_file_extent_cluster+0xb2/0x720
>        relocate_data_extent+0x107/0x160
>        relocate_block_group+0x442/0x550
>        btrfs_relocate_block_group+0x2cb/0x4b0
>        btrfs_relocate_chunk+0x50/0x1b0
>        btrfs_balance+0x92f/0x13d0
>        btrfs_ioctl+0x1abf/0x2600
>        __x64_sys_ioctl+0x97/0xd0
>        do_syscall_64+0x95/0x180
>        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> -> #0 (&fs_info->cleaner_mutex){+.+.}-{3:3}:
>        __lock_acquire+0x13e7/0x2180
>        lock_acquire+0xcb/0x2e0
>        __mutex_lock+0xbe/0xc00
>        btrfs_quota_disable+0x54/0x4c0
>        btrfs_ioctl+0x206b/0x2600
>        __x64_sys_ioctl+0x97/0xd0
>        do_syscall_64+0x95/0x180
>        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> other info that might help us debug this:
> 
> Chain exists of:
>   &fs_info->cleaner_mutex --> &sb->s_type->i_mutex_key#16 --> &fs_info->subvol_sem
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(&fs_info->subvol_sem);
>                                lock(&sb->s_type->i_mutex_key#16);
>                                lock(&fs_info->subvol_sem);
>   lock(&fs_info->cleaner_mutex);
> 
>  *** DEADLOCK ***
> 
> 2 locks held by btrfs/471533:
>  #0: ffff92ba4319e420 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0x3b5/0x2600
>  #1: ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600
> 
> stack backtrace:
> CPU: 1 PID: 471533 Comm: btrfs Kdump: loaded Not tainted 6.9.0-rc4+ #1
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0x77/0xb0
>  check_noncircular+0x148/0x160
>  ? lock_acquire+0xcb/0x2e0
>  __lock_acquire+0x13e7/0x2180
>  lock_acquire+0xcb/0x2e0
>  ? btrfs_quota_disable+0x54/0x4c0
>  ? lock_is_held_type+0x9a/0x110
>  __mutex_lock+0xbe/0xc00
>  ? btrfs_quota_disable+0x54/0x4c0
>  ? srso_return_thunk+0x5/0x5f
>  ? lock_acquire+0xcb/0x2e0
>  ? btrfs_quota_disable+0x54/0x4c0
>  ? btrfs_quota_disable+0x54/0x4c0
>  btrfs_quota_disable+0x54/0x4c0
>  btrfs_ioctl+0x206b/0x2600
>  ? srso_return_thunk+0x5/0x5f
>  ? __do_sys_statfs+0x61/0x70
>  __x64_sys_ioctl+0x97/0xd0
>  do_syscall_64+0x95/0x180
>  ? srso_return_thunk+0x5/0x5f
>  ? reacquire_held_locks+0xd1/0x1f0
>  ? do_user_addr_fault+0x307/0x8a0
>  ? srso_return_thunk+0x5/0x5f
>  ? lock_acquire+0xcb/0x2e0
>  ? srso_return_thunk+0x5/0x5f
>  ? srso_return_thunk+0x5/0x5f
>  ? find_held_lock+0x2b/0x80
>  ? srso_return_thunk+0x5/0x5f
>  ? lock_release+0xca/0x2a0
>  ? srso_return_thunk+0x5/0x5f
>  ? do_user_addr_fault+0x35c/0x8a0
>  ? srso_return_thunk+0x5/0x5f
>  ? trace_hardirqs_off+0x4b/0xc0
>  ? srso_return_thunk+0x5/0x5f
>  ? lockdep_hardirqs_on_prepare+0xde/0x190
>  ? srso_return_thunk+0x5/0x5f
> 
> This happens because when we call rename we already have the inode mutex
> held, and then we acquire the subvol_sem if we are a subvolume.  This
> makes the dependency
> 
> inode lock -> subvol sem
> 
> When we're running data relocation we will preallocate space for the
> data relocation inode, and we always run the relocation under the
> ->cleaner_mutex.  This now creates the dependency of
> 
> cleaner_mutex -> inode lock (from the prealloc) -> subvol_sem
> 
> Qgroup delete is doing this in the opposite order, it is acquiring the
> subvol_sem and then it is acquiring the cleaner_mutex, which results in
> this lockdep splat.  This deadlock can't happen in reality, because we
> won't ever rename the data reloc inode, nor is the data reloc inode a
> subvolume.
> 
> However this is fairly easy to fix, simply take the cleaner mutex in the
> case where we are disabling qgroups before we take the subvol_sem.  This
> resolves the lockdep splat.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: David Sterba <dsterba@suse.com>

David Sterba April 23, 2024, 2:03 p.m. UTC | #3

On Mon, Apr 22, 2024 at 04:27:26PM +0100, Filipe Manana wrote:
> >         case BTRFS_QUOTA_CTL_DISABLE:
> > +               /*
> > +                * Lock the cleaner mutex to prevent races with concurrent
> > +                * relocation, because relocation may be building backrefs for
> > +                * blocks of the quota root while we are deleting the root. This
> > +                * is like dropping fs roots of deleted snapshots/subvolumes, we
> > +                * need the same protection.
> > +                *
> > +                * This also prevents races between concurrent tasks trying to
> > +                * disable quotas, because we will unlock and relock
> > +                * qgroup_ioctl_lock across BTRFS_FS_QUOTA_ENABLED changes.
> > +                */
> > +               mutex_lock(&fs_info->cleaner_mutex);
> > +               down_write(&fs_info->subvol_sem);
> 
> Everything is correct and makes sense.
> 
> I'm afraid in the future someone looking into this code after the
> patch is merged, will wonder why we have this duplicated locking of
> the
> subvol_sem in each case of the switch statement and then decide "ah
> let's get rid of this duplicated code and make the locking before the
> switch and the unlocking after".

I think we'd catch that at review time, moving locks around must be done
carefully. We could improve the lock documentation either as
lockdep_assert_held() or comments, I would not mind adding the comment
here but it also does not seem necessary.

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0c977b7cc253..494f4d1dc38d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3758,15 +3758,30 @@  static long btrfs_ioctl_quota_ctl(struct file *file, void __user *arg)
 		goto drop_write;
 	}
 
-	down_write(&fs_info->subvol_sem);
-
 	switch (sa->cmd) {
 	case BTRFS_QUOTA_CTL_ENABLE:
 	case BTRFS_QUOTA_CTL_ENABLE_SIMPLE_QUOTA:
+		down_write(&fs_info->subvol_sem);
 		ret = btrfs_quota_enable(fs_info, sa);
+		up_write(&fs_info->subvol_sem);
 		break;
 	case BTRFS_QUOTA_CTL_DISABLE:
+		/*
+		 * Lock the cleaner mutex to prevent races with concurrent
+		 * relocation, because relocation may be building backrefs for
+		 * blocks of the quota root while we are deleting the root. This
+		 * is like dropping fs roots of deleted snapshots/subvolumes, we
+		 * need the same protection.
+		 *
+		 * This also prevents races between concurrent tasks trying to
+		 * disable quotas, because we will unlock and relock
+		 * qgroup_ioctl_lock across BTRFS_FS_QUOTA_ENABLED changes.
+		 */
+		mutex_lock(&fs_info->cleaner_mutex);
+		down_write(&fs_info->subvol_sem);
 		ret = btrfs_quota_disable(fs_info);
+		up_write(&fs_info->subvol_sem);
+		mutex_unlock(&fs_info->cleaner_mutex);
 		break;
 	default:
 		ret = -EINVAL;
@@ -3774,7 +3789,6 @@  static long btrfs_ioctl_quota_ctl(struct file *file, void __user *arg)
 	}
 
 	kfree(sa);
-	up_write(&fs_info->subvol_sem);
 drop_write:
 	mnt_drop_write_file(file);
 	return ret;
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 9a9f84c4d3b8..7d48ecf9ee7f 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1342,16 +1342,10 @@  int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
 	lockdep_assert_held_write(&fs_info->subvol_sem);
 
 	/*
-	 * Lock the cleaner mutex to prevent races with concurrent relocation,
-	 * because relocation may be building backrefs for blocks of the quota
-	 * root while we are deleting the root. This is like dropping fs roots
-	 * of deleted snapshots/subvolumes, we need the same protection.
-	 *
-	 * This also prevents races between concurrent tasks trying to disable
-	 * quotas, because we will unlock and relock qgroup_ioctl_lock across
-	 * BTRFS_FS_QUOTA_ENABLED changes.
+	 * Relocation will mess with backrefs, so make sure we have the
+	 * cleaner_mutex held to protect us from relocate.
 	 */
-	mutex_lock(&fs_info->cleaner_mutex);
+	lockdep_assert_held(&fs_info->cleaner_mutex);
 
 	mutex_lock(&fs_info->qgroup_ioctl_lock);
 	if (!fs_info->quota_root)
@@ -1373,9 +1367,13 @@  int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
 	clear_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags);
 	btrfs_qgroup_wait_for_completion(fs_info, false);
 
+	/*
+	 * We have nothing held here and no trans handle, just return the error
+	 * if there is one.
+	 */
 	ret = flush_reservations(fs_info);
 	if (ret)
-		goto out_unlock_cleaner;
+		return ret;
 
 	/*
 	 * 1 For the root item
@@ -1439,9 +1437,6 @@  int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
 		btrfs_end_transaction(trans);
 	else if (trans)
 		ret = btrfs_commit_transaction(trans);
-out_unlock_cleaner:
-	mutex_unlock(&fs_info->cleaner_mutex);
-
 	return ret;
 }

btrfs: take the cleaner_mutex earlier in qgroup disable

Commit Message

Comments

Patch