btrfs: fix the race between umount and btrfs-cleaner

Message ID	20240821114628.645455-1-sunjunchao2870@gmail.com (mailing list archive)
State	New, archived
Headers	show Received: from mail-pl1-f170.google.com (mail-pl1-f170.google.com [209.85.214.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 03F761BA26C; Wed, 21 Aug 2024 11:46:35 +0000 (UTC) From: Julian Sun <sunjunchao2870@gmail.com> To: linux-btrfs@vger.kernel.org Cc: clm@fb.com, josef@toxicpanda.com, dsterba@suse.com, Julian Sun <sunjunchao2870@gmail.com>, syzbot+67ba3c42bcbb4665d3ad@syzkaller.appspotmail.com, stable@vger.kernel.org Subject: [PATCH] btrfs: fix the race between umount and btrfs-cleaner Date: Wed, 21 Aug 2024 19:46:28 +0800 Message-Id: <20240821114628.645455-1-sunjunchao2870@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	btrfs: fix the race between umount and btrfs-cleaner \| expand btrfs: fix the race between umount and btrfs-cleaner

Julian Sun Aug. 21, 2024, 11:46 a.m. UTC

There is a race condition generic_shutdown_super() and
__btrfs_run_defrag_inode().
Consider the following scenario:

umount thread:             btrfs-cleaner thread:
			     btrfs_run_delayed_iputs()
			       ->run_delayed_iput_locked()
				->iput(inode)
				  // Here the inode (ie ino 261) will be cleared and freed
btrfs_kill_super()
  ->generic_shutdown_super()
    			     btrfs_run_defrag_inodes()
			       ->__btrfs_run_defrag_inode()
				->btrfs_iget(ino)
				// The inode 261 was recreated with i_count=1
				// and added to the sb list
    ->evict_inodes(sb)          // After some work
    // inode 261 was added      ->iput(inode)
    // to the dispose list        ->iput_funal()
      ->evict(inode)                ->evict(inode)

Now, we have two threads simultaneously evicting
the same inode, which led to a bug.

The above behavior can be confirmed by the log I added for debugging
and the log printed when BUG was triggered.
Due to space limitations, I cannot paste the full diff and here is a
brief describtion.

First, within __btrfs_run_defrag_inode(), set inode->i_state |= (1<<19)
just before calling iput().
Within the dispose_list(), check the flag, if the flag was set, then
pr_info("bug! double evict! crash will happen! state is 0x%lx\n", inode->i_state);

Here is the printed log when the BUG was triggered:
[  190.686726][ T2336] bug! double evict! crash will happen! state is 0x80020
[  190.687647][ T2336] ------------[ cut here ]------------
[  190.688294][ T2336] kernel BUG at fs/inode.c:626!
[  190.688939][ T2336] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
[  190.689792][ T2336] CPU: 1 PID: 2336 Comm: a.out Not tainted 6.10.0-rc2-00223-g0c529ab65ef8-dirty #109
[  190.690894][ T2336] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[  190.692111][ T2336] RIP: 0010:clear_inode+0x15b/0x190
// some logs...
[  190.704501][ T2336]  btrfs_evict_inode+0x529/0xe80
[  190.706966][ T2336]  evict+0x2ed/0x6c0
[  190.707209][ T2336]  dispose_list+0x62/0x260
[  190.707490][ T2336]  evict_inodes+0x34e/0x450

To prevent this behavior, we need to set BTRFS_FS_CLOSING_START
before kill_anon_super() to ensure that
btrfs_run_defrag_inodes() doesn't continue working after unmount.

Reported-and-tested-by: syzbot+67ba3c42bcbb4665d3ad@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=67ba3c42bcbb4665d3ad
CC: stable@vger.kernel.org
Fixes: c146afad2c7f ("Btrfs: mount ro and remount support")
Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
---
 fs/btrfs/super.c | 1 +
 1 file changed, 1 insertion(+)

Josef Bacik Aug. 21, 2024, 6:08 p.m. UTC | #1

On Wed, Aug 21, 2024 at 07:46:28PM +0800, Julian Sun wrote:
> There is a race condition generic_shutdown_super() and
> __btrfs_run_defrag_inode().
> Consider the following scenario:
> 
> umount thread:             btrfs-cleaner thread:
> 			     btrfs_run_delayed_iputs()
> 			       ->run_delayed_iput_locked()
> 				->iput(inode)
> 				  // Here the inode (ie ino 261) will be cleared and freed
> btrfs_kill_super()
>   ->generic_shutdown_super()
>     			     btrfs_run_defrag_inodes()
> 			       ->__btrfs_run_defrag_inode()
> 				->btrfs_iget(ino)
> 				// The inode 261 was recreated with i_count=1
> 				// and added to the sb list
>     ->evict_inodes(sb)          // After some work
>     // inode 261 was added      ->iput(inode)
>     // to the dispose list        ->iput_funal()
>       ->evict(inode)                ->evict(inode)

This is wrong though, evict_inodes() isn't supposed to isolate if i_count == 1,
and iput_final sets I_FREEING, so we won't get the evict() from evict_inodes.
Something else is happening here.  Thanks,

Josef

Julian Sun Aug. 22, 2024, 11:41 a.m. UTC | #2

Hi, Josef.
Thanks for your review and comments.
Yeah, you are correct, sorry for the mistake.

I'm thinking if is there any misunderstanding in the following scenario

umount thread:                     btrfs-cleaner thread:
                                   btrfs_run_delayed_iputs()
                                     ->run_delayed_iput_locked()
                                       ->iput(inode)
                                         ->spin_lock(inode)
 				         // inode->i_count dec to 0
                                         ->iput_final()
                                           ->__inode_add_lru()
                                            // mapping_shrinkable() returned false
                                            // so iput_final() returned with i_count=0
                                            // note here: the inode still
 btrfs_kill_super()                         // in the sb list and I_FREEING
   ->generic_shutdown_super()               // was not set
     ->evict_inodes()                       ->spin_unlock(inode)
      ->spin_lock(inode)
      // i_count is 0 and                // continue running...
      // I_FREEING was not set           ->__btrfs_run_defrag_inode()
      // so the inode was added to         ->btrfs_iget()
      // dispost list                        // note the inode is still
                                             // in the sb list  
                                             // and I_WILL_FREE|I_FREEING
                                             // was not set
  ->spin_unlock()                            ->find_inode()
                                              ->spin_lock(inode)
                                              ->__iget();
                                              ->spin_unlock();
  ->dispose_list()                           
    ->evict()                             ->iput()
     // note: i_lock was not                ->spin_lock(inode)
     // hold here.                          ->iput_final()
                                              ->spin_unlock()
                                              ->evict()

Now, we have two threads simultaneously evicting
the same inode, which led to a bug.

The key points here are:
1. btrfs-cleaner: iput_final() called __inode_add_lru() and then
return because some reason with i_count=0, and I_FREEING was not set, 
and most important, i_lock was unlocked,
which gives evict_inodes() a perfect chance to put the inode into dispose list.
2. umount thread: The inode (which i_count is 0 and I_FREEING was not set) 
was added to dispose list.
3. btrfs-cleaner: Before umount thread set I_FREEING flag, btrfs_iget()
finded the inode and inc i_count
4. umount thread called evict() and btrfs-cleaner called iput(), which leads 
to bug.

The above behavior can be confirmed by the debugging log 
based on commit d30d0e49da71:

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index 407ccec3e57e..cc50dd1e8beb 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -278,6 +278,8 @@ static int __btrfs_run_defrag_inode(struct
btrfs_fs_info *fs_info,
 	ret = btrfs_defrag_file(inode, NULL, &range, defrag->transid,
 				       BTRFS_DEFRAG_BATCH);
 	sb_end_write(fs_info->sb);
+	if (inode->i_state & (1<<19))
+		pr_info("bug! double evict! crash will happen\n");
 	iput(inode);
 
 	if (ret < 0)
diff --git a/fs/inode.c b/fs/inode.c
index 3a41f83a4ba5..641a828bc2b7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -461,8 +461,14 @@ static void __inode_add_lru(struct inode *inode,
bool rotate)
 		return;
 	if (!(inode->i_sb->s_flags & SB_ACTIVE))
 		return;
-	if (!mapping_shrinkable(&inode->i_data))
+	if (!mapping_shrinkable(&inode->i_data)) {
+		if (inode->i_ino == 261) {
+			pr_info("set inode 0x%px dbg_flag casue
mapping_shrinkable\n", inode);
+			dump_stack();
+			WRITE_ONCE(inode->i_state, inode-
>i_state|(1<<19));
+		}
 		return;
+	}
 
 	if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode-
>i_lru))
 		this_cpu_inc(nr_unused);
@@ -623,7 +629,10 @@ void clear_inode(struct inode *inode)
 	xa_unlock_irq(&inode->i_data.i_pages);
 	BUG_ON(!list_empty(&inode->i_data.i_private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
-	BUG_ON(inode->i_state & I_CLEAR);
+	if (inode->i_state & I_CLEAR) {
+		pr_info("inode 0x%px state 0x%lx triggered bug\n",
inode, inode->i_state);
+		BUG();
+	}
 	BUG_ON(!list_empty(&inode->i_wb_list));
 	/* don't need i_lock here, no concurrent mods to i_state */
 	inode->i_state = I_FREEING | I_CLEAR;
@@ -1754,7 +1763,10 @@ void iput(struct inode *inode)
 {
 	if (!inode)
 		return;
-	BUG_ON(inode->i_state & I_CLEAR);
+	if (inode->i_state & I_CLEAR) {
+		pr_info("inode 0x%px state 0x%lx triggered a bug\n",
inode, inode->i_state);
+		BUG();
+	}
 retry:
 	if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
 		if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME))
{
@@ -1764,6 +1776,8 @@ void iput(struct inode *inode)
 			mark_inode_dirty_sync(inode);
 			goto retry;
 		}
+		if (inode->i_ino == 261 && (inode->i_state &(1<<19)))
+			pr_info("state is 0x%lx, bug may be
triggered\n", inode->i_state);
 		iput_final(inode);
 	}
 }

The (1<<19) bit was used to debugging.

Here is the log when bug was triggered:

[   33.028024][ T2451] set inode 0xffff88801fb87738 dbg_flag casue
mapping_shrinkable
[   33.028478][ T2451] CPU: 0 PID: 2451 Comm: btrfs-cleaner Not tainted
6.10.0-rc2-00225-gf0deb01d52b3-dirty #132
[   33.029072][ T2451] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   33.029675][ T2451] Call Trace:
[   33.029873][ T2451]  <TASK>
[   33.030046][ T2451]  dump_stack_lvl+0x16c/0x1f0
[   33.030330][ T2451]  __inode_add_lru.part.0+0x1ea/0x280
[   33.030647][ T2451]  iput+0x7c7/0x920
[   33.030876][ T2451]  run_delayed_iput_locked+0x136/0x1e0
[   33.031199][ T2451]  btrfs_run_delayed_iputs+0x8e/0x120
[   33.031516][ T2451]  cleaner_kthread+0x2d3/0x480
[   33.031801][ T2451]  ? __pfx_cleaner_kthread+0x10/0x10
[   33.032113][ T2451]  kthread+0x2c1/0x3a0
[   33.032354][ T2451]  ? _raw_spin_unlock_irq+0x23/0x50
[   33.032659][ T2451]  ? __pfx_kthread+0x10/0x10
[   33.032929][ T2451]  ret_from_fork+0x45/0x80
[   33.033198][ T2451]  ? __pfx_kthread+0x10/0x10
[   33.033469][ T2451]  ret_from_fork_asm+0x1a/0x30
[   33.033756][ T2451]  </TASK>
[   33.033978][ T2451] bug! double evict! crash will happen
[   33.034168][ T2332] BTRFS info (device loop0): last unmount of
filesystem c9fe44da-de57-406a-8241-57ec7d4412cf
[   33.034295][ T2451] inode 0xffff88801fb87738 state 0x60 triggered a
bug
[   33.034332][ T2451] ------------[ cut here ]------------
[   33.035578][ T2451] kernel BUG at fs/inode.c:1768!
[   33.035873][ T2451] Oops: invalid opcode: 0000 [#1] PREEMPT SMP
KASAN NOPTI
[   33.036286][ T2451] CPU: 0 PID: 2451 Comm: btrfs-cleaner Not tainted
6.10.0-rc2-00225-gf0deb01d52b3-dirty #132
[   33.036870][ T2451] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   33.037464][ T2451] RIP: 0010:iput+0x896/0x920
// logs ...
[   33.042882][ T2451] PKRU: 55555554
[   33.043092][ T2451] Call Trace:
[   33.043288][ T2451]  <TASK>
[   33.044190][ T2451]  ? iput+0x896/0x920
[   33.047321][ T2451]  ? btrfs_run_defrag_inodes+0xb49/0xe60
[   33.047653][ T2451]  btrfs_run_defrag_inodes+0xa40/0xe60
[   33.047975][ T2451]  ? __pfx_btrfs_run_defrag_inodes+0x10/0x10
[   33.048327][ T2451]  ? __pfx___mutex_unlock_slowpath+0x10/0x10
[   33.048678][ T2451]  ? __pfx_do_raw_spin_lock+0x10/0x10
[   33.048996][ T2451]  ? _raw_spin_unlock+0x28/0x50
[   33.049285][ T2451]  ? btrfs_clean_one_deleted_snapshot+0x2b2/0x420
[   33.049660][ T2451]  cleaner_kthread+0x2ee/0x480
[   33.049941][ T2451]  ? __pfx_cleaner_kthread+0x10/0x10
[   33.050252][ T2451]  kthread+0x2c1/0x3a0
[   33.050492][ T2451]  ? _raw_spin_unlock_irq+0x23/0x50
[   33.050796][ T2451]  ? __pfx_kthread+0x10/0x10
[   33.051067][ T2451]  ret_from_fork+0x45/0x80
[   33.051333][ T2451]  ? __pfx_kthread+0x10/0x10
[   33.051603][ T2451]  ret_from_fork_asm+0x1a/0x30
[   33.051887][ T2451]  </TASK>
[   33.052066][ T2451] Modules linked in:
[   33.052316][ T2451] ---[ end trace 0000000000000000 ]---
[   33.052634][ T2451] RIP: 0010:iput+0x896/0x920

Please note inode 0xffff88801fb87738, which was set debug flag by
__inode_add_lru() in the run_delayed_iput_locked() code path. This 
indicates that the inode was not actually evicted. 
Then, please note the double evict log, which shows that this
inode which was set debug flag and was not be evicted is 
still on the sb list and was successfully acquired by btrfs_iget(). 
Next, take note of the unmount log, indicating that there was 
a concurrent unmount thread which has already called evict_inodes() at that time. 
Finally, observe where the BUG was triggered: the state of inode
0xffff88801fb87738 had been set to 0x60 by the unmount thread calling
inode_clear(), and at the same time, btrfs-cleaner called iput(),
ultimately leading to the BUG at the beginning check of iput(). 
Although the point of trigger differs from the one reported by 
syzbot, there is no doubt that they are the same bug.

You could apply the above diff and try to reproduce the bug using the 
reproducer[1].

If there is anything I'm missing or more info was needed, please let me know.

[1]: https://syzkaller.appspot.com/x/repro.c?x=14c57f16980000

Thanks.

Julian Sun Aug. 22, 2024, 11:51 a.m. UTC | #3

On Thu, 2024-08-22 at 19:41 +0800, Julian Sun wrote:
> Hi, Josef.
> Thanks for your review and comments.
> Yeah, you are correct, sorry for the mistake.
> 
> I'm thinking if is there any misunderstanding in the following
> scenario
> 
> umount thread:                     btrfs-cleaner thread:
>                                    btrfs_run_delayed_iputs()
>                                      ->run_delayed_iput_locked()
>                                        ->iput(inode)
>                                          ->spin_lock(inode)
>                                          // inode->i_count dec to 0
>                                          ->iput_final()
>                                            ->__inode_add_lru()
>                                             // mapping_shrinkable()
> returned false
>                                             // so iput_final()
> returned with i_count=0
>                                             // note here: the inode
> still
>  btrfs_kill_super()                         // in the sb list and
> I_FREEING
>    ->generic_shutdown_super()               // was not set
>      ->evict_inodes()                       ->spin_unlock(inode)
>       ->spin_lock(inode)
>       // i_count is 0 and                // continue running...
>       // I_FREEING was not set           ->__btrfs_run_defrag_inode()
>       // so the inode was added to         ->btrfs_iget()
>       // dispost list                        // note the inode is
> still
>                                              // in the sb list  
>                                              // and
> I_WILL_FREE|I_FREEING
>                                              // was not set
>   ->spin_unlock()                            ->find_inode()
>                                               ->spin_lock(inode)
>                                               ->__iget();
>                                               ->spin_unlock();
>   ->dispose_list()                           
>     ->evict()                             ->iput()
>      // note: i_lock was not                ->spin_lock(inode)
>      // hold here.                          ->iput_final()
>                                               ->spin_unlock()
>                                               ->evict()
> 
The understanding here is not correct, I will try to analyze it more
carefully.

> Now, we have two threads simultaneously evicting
> the same inode, which led to a bug.
> 
> The key points here are:
> 1. btrfs-cleaner: iput_final() called __inode_add_lru() and then
> return because some reason with i_count=0, and I_FREEING was not set,
> and most important, i_lock was unlocked,
> which gives evict_inodes() a perfect chance to put the inode into
> dispose list.
> 2. umount thread: The inode (which i_count is 0 and I_FREEING was not
> set) 
> was added to dispose list.
> 3. btrfs-cleaner: Before umount thread set I_FREEING flag,
> btrfs_iget()
> finded the inode and inc i_count
> 4. umount thread called evict() and btrfs-cleaner called iput(),
> which leads 
> to bug.
> 
> The above behavior can be confirmed by the debugging log 
> based on commit d30d0e49da71:
> 
> diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
> index 407ccec3e57e..cc50dd1e8beb 100644
> --- a/fs/btrfs/defrag.c
> +++ b/fs/btrfs/defrag.c
> @@ -278,6 +278,8 @@ static int __btrfs_run_defrag_inode(struct
> btrfs_fs_info *fs_info,
>         ret = btrfs_defrag_file(inode, NULL, &range, defrag->transid,
>                                        BTRFS_DEFRAG_BATCH);
>         sb_end_write(fs_info->sb);
> +       if (inode->i_state & (1<<19))
> +               pr_info("bug! double evict! crash will happen\n");
>         iput(inode);
>  
>         if (ret < 0)
> diff --git a/fs/inode.c b/fs/inode.c
> index 3a41f83a4ba5..641a828bc2b7 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -461,8 +461,14 @@ static void __inode_add_lru(struct inode *inode,
> bool rotate)
>                 return;
>         if (!(inode->i_sb->s_flags & SB_ACTIVE))
>                 return;
> -       if (!mapping_shrinkable(&inode->i_data))
> +       if (!mapping_shrinkable(&inode->i_data)) {
> +               if (inode->i_ino == 261) {
> +                       pr_info("set inode 0x%px dbg_flag casue
> mapping_shrinkable\n", inode);
> +                       dump_stack();
> +                       WRITE_ONCE(inode->i_state, inode-
> > i_state|(1<<19));
> +               }
>                 return;
> +       }
>  
>         if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode-
> > i_lru))
>                 this_cpu_inc(nr_unused);
> @@ -623,7 +629,10 @@ void clear_inode(struct inode *inode)
>         xa_unlock_irq(&inode->i_data.i_pages);
>         BUG_ON(!list_empty(&inode->i_data.i_private_list));
>         BUG_ON(!(inode->i_state & I_FREEING));
> -       BUG_ON(inode->i_state & I_CLEAR);
> +       if (inode->i_state & I_CLEAR) {
> +               pr_info("inode 0x%px state 0x%lx triggered bug\n",
> inode, inode->i_state);
> +               BUG();
> +       }
>         BUG_ON(!list_empty(&inode->i_wb_list));
>         /* don't need i_lock here, no concurrent mods to i_state */
>         inode->i_state = I_FREEING | I_CLEAR;
> @@ -1754,7 +1763,10 @@ void iput(struct inode *inode)
>  {
>         if (!inode)
>                 return;
> -       BUG_ON(inode->i_state & I_CLEAR);
> +       if (inode->i_state & I_CLEAR) {
> +               pr_info("inode 0x%px state 0x%lx triggered a bug\n",
> inode, inode->i_state);
> +               BUG();
> +       }
>  retry:
>         if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
>                 if (inode->i_nlink && (inode->i_state &
> I_DIRTY_TIME))
> {
> @@ -1764,6 +1776,8 @@ void iput(struct inode *inode)
>                         mark_inode_dirty_sync(inode);
>                         goto retry;
>                 }
> +               if (inode->i_ino == 261 && (inode->i_state &(1<<19)))
> +                       pr_info("state is 0x%lx, bug may be
> triggered\n", inode->i_state);
>                 iput_final(inode);
>         }
>  }
> 
> The (1<<19) bit was used to debugging.
> 
> Here is the log when bug was triggered:
> 
> [   33.028024][ T2451] set inode 0xffff88801fb87738 dbg_flag casue
> mapping_shrinkable
> [   33.028478][ T2451] CPU: 0 PID: 2451 Comm: btrfs-cleaner Not
> tainted
> 6.10.0-rc2-00225-gf0deb01d52b3-dirty #132
> [   33.029072][ T2451] Hardware name: QEMU Standard PC (i440FX +
> PIIX,
> 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [   33.029675][ T2451] Call Trace:
> [   33.029873][ T2451]  <TASK>
> [   33.030046][ T2451]  dump_stack_lvl+0x16c/0x1f0
> [   33.030330][ T2451]  __inode_add_lru.part.0+0x1ea/0x280
> [   33.030647][ T2451]  iput+0x7c7/0x920
> [   33.030876][ T2451]  run_delayed_iput_locked+0x136/0x1e0
> [   33.031199][ T2451]  btrfs_run_delayed_iputs+0x8e/0x120
> [   33.031516][ T2451]  cleaner_kthread+0x2d3/0x480
> [   33.031801][ T2451]  ? __pfx_cleaner_kthread+0x10/0x10
> [   33.032113][ T2451]  kthread+0x2c1/0x3a0
> [   33.032354][ T2451]  ? _raw_spin_unlock_irq+0x23/0x50
> [   33.032659][ T2451]  ? __pfx_kthread+0x10/0x10
> [   33.032929][ T2451]  ret_from_fork+0x45/0x80
> [   33.033198][ T2451]  ? __pfx_kthread+0x10/0x10
> [   33.033469][ T2451]  ret_from_fork_asm+0x1a/0x30
> [   33.033756][ T2451]  </TASK>
> [   33.033978][ T2451] bug! double evict! crash will happen
> [   33.034168][ T2332] BTRFS info (device loop0): last unmount of
> filesystem c9fe44da-de57-406a-8241-57ec7d4412cf
> [   33.034295][ T2451] inode 0xffff88801fb87738 state 0x60 triggered
> a
> bug
> [   33.034332][ T2451] ------------[ cut here ]------------
> [   33.035578][ T2451] kernel BUG at fs/inode.c:1768!
> [   33.035873][ T2451] Oops: invalid opcode: 0000 [#1] PREEMPT SMP
> KASAN NOPTI
> [   33.036286][ T2451] CPU: 0 PID: 2451 Comm: btrfs-cleaner Not
> tainted
> 6.10.0-rc2-00225-gf0deb01d52b3-dirty #132
> [   33.036870][ T2451] Hardware name: QEMU Standard PC (i440FX +
> PIIX,
> 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [   33.037464][ T2451] RIP: 0010:iput+0x896/0x920
> // logs ...
> [   33.042882][ T2451] PKRU: 55555554
> [   33.043092][ T2451] Call Trace:
> [   33.043288][ T2451]  <TASK>
> [   33.044190][ T2451]  ? iput+0x896/0x920
> [   33.047321][ T2451]  ? btrfs_run_defrag_inodes+0xb49/0xe60
> [   33.047653][ T2451]  btrfs_run_defrag_inodes+0xa40/0xe60
> [   33.047975][ T2451]  ? __pfx_btrfs_run_defrag_inodes+0x10/0x10
> [   33.048327][ T2451]  ? __pfx___mutex_unlock_slowpath+0x10/0x10
> [   33.048678][ T2451]  ? __pfx_do_raw_spin_lock+0x10/0x10
> [   33.048996][ T2451]  ? _raw_spin_unlock+0x28/0x50
> [   33.049285][ T2451]  ?
> btrfs_clean_one_deleted_snapshot+0x2b2/0x420
> [   33.049660][ T2451]  cleaner_kthread+0x2ee/0x480
> [   33.049941][ T2451]  ? __pfx_cleaner_kthread+0x10/0x10
> [   33.050252][ T2451]  kthread+0x2c1/0x3a0
> [   33.050492][ T2451]  ? _raw_spin_unlock_irq+0x23/0x50
> [   33.050796][ T2451]  ? __pfx_kthread+0x10/0x10
> [   33.051067][ T2451]  ret_from_fork+0x45/0x80
> [   33.051333][ T2451]  ? __pfx_kthread+0x10/0x10
> [   33.051603][ T2451]  ret_from_fork_asm+0x1a/0x30
> [   33.051887][ T2451]  </TASK>
> [   33.052066][ T2451] Modules linked in:
> [   33.052316][ T2451] ---[ end trace 0000000000000000 ]---
> [   33.052634][ T2451] RIP: 0010:iput+0x896/0x920
> 
> Please note inode 0xffff88801fb87738, which was set debug flag by
> __inode_add_lru() in the run_delayed_iput_locked() code path. This 
> indicates that the inode was not actually evicted. 
> Then, please note the double evict log, which shows that this
> inode which was set debug flag and was not be evicted is 
> still on the sb list and was successfully acquired by btrfs_iget(). 
> Next, take note of the unmount log, indicating that there was 
> a concurrent unmount thread which has already called evict_inodes()
> at that time. 
> Finally, observe where the BUG was triggered: the state of inode
> 0xffff88801fb87738 had been set to 0x60 by the unmount thread calling
> inode_clear(), and at the same time, btrfs-cleaner called iput(),
> ultimately leading to the BUG at the beginning check of iput(). 
> Although the point of trigger differs from the one reported by 
> syzbot, there is no doubt that they are the same bug.
> 
> You could apply the above diff and try to reproduce the bug using the
> reproducer[1].
> 
> If there is anything I'm missing or more info was needed, please let
> me know.
> 
> [1]: https://syzkaller.appspot.com/x/repro.c?x=14c57f16980000
> 
> Thanks.

Julian Sun Aug. 22, 2024, 12:45 p.m. UTC | #4

Hi, Josef

I believe there is a bug in the following scenario, but I'm not sure
if it is the same bug reported by syzbot. Do you have any idea?

umount thread:                     btrfs-cleaner thread:
                                   btrfs_run_delayed_iputs()
                                     ->run_delayed_iput_locked()
  btrfs_kill_super()                   ->iput(inode)
    ->generic_shutdown_super()           ->spin_lock(inode)
      ->evict_inodes()               	 // inode->i_count dec to 0
       ->spin_lock(inode)                ->iput_final()
                                          // cause some reason, get into
                                          // __inode_add_lru
       // passed i_count==0 test          ->__inode_add_lru()
       // and then schedule out		  // so iput_final() returned wich I_FREEING was not set
                                          // note here: the inode still in the sb list
				     ->__btrfs_run_defrag_inode()
					    ->btrfs_iget()
  		                              ->find_inode()
                                                ->spin_lock(inode)
                                                ->__iget(); // i_count inc to 1
                                                ->spin_unlock(inode);
     // schedule back
     spin_lock(inode)
     // I_FREEING was not set
     // so we continue
     set I_FREEING flag
     spin_unlock(inode)                   ->iput() 
     // put the inode into dispose          ->spin_lock(inode)
     // list                                    // dec i_count to 0
  ->dispose_list()                          ->iput_final()
    ->evict()                                 ->spin_unlock()
                                              ->evict()
                                              
Now, we have two threads simultaneously evicting
the same inode, which led to a bug.
I think this can be addressed by protecting the 
atomic_read(inode->i_count) in evict_inode() with
inode->i_lock.

David Sterba Aug. 26, 2024, 3:39 p.m. UTC | #5

On Thu, Aug 22, 2024 at 08:45:24PM +0800, Julian Sun wrote:
> Hi, Josef
> 
> I believe there is a bug in the following scenario, but I'm not sure
> if it is the same bug reported by syzbot. Do you have any idea?
> 
> umount thread:                     btrfs-cleaner thread:
>                                    btrfs_run_delayed_iputs()
>                                      ->run_delayed_iput_locked()
>   btrfs_kill_super()                   ->iput(inode)
>     ->generic_shutdown_super()           ->spin_lock(inode)
>       ->evict_inodes()               	 // inode->i_count dec to 0
>        ->spin_lock(inode)                ->iput_final()
>                                           // cause some reason, get into
>                                           // __inode_add_lru
>        // passed i_count==0 test          ->__inode_add_lru()
>        // and then schedule out		  // so iput_final() returned wich I_FREEING was not set
>                                           // note here: the inode still in the sb list
> 				     ->__btrfs_run_defrag_inode()
> 					    ->btrfs_iget()
>   		                              ->find_inode()
>                                                 ->spin_lock(inode)
>                                                 ->__iget(); // i_count inc to 1
>                                                 ->spin_unlock(inode);
>      // schedule back
>      spin_lock(inode)
>      // I_FREEING was not set
>      // so we continue
>      set I_FREEING flag
>      spin_unlock(inode)                   ->iput() 
>      // put the inode into dispose          ->spin_lock(inode)
>      // list                                    // dec i_count to 0
>   ->dispose_list()                          ->iput_final()
>     ->evict()                                 ->spin_unlock()
>                                               ->evict()
>                                               
> Now, we have two threads simultaneously evicting
> the same inode, which led to a bug.
> I think this can be addressed by protecting the 
> atomic_read(inode->i_count) in evict_inode() with
> inode->i_lock.

For reference, this will be fixed in VFS

https://lore.kernel.org/linux-btrfs/20240823130730.658881-1-sunjunchao2870@gmail.com/

btrfs: fix the race between umount and btrfs-cleaner

Commit Message

Comments

Patch