btrfs: fix warning in iput for bad-inode

Message ID	20110817185619.4660.78543.stgit@localhost6 (mailing list archive)
State	New, archived
Headers	show Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by demeter1.kernel.org (8.14.4/8.14.4) with ESMTP id p7HIuZar030130 for <patchwork-linux-btrfs@patchwork.kernel.org>; Wed, 17 Aug 2011 18:56:35 GMT Subject: [PATCH] btrfs: fix warning in iput for bad-inode To: linux-btrfs@vger.kernel.org, Chris Mason <chris.mason@oracle.com> From: Konstantin Khlebnikov <khlebnikov@openvz.org> Date: Wed, 17 Aug 2011 22:56:19 +0400 Message-ID: <20110817185619.4660.78543.stgit@localhost6> User-Agent: StGit/0.15 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk

Konstantin Khlebnikov Aug. 17, 2011, 6:56 p.m. UTC

iput() shouldn't be called for inodes in I_NEW state,
lets call __destroy_inode() and btrfs_destroy_inode() instead

[    1.871723] WARNING: at fs/inode.c:1309 iput+0x1d9/0x200()
[    1.873722] Modules linked in:
[    1.873722] Pid: 1, comm: swapper Tainted: G        W   3.1.0-rc2-zurg #58
[    1.875722] Call Trace:
[    1.875722]  [<ffffffff8113cb99>] ? iput+0x1d9/0x200
[    1.876722]  [<ffffffff81044c3a>] warn_slowpath_common+0x7a/0xb0
[    1.877722]  [<ffffffff81044c85>] warn_slowpath_null+0x15/0x20
[    1.879722]  [<ffffffff8113cb99>] iput+0x1d9/0x200
[    1.879722]  [<ffffffff81295cf4>] btrfs_iget+0x1c4/0x450
[    1.881721]  [<ffffffff812b7e6b>] ? btrfs_tree_read_unlock_blocking+0x3b/0x60
[    1.882721]  [<ffffffff8111769a>] ? kmem_cache_free+0x2a/0x160
[    1.883721]  [<ffffffff812966f3>] btrfs_lookup_dentry+0x413/0x490
[    1.885721]  [<ffffffff8103b1e1>] ? get_parent_ip+0x11/0x50
[    1.886720]  [<ffffffff81296781>] btrfs_lookup+0x11/0x30
[    1.887720]  [<ffffffff8112de50>] d_alloc_and_lookup+0x40/0x80
[    1.888720]  [<ffffffff8113ac10>] ? d_lookup+0x30/0x50
[    1.889720]  [<ffffffff811301a8>] do_lookup+0x288/0x370
[    1.890720]  [<ffffffff8103b1e1>] ? get_parent_ip+0x11/0x50
[    1.891720]  [<ffffffff81132210>] do_last+0xe0/0x910
[    1.892720]  [<ffffffff81132b4d>] path_openat+0xcd/0x3a0
[    1.893719]  [<ffffffff813bab4b>] ? wait_for_xmitr+0x3b/0xa0
[    1.895719]  [<ffffffff8131c50a>] ? put_dec_full+0x5a/0xb0
[    1.896719]  [<ffffffff813babdb>] ? serial8250_console_putchar+0x2b/0x40
[    1.897719]  [<ffffffff81132e7d>] do_filp_open+0x3d/0xa0
[    1.898719]  [<ffffffff8103b1e1>] ? get_parent_ip+0x11/0x50
[    1.899718]  [<ffffffff8103b1e1>] ? get_parent_ip+0x11/0x50
[    1.900718]  [<ffffffff816e80fd>] ? sub_preempt_count+0x9d/0xd0
[    1.902718]  [<ffffffff8112a09d>] open_exec+0x2d/0xf0
[    1.903718]  [<ffffffff8112aaaf>] do_execve_common.isra.32+0x12f/0x340
[    1.906717]  [<ffffffff8112acd6>] do_execve+0x16/0x20
[    1.907717]  [<ffffffff8100af02>] sys_execve+0x42/0x70
[    1.908717]  [<ffffffff816ed968>] kernel_execve+0x68/0xd0
[    1.909717]  [<ffffffff816d828e>] ? run_init_process+0x1e/0x20
[    1.911717]  [<ffffffff816d831e>] init_post+0x8e/0xc0
[    1.912716]  [<ffffffff81cb8c79>] kernel_init+0x13d/0x13d
[    1.913716]  [<ffffffff816ed8f4>] kernel_thread_helper+0x4/0x10
[    1.914716]  [<ffffffff81cb8b3c>] ? start_kernel+0x33f/0x33f
[    1.915716]  [<ffffffff816ed8f0>] ? gs_change+0xb/0xb

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 fs/btrfs/inode.c |   10 +++-------
 1 files changed, 3 insertions(+), 7 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sergei Trofimovich Aug. 29, 2011, 3:34 a.m. UTC | #1

On Wed, 17 Aug 2011 22:56:19 +0400
Konstantin Khlebnikov <khlebnikov@openvz.org> wrote:

> iput() shouldn't be called for inodes in I_NEW state,
> lets call __destroy_inode() and btrfs_destroy_inode() instead

Hi Konstantin!

Running 'sync' program after the load does not finish and eats 100%CPU
busy-waiting for something in kernel.

It's easy to reproduce hang with patch for me. I just run liferea and sync
after it. Without patch I haven't managed to hang btrfs up.

Sergei Trofimovich Aug. 30, 2011, 4:53 p.m. UTC | #2

> Running 'sync' program after the load does not finish and eats 100%CPU
> busy-waiting for something in kernel.
> 
> It's easy to reproduce hang with patch for me. I just run liferea and sync
> after it. Without patch I haven't managed to hang btrfs up.

And I think it's another btrfs bug. I've managed to reproduce it _without_
your patch and _without_ autodefrag enabled by manually running the
following commands:
    $ btrfs fi defrag file-with-20_000-extents
    $ sync

I think your patch just shuffles things a bit and forces autodefrag races
to pop-up sooner (which is good! :])

Josef Bacik Aug. 30, 2011, 6:02 p.m. UTC | #3

On 08/30/2011 12:53 PM, Sergei Trofimovich wrote:
>> Running 'sync' program after the load does not finish and eats 
>> 100%CPU busy-waiting for something in kernel.
>> 
>> It's easy to reproduce hang with patch for me. I just run
>> liferea and sync after it. Without patch I haven't managed to
>> hang btrfs up.
> 
> And I think it's another btrfs bug. I've managed to reproduce it 
> _without_ your patch and _without_ autodefrag enabled by manually 
> running the following commands: $ btrfs fi defrag 
> file-with-20_000-extents $ sync
> 
> I think your patch just shuffles things a bit and forces
> autodefrag races to pop-up sooner (which is good! :])
> 

Sergei, can you do sysrq+w when this is happening, and maybe turn on
the softlockup detector so we can see where sync is getting stuck?
Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sergei Trofimovich Aug. 30, 2011, 7:31 p.m. UTC | #4

On Tue, 30 Aug 2011 14:02:37 -0400
Josef Bacik <josef@redhat.com> wrote:

> On 08/30/2011 12:53 PM, Sergei Trofimovich wrote:
> >> Running 'sync' program after the load does not finish and eats 
> >> 100%CPU busy-waiting for something in kernel.
> >> 
> >> It's easy to reproduce hang with patch for me. I just run
> >> liferea and sync after it. Without patch I haven't managed to
> >> hang btrfs up.
> > 
> > And I think it's another btrfs bug. I've managed to reproduce it 
> > _without_ your patch and _without_ autodefrag enabled by manually 
> > running the following commands: $ btrfs fi defrag 
> > file-with-20_000-extents $ sync
> > 
> > I think your patch just shuffles things a bit and forces
> > autodefrag races to pop-up sooner (which is good! :])
> > 
> 
> Sergei, can you do sysrq+w when this is happening, and maybe turn on
> the softlockup detector so we can see where sync is getting stuck?
> Thanks,

Sure. As I keep telling about 2 cases in IRC I will state both here explicitely:

==The First Issue (aka "The Hung sync()" case) ==

- it's an unpatched linus's v3.1-rc4-80-g0f43dd5
- /dev/root on / type btrfs (rw,noatime,compress=lzo)
- 50% full 30GB filesystem (usual nonmixed mode)

How I hung it:
    $ /usr/sbin/filefrag ~/.bogofilter/wordlist.db 
        /home/st/.bogofilter/wordlist.db: 19070 extents found
      the file is 138MB sqlite database for bayesian SPAM filter,
      it's being read and written every 20 minutes or so.
      Maybe, it was writtent even in defrag/sync time!
    $~/dev/git/btrfs-progs-unstable/btrfs fi defrag ~/.bogofilter/wordlist.db 
    $ sync
    ^C<hung in D-state>

I didn't try to reproduce it yet.
As for lockdep I'll try but I'm afraid I will fail to reproduce, but I'll try tomorrow.
I suspect I'll need to seriously fragment some file first down to such horrible state.

With help of David I've some (hopefully relevant) info:
    #!/bin/sh -x
    
    for i in $(ps aux|grep " D[+ ]\?"|awk '{print $2}'); do
            ps $i
            sudo cat /proc/$i/stack
    done

  PID TTY      STAT   TIME COMMAND
 1291 ?        D      0:00 [btrfs-endio-wri]
[<ffffffff8130055d>] btrfs_tree_read_lock+0x6d/0x120
[<ffffffff812b8e88>] btrfs_search_slot+0x698/0x8b0
[<ffffffff812c9e18>] btrfs_lookup_csum+0x68/0x190
[<ffffffff812ca10f>] __btrfs_lookup_bio_sums+0x1cf/0x3e0
[<ffffffff812ca371>] btrfs_lookup_bio_sums+0x11/0x20
[<ffffffff812d6a50>] btrfs_submit_bio_hook+0x140/0x170
[<ffffffff812ed594>] submit_one_bio+0x64/0xa0
[<ffffffff812f14f5>] extent_readpages+0xe5/0x100
[<ffffffff812d7aaa>] btrfs_readpages+0x1a/0x20
[<ffffffff810a6a02>] __do_page_cache_readahead+0x1d2/0x280
[<ffffffff810a6d8c>] ra_submit+0x1c/0x20
[<ffffffff810a6ebd>] ondemand_readahead+0x12d/0x270
[<ffffffff810a70cc>] page_cache_sync_readahead+0x2c/0x40
[<ffffffff81309987>] __load_free_space_cache+0x1a7/0x5b0
[<ffffffff81309e61>] load_free_space_cache+0xd1/0x190
[<ffffffff812be07b>] cache_block_group+0xab/0x290
[<ffffffff812c3def>] find_free_extent.clone.71+0x39f/0xab0
[<ffffffff812c5160>] btrfs_reserve_extent+0xe0/0x170
[<ffffffff812c56df>] btrfs_alloc_free_block+0xcf/0x330
[<ffffffff812b498d>] __btrfs_cow_block+0x11d/0x4a0
[<ffffffff812b4df8>] btrfs_cow_block+0xe8/0x1a0
[<ffffffff812b8965>] btrfs_search_slot+0x175/0x8b0
[<ffffffff812c9e18>] btrfs_lookup_csum+0x68/0x190
[<ffffffff812caf6e>] btrfs_csum_file_blocks+0xbe/0x670
[<ffffffff812d7d91>] add_pending_csums.clone.39+0x41/0x60
[<ffffffff812da528>] btrfs_finish_ordered_io+0x218/0x310
[<ffffffff812da635>] btrfs_writepage_end_io_hook+0x15/0x20
[<ffffffff8130c71a>] end_compressed_bio_write+0x7a/0xe0
[<ffffffff811146f8>] bio_endio+0x18/0x30
[<ffffffff812cd8fc>] end_workqueue_fn+0xec/0x120
[<ffffffff812fb0ac>] worker_loop+0xac/0x520
[<ffffffff8105d486>] kthread+0x96/0xa0
[<ffffffff815f9214>] kernel_thread_helper+0x4/0x10
[<ffffffffffffffff>] 0xffffffffffffffff
  PID TTY      STAT   TIME COMMAND
 1296 ?        D      0:00 [btrfs-transacti]
[<ffffffff812d3d6e>] btrfs_commit_transaction+0x22e/0x870
[<ffffffff812cd5bb>] transaction_kthread+0x26b/0x280
[<ffffffff8105d486>] kthread+0x96/0xa0
[<ffffffff815f9214>] kernel_thread_helper+0x4/0x10
[<ffffffffffffffff>] 0xffffffffffffffff
  PID TTY      STAT   TIME COMMAND
 2891 pts/0    D+     0:00 sync
[<ffffffff8109d069>] sleep_on_page+0x9/0x10
[<ffffffff8109d28e>] wait_on_page_bit+0x6e/0x80
[<ffffffff8109d9aa>] filemap_fdatawait_range+0xfa/0x190
[<ffffffff8109da62>] filemap_fdatawait+0x22/0x30
[<ffffffff8110ae5a>] sync_inodes_sb+0x17a/0x1e0
[<ffffffff8110eae8>] __sync_filesystem+0x78/0x80
[<ffffffff8110eb07>] sync_one_sb+0x17/0x20
[<ffffffff810e9cc5>] iterate_supers+0x95/0xf0
[<ffffffff8110ea2b>] sync_filesystems+0x1b/0x20
[<ffffffff8110eb8c>] sys_sync+0x1c/0x40
[<ffffffff815f813b>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
  PID TTY      STAT   TIME COMMAND
cat: /proc/4454/stack: ??? ?????? ????? ??? ????????

disassebbled btrfs_tree_read_lock:

ffffffff813004f0 <btrfs_tree_read_lock> push   %rbp
ffffffff813004f1 <btrfs_tree_read_lock+0x1> mov    %gs:0xb5c0,%rax
ffffffff813004fa <btrfs_tree_read_lock+0xa> mov    %rsp,%rbp
ffffffff813004fd <btrfs_tree_read_lock+0xd> push   %r15
ffffffff813004ff <btrfs_tree_read_lock+0xf> push   %r14
ffffffff81300501 <btrfs_tree_read_lock+0x11> lea    0x6c(%rdi),%r14
ffffffff81300505 <btrfs_tree_read_lock+0x15> push   %r13
ffffffff81300507 <btrfs_tree_read_lock+0x17> lea    0x70(%rdi),%r13
ffffffff8130050b <btrfs_tree_read_lock+0x1b> push   %r12
ffffffff8130050d <btrfs_tree_read_lock+0x1d> lea    -0x60(%rbp),%r12
ffffffff81300511 <btrfs_tree_read_lock+0x21> push   %rbx
ffffffff81300512 <btrfs_tree_read_lock+0x22> lea    0x18(%r12),%r15
ffffffff81300517 <btrfs_tree_read_lock+0x27> lea    0x5c(%rdi),%rbx
ffffffff8130051b <btrfs_tree_read_lock+0x2b> sub    $0x58,%rsp
ffffffff8130051f <btrfs_tree_read_lock+0x2f> mov    %rdi,-0x70(%rbp)
ffffffff81300523 <btrfs_tree_read_lock+0x33> mov    %rax,-0x68(%rbp)
ffffffff81300527 <btrfs_tree_read_lock+0x37> mov    %rax,-0x78(%rbp)
ffffffff8130052b <btrfs_tree_read_lock+0x3b> nopl   0x0(%rax,%rax,1)
ffffffff81300530 <btrfs_tree_read_lock+0x40> mov    (%rbx),%eax
ffffffff81300532 <btrfs_tree_read_lock+0x42> test   %eax,%eax
ffffffff81300534 <btrfs_tree_read_lock+0x44> je     ffffffff8130057e <btrfs_tree_read_lock+0x8e>
ffffffff81300536 <btrfs_tree_read_lock+0x46> mov    -0x68(%rbp),%rax
ffffffff8130053a <btrfs_tree_read_lock+0x4a> movq   $0x0,-0x60(%rbp)
ffffffff81300542 <btrfs_tree_read_lock+0x52> movq   $0xffffffff8105d950,-0x50(%rbp)
ffffffff8130054a <btrfs_tree_read_lock+0x5a> mov    %r15,-0x48(%rbp)
ffffffff8130054e <btrfs_tree_read_lock+0x5e> mov    %r15,-0x40(%rbp)
ffffffff81300552 <btrfs_tree_read_lock+0x62> mov    %rax,-0x58(%rbp)
ffffffff81300556 <btrfs_tree_read_lock+0x66> jmp    ffffffff8130055d <btrfs_tree_read_lock+0x6d>
ffffffff81300558 <btrfs_tree_read_lock+0x68> callq  ffffffff815f4670 <schedule>
ffffffff8130055d <btrfs_tree_read_lock+0x6d> mov    $0x2,%edx
ffffffff81300562 <btrfs_tree_read_lock+0x72> mov    %r12,%rsi
ffffffff81300565 <btrfs_tree_read_lock+0x75> mov    %r13,%rdi
ffffffff81300568 <btrfs_tree_read_lock+0x78> callq  ffffffff8105db90 <prepare_to_wait>
ffffffff8130056d <btrfs_tree_read_lock+0x7d> mov    (%rbx),%eax
ffffffff8130056f <btrfs_tree_read_lock+0x7f> test   %eax,%eax
ffffffff81300571 <btrfs_tree_read_lock+0x81> jne    ffffffff81300558 <btrfs_tree_read_lock+0x68>
ffffffff81300573 <btrfs_tree_read_lock+0x83> mov    %r12,%rsi
ffffffff81300576 <btrfs_tree_read_lock+0x86> mov    %r13,%rdi
ffffffff81300579 <btrfs_tree_read_lock+0x89> callq  ffffffff8105da80 <finish_wait>
ffffffff8130057e <btrfs_tree_read_lock+0x8e> mov    %r14,%rdi
ffffffff81300581 <btrfs_tree_read_lock+0x91> callq  ffffffff815f76d0 <_raw_read_lock>
ffffffff81300586 <btrfs_tree_read_lock+0x96> mov    (%rbx),%eax
ffffffff81300588 <btrfs_tree_read_lock+0x98> test   %eax,%eax
ffffffff8130058a <btrfs_tree_read_lock+0x9a> je     ffffffff813005eb <btrfs_tree_read_lock+0xfb>
ffffffff8130058c <btrfs_tree_read_lock+0x9c> mov    %r14,%rdi
ffffffff8130058f <btrfs_tree_read_lock+0x9f> callq  ffffffff815f73f0 <_raw_read_unlock>
ffffffff81300594 <btrfs_tree_read_lock+0xa4> mov    (%rbx),%eax
ffffffff81300596 <btrfs_tree_read_lock+0xa6> test   %eax,%eax
ffffffff81300598 <btrfs_tree_read_lock+0xa8> je     ffffffff81300530 <btrfs_tree_read_lock+0x40>
ffffffff8130059a <btrfs_tree_read_lock+0xaa> mov    -0x78(%rbp),%rax
ffffffff8130059e <btrfs_tree_read_lock+0xae> movq   $0x0,-0x60(%rbp)
ffffffff813005a6 <btrfs_tree_read_lock+0xb6> movq   $0xffffffff8105d950,-0x50(%rbp)
ffffffff813005ae <btrfs_tree_read_lock+0xbe> mov    %r15,-0x48(%rbp)
ffffffff813005b2 <btrfs_tree_read_lock+0xc2> mov    %r15,-0x40(%rbp)
ffffffff813005b6 <btrfs_tree_read_lock+0xc6> mov    %rax,-0x58(%rbp)
ffffffff813005ba <btrfs_tree_read_lock+0xca> jmp    ffffffff813005c5 <btrfs_tree_read_lock+0xd5>
ffffffff813005bc <btrfs_tree_read_lock+0xcc> nopl   0x0(%rax)
ffffffff813005c0 <btrfs_tree_read_lock+0xd0> callq  ffffffff815f4670 <schedule>
ffffffff813005c5 <btrfs_tree_read_lock+0xd5> mov    $0x2,%edx
ffffffff813005ca <btrfs_tree_read_lock+0xda> mov    %r12,%rsi
ffffffff813005cd <btrfs_tree_read_lock+0xdd> mov    %r13,%rdi
ffffffff813005d0 <btrfs_tree_read_lock+0xe0> callq  ffffffff8105db90 <prepare_to_wait>
ffffffff813005d5 <btrfs_tree_read_lock+0xe5> mov    (%rbx),%eax
ffffffff813005d7 <btrfs_tree_read_lock+0xe7> test   %eax,%eax
ffffffff813005d9 <btrfs_tree_read_lock+0xe9> jne    ffffffff813005c0 <btrfs_tree_read_lock+0xd0>
ffffffff813005db <btrfs_tree_read_lock+0xeb> mov    %r12,%rsi
ffffffff813005de <btrfs_tree_read_lock+0xee> mov    %r13,%rdi
ffffffff813005e1 <btrfs_tree_read_lock+0xf1> callq  ffffffff8105da80 <finish_wait>
ffffffff813005e6 <btrfs_tree_read_lock+0xf6> jmpq   ffffffff81300530 <btrfs_tree_read_lock+0x40>
ffffffff813005eb <btrfs_tree_read_lock+0xfb> mov    -0x70(%rbp),%rax
ffffffff813005ef <btrfs_tree_read_lock+0xff> lock incl 0x58(%rax)
ffffffff813005f3 <btrfs_tree_read_lock+0x103> lock incl 0x64(%rax)
ffffffff813005f7 <btrfs_tree_read_lock+0x107> add    $0x58,%rsp
ffffffff813005fb <btrfs_tree_read_lock+0x10b> pop    %rbx
ffffffff813005fc <btrfs_tree_read_lock+0x10c> pop    %r12
ffffffff813005fe <btrfs_tree_read_lock+0x10e> pop    %r13
ffffffff81300600 <btrfs_tree_read_lock+0x110> pop    %r14
ffffffff81300602 <btrfs_tree_read_lock+0x112> pop    %r15
ffffffff81300604 <btrfs_tree_read_lock+0x114> leaveq 
ffffffff81300605 <btrfs_tree_read_lock+0x115> retq   
ffffffff81300606 <btrfs_tree_read_lock+0x116> nopw   %cs:0x0(%rax,%rax,1)

==The Second Issue (aka "The Busy Looping sync()" case) ==

The box is different from first, so conditions are a bit different.

- it's a _patched_ linus's 3.1.0-rc3-00271-g3230739 (+patch from $SUBJ on top)
- /dev/root on / type btrfs (rw,noatime,autodefrag)
  (note autodefrag!)
- 15% full 594GB filesystem (usual nonmixed mode)

How I hung it:
    $ liferea
    <wait it to calm down. it does a lot of SQLite reads/writes>
    $ sync
    Got CPU is 100% loaded

Perfectly reproducible.
I don't remembed it it exits or hangs. Will send detailed followup today.

Josef Bacik Aug. 30, 2011, 7:40 p.m. UTC | #5

On 08/30/2011 03:31 PM, Sergei Trofimovich wrote:
> On Tue, 30 Aug 2011 14:02:37 -0400 Josef Bacik <josef@redhat.com>
> wrote:
> 
>> On 08/30/2011 12:53 PM, Sergei Trofimovich wrote:
>>>> Running 'sync' program after the load does not finish and
>>>> eats 100%CPU busy-waiting for something in kernel.
>>>> 
>>>> It's easy to reproduce hang with patch for me. I just run 
>>>> liferea and sync after it. Without patch I haven't managed
>>>> to hang btrfs up.
>>> 
>>> And I think it's another btrfs bug. I've managed to reproduce
>>> it _without_ your patch and _without_ autodefrag enabled by
>>> manually running the following commands: $ btrfs fi defrag 
>>> file-with-20_000-extents $ sync
>>> 
>>> I think your patch just shuffles things a bit and forces 
>>> autodefrag races to pop-up sooner (which is good! :])
>>> 
>> 
>> Sergei, can you do sysrq+w when this is happening, and maybe turn
>> on the softlockup detector so we can see where sync is getting
>> stuck? Thanks,
> 
> Sure. As I keep telling about 2 cases in IRC I will state both here
> explicitely:
> 
> ==The First Issue (aka "The Hung sync()" case) ==
> 
> - it's an unpatched linus's v3.1-rc4-80-g0f43dd5 - /dev/root on /
> type btrfs (rw,noatime,compress=lzo) - 50% full 30GB filesystem
> (usual nonmixed mode)
> 
> How I hung it: $ /usr/sbin/filefrag ~/.bogofilter/wordlist.db 
> /home/st/.bogofilter/wordlist.db: 19070 extents found the file is
> 138MB sqlite database for bayesian SPAM filter, it's being read and
> written every 20 minutes or so. Maybe, it was writtent even in
> defrag/sync time! $~/dev/git/btrfs-progs-unstable/btrfs fi defrag
> ~/.bogofilter/wordlist.db $ sync ^C<hung in D-state>
> 
> I didn't try to reproduce it yet. As for lockdep I'll try but I'm
> afraid I will fail to reproduce, but I'll try tomorrow. I suspect
> I'll need to seriously fragment some file first down to such
> horrible state.
> 
> With help of David I've some (hopefully relevant) info: #!/bin/sh
> -x
> 
> for i in $(ps aux|grep " D[+ ]\?"|awk '{print $2}'); do ps $i sudo
> cat /proc/$i/stack done
> 
> PID TTY      STAT   TIME COMMAND 1291 ?        D      0:00
> [btrfs-endio-wri] [<ffffffff8130055d>]
> btrfs_tree_read_lock+0x6d/0x120 [<ffffffff812b8e88>]
> btrfs_search_slot+0x698/0x8b0 [<ffffffff812c9e18>]
> btrfs_lookup_csum+0x68/0x190 [<ffffffff812ca10f>]
> __btrfs_lookup_bio_sums+0x1cf/0x3e0 [<ffffffff812ca371>]
> btrfs_lookup_bio_sums+0x11/0x20 [<ffffffff812d6a50>]
> btrfs_submit_bio_hook+0x140/0x170 [<ffffffff812ed594>]
> submit_one_bio+0x64/0xa0 [<ffffffff812f14f5>]
> extent_readpages+0xe5/0x100 [<ffffffff812d7aaa>]
> btrfs_readpages+0x1a/0x20 [<ffffffff810a6a02>]
> __do_page_cache_readahead+0x1d2/0x280 [<ffffffff810a6d8c>]
> ra_submit+0x1c/0x20 [<ffffffff810a6ebd>]
> ondemand_readahead+0x12d/0x270 [<ffffffff810a70cc>]
> page_cache_sync_readahead+0x2c/0x40 [<ffffffff81309987>]
> __load_free_space_cache+0x1a7/0x5b0 [<ffffffff81309e61>]
> load_free_space_cache+0xd1/0x190 [<ffffffff812be07b>]
> cache_block_group+0xab/0x290 [<ffffffff812c3def>]
> find_free_extent.clone.71+0x39f/0xab0 [<ffffffff812c5160>]
> btrfs_reserve_extent+0xe0/0x170 [<ffffffff812c56df>]
> btrfs_alloc_free_block+0xcf/0x330 [<ffffffff812b498d>]
> __btrfs_cow_block+0x11d/0x4a0 [<ffffffff812b4df8>]
> btrfs_cow_block+0xe8/0x1a0 [<ffffffff812b8965>]
> btrfs_search_slot+0x175/0x8b0 [<ffffffff812c9e18>]
> btrfs_lookup_csum+0x68/0x190 [<ffffffff812caf6e>]
> btrfs_csum_file_blocks+0xbe/0x670 [<ffffffff812d7d91>]
> add_pending_csums.clone.39+0x41/0x60 [<ffffffff812da528>]
> btrfs_finish_ordered_io+0x218/0x310 [<ffffffff812da635>]
> btrfs_writepage_end_io_hook+0x15/0x20 [<ffffffff8130c71a>]
> end_compressed_bio_write+0x7a/0xe0 [<ffffffff811146f8>]
> bio_endio+0x18/0x30 [<ffffffff812cd8fc>]
> end_workqueue_fn+0xec/0x120 [<ffffffff812fb0ac>]
> worker_loop+0xac/0x520 [<ffffffff8105d486>] kthread+0x96/0xa0 
> [<ffffffff815f9214>] kernel_thread_helper+0x4/0x10 
> [<ffffffffffffffff>] 0xffffffffffffffff

Ok this should have been fixed with

Btrfs: use the commit_root for reading free_space_inode crcs

which is commit # 2cf8572dac62cc2ff7e995173e95b6c694401b3f.  Does your
kernel have this commit?  Because if it does then we did something
wrong.  If not it should be in linus's latest tree, so update and it
should go away (hopefully).  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik Aug. 30, 2011, 7:45 p.m. UTC | #6

On 08/30/2011 03:40 PM, Josef Bacik wrote:
> On 08/30/2011 03:31 PM, Sergei Trofimovich wrote:
>> On Tue, 30 Aug 2011 14:02:37 -0400 Josef Bacik <josef@redhat.com>
>> wrote:
>>
>>> On 08/30/2011 12:53 PM, Sergei Trofimovich wrote:
>>>>> Running 'sync' program after the load does not finish and
>>>>> eats 100%CPU busy-waiting for something in kernel.
>>>>>
>>>>> It's easy to reproduce hang with patch for me. I just run 
>>>>> liferea and sync after it. Without patch I haven't managed
>>>>> to hang btrfs up.
>>>>
>>>> And I think it's another btrfs bug. I've managed to reproduce
>>>> it _without_ your patch and _without_ autodefrag enabled by
>>>> manually running the following commands: $ btrfs fi defrag 
>>>> file-with-20_000-extents $ sync
>>>>
>>>> I think your patch just shuffles things a bit and forces 
>>>> autodefrag races to pop-up sooner (which is good! :])
>>>>
>>>
>>> Sergei, can you do sysrq+w when this is happening, and maybe turn
>>> on the softlockup detector so we can see where sync is getting
>>> stuck? Thanks,
>>
>> Sure. As I keep telling about 2 cases in IRC I will state both here
>> explicitely:
>>
>> ==The First Issue (aka "The Hung sync()" case) ==
>>
>> - it's an unpatched linus's v3.1-rc4-80-g0f43dd5 - /dev/root on /
>> type btrfs (rw,noatime,compress=lzo) - 50% full 30GB filesystem
>> (usual nonmixed mode)
>>
>> How I hung it: $ /usr/sbin/filefrag ~/.bogofilter/wordlist.db 
>> /home/st/.bogofilter/wordlist.db: 19070 extents found the file is
>> 138MB sqlite database for bayesian SPAM filter, it's being read and
>> written every 20 minutes or so. Maybe, it was writtent even in
>> defrag/sync time! $~/dev/git/btrfs-progs-unstable/btrfs fi defrag
>> ~/.bogofilter/wordlist.db $ sync ^C<hung in D-state>
>>
>> I didn't try to reproduce it yet. As for lockdep I'll try but I'm
>> afraid I will fail to reproduce, but I'll try tomorrow. I suspect
>> I'll need to seriously fragment some file first down to such
>> horrible state.
>>
>> With help of David I've some (hopefully relevant) info: #!/bin/sh
>> -x
>>
>> for i in $(ps aux|grep " D[+ ]\?"|awk '{print $2}'); do ps $i sudo
>> cat /proc/$i/stack done
>>
>> PID TTY      STAT   TIME COMMAND 1291 ?        D      0:00
>> [btrfs-endio-wri] [<ffffffff8130055d>]
>> btrfs_tree_read_lock+0x6d/0x120 [<ffffffff812b8e88>]
>> btrfs_search_slot+0x698/0x8b0 [<ffffffff812c9e18>]
>> btrfs_lookup_csum+0x68/0x190 [<ffffffff812ca10f>]
>> __btrfs_lookup_bio_sums+0x1cf/0x3e0 [<ffffffff812ca371>]
>> btrfs_lookup_bio_sums+0x11/0x20 [<ffffffff812d6a50>]
>> btrfs_submit_bio_hook+0x140/0x170 [<ffffffff812ed594>]
>> submit_one_bio+0x64/0xa0 [<ffffffff812f14f5>]
>> extent_readpages+0xe5/0x100 [<ffffffff812d7aaa>]
>> btrfs_readpages+0x1a/0x20 [<ffffffff810a6a02>]
>> __do_page_cache_readahead+0x1d2/0x280 [<ffffffff810a6d8c>]
>> ra_submit+0x1c/0x20 [<ffffffff810a6ebd>]
>> ondemand_readahead+0x12d/0x270 [<ffffffff810a70cc>]
>> page_cache_sync_readahead+0x2c/0x40 [<ffffffff81309987>]
>> __load_free_space_cache+0x1a7/0x5b0 [<ffffffff81309e61>]
>> load_free_space_cache+0xd1/0x190 [<ffffffff812be07b>]
>> cache_block_group+0xab/0x290 [<ffffffff812c3def>]
>> find_free_extent.clone.71+0x39f/0xab0 [<ffffffff812c5160>]
>> btrfs_reserve_extent+0xe0/0x170 [<ffffffff812c56df>]
>> btrfs_alloc_free_block+0xcf/0x330 [<ffffffff812b498d>]
>> __btrfs_cow_block+0x11d/0x4a0 [<ffffffff812b4df8>]
>> btrfs_cow_block+0xe8/0x1a0 [<ffffffff812b8965>]
>> btrfs_search_slot+0x175/0x8b0 [<ffffffff812c9e18>]
>> btrfs_lookup_csum+0x68/0x190 [<ffffffff812caf6e>]
>> btrfs_csum_file_blocks+0xbe/0x670 [<ffffffff812d7d91>]
>> add_pending_csums.clone.39+0x41/0x60 [<ffffffff812da528>]
>> btrfs_finish_ordered_io+0x218/0x310 [<ffffffff812da635>]
>> btrfs_writepage_end_io_hook+0x15/0x20 [<ffffffff8130c71a>]
>> end_compressed_bio_write+0x7a/0xe0 [<ffffffff811146f8>]
>> bio_endio+0x18/0x30 [<ffffffff812cd8fc>]
>> end_workqueue_fn+0xec/0x120 [<ffffffff812fb0ac>]
>> worker_loop+0xac/0x520 [<ffffffff8105d486>] kthread+0x96/0xa0 
>> [<ffffffff815f9214>] kernel_thread_helper+0x4/0x10 
>> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> Ok this should have been fixed with
> 
> Btrfs: use the commit_root for reading free_space_inode crcs
> 
> which is commit # 2cf8572dac62cc2ff7e995173e95b6c694401b3f.  Does your
> kernel have this commit?  Because if it does then we did something
> wrong.  If not it should be in linus's latest tree, so update and it
> should go away (hopefully).  Thanks,
> 

Oops looks like that patch won't fix it completely, I just sent another
patch that will fix this problem totally, sorry about that

[PATCH] Btrfs: skip locking if searching the commit root in lookup_csums

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sergei Trofimovich Aug. 30, 2011, 8:46 p.m. UTC | #7

> On 08/30/2011 03:40 PM, Josef Bacik wrote:
> > On 08/30/2011 03:31 PM, Sergei Trofimovich wrote:
> >> On Tue, 30 Aug 2011 14:02:37 -0400 Josef Bacik <josef@redhat.com>
> >> wrote:
> >>
> >>> On 08/30/2011 12:53 PM, Sergei Trofimovich wrote:
> >>>>> Running 'sync' program after the load does not finish and
> >>>>> eats 100%CPU busy-waiting for something in kernel.
> >>>>>
> >>>>> It's easy to reproduce hang with patch for me. I just run 
> >>>>> liferea and sync after it. Without patch I haven't managed
> >>>>> to hang btrfs up.
> >>>>
> >>>> And I think it's another btrfs bug. I've managed to reproduce
> >>>> it _without_ your patch and _without_ autodefrag enabled by
> >>>> manually running the following commands: $ btrfs fi defrag 
> >>>> file-with-20_000-extents $ sync
> >>>>
> >>>> I think your patch just shuffles things a bit and forces 
> >>>> autodefrag races to pop-up sooner (which is good! :])
> >>>>
> >>>
> >>> Sergei, can you do sysrq+w when this is happening, and maybe turn
> >>> on the softlockup detector so we can see where sync is getting
> >>> stuck? Thanks,
> >>
> >> Sure. As I keep telling about 2 cases in IRC I will state both here
> >> explicitely:
> >>
> >> ==The First Issue (aka "The Hung sync()" case) ==
> >>
> >> - it's an unpatched linus's v3.1-rc4-80-g0f43dd5 - /dev/root on /
> >> type btrfs (rw,noatime,compress=lzo) - 50% full 30GB filesystem
> >> (usual nonmixed mode)
> >>
> >> How I hung it: $ /usr/sbin/filefrag ~/.bogofilter/wordlist.db 
> >> /home/st/.bogofilter/wordlist.db: 19070 extents found the file is
> >> 138MB sqlite database for bayesian SPAM filter, it's being read and
> >> written every 20 minutes or so. Maybe, it was writtent even in
> >> defrag/sync time! $~/dev/git/btrfs-progs-unstable/btrfs fi defrag
> >> ~/.bogofilter/wordlist.db $ sync ^C<hung in D-state>
> >>
> >> I didn't try to reproduce it yet. As for lockdep I'll try but I'm
> >> afraid I will fail to reproduce, but I'll try tomorrow. I suspect
> >> I'll need to seriously fragment some file first down to such
> >> horrible state.
> >>
> >> With help of David I've some (hopefully relevant) info: #!/bin/sh
> >> -x
> >>
> >> for i in $(ps aux|grep " D[+ ]\?"|awk '{print $2}'); do ps $i sudo
> >> cat /proc/$i/stack done
> >>
> >> PID TTY      STAT   TIME COMMAND 1291 ?        D      0:00
> >> [btrfs-endio-wri] [<ffffffff8130055d>]
> >> btrfs_tree_read_lock+0x6d/0x120 [<ffffffff812b8e88>]
> >> btrfs_search_slot+0x698/0x8b0 [<ffffffff812c9e18>]
> >> btrfs_lookup_csum+0x68/0x190 [<ffffffff812ca10f>]
> >> __btrfs_lookup_bio_sums+0x1cf/0x3e0 [<ffffffff812ca371>]
> >> btrfs_lookup_bio_sums+0x11/0x20 [<ffffffff812d6a50>]
> >> btrfs_submit_bio_hook+0x140/0x170 [<ffffffff812ed594>]
> >> submit_one_bio+0x64/0xa0 [<ffffffff812f14f5>]
> >> extent_readpages+0xe5/0x100 [<ffffffff812d7aaa>]
> >> btrfs_readpages+0x1a/0x20 [<ffffffff810a6a02>]
> >> __do_page_cache_readahead+0x1d2/0x280 [<ffffffff810a6d8c>]
> >> ra_submit+0x1c/0x20 [<ffffffff810a6ebd>]
> >> ondemand_readahead+0x12d/0x270 [<ffffffff810a70cc>]
> >> page_cache_sync_readahead+0x2c/0x40 [<ffffffff81309987>]
> >> __load_free_space_cache+0x1a7/0x5b0 [<ffffffff81309e61>]
> >> load_free_space_cache+0xd1/0x190 [<ffffffff812be07b>]
> >> cache_block_group+0xab/0x290 [<ffffffff812c3def>]
> >> find_free_extent.clone.71+0x39f/0xab0 [<ffffffff812c5160>]
> >> btrfs_reserve_extent+0xe0/0x170 [<ffffffff812c56df>]
> >> btrfs_alloc_free_block+0xcf/0x330 [<ffffffff812b498d>]
> >> __btrfs_cow_block+0x11d/0x4a0 [<ffffffff812b4df8>]
> >> btrfs_cow_block+0xe8/0x1a0 [<ffffffff812b8965>]
> >> btrfs_search_slot+0x175/0x8b0 [<ffffffff812c9e18>]
> >> btrfs_lookup_csum+0x68/0x190 [<ffffffff812caf6e>]
> >> btrfs_csum_file_blocks+0xbe/0x670 [<ffffffff812d7d91>]
> >> add_pending_csums.clone.39+0x41/0x60 [<ffffffff812da528>]
> >> btrfs_finish_ordered_io+0x218/0x310 [<ffffffff812da635>]
> >> btrfs_writepage_end_io_hook+0x15/0x20 [<ffffffff8130c71a>]
> >> end_compressed_bio_write+0x7a/0xe0 [<ffffffff811146f8>]
> >> bio_endio+0x18/0x30 [<ffffffff812cd8fc>]
> >> end_workqueue_fn+0xec/0x120 [<ffffffff812fb0ac>]
> >> worker_loop+0xac/0x520 [<ffffffff8105d486>] kthread+0x96/0xa0 
> >> [<ffffffff815f9214>] kernel_thread_helper+0x4/0x10 
> >> [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > Ok this should have been fixed with
> > 
> > Btrfs: use the commit_root for reading free_space_inode crcs
> > 
> > which is commit # 2cf8572dac62cc2ff7e995173e95b6c694401b3f.  Does your
> > kernel have this commit?  Because if it does then we did something
> > wrong.  If not it should be in linus's latest tree, so update and it
> > should go away (hopefully).  Thanks,

Yeah, this one was in my local tree when hung.

> Oops looks like that patch won't fix it completely, I just sent another
> patch that will fix this problem totally, sorry about that
> 
> [PATCH] Btrfs: skip locking if searching the commit root in lookup_csums

I'll try to reproduce/test it tomorrow.

About the second one:

> ==The Second Issue (aka "The Busy Looping sync()" case) ==
> The box is different from first, so conditions are a bit different.
> - /dev/root on / type btrfs (rw,noatime,autodefrag)
>   (note autodefrag!)
> - 15% full 594GB filesystem (usual nonmixed mode)
> 
>    $ liferea
>    <wait it to calm down. it does a lot of SQLite reads/writes>
>    $ sync
>    Got CPU is 100% loaded <hung>

Still reproducible with 2 patches above + $SUBJ one. strace says it hangs in
strace() syscall. Stack trace is odd:
# cat /proc/`pidof sync`/stack
[<ffffffffffffffff>] 0xffffffffffffffff

Sergei Trofimovich Aug. 30, 2011, 9:17 p.m. UTC | #8

> About the second one:
> 
> > ==The Second Issue (aka "The Busy Looping sync()" case) ==
> > The box is different from first, so conditions are a bit different.
> > - /dev/root on / type btrfs (rw,noatime,autodefrag)
> >   (note autodefrag!)
> > - 15% full 594GB filesystem (usual nonmixed mode)
> > 
> >    $ liferea
> >    <wait it to calm down. it does a lot of SQLite reads/writes>
> >    $ sync
> >    Got CPU is 100% loaded <hung>
> 
> Still reproducible with 2 patches above + $SUBJ one. strace says it hangs in
> strace() syscall. Stack trace is odd:
> # cat /proc/`pidof sync`/stack
> [<ffffffffffffffff>] 0xffffffffffffffff

"got" stack of the looper with such hack:
    $ while :; do sudo echo "-"; cat /proc/2304/stack | fgrep -v '[<ffffffffffffffff>] 0xffffffffffffffff'; done | uniq

[<ffffffff8109d39d>] find_get_pages_tag+0x14d/0x1a0
[<ffffffff81076981>] __lock_acquire+0x5a1/0xbd0
[<ffffffff8109e7bd>] filemap_fdatawait_range+0x9d/0x1b0
[<ffffffff8109e8f2>] filemap_fdatawait+0x22/0x30
[<ffffffff81106a04>] sync_inodes_sb+0x1c4/0x250
[<ffffffff8110b398>] __sync_filesystem+0x78/0x80
[<ffffffff8110b3b7>] sync_one_sb+0x17/0x20
[<ffffffff810e3fbe>] iterate_supers+0x7e/0xe0
[<ffffffff8110b400>] sys_sync+0x40/0x60
[<ffffffff81456afb>] system_call_fastpath+0x16/0x1b
-
[<ffffffff811069f1>] sync_inodes_sb+0x1b1/0x250
[<ffffffff8110b398>] __sync_filesystem+0x78/0x80
[<ffffffff8110b3b7>] sync_one_sb+0x17/0x20
[<ffffffff810e3fbe>] iterate_supers+0x7e/0xe0
[<ffffffff8110b400>] sys_sync+0x40/0x60
[<ffffffff81456afb>] system_call_fastpath+0x16/0x1b
-
[<ffffffff811069f1>] sync_inodes_sb+0x1b1/0x250
[<ffffffff8110b398>] __sync_filesystem+0x78/0x80
[<ffffffff8110b3b7>] sync_one_sb+0x17/0x20
[<ffffffff810e3fbe>] iterate_supers+0x7e/0xe0
[<ffffffff8110b400>] sys_sync+0x40/0x60
[<ffffffff81456afb>] system_call_fastpath+0x16/0x1b
-
[<ffffffff814564b6>] retint_kernel+0x26/0x30
[<ffffffff8109d398>] find_get_pages_tag+0x148/0x1a0
[<ffffffff810a7a60>] pagevec_lookup_tag+0x20/0x30
[<ffffffff8109e7bd>] filemap_fdatawait_range+0x9d/0x1b0
[<ffffffff8109e8f2>] filemap_fdatawait+0x22/0x30
[<ffffffff81106a04>] sync_inodes_sb+0x1c4/0x250
[<ffffffff8110b398>] __sync_filesystem+0x78/0x80
[<ffffffff8110b3b7>] sync_one_sb+0x17/0x20
[<ffffffff810e3fbe>] iterate_supers+0x7e/0xe0
[<ffffffff8110b400>] sys_sync+0x40/0x60
[<ffffffff81456afb>] system_call_fastpath+0x16/0x1b

Looks like dirty inode count (or dirty page count) is affected by
the $SUBJ patch.

David Sterba Sept. 1, 2011, 2:45 a.m. UTC | #9

Hi,

I saw something different with your patch. After creating and deleting
lots of subvolumes followed by an umount, there is one unpleasant
interaction with orphans.

After first mount attempt, the mount fails with

# mount /dev/sdb5
mount: Stale NFS file handle

syslog:
[  256.050787] btrfs: could not do orphan cleanup -116
[  256.962781] btrfs: open_ctree failed


the -116 is -ESTALE which your patch adds.

Another mount attempt and success, no relevant message in syslog. Reproducible.


btrfs_orphan_cleanup
  btrfs_iget -> fails

2373                 inode = btrfs_iget(root->fs_info->sb, &found_key, root, NULL);
2374                 if (IS_ERR(inode)) {
2375                         ret = PTR_ERR(inode);
2376                         goto out;
2377                 }
...
2440 out:
2441         if (ret)
2442                 printk(KERN_CRIT "btrfs: could not do orphan cleanup %d\n", ret);
2443         btrfs_free_path(path);
2444         return ret;
2445 }


I'm concerned about leaving some unprocessed orphans "somewhere".


david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

btrfs: fix warning in iput for bad-inode

Commit Message

Comments

Patch