diff mbox

writeback: Write dirty times for WB_SYNC_ALL writeback

Message ID 1469525900-9212-1-git-send-email-jack@suse.cz (mailing list archive)
State New, archived
Headers show

Commit Message

Jan Kara July 26, 2016, 9:38 a.m. UTC
Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and
queue_io() so that inodes which have only dirty timestamps are properly
written on fsync(2) and sync(2). However there are other call sites -
most notably going through write_inode_now() - which expect inode to be
clean after WB_SYNC_ALL writeback. This is not currently true as we do
not clear I_DIRTY_TIME in __writeback_single_inode() even for
WB_SYNC_ALL writeback in all the cases. This then resulted in the
following oops because bdev_write_inode() did not clean the inode and
writeback code later stumbled over a dirty inode with detached wb.

  general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
  Modules linked in:
  CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Workqueue: writeback wb_workfn (flush-11:0)
  task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000
  RIP: 0010:[<ffffffff818884d2>]  [<ffffffff818884d2>]
  locked_inode_to_wb_and_lock_list+0xa2/0x750
  RSP: 0018:ffff88006cdaf7d0  EFLAGS: 00010246
  RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
  RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
  RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
  R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
  R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
  FS:  0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
  DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
  Stack:
   ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
   ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
   ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
  Call Trace:
   [<     inline     >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
   [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
   [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
   [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
   [<     inline     >] wb_do_writeback fs/fs-writeback.c:1844
   [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
   [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
   [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
   [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
   [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
  Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
  00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
  80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
  RIP  [<     inline     >] wb_get include/linux/backing-dev-defs.h:212
  RIP  [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
  fs/fs-writeback.c:281
   RSP <ffff88006cdaf7d0>
  ---[ end trace 986a4d314dcb2694 ]---

Fix the problem by making sure __writeback_single_inode() writes inode
only with dirty times in WB_SYNC_ALL mode.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 1 +
 1 file changed, 1 insertion(+)

Jens, can you please take this patch? Thanks!
								Honza

Comments

Jens Axboe July 26, 2016, 2:16 p.m. UTC | #1
On 07/26/2016 03:38 AM, Jan Kara wrote:
> Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and
> queue_io() so that inodes which have only dirty timestamps are properly
> written on fsync(2) and sync(2). However there are other call sites -
> most notably going through write_inode_now() - which expect inode to be
> clean after WB_SYNC_ALL writeback. This is not currently true as we do
> not clear I_DIRTY_TIME in __writeback_single_inode() even for
> WB_SYNC_ALL writeback in all the cases. This then resulted in the
> following oops because bdev_write_inode() did not clean the inode and
> writeback code later stumbled over a dirty inode with detached wb.
>
>   general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
>   Modules linked in:
>   CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>   Workqueue: writeback wb_workfn (flush-11:0)
>   task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000
>   RIP: 0010:[<ffffffff818884d2>]  [<ffffffff818884d2>]
>   locked_inode_to_wb_and_lock_list+0xa2/0x750
>   RSP: 0018:ffff88006cdaf7d0  EFLAGS: 00010246
>   RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
>   RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
>   RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
>   R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
>   R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
>   FS:  0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
>   DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
>   DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
>   Stack:
>    ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
>    ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
>    ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
>   Call Trace:
>    [<     inline     >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
>    [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
>    [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
>    [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
>    [<     inline     >] wb_do_writeback fs/fs-writeback.c:1844
>    [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
>    [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
>    [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
>    [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
>    [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
>   Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
>   00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
>   80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
>   RIP  [<     inline     >] wb_get include/linux/backing-dev-defs.h:212
>   RIP  [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
>   fs/fs-writeback.c:281
>    RSP <ffff88006cdaf7d0>
>   ---[ end trace 986a4d314dcb2694 ]---
>
> Fix the problem by making sure __writeback_single_inode() writes inode
> only with dirty times in WB_SYNC_ALL mode.
>
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/fs-writeback.c | 1 +
>  1 file changed, 1 insertion(+)
>
> Jens, can you please take this patch? Thanks!

Yep, applied for 4.8, thanks!
Laurent Dufour Dec. 5, 2016, 2 p.m. UTC | #2
On 26/07/2016 11:38, Jan Kara wrote:
> Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and
> queue_io() so that inodes which have only dirty timestamps are properly
> written on fsync(2) and sync(2). However there are other call sites -
> most notably going through write_inode_now() - which expect inode to be
> clean after WB_SYNC_ALL writeback. This is not currently true as we do
> not clear I_DIRTY_TIME in __writeback_single_inode() even for
> WB_SYNC_ALL writeback in all the cases. This then resulted in the
> following oops because bdev_write_inode() did not clean the inode and
> writeback code later stumbled over a dirty inode with detached wb.
> 
>   general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
>   Modules linked in:
>   CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>   Workqueue: writeback wb_workfn (flush-11:0)
>   task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000
>   RIP: 0010:[<ffffffff818884d2>]  [<ffffffff818884d2>]
>   locked_inode_to_wb_and_lock_list+0xa2/0x750
>   RSP: 0018:ffff88006cdaf7d0  EFLAGS: 00010246
>   RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
>   RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
>   RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
>   R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
>   R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
>   FS:  0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
>   DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
>   DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
>   Stack:
>    ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
>    ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
>    ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
>   Call Trace:
>    [<     inline     >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
>    [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
>    [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
>    [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
>    [<     inline     >] wb_do_writeback fs/fs-writeback.c:1844
>    [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
>    [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
>    [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
>    [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
>    [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
>   Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
>   00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
>   80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
>   RIP  [<     inline     >] wb_get include/linux/backing-dev-defs.h:212
>   RIP  [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
>   fs/fs-writeback.c:281
>    RSP <ffff88006cdaf7d0>
>   ---[ end trace 986a4d314dcb2694 ]---
> 
> Fix the problem by making sure __writeback_single_inode() writes inode
> only with dirty times in WB_SYNC_ALL mode.
> 
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/fs-writeback.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> Jens, can you please take this patch? Thanks!
> 								Honza
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 989a2cef6b76..51efb644410a 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1291,6 +1291,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
>  	dirty = inode->i_state & I_DIRTY;
>  	if (inode->i_state & I_DIRTY_TIME) {
>  		if ((dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) ||
> +		    wbc->sync_mode == WB_SYNC_ALL ||
>  		    unlikely(inode->i_state & I_DIRTY_TIME_EXPIRED) ||
>  		    unlikely(time_after(jiffies,
>  					(inode->dirtied_time_when +
> 

Hi Jan,

I'm sorry to say that but this bug is surfacing again.

We got it using the latest Ubuntu 16.04 kernel but I did some test using
a 4.8 kernel and I was able to get it again.
It's not easy to recreate, we have to let guest running for a while with
several disks attached and a database test program which trigger this
disk is run to get it.

Here is the stack I got :

[113031.075540] Unable to handle kernel paging request for data at
address 0x00000000
[113031.075614] Faulting instruction address: 0xc0000000003692e0
0:mon> t
[c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590
[c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150
[c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450
[c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580
[c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590
[c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660
[c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130
[c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c
--- Exception: 0  at 0000000000000000
0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c0000000fb65f620]
    pc: c0000000003692e0: locked_inode_to_wb_and_lock_list+0x50/0x290
    lr: c00000000036cb6c: writeback_sb_inodes+0x30c/0x590
    sp: c0000000fb65f8a0
   msr: 800000010280b033
   dar: 0
 dsisr: 40000000
  current = 0xc0000001d69be400
  paca    = 0xc000000003480000	 softe: 0	 irq_happened: 0x01
    pid   = 18689, comm = kworker/u16:10
Linux version 4.8.0 (laurent@lucky05) (gcc version 5.4.0 20160609
(Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #1 SMP Thu Dec 1 09:25:13 CST 2016
0:mon> r
R00 = c00000000036cb6c   R16 = c0000001fc0312e8
R01 = c0000000fb65f8a0   R17 = c0000001fc031260
R02 = c000000001471600   R18 = c0000001fc031350
R03 = c0000001fc031260   R19 = 0000000000000000
R04 = c0000001d69beee0   R20 = 0000000000000000
R05 = 0000000000000000   R21 = c0000000fb65c000
R06 = 00000001fed50000   R22 = c000000014960b10
R07 = 00029313df052efd   R23 = c000000014960af0
R08 = 0000000000000000   R24 = 0000000000000000
R09 = 0000000000000000   R25 = c0000001fc0312e8
R10 = 0000000080000000   R26 = 0000000000000000
R11 = 071c71c71c71c71c   R27 = 0000000000000000
R12 = 0000000000000000   R28 = 0000000000000001
R13 = c000000003480000   R29 = c0000001fc031260
R14 = c0000000000fc3a8   R30 = c0000000fb65fba0
R15 = 0000000000000000   R31 = 0000000000000000
pc  = c0000000003692e0 locked_inode_to_wb_and_lock_list+0x50/0x290
cfar= c0000000000a14d0 lparcfg_data+0xc10/0xda0
lr  = c00000000036cb6c writeback_sb_inodes+0x30c/0x590
msr = 800000010280b033   cr  = 24652882
ctr = c000000000126e20   xer = 0000000020000000   trap =  300
dar = 0000000000000000   dsisr = 40000000

The panic is occuring when entering locked_inode_to_wb_and_lock_list(),
the inode->i_wb field is NULL. I'm almost sure that we didn't loop in
locked_inode_to_wb_and_lock_list() because the LR register is still
pointing to the caller.

Note that on another CPU, 'systemd-udevd' was doing this:
2:mon> t
[link register   ] c00000000034db4c __destroy_inode+0xac/0x220
[c0000001dfb57c80] c00000000034db40 __destroy_inode+0xa0/0x220 (unreliable)
[c0000001dfb57cb0] c00000000034e094 destroy_inode+0x44/0xc0
[c0000001dfb57ce0] c000000000346ae8 dentry_unlink_inode+0x138/0x220
[c0000001dfb57d10] c000000000346e58 __dentry_kill+0xe8/0x290
[c0000001dfb57d50] c0000000003289a8 __fput+0x1a8/0x310
[c0000001dfb57db0] c0000000000f90e0 task_work_run+0x130/0x190
[c0000001dfb57e00] c00000000001c124 do_notify_resume+0xc4/0xd0
[c0000001dfb57e30] c000000000009b44 ret_from_except_lite+0x70/0x74
--- Exception: c00 (System Call) at 00003fff87b568a8

I checked for the registers and the 2 CPUs are not dealing with the same
inode so not sure there is a direct link.

I'll try to recreate it on 4.9-rc8 but in the mean time if you have any
idea on what's going wrong here...

Thanks,
Laurent.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Dec. 7, 2016, 1:43 p.m. UTC | #3
Hi Laurent!

On Mon 05-12-16 15:00:45, Laurent Dufour wrote:
> I'm sorry to say that but this bug is surfacing again.

Well, thanks for report! ;)

> We got it using the latest Ubuntu 16.04 kernel but I did some test using
> a 4.8 kernel and I was able to get it again.
> It's not easy to recreate, we have to let guest running for a while with
> several disks attached and a database test program which trigger this
> disk is run to get it.
> 
> Here is the stack I got :
> 
> [113031.075540] Unable to handle kernel paging request for data at
> address 0x00000000
> [113031.075614] Faulting instruction address: 0xc0000000003692e0
> 0:mon> t
> [c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590
> [c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150
> [c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450
> [c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580
> [c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590
> [c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660
> [c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130
> [c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c
> --- Exception: 0  at 0000000000000000
> 0:mon> e
> cpu 0x0: Vector: 300 (Data Access) at [c0000000fb65f620]
>     pc: c0000000003692e0: locked_inode_to_wb_and_lock_list+0x50/0x290
>     lr: c00000000036cb6c: writeback_sb_inodes+0x30c/0x590
>     sp: c0000000fb65f8a0
>    msr: 800000010280b033
>    dar: 0
>  dsisr: 40000000
>   current = 0xc0000001d69be400
>   paca    = 0xc000000003480000	 softe: 0	 irq_happened: 0x01
>     pid   = 18689, comm = kworker/u16:10
> Linux version 4.8.0 (laurent@lucky05) (gcc version 5.4.0 20160609
> (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #1 SMP Thu Dec 1 09:25:13 CST 2016
> 0:mon> r
> R00 = c00000000036cb6c   R16 = c0000001fc0312e8
> R01 = c0000000fb65f8a0   R17 = c0000001fc031260
> R02 = c000000001471600   R18 = c0000001fc031350
> R03 = c0000001fc031260   R19 = 0000000000000000
> R04 = c0000001d69beee0   R20 = 0000000000000000
> R05 = 0000000000000000   R21 = c0000000fb65c000
> R06 = 00000001fed50000   R22 = c000000014960b10
> R07 = 00029313df052efd   R23 = c000000014960af0
> R08 = 0000000000000000   R24 = 0000000000000000
> R09 = 0000000000000000   R25 = c0000001fc0312e8
> R10 = 0000000080000000   R26 = 0000000000000000
> R11 = 071c71c71c71c71c   R27 = 0000000000000000
> R12 = 0000000000000000   R28 = 0000000000000001
> R13 = c000000003480000   R29 = c0000001fc031260
> R14 = c0000000000fc3a8   R30 = c0000000fb65fba0
> R15 = 0000000000000000   R31 = 0000000000000000
> pc  = c0000000003692e0 locked_inode_to_wb_and_lock_list+0x50/0x290
> cfar= c0000000000a14d0 lparcfg_data+0xc10/0xda0
> lr  = c00000000036cb6c writeback_sb_inodes+0x30c/0x590
> msr = 800000010280b033   cr  = 24652882
> ctr = c000000000126e20   xer = 0000000020000000   trap =  300
> dar = 0000000000000000   dsisr = 40000000
> 
> The panic is occuring when entering locked_inode_to_wb_and_lock_list(),
> the inode->i_wb field is NULL. I'm almost sure that we didn't loop in
> locked_inode_to_wb_and_lock_list() because the LR register is still
> pointing to the caller.

So this looks like a different problem (although it manifests similarly).
The problem is that for block device inodes, inode_detach_wb() is called
while inode is still alive from __blkdev_put() and flush worker trips over
the inode when inode->i_wb is set to NULL. It requires relatively precise
timing for __blkdev_put() to hit while writeback_sb_inodes() is just
starting to look at the block device inode before it sets I_SYNC on the
inode but it seems the race window is there. Another example why special
block device lifetime rules cause subtle issues. This needs careful thought
how to design the lifetime rules for a block device so that we just don't
keep stacking hacks on hacks... If anyone has clever idea, speak up. I'll
be looking into this once I sort out other outstanding issues so probably
next week...

								Honza
diff mbox

Patch

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 989a2cef6b76..51efb644410a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1291,6 +1291,7 @@  __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	dirty = inode->i_state & I_DIRTY;
 	if (inode->i_state & I_DIRTY_TIME) {
 		if ((dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) ||
+		    wbc->sync_mode == WB_SYNC_ALL ||
 		    unlikely(inode->i_state & I_DIRTY_TIME_EXPIRED) ||
 		    unlikely(time_after(jiffies,
 					(inode->dirtied_time_when +