Message ID | 1469525900-9212-1-git-send-email-jack@suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 07/26/2016 03:38 AM, Jan Kara wrote: > Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and > queue_io() so that inodes which have only dirty timestamps are properly > written on fsync(2) and sync(2). However there are other call sites - > most notably going through write_inode_now() - which expect inode to be > clean after WB_SYNC_ALL writeback. This is not currently true as we do > not clear I_DIRTY_TIME in __writeback_single_inode() even for > WB_SYNC_ALL writeback in all the cases. This then resulted in the > following oops because bdev_write_inode() did not clean the inode and > writeback code later stumbled over a dirty inode with detached wb. > > general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN > Modules linked in: > CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > Workqueue: writeback wb_workfn (flush-11:0) > task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000 > RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>] > locked_inode_to_wb_and_lock_list+0xa2/0x750 > RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246 > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050 > RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286 > RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000 > R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0 > R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40 > FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0 > DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 > Stack: > ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948 > ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948 > ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e > Call Trace: > [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309 > [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554 > [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600 > [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709 > [< inline >] wb_do_writeback fs/fs-writeback.c:1844 > [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884 > [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094 > [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228 > [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303 > [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392 > Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e > 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42> > 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3 > RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212 > RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 > fs/fs-writeback.c:281 > RSP <ffff88006cdaf7d0> > ---[ end trace 986a4d314dcb2694 ]--- > > Fix the problem by making sure __writeback_single_inode() writes inode > only with dirty times in WB_SYNC_ALL mode. > > Reported-by: Dmitry Vyukov <dvyukov@google.com> > Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> > Signed-off-by: Jan Kara <jack@suse.cz> > --- > fs/fs-writeback.c | 1 + > 1 file changed, 1 insertion(+) > > Jens, can you please take this patch? Thanks! Yep, applied for 4.8, thanks!
On 26/07/2016 11:38, Jan Kara wrote: > Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and > queue_io() so that inodes which have only dirty timestamps are properly > written on fsync(2) and sync(2). However there are other call sites - > most notably going through write_inode_now() - which expect inode to be > clean after WB_SYNC_ALL writeback. This is not currently true as we do > not clear I_DIRTY_TIME in __writeback_single_inode() even for > WB_SYNC_ALL writeback in all the cases. This then resulted in the > following oops because bdev_write_inode() did not clean the inode and > writeback code later stumbled over a dirty inode with detached wb. > > general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN > Modules linked in: > CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > Workqueue: writeback wb_workfn (flush-11:0) > task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000 > RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>] > locked_inode_to_wb_and_lock_list+0xa2/0x750 > RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246 > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050 > RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286 > RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000 > R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0 > R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40 > FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0 > DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 > Stack: > ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948 > ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948 > ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e > Call Trace: > [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309 > [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554 > [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600 > [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709 > [< inline >] wb_do_writeback fs/fs-writeback.c:1844 > [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884 > [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094 > [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228 > [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303 > [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392 > Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e > 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42> > 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3 > RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212 > RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 > fs/fs-writeback.c:281 > RSP <ffff88006cdaf7d0> > ---[ end trace 986a4d314dcb2694 ]--- > > Fix the problem by making sure __writeback_single_inode() writes inode > only with dirty times in WB_SYNC_ALL mode. > > Reported-by: Dmitry Vyukov <dvyukov@google.com> > Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> > Signed-off-by: Jan Kara <jack@suse.cz> > --- > fs/fs-writeback.c | 1 + > 1 file changed, 1 insertion(+) > > Jens, can you please take this patch? Thanks! > Honza > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 989a2cef6b76..51efb644410a 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -1291,6 +1291,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) > dirty = inode->i_state & I_DIRTY; > if (inode->i_state & I_DIRTY_TIME) { > if ((dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) || > + wbc->sync_mode == WB_SYNC_ALL || > unlikely(inode->i_state & I_DIRTY_TIME_EXPIRED) || > unlikely(time_after(jiffies, > (inode->dirtied_time_when + > Hi Jan, I'm sorry to say that but this bug is surfacing again. We got it using the latest Ubuntu 16.04 kernel but I did some test using a 4.8 kernel and I was able to get it again. It's not easy to recreate, we have to let guest running for a while with several disks attached and a database test program which trigger this disk is run to get it. Here is the stack I got : [113031.075540] Unable to handle kernel paging request for data at address 0x00000000 [113031.075614] Faulting instruction address: 0xc0000000003692e0 0:mon> t [c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590 [c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150 [c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450 [c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580 [c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590 [c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660 [c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130 [c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c --- Exception: 0 at 0000000000000000 0:mon> e cpu 0x0: Vector: 300 (Data Access) at [c0000000fb65f620] pc: c0000000003692e0: locked_inode_to_wb_and_lock_list+0x50/0x290 lr: c00000000036cb6c: writeback_sb_inodes+0x30c/0x590 sp: c0000000fb65f8a0 msr: 800000010280b033 dar: 0 dsisr: 40000000 current = 0xc0000001d69be400 paca = 0xc000000003480000 softe: 0 irq_happened: 0x01 pid = 18689, comm = kworker/u16:10 Linux version 4.8.0 (laurent@lucky05) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #1 SMP Thu Dec 1 09:25:13 CST 2016 0:mon> r R00 = c00000000036cb6c R16 = c0000001fc0312e8 R01 = c0000000fb65f8a0 R17 = c0000001fc031260 R02 = c000000001471600 R18 = c0000001fc031350 R03 = c0000001fc031260 R19 = 0000000000000000 R04 = c0000001d69beee0 R20 = 0000000000000000 R05 = 0000000000000000 R21 = c0000000fb65c000 R06 = 00000001fed50000 R22 = c000000014960b10 R07 = 00029313df052efd R23 = c000000014960af0 R08 = 0000000000000000 R24 = 0000000000000000 R09 = 0000000000000000 R25 = c0000001fc0312e8 R10 = 0000000080000000 R26 = 0000000000000000 R11 = 071c71c71c71c71c R27 = 0000000000000000 R12 = 0000000000000000 R28 = 0000000000000001 R13 = c000000003480000 R29 = c0000001fc031260 R14 = c0000000000fc3a8 R30 = c0000000fb65fba0 R15 = 0000000000000000 R31 = 0000000000000000 pc = c0000000003692e0 locked_inode_to_wb_and_lock_list+0x50/0x290 cfar= c0000000000a14d0 lparcfg_data+0xc10/0xda0 lr = c00000000036cb6c writeback_sb_inodes+0x30c/0x590 msr = 800000010280b033 cr = 24652882 ctr = c000000000126e20 xer = 0000000020000000 trap = 300 dar = 0000000000000000 dsisr = 40000000 The panic is occuring when entering locked_inode_to_wb_and_lock_list(), the inode->i_wb field is NULL. I'm almost sure that we didn't loop in locked_inode_to_wb_and_lock_list() because the LR register is still pointing to the caller. Note that on another CPU, 'systemd-udevd' was doing this: 2:mon> t [link register ] c00000000034db4c __destroy_inode+0xac/0x220 [c0000001dfb57c80] c00000000034db40 __destroy_inode+0xa0/0x220 (unreliable) [c0000001dfb57cb0] c00000000034e094 destroy_inode+0x44/0xc0 [c0000001dfb57ce0] c000000000346ae8 dentry_unlink_inode+0x138/0x220 [c0000001dfb57d10] c000000000346e58 __dentry_kill+0xe8/0x290 [c0000001dfb57d50] c0000000003289a8 __fput+0x1a8/0x310 [c0000001dfb57db0] c0000000000f90e0 task_work_run+0x130/0x190 [c0000001dfb57e00] c00000000001c124 do_notify_resume+0xc4/0xd0 [c0000001dfb57e30] c000000000009b44 ret_from_except_lite+0x70/0x74 --- Exception: c00 (System Call) at 00003fff87b568a8 I checked for the registers and the 2 CPUs are not dealing with the same inode so not sure there is a direct link. I'll try to recreate it on 4.9-rc8 but in the mean time if you have any idea on what's going wrong here... Thanks, Laurent. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Laurent! On Mon 05-12-16 15:00:45, Laurent Dufour wrote: > I'm sorry to say that but this bug is surfacing again. Well, thanks for report! ;) > We got it using the latest Ubuntu 16.04 kernel but I did some test using > a 4.8 kernel and I was able to get it again. > It's not easy to recreate, we have to let guest running for a while with > several disks attached and a database test program which trigger this > disk is run to get it. > > Here is the stack I got : > > [113031.075540] Unable to handle kernel paging request for data at > address 0x00000000 > [113031.075614] Faulting instruction address: 0xc0000000003692e0 > 0:mon> t > [c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590 > [c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150 > [c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450 > [c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580 > [c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590 > [c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660 > [c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130 > [c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c > --- Exception: 0 at 0000000000000000 > 0:mon> e > cpu 0x0: Vector: 300 (Data Access) at [c0000000fb65f620] > pc: c0000000003692e0: locked_inode_to_wb_and_lock_list+0x50/0x290 > lr: c00000000036cb6c: writeback_sb_inodes+0x30c/0x590 > sp: c0000000fb65f8a0 > msr: 800000010280b033 > dar: 0 > dsisr: 40000000 > current = 0xc0000001d69be400 > paca = 0xc000000003480000 softe: 0 irq_happened: 0x01 > pid = 18689, comm = kworker/u16:10 > Linux version 4.8.0 (laurent@lucky05) (gcc version 5.4.0 20160609 > (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #1 SMP Thu Dec 1 09:25:13 CST 2016 > 0:mon> r > R00 = c00000000036cb6c R16 = c0000001fc0312e8 > R01 = c0000000fb65f8a0 R17 = c0000001fc031260 > R02 = c000000001471600 R18 = c0000001fc031350 > R03 = c0000001fc031260 R19 = 0000000000000000 > R04 = c0000001d69beee0 R20 = 0000000000000000 > R05 = 0000000000000000 R21 = c0000000fb65c000 > R06 = 00000001fed50000 R22 = c000000014960b10 > R07 = 00029313df052efd R23 = c000000014960af0 > R08 = 0000000000000000 R24 = 0000000000000000 > R09 = 0000000000000000 R25 = c0000001fc0312e8 > R10 = 0000000080000000 R26 = 0000000000000000 > R11 = 071c71c71c71c71c R27 = 0000000000000000 > R12 = 0000000000000000 R28 = 0000000000000001 > R13 = c000000003480000 R29 = c0000001fc031260 > R14 = c0000000000fc3a8 R30 = c0000000fb65fba0 > R15 = 0000000000000000 R31 = 0000000000000000 > pc = c0000000003692e0 locked_inode_to_wb_and_lock_list+0x50/0x290 > cfar= c0000000000a14d0 lparcfg_data+0xc10/0xda0 > lr = c00000000036cb6c writeback_sb_inodes+0x30c/0x590 > msr = 800000010280b033 cr = 24652882 > ctr = c000000000126e20 xer = 0000000020000000 trap = 300 > dar = 0000000000000000 dsisr = 40000000 > > The panic is occuring when entering locked_inode_to_wb_and_lock_list(), > the inode->i_wb field is NULL. I'm almost sure that we didn't loop in > locked_inode_to_wb_and_lock_list() because the LR register is still > pointing to the caller. So this looks like a different problem (although it manifests similarly). The problem is that for block device inodes, inode_detach_wb() is called while inode is still alive from __blkdev_put() and flush worker trips over the inode when inode->i_wb is set to NULL. It requires relatively precise timing for __blkdev_put() to hit while writeback_sb_inodes() is just starting to look at the block device inode before it sets I_SYNC on the inode but it seems the race window is there. Another example why special block device lifetime rules cause subtle issues. This needs careful thought how to design the lifetime rules for a block device so that we just don't keep stacking hacks on hacks... If anyone has clever idea, speak up. I'll be looking into this once I sort out other outstanding issues so probably next week... Honza
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 989a2cef6b76..51efb644410a 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1291,6 +1291,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) dirty = inode->i_state & I_DIRTY; if (inode->i_state & I_DIRTY_TIME) { if ((dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) || + wbc->sync_mode == WB_SYNC_ALL || unlikely(inode->i_state & I_DIRTY_TIME_EXPIRED) || unlikely(time_after(jiffies, (inode->dirtied_time_when +