Message ID | 20160620133146.GE6882@quack2.suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Jun 20, 2016 at 3:31 PM, Jan Kara <jack@suse.cz> wrote: > Hi, > > On Fri 17-06-16 12:04:05, Tejun Heo wrote: >> 43d1c0eb7e11 ("block: detach bdev inode from its wb in >> __blkdev_put()") detached bdev inode from its wb as the bdev inode may >> outlive the underlying bdi and thus the wb. This is accomplished by >> invoking inode_detach_wb() from __blkdev_put(); however, while the >> inode can't be dirtied by the time control reaches there, that doesn't >> guarantee that writeback isn't already in progress on the inode. This >> can lead to the inode being disassociated from its wb while writeback >> operation is in flight causing oopses like the following. > > <snip> > >> Fix it by flushing the wb dwork before detaching the inode. Combined >> with the fact that the inode can no longer be dirtied, this guarantees >> that no writeback operation can be in flight or initiated. > > Sorry for the late reply but now when thinking about the patch I don't > think it is quite right. Writeback can happen from other contexts than just > the worker one (e.g. kswapd can do writeback, or in some out-of-memory > situations we may punt to doing writeback directly instead of calling the > worker, or sync(2) calls fdatawrite() for block device inode directly when > iterating through blockdev superblock). So flushing the workqueue IMHO is > not covering 100% of cases. > > I wanted to suggest to use inode_wait_for_writeback() which will make sure > writeback is done with the inode. However we effectively already do this > by calling bdev_write_inode() which calls writeback_single_inode() in > WB_SYNC_ALL mode. So by the time that call completes we are sure writeback > code is not looking at the inode. *However* what I think is happening is > that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes > the inode and clears all dirty bits, however the inode still stays in the > b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently > once flusher work runs, it finds the inode, looks at it and boom. And the > problem seems to be that write_inode_now(inode, true) does not result in > I_DIRTY_TIME being cleared. > > Attached patch should fix this issue - it is compile-tested only. Dmitry, > can you check whether this patch fixes the issue for you as well? I can't directly test it because crash happened very infrequently. If Tejun/Ted agree that it is the right way to fix it, then I can patch it, restart the fuzzer and leave it running for a while. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, On Mon, Jun 20, 2016 at 03:38:41PM +0200, Dmitry Vyukov wrote: > > Sorry for the late reply but now when thinking about the patch I don't > > think it is quite right. Writeback can happen from other contexts than just > > the worker one (e.g. kswapd can do writeback, or in some out-of-memory > > situations we may punt to doing writeback directly instead of calling the > > worker, or sync(2) calls fdatawrite() for block device inode directly when > > iterating through blockdev superblock). So flushing the workqueue IMHO is > > not covering 100% of cases. Hmmm, yeah, the patch undoes what the cgroup writeback changes added but it looks like the addition was exposing the existing problem more rather than causing a new one. > > I wanted to suggest to use inode_wait_for_writeback() which will make sure > > writeback is done with the inode. However we effectively already do this > > by calling bdev_write_inode() which calls writeback_single_inode() in > > WB_SYNC_ALL mode. So by the time that call completes we are sure writeback > > code is not looking at the inode. *However* what I think is happening is > > that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes > > the inode and clears all dirty bits, however the inode still stays in the > > b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently > > once flusher work runs, it finds the inode, looks at it and boom. And the > > problem seems to be that write_inode_now(inode, true) does not result in > > I_DIRTY_TIME being cleared. > > > > Attached patch should fix this issue - it is compile-tested only. Dmitry, > > can you check whether this patch fixes the issue for you as well? > > I can't directly test it because crash happened very infrequently. > If Tejun/Ted agree that it is the right way to fix it, then I can > patch it, restart the fuzzer and leave it running for a while. Yes, please try out the patch. Thanks a lot.
On Mon, Jun 20, 2016 at 7:40 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Mon, Jun 20, 2016 at 03:38:41PM +0200, Dmitry Vyukov wrote: >> > Sorry for the late reply but now when thinking about the patch I don't >> > think it is quite right. Writeback can happen from other contexts than just >> > the worker one (e.g. kswapd can do writeback, or in some out-of-memory >> > situations we may punt to doing writeback directly instead of calling the >> > worker, or sync(2) calls fdatawrite() for block device inode directly when >> > iterating through blockdev superblock). So flushing the workqueue IMHO is >> > not covering 100% of cases. > > Hmmm, yeah, the patch undoes what the cgroup writeback changes added > but it looks like the addition was exposing the existing problem more > rather than causing a new one. > >> > I wanted to suggest to use inode_wait_for_writeback() which will make sure >> > writeback is done with the inode. However we effectively already do this >> > by calling bdev_write_inode() which calls writeback_single_inode() in >> > WB_SYNC_ALL mode. So by the time that call completes we are sure writeback >> > code is not looking at the inode. *However* what I think is happening is >> > that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes >> > the inode and clears all dirty bits, however the inode still stays in the >> > b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently >> > once flusher work runs, it finds the inode, looks at it and boom. And the >> > problem seems to be that write_inode_now(inode, true) does not result in >> > I_DIRTY_TIME being cleared. >> > >> > Attached patch should fix this issue - it is compile-tested only. Dmitry, >> > can you check whether this patch fixes the issue for you as well? >> >> I can't directly test it because crash happened very infrequently. >> If Tejun/Ted agree that it is the right way to fix it, then I can >> patch it, restart the fuzzer and leave it running for a while. > > Yes, please try out the patch. Done At least it should catch any new introduced bugs. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From 907aceb33f6b598f00c9b1d10a110c74e4a0f954 Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Mon, 20 Jun 2016 15:13:42 +0200 Subject: [PATCH] writeback: Write dirty times for WB_SYNC_ALL writeback Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and queue_io() so that inodes which have only dirty timestamps are properly written on fsync(2) and sync(2). However there are other call sites - most notably going through write_inode_now() - which expect inode to be clean after WB_SYNC_ALL writeback. This is not currently true as we do not clear I_DIRTY_TIME in __writeback_single_inode() even for WB_SYNC_ALL writeback in all the cases. This then resulted in the following oops because bdev_write_inode() did not clean the inode and writeback code later stumbled over a dirty inode with detached wb. general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN Modules linked in: CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-11:0) task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000 RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050 RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286 RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000 R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0 R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40 FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0 DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 Stack: ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948 ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948 ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e Call Trace: [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309 [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554 [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600 [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709 [< inline >] wb_do_writeback fs/fs-writeback.c:1844 [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884 [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094 [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228 [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303 [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392 Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3 RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212 RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 fs/fs-writeback.c:281 RSP <ffff88006cdaf7d0> ---[ end trace 986a4d314dcb2694 ]--- Fix the problem by making sure __writeback_single_inode() writes inode only with dirty times in WB_SYNC_ALL mode. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Jan Kara <jack@suse.cz> --- fs/fs-writeback.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 989a2cef6b76..51efb644410a 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1291,6 +1291,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) dirty = inode->i_state & I_DIRTY; if (inode->i_state & I_DIRTY_TIME) { if ((dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) || + wbc->sync_mode == WB_SYNC_ALL || unlikely(inode->i_state & I_DIRTY_TIME_EXPIRED) || unlikely(time_after(jiffies, (inode->dirtied_time_when + -- 2.6.6