diff mbox

[resend,3/3] writeback: fix false positive WARN in __mark_inode_dirty

Message ID 20160104182016.24118.33718.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dan Williams Jan. 4, 2016, 6:20 p.m. UTC
This warning was added as a debugging aid way back in commit
500b067c5e6c "writeback: check for registered bdi in flusher add and
inode dirty" when we were switching over to per-bdi writeback.

Once the block device has been torn down it's no longer useful to
complain about unregistered bdi's.  Clear the writeback capability under
the the wb->list_lock(), so that __mark_inode_dirty has no opportunity
to race bdi_unregister() to this WARN() condition.

Alternatively we could just delete the warning...

Found this while testing block device remove from underneath an active
fs triggering traces like:

 WARNING: CPU: 6 PID: 2129 at fs/fs-writeback.c:2065 __mark_inode_dirty+0x261/0x350()
 bdi-block not registered
 [..]
 Call Trace:
  [<ffffffff81459f02>] dump_stack+0x44/0x62
  [<ffffffff810a1f32>] warn_slowpath_common+0x82/0xc0
  [<ffffffff810a1fcc>] warn_slowpath_fmt+0x5c/0x80
  [<ffffffff812830b1>] __mark_inode_dirty+0x261/0x350
  [<ffffffff8126d019>] generic_update_time+0x79/0xd0
  [<ffffffff8126d19d>] file_update_time+0xbd/0x110
  [<ffffffff812e4ab8>] ext4_dax_fault+0x68/0x110
  [<ffffffff811f7f3e>] __do_fault+0x4e/0xf0
  [<ffffffff811fc077>] handle_mm_fault+0x5e7/0x1b50
  [<ffffffff811fbae1>] ? handle_mm_fault+0x51/0x1b50
  [<ffffffff81068981>] __do_page_fault+0x191/0x3f0
  [<ffffffff81068caf>] trace_do_page_fault+0x4f/0x120
  [<ffffffff810630fa>] do_async_page_fault+0x1a/0xa0
  [<ffffffff819023f8>] async_page_fault+0x28/0x30

Cc: Jan Kara <jack@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/backing-dev.c |    7 +++++++
 1 file changed, 7 insertions(+)


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Dave Chinner Jan. 5, 2016, 4:23 a.m. UTC | #1
On Mon, Jan 04, 2016 at 10:20:16AM -0800, Dan Williams wrote:
> This warning was added as a debugging aid way back in commit
> 500b067c5e6c "writeback: check for registered bdi in flusher add and
> inode dirty" when we were switching over to per-bdi writeback.
> 
> Once the block device has been torn down it's no longer useful to
> complain about unregistered bdi's.  Clear the writeback capability under
> the the wb->list_lock(), so that __mark_inode_dirty has no opportunity
> to race bdi_unregister() to this WARN() condition.
> 
> Alternatively we could just delete the warning...

The warning is correct - the filesytem is trying to mark an inode
dirty on a device that can't do writeback anymore. Seems to me like
it is functioning as it should.

> Found this while testing block device remove from underneath an active
> fs triggering traces like:
> 
>  WARNING: CPU: 6 PID: 2129 at fs/fs-writeback.c:2065 __mark_inode_dirty+0x261/0x350()
>  bdi-block not registered
>  [..]
>  Call Trace:
>   [<ffffffff81459f02>] dump_stack+0x44/0x62
>   [<ffffffff810a1f32>] warn_slowpath_common+0x82/0xc0
>   [<ffffffff810a1fcc>] warn_slowpath_fmt+0x5c/0x80
>   [<ffffffff812830b1>] __mark_inode_dirty+0x261/0x350
>   [<ffffffff8126d019>] generic_update_time+0x79/0xd0
>   [<ffffffff8126d19d>] file_update_time+0xbd/0x110
>   [<ffffffff812e4ab8>] ext4_dax_fault+0x68/0x110
>   [<ffffffff811f7f3e>] __do_fault+0x4e/0xf0

This seems like the problem to me - you haven't implemented a
shutdown hook for ext4, and so it continues to allow page faults to
make progress after the device has been removed. The DAX fault
should have been failed before the filesystem gets to the point of
marking the inode dirty....

> +	/* tell __mark_inode_dirty that writeback is no longer possible */
> +	spin_lock(&wb->list_lock);
> +	wb->bdi->capabilities |= BDI_CAP_NO_WRITEBACK;
> +	spin_unlock(&wb->list_lock);
> +
>  	spin_unlock_bh(&wb->work_lock);

Is that lock ordering safe? i.e. it's inside a section using bh-safe
locking, which tends to imply that it can run from interrupt
contexts. Can we get something like

	spin_lock(&wb->list_lock);
	.....
	<irq>
	.....
	wb_shutdown
	spin_lock_bh(&wb->work_lock);
	spin_lock(&wb->list_lock);

Cheers,

Dave.
Dan Williams Jan. 5, 2016, 7:59 p.m. UTC | #2
On Mon, Jan 4, 2016 at 8:23 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Jan 04, 2016 at 10:20:16AM -0800, Dan Williams wrote:
>> This warning was added as a debugging aid way back in commit
>> 500b067c5e6c "writeback: check for registered bdi in flusher add and
>> inode dirty" when we were switching over to per-bdi writeback.
>>
>> Once the block device has been torn down it's no longer useful to
>> complain about unregistered bdi's.  Clear the writeback capability under
>> the the wb->list_lock(), so that __mark_inode_dirty has no opportunity
>> to race bdi_unregister() to this WARN() condition.
>>
>> Alternatively we could just delete the warning...
>
> The warning is correct - the filesytem is trying to mark an inode
> dirty on a device that can't do writeback anymore. Seems to me like
> it is functioning as it should.
>
>> Found this while testing block device remove from underneath an active
>> fs triggering traces like:
>>
>>  WARNING: CPU: 6 PID: 2129 at fs/fs-writeback.c:2065 __mark_inode_dirty+0x261/0x350()
>>  bdi-block not registered
>>  [..]
>>  Call Trace:
>>   [<ffffffff81459f02>] dump_stack+0x44/0x62
>>   [<ffffffff810a1f32>] warn_slowpath_common+0x82/0xc0
>>   [<ffffffff810a1fcc>] warn_slowpath_fmt+0x5c/0x80
>>   [<ffffffff812830b1>] __mark_inode_dirty+0x261/0x350
>>   [<ffffffff8126d019>] generic_update_time+0x79/0xd0
>>   [<ffffffff8126d19d>] file_update_time+0xbd/0x110
>>   [<ffffffff812e4ab8>] ext4_dax_fault+0x68/0x110
>>   [<ffffffff811f7f3e>] __do_fault+0x4e/0xf0
>
> This seems like the problem to me - you haven't implemented a
> shutdown hook for ext4, and so it continues to allow page faults to
> make progress after the device has been removed. The DAX fault
> should have been failed before the filesystem gets to the point of
> marking the inode dirty....

xfs doesn't check that the fs is still alive before calling
file_update_time().  Also, I don't think we need/want to sprinkle "is
fs alive?" checks to address this when the block device shutdown path
can simply turn off writeback.

>
>> +     /* tell __mark_inode_dirty that writeback is no longer possible */
>> +     spin_lock(&wb->list_lock);
>> +     wb->bdi->capabilities |= BDI_CAP_NO_WRITEBACK;
>> +     spin_unlock(&wb->list_lock);
>> +
>>       spin_unlock_bh(&wb->work_lock);
>
> Is that lock ordering safe? i.e. it's inside a section using bh-safe
> locking, which tends to imply that it can run from interrupt
> contexts. Can we get something like

This would be a problem if wb_shutdown() was called from softirq
context, but it is always called from process context.  So,
->list_lock doesn't currently need to be upgraded to spin_lock_bh and
lockdep will trigger if this situation changes.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Jan. 5, 2016, 9:10 p.m. UTC | #3
On Tue, Jan 05, 2016 at 11:59:41AM -0800, Dan Williams wrote:
> On Mon, Jan 4, 2016 at 8:23 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Jan 04, 2016 at 10:20:16AM -0800, Dan Williams wrote:
> >> This warning was added as a debugging aid way back in commit
> >> 500b067c5e6c "writeback: check for registered bdi in flusher add and
> >> inode dirty" when we were switching over to per-bdi writeback.
> >>
> >> Once the block device has been torn down it's no longer useful to
> >> complain about unregistered bdi's.  Clear the writeback capability under
> >> the the wb->list_lock(), so that __mark_inode_dirty has no opportunity
> >> to race bdi_unregister() to this WARN() condition.
> >>
> >> Alternatively we could just delete the warning...
> >
> > The warning is correct - the filesytem is trying to mark an inode
> > dirty on a device that can't do writeback anymore. Seems to me like
> > it is functioning as it should.
> >
> >> Found this while testing block device remove from underneath an active
> >> fs triggering traces like:
> >>
> >>  WARNING: CPU: 6 PID: 2129 at fs/fs-writeback.c:2065 __mark_inode_dirty+0x261/0x350()
> >>  bdi-block not registered
> >>  [..]
> >>  Call Trace:
> >>   [<ffffffff81459f02>] dump_stack+0x44/0x62
> >>   [<ffffffff810a1f32>] warn_slowpath_common+0x82/0xc0
> >>   [<ffffffff810a1fcc>] warn_slowpath_fmt+0x5c/0x80
> >>   [<ffffffff812830b1>] __mark_inode_dirty+0x261/0x350
> >>   [<ffffffff8126d019>] generic_update_time+0x79/0xd0
> >>   [<ffffffff8126d19d>] file_update_time+0xbd/0x110
> >>   [<ffffffff812e4ab8>] ext4_dax_fault+0x68/0x110
> >>   [<ffffffff811f7f3e>] __do_fault+0x4e/0xf0
> >
> > This seems like the problem to me - you haven't implemented a
> > shutdown hook for ext4, and so it continues to allow page faults to
> > make progress after the device has been removed. The DAX fault
> > should have been failed before the filesystem gets to the point of
> > marking the inode dirty....
> 
> xfs doesn't check that the fs is still alive before calling
> file_update_time().  Also, I don't think we need/want to sprinkle "is
> fs alive?" checks to address this when the block device shutdown path
> can simply turn off writeback.

That's because the timestamp update is aborted in XFS during
transaction commit because xfs_trans_commit has a "is the filesystem
shutdown" check to prevent situations like this occurring. i.e. XFS
has shutdown checks sprinkled all through it in strategic places to
ensure that shutdown does what it is supposed to and does not
trigger warnings in generic/VFS code.

If you've removed a device, those inodes can *never* be written
back, and hence marking new inodes dirty for writeback after a
shutdown is completely wrong. e.g. if ext4 has had some other fatal
corruption error, it will continue to allow timestamp updates and
inode writeback through this mechanism. That's a bug in the
filesystem error handling, not a bug in the writeback code.

Cheers,

Dave.
Dan Williams Jan. 5, 2016, 9:29 p.m. UTC | #4
On Tue, Jan 5, 2016 at 1:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Jan 05, 2016 at 11:59:41AM -0800, Dan Williams wrote:
>> On Mon, Jan 4, 2016 at 8:23 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Mon, Jan 04, 2016 at 10:20:16AM -0800, Dan Williams wrote:
>> >> This warning was added as a debugging aid way back in commit
>> >> 500b067c5e6c "writeback: check for registered bdi in flusher add and
>> >> inode dirty" when we were switching over to per-bdi writeback.
>> >>
>> >> Once the block device has been torn down it's no longer useful to
>> >> complain about unregistered bdi's.  Clear the writeback capability under
>> >> the the wb->list_lock(), so that __mark_inode_dirty has no opportunity
>> >> to race bdi_unregister() to this WARN() condition.
>> >>
>> >> Alternatively we could just delete the warning...
>> >
>> > The warning is correct - the filesytem is trying to mark an inode
>> > dirty on a device that can't do writeback anymore. Seems to me like
>> > it is functioning as it should.
>> >
>> >> Found this while testing block device remove from underneath an active
>> >> fs triggering traces like:
>> >>
>> >>  WARNING: CPU: 6 PID: 2129 at fs/fs-writeback.c:2065 __mark_inode_dirty+0x261/0x350()
>> >>  bdi-block not registered
>> >>  [..]
>> >>  Call Trace:
>> >>   [<ffffffff81459f02>] dump_stack+0x44/0x62
>> >>   [<ffffffff810a1f32>] warn_slowpath_common+0x82/0xc0
>> >>   [<ffffffff810a1fcc>] warn_slowpath_fmt+0x5c/0x80
>> >>   [<ffffffff812830b1>] __mark_inode_dirty+0x261/0x350
>> >>   [<ffffffff8126d019>] generic_update_time+0x79/0xd0
>> >>   [<ffffffff8126d19d>] file_update_time+0xbd/0x110
>> >>   [<ffffffff812e4ab8>] ext4_dax_fault+0x68/0x110
>> >>   [<ffffffff811f7f3e>] __do_fault+0x4e/0xf0
>> >
>> > This seems like the problem to me - you haven't implemented a
>> > shutdown hook for ext4, and so it continues to allow page faults to
>> > make progress after the device has been removed. The DAX fault
>> > should have been failed before the filesystem gets to the point of
>> > marking the inode dirty....
>>
>> xfs doesn't check that the fs is still alive before calling
>> file_update_time().  Also, I don't think we need/want to sprinkle "is
>> fs alive?" checks to address this when the block device shutdown path
>> can simply turn off writeback.
>
> That's because the timestamp update is aborted in XFS during
> transaction commit because xfs_trans_commit has a "is the filesystem
> shutdown" check to prevent situations like this occurring. i.e. XFS
> has shutdown checks sprinkled all through it in strategic places to
> ensure that shutdown does what it is supposed to and does not
> trigger warnings in generic/VFS code.
>
> If you've removed a device, those inodes can *never* be written
> back, and hence marking new inodes dirty for writeback after a
> shutdown is completely wrong. e.g. if ext4 has had some other fatal
> corruption error, it will continue to allow timestamp updates and
> inode writeback through this mechanism. That's a bug in the
> filesystem error handling, not a bug in the writeback code.

True, in that sense the warning is serving a valid purpose to say
"FIXME, implement shutdown checks".  I'll drop this patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7340353f8aea..9ea843bca709 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -343,10 +343,17 @@  static void wb_shutdown(struct bdi_writeback *wb)
 {
 	/* Make sure nobody queues further work */
 	spin_lock_bh(&wb->work_lock);
+
 	if (!test_and_clear_bit(WB_registered, &wb->state)) {
 		spin_unlock_bh(&wb->work_lock);
 		return;
 	}
+
+	/* tell __mark_inode_dirty that writeback is no longer possible */
+	spin_lock(&wb->list_lock);
+	wb->bdi->capabilities |= BDI_CAP_NO_WRITEBACK;
+	spin_unlock(&wb->list_lock);
+
 	spin_unlock_bh(&wb->work_lock);
 
 	/*