diff mbox series

[v3,2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE

Message ID 20220812123727.46397-2-lczerner@redhat.com (mailing list archive)
State New, archived
Headers show
Series [v3,1/3] ext4: don't increase iversion counter for ea_inodes | expand

Commit Message

Lukas Czerner Aug. 12, 2022, 12:37 p.m. UTC
Currently the I_DIRTY_TIME will never get set if the inode already has
I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME.  That's
true, however ext4 will only update the on-disk inode in
->dirty_inode(), not on actual writeback. As a result if the inode
already has I_DIRTY_INODE state by the time we get to
__mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
into on-disk inode and will not get updated until the next I_DIRTY_INODE
update, which might never come if we crash or get a power failure.

The problem can be reproduced on ext4 by running xfstest generic/622
with -o iversion mount option.

Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
I_DIRTY_INODE. Also make sure that the case is properly handled in
writeback_single_inode() as well. Additionally changes in
xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag.

Thanks Jan Kara for suggestions on how to make this work properly.

Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Suggested-by: Jan Kara <jack@suse.cz>
---
v2: Reworked according to suggestions from Jan
v3: Update documentation, add comments, change flag to flags in
    xfs_fs_dirty_inode()

 Documentation/filesystems/vfs.rst |  2 ++
 fs/fs-writeback.c                 | 34 ++++++++++++++++++++-----------
 fs/xfs/xfs_super.c                | 12 +++++++++--
 include/linux/fs.h                |  9 ++++----
 4 files changed, 39 insertions(+), 18 deletions(-)

Comments

Eric Biggers Aug. 12, 2022, 6:01 p.m. UTC | #1
On Fri, Aug 12, 2022 at 02:37:26PM +0200, Lukas Czerner wrote:
> Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
> I_DIRTY_INODE.

How can this be reconciled with the below code in __mark_inode_dirty(), which
this patch doesn't touch?

	/* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
	flags &= ~I_DIRTY_TIME;

Also inode_is_dirtytime_only(), which I thought I mentioned before:

	/*
	 * Returns true if the given inode itself only has dirty timestamps (its pages
	 * may still be dirty) and isn't currently being allocated or freed.
	 * Filesystems should call this if when writing an inode when lazytime is
	 * enabled, they want to opportunistically write the timestamps of other inodes
	 * located very nearby on-disk, e.g. in the same inode block.  This returns true
	 * if the given inode is in need of such an opportunistic update.  Requires
	 * i_lock, or at least later re-checking under i_lock.
	 */
	static inline bool inode_is_dirtytime_only(struct inode *inode)
	{
		return (inode->i_state & (I_DIRTY_TIME | I_NEW |
					I_FREEING | I_WILL_FREE)) == I_DIRTY_TIME;
	}

- Eric
Eric Biggers Aug. 12, 2022, 6:12 p.m. UTC | #2
On Fri, Aug 12, 2022 at 02:37:26PM +0200, Lukas Czerner wrote:
> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 6cd6953e175b..5d72b6ba4e63 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -274,6 +274,8 @@ or bottom half).
>  	This is specifically for the inode itself being marked dirty,
>  	not its data.  If the update needs to be persisted by fdatasync(),
>  	then I_DIRTY_DATASYNC will be set in the flags argument.
> +	If the inode has dirty timestamp and lazytime is enabled
> +	I_DIRTY_TIME will be set in the flags.

The new sentence is not always true, since with this patch if
__mark_inode_dirty(I_DIRTY_INODE) is called twice on an inode that has
I_DIRTY_TIME, the second call will no longer include I_DIRTY_TIME -- even though
the inode still has dirty timestamps.  Please be super clear about what the
flags actually mean -- I'm still struggling to understand this patch...

- Eric
Eric Biggers Aug. 12, 2022, 6:42 p.m. UTC | #3
On Fri, Aug 12, 2022 at 02:37:26PM +0200, Lukas Czerner wrote:
> Currently the I_DIRTY_TIME will never get set if the inode already has
> I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME.  That's
> true, however ext4 will only update the on-disk inode in
> ->dirty_inode(), not on actual writeback. As a result if the inode
> already has I_DIRTY_INODE state by the time we get to
> __mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
> into on-disk inode and will not get updated until the next I_DIRTY_INODE
> update, which might never come if we crash or get a power failure.
> 
> The problem can be reproduced on ext4 by running xfstest generic/622
> with -o iversion mount option.
> 
> Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
> I_DIRTY_INODE. Also make sure that the case is properly handled in
> writeback_single_inode() as well. Additionally changes in
> xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag.
> 
> Thanks Jan Kara for suggestions on how to make this work properly.
> 
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> Suggested-by: Jan Kara <jack@suse.cz>

Sorry for so many separate emails.  One more thought: isn't there a much more
straightforward fix to this bug that wouldn't require changing the semantics of
the inode flags: on __mark_inode_dirty(I_DIRTY_TIME), if the inode already has
i_state & I_DIRTY_INODE, just call ->dirty_inode with i_state & I_DIRTY_INODE?
That would fix the bug by making the filesystem update the on-disk inode.

Perhaps you aren't doing that in order to strictly maintain the semantics of
'lazytime', where timestamp updates are only persisted at certain times?  Is
this useful even in the short window of time that an inode is dirty?

- Eric
Jan Kara Aug. 16, 2022, 11:21 a.m. UTC | #4
On Fri 12-08-22 11:12:27, Eric Biggers wrote:
> On Fri, Aug 12, 2022 at 02:37:26PM +0200, Lukas Czerner wrote:
> > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> > index 6cd6953e175b..5d72b6ba4e63 100644
> > --- a/Documentation/filesystems/vfs.rst
> > +++ b/Documentation/filesystems/vfs.rst
> > @@ -274,6 +274,8 @@ or bottom half).
> >  	This is specifically for the inode itself being marked dirty,
> >  	not its data.  If the update needs to be persisted by fdatasync(),
> >  	then I_DIRTY_DATASYNC will be set in the flags argument.
> > +	If the inode has dirty timestamp and lazytime is enabled
> > +	I_DIRTY_TIME will be set in the flags.
> 
> The new sentence is not always true, since with this patch if
> __mark_inode_dirty(I_DIRTY_INODE) is called twice on an inode that has
> I_DIRTY_TIME, the second call will no longer include I_DIRTY_TIME -- even though
> the inode still has dirty timestamps.  Please be super clear about what the
> flags actually mean -- I'm still struggling to understand this patch...

Let me chime in here because I was the one who suggested the solution to
Lukas. There are two different things (which is why this is confusing I
guess):

1) I_DIRTY_TIME in the inode->i_state should mean: struct inode has times
updated after we last called ->dirty_inode() callback. Hence
inode_is_dirtytime_only() as well as the chunk:
                /* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
                flags &= ~I_DIRTY_TIME;
you mention in the previous email are compatible with this meaning AFAICT.

2) I_DIRTY_TIME flag passed to ->dirty_inode() callback. This is admittedly
bit of a hack. Currently XFS relies on the fact that the only time its
->dirty_inode() callback needs to do anything is when VFS decides it is
time to writeback timestamps and XFS detects this situation by checking for
I_DIRTY_TIME in inode->i_state. Now to fix the race, we need to first clear
I_DIRTY_TIME in inode->i_state and only then call the ->dirty_inode()
callback (otherwise timestamp update can get lost). So the solution I've
suggested was to propagate the information "timestamp update needed" to XFS
through I_DIRTY_TIME in flags passed to ->dirty_inode().

I hope things are clearer now.

								Honza
Jan Kara Aug. 16, 2022, 11:41 a.m. UTC | #5
On Fri 12-08-22 11:42:21, Eric Biggers wrote:
> On Fri, Aug 12, 2022 at 02:37:26PM +0200, Lukas Czerner wrote:
> > Currently the I_DIRTY_TIME will never get set if the inode already has
> > I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME.  That's
> > true, however ext4 will only update the on-disk inode in
> > ->dirty_inode(), not on actual writeback. As a result if the inode
> > already has I_DIRTY_INODE state by the time we get to
> > __mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
> > into on-disk inode and will not get updated until the next I_DIRTY_INODE
> > update, which might never come if we crash or get a power failure.
> > 
> > The problem can be reproduced on ext4 by running xfstest generic/622
> > with -o iversion mount option.
> > 
> > Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
> > I_DIRTY_INODE. Also make sure that the case is properly handled in
> > writeback_single_inode() as well. Additionally changes in
> > xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag.
> > 
> > Thanks Jan Kara for suggestions on how to make this work properly.
> > 
> > Cc: Dave Chinner <david@fromorbit.com>
> > Cc: Christoph Hellwig <hch@infradead.org>
> > Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> > Suggested-by: Jan Kara <jack@suse.cz>
> 
> Sorry for so many separate emails.  One more thought: isn't there a much more
> straightforward fix to this bug that wouldn't require changing the semantics of
> the inode flags: on __mark_inode_dirty(I_DIRTY_TIME), if the inode already has
> i_state & I_DIRTY_INODE, just call ->dirty_inode with i_state & I_DIRTY_INODE?
> That would fix the bug by making the filesystem update the on-disk inode.

This is a good question and I was also considering this when we first
discussed the problem with Lukas. It should fix the bug for ext4 but
eventually I've decided against this because XFS would still need something
else to fix the problem (see my previous email) and for ext4 it seemed as
unnecessary overhead (see below).

> Perhaps you aren't doing that in order to strictly maintain the semantics of
> 'lazytime', where timestamp updates are only persisted at certain times?  Is
> this useful even in the short window of time that an inode is dirty?

The result of this for ext4 will be that after the inode is dirtied for
some reason, we will be logging every timestamp update to the journal
(effectively disabling lazytime for the inode) for the 30s time window that
the inode stays dirty before writeback code decides to do writeback (which
just results in clearing the I_DIRTY_INODE flag for ext4). Not too bad I
guess but I'd prefer to avoid this overhead.

								Honza
Christoph Hellwig Aug. 21, 2022, 6:14 a.m. UTC | #6
On Tue, Aug 16, 2022 at 01:21:24PM +0200, Jan Kara wrote:
> 2) I_DIRTY_TIME flag passed to ->dirty_inode() callback. This is admittedly
> bit of a hack. Currently XFS relies on the fact that the only time its
> ->dirty_inode() callback needs to do anything is when VFS decides it is
> time to writeback timestamps and XFS detects this situation by checking for
> I_DIRTY_TIME in inode->i_state. Now to fix the race, we need to first clear
> I_DIRTY_TIME in inode->i_state and only then call the ->dirty_inode()
> callback (otherwise timestamp update can get lost). So the solution I've
> suggested was to propagate the information "timestamp update needed" to XFS
> through I_DIRTY_TIME in flags passed to ->dirty_inode().

Maybe we should just add a separate update_lazy_time method to make this
a little more clear?
Jan Kara Aug. 22, 2022, 8:33 a.m. UTC | #7
On Sat 20-08-22 23:14:37, Christoph Hellwig wrote:
> On Tue, Aug 16, 2022 at 01:21:24PM +0200, Jan Kara wrote:
> > 2) I_DIRTY_TIME flag passed to ->dirty_inode() callback. This is admittedly
> > bit of a hack. Currently XFS relies on the fact that the only time its
> > ->dirty_inode() callback needs to do anything is when VFS decides it is
> > time to writeback timestamps and XFS detects this situation by checking for
> > I_DIRTY_TIME in inode->i_state. Now to fix the race, we need to first clear
> > I_DIRTY_TIME in inode->i_state and only then call the ->dirty_inode()
> > callback (otherwise timestamp update can get lost). So the solution I've
> > suggested was to propagate the information "timestamp update needed" to XFS
> > through I_DIRTY_TIME in flags passed to ->dirty_inode().
> 
> Maybe we should just add a separate update_lazy_time method to make this
> a little more clear?

Yes, we could do that if people prefer this. Although I'd say that good
documentation at the place in __mark_inode_dirty() where this gets used and
in documentation of .dirty_inode might clear the confusion as well.

								Honza
diff mbox series

Patch

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 6cd6953e175b..5d72b6ba4e63 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -274,6 +274,8 @@  or bottom half).
 	This is specifically for the inode itself being marked dirty,
 	not its data.  If the update needs to be persisted by fdatasync(),
 	then I_DIRTY_DATASYNC will be set in the flags argument.
+	If the inode has dirty timestamp and lazytime is enabled
+	I_DIRTY_TIME will be set in the flags.
 
 ``write_inode``
 	this method is called when the VFS needs to write an inode to
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 05221366a16d..638dbf143727 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1718,9 +1718,14 @@  static int writeback_single_inode(struct inode *inode,
 	 */
 	if (!(inode->i_state & I_DIRTY_ALL))
 		inode_cgwb_move_to_attached(inode, wb);
-	else if (!(inode->i_state & I_SYNC_QUEUED) &&
-		 (inode->i_state & I_DIRTY))
-		redirty_tail_locked(inode, wb);
+	else if (!(inode->i_state & I_SYNC_QUEUED)) {
+		if ((inode->i_state & I_DIRTY))
+			redirty_tail_locked(inode, wb);
+		else if (inode->i_state & I_DIRTY_TIME) {
+			inode->dirtied_when = jiffies;
+			inode_io_list_move_locked(inode, wb, &wb->b_dirty_time);
+		}
+	}
 
 	spin_unlock(&wb->list_lock);
 	inode_sync_complete(inode);
@@ -2369,6 +2374,17 @@  void __mark_inode_dirty(struct inode *inode, int flags)
 	trace_writeback_mark_inode_dirty(inode, flags);
 
 	if (flags & I_DIRTY_INODE) {
+
+		/* Inode timestamp update will piggback on this dirtying */
+		if (inode->i_state & I_DIRTY_TIME) {
+			spin_lock(&inode->i_lock);
+			if (inode->i_state & I_DIRTY_TIME) {
+				inode->i_state &= ~I_DIRTY_TIME;
+				flags |= I_DIRTY_TIME;
+			}
+			spin_unlock(&inode->i_lock);
+		}
+
 		/*
 		 * Notify the filesystem about the inode being dirtied, so that
 		 * (if needed) it can update on-disk fields and journal the
@@ -2378,7 +2394,8 @@  void __mark_inode_dirty(struct inode *inode, int flags)
 		 */
 		trace_writeback_dirty_inode_start(inode, flags);
 		if (sb->s_op->dirty_inode)
-			sb->s_op->dirty_inode(inode, flags & I_DIRTY_INODE);
+			sb->s_op->dirty_inode(inode,
+				flags & (I_DIRTY_INODE | I_DIRTY_TIME));
 		trace_writeback_dirty_inode(inode, flags);
 
 		/* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
@@ -2399,21 +2416,15 @@  void __mark_inode_dirty(struct inode *inode, int flags)
 	 */
 	smp_mb();
 
-	if (((inode->i_state & flags) == flags) ||
-	    (dirtytime && (inode->i_state & I_DIRTY_INODE)))
+	if ((inode->i_state & flags) == flags)
 		return;
 
 	spin_lock(&inode->i_lock);
-	if (dirtytime && (inode->i_state & I_DIRTY_INODE))
-		goto out_unlock_inode;
 	if ((inode->i_state & flags) != flags) {
 		const int was_dirty = inode->i_state & I_DIRTY;
 
 		inode_attach_wb(inode, NULL);
 
-		/* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
-		if (flags & I_DIRTY_INODE)
-			inode->i_state &= ~I_DIRTY_TIME;
 		inode->i_state |= flags;
 
 		/*
@@ -2486,7 +2497,6 @@  void __mark_inode_dirty(struct inode *inode, int flags)
 out_unlock:
 	if (wb)
 		spin_unlock(&wb->list_lock);
-out_unlock_inode:
 	spin_unlock(&inode->i_lock);
 }
 EXPORT_SYMBOL(__mark_inode_dirty);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 9ac59814bbb6..13efc77a1e79 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -653,7 +653,7 @@  xfs_fs_destroy_inode(
 static void
 xfs_fs_dirty_inode(
 	struct inode			*inode,
-	int				flag)
+	int				flags)
 {
 	struct xfs_inode		*ip = XFS_I(inode);
 	struct xfs_mount		*mp = ip->i_mount;
@@ -661,7 +661,15 @@  xfs_fs_dirty_inode(
 
 	if (!(inode->i_sb->s_flags & SB_LAZYTIME))
 		return;
-	if (flag != I_DIRTY_SYNC || !(inode->i_state & I_DIRTY_TIME))
+
+	/*
+	 * Only do the timestamp update if the inode is dirty (I_DIRTY_SYNC)
+	 * and has dirty timestamp (I_DIRTY_TIME). I_DIRTY_TIME can be either
+	 * already set in i_state, or passed in flags possibly together with
+	 * I_DIRTY_SYNC.
+	 */
+	if ((flags & ~I_DIRTY_TIME) != I_DIRTY_SYNC ||
+	    !((inode->i_state | flags) & I_DIRTY_TIME))
 		return;
 
 	if (xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5113f65c786f..e220d26d771a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2376,13 +2376,14 @@  static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
  *			don't have to write inode on fdatasync() when only
  *			e.g. the timestamps have changed.
  * I_DIRTY_PAGES	Inode has dirty pages.  Inode itself may be clean.
- * I_DIRTY_TIME		The inode itself only has dirty timestamps, and the
+ * I_DIRTY_TIME		The inode itself has dirty timestamps, and the
  *			lazytime mount option is enabled.  We keep track of this
  *			separately from I_DIRTY_SYNC in order to implement
  *			lazytime.  This gets cleared if I_DIRTY_INODE
- *			(I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set.  I.e.
- *			either I_DIRTY_TIME *or* I_DIRTY_INODE can be set in
- *			i_state, but not both.  I_DIRTY_PAGES may still be set.
+ *			(I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. But
+ *			I_DIRTY_TIME can still be set if I_DIRTY_SYNC is already
+ *			in place because writeback might already be in progress
+ *			and we don't want to lose the time update
  * I_NEW		Serves as both a mutex and completion notification.
  *			New inodes set I_NEW.  If two processes both create
  *			the same inode, one of them will release its inode and