diff mbox series

[2/2] xfs: run blockgc on freeze to avoid iget stalls after reclaim

Message ID 20220113133701.629593-3-bfoster@redhat.com (mailing list archive)
State Superseded, archived
Headers show
Series xfs: a couple misc/small deferred inactivation tweaks | expand

Commit Message

Brian Foster Jan. 13, 2022, 1:37 p.m. UTC
We've had reports on distro (pre-deferred inactivation) kernels that
inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
lock when invoked on a frozen XFS fs. This occurs because
drop_caches acquires the lock and then blocks in xfs_inactive() on
transaction alloc for an inode that requires an eofb trim. unfreeze
then blocks on the same lock and the fs is deadlocked.

With deferred inactivation, the deadlock problem is no longer
present because ->destroy_inode() no longer blocks whether the fs is
frozen or not. There is still unfortunate behavior in that lookups
of a pending inactive inode spin loop waiting for the pending
inactive state to clear, which won't happen until the fs is
unfrozen. This was always possible to some degree, but is
potentially amplified by the fact that reclaim no longer blocks on
the first inode that requires inactivation work. Instead, we
populate the inactivation queues indefinitely. The side effect can
be observed easily by invoking drop_caches on a frozen fs previously
populated with eofb and/or cowblocks inodes and then running
anything that relies on inode lookup (i.e., ls).

To mitigate this behavior, invoke internal blockgc reclaim during
the freeze sequence to guarantee that inode eviction doesn't lead to
this state due to eofb or cowblocks inodes. This is similar to
current behavior on read-only remount. Since the deadlock issue was
present for such a long time, also document the subtle
->destroy_inode() constraint to avoid unintentional reintroduction
of the deadlock problem in the future.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_super.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

Comments

Darrick J. Wong Jan. 13, 2022, 5:13 p.m. UTC | #1
On Thu, Jan 13, 2022 at 08:37:01AM -0500, Brian Foster wrote:
> We've had reports on distro (pre-deferred inactivation) kernels that
> inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> lock when invoked on a frozen XFS fs. This occurs because
> drop_caches acquires the lock

Eww, I hadn't even noticed drop_caches as a way in to a s_umount
deadlock.  Good catch!

> and then blocks in xfs_inactive() on
> transaction alloc for an inode that requires an eofb trim. unfreeze
> then blocks on the same lock and the fs is deadlocked.
> 
> With deferred inactivation, the deadlock problem is no longer
> present because ->destroy_inode() no longer blocks whether the fs is
> frozen or not. There is still unfortunate behavior in that lookups
> of a pending inactive inode spin loop waiting for the pending
> inactive state to clear, which won't happen until the fs is
> unfrozen. This was always possible to some degree, but is
> potentially amplified by the fact that reclaim no longer blocks on
> the first inode that requires inactivation work. Instead, we
> populate the inactivation queues indefinitely. The side effect can
> be observed easily by invoking drop_caches on a frozen fs previously
> populated with eofb and/or cowblocks inodes and then running
> anything that relies on inode lookup (i.e., ls).
> 
> To mitigate this behavior, invoke internal blockgc reclaim during
> the freeze sequence to guarantee that inode eviction doesn't lead to
> this state due to eofb or cowblocks inodes. This is similar to
> current behavior on read-only remount. Since the deadlock issue was
> present for such a long time, also document the subtle
> ->destroy_inode() constraint to avoid unintentional reintroduction
> of the deadlock problem in the future.

Yay for improved documentation. :)

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_super.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index c7ac486ca5d3..1d0f87e47fa4 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -623,8 +623,13 @@ xfs_fs_alloc_inode(
>  }
>  
>  /*
> - * Now that the generic code is guaranteed not to be accessing
> - * the linux inode, we can inactivate and reclaim the inode.
> + * Now that the generic code is guaranteed not to be accessing the inode, we can
> + * inactivate and reclaim it.
> + *
> + * NOTE: ->destroy_inode() can be called (with ->s_umount held) while the
> + * filesystem is frozen. Therefore it is generally unsafe to attempt transaction
> + * allocation in this context. A transaction alloc that blocks on frozen state
> + * from a context with ->s_umount held will deadlock with unfreeze.
>   */
>  STATIC void
>  xfs_fs_destroy_inode(
> @@ -764,6 +769,16 @@ xfs_fs_sync_fs(
>  	 * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
>  	 */
>  	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> +		struct xfs_icwalk	icw = {0};
> +
> +		/*
> +		 * Clear out eofb and cowblocks inodes so eviction while frozen
> +		 * doesn't leave them sitting in the inactivation queue where
> +		 * they cannot be processed.

Would you mind adding an explicit link in the comment between needing to
get /all/ the inodes and _FLAG_SYNC?

"We must process every cached inode, so this requires a synchronous
cache scan."

> +		 */
> +		icw.icw_flags = XFS_ICWALK_FLAG_SYNC;
> +		xfs_blockgc_free_space(mp, &icw);

This needs to check the return value, right?

--D

> +
>  		xfs_inodegc_stop(mp);
>  		xfs_blockgc_stop(mp);
>  	}
> -- 
> 2.31.1
>
Brian Foster Jan. 13, 2022, 7:58 p.m. UTC | #2
On Thu, Jan 13, 2022 at 09:13:47AM -0800, Darrick J. Wong wrote:
> On Thu, Jan 13, 2022 at 08:37:01AM -0500, Brian Foster wrote:
> > We've had reports on distro (pre-deferred inactivation) kernels that
> > inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> > lock when invoked on a frozen XFS fs. This occurs because
> > drop_caches acquires the lock
> 
> Eww, I hadn't even noticed drop_caches as a way in to a s_umount
> deadlock.  Good catch!
> 
> > and then blocks in xfs_inactive() on
> > transaction alloc for an inode that requires an eofb trim. unfreeze
> > then blocks on the same lock and the fs is deadlocked.
> > 
> > With deferred inactivation, the deadlock problem is no longer
> > present because ->destroy_inode() no longer blocks whether the fs is
> > frozen or not. There is still unfortunate behavior in that lookups
> > of a pending inactive inode spin loop waiting for the pending
> > inactive state to clear, which won't happen until the fs is
> > unfrozen. This was always possible to some degree, but is
> > potentially amplified by the fact that reclaim no longer blocks on
> > the first inode that requires inactivation work. Instead, we
> > populate the inactivation queues indefinitely. The side effect can
> > be observed easily by invoking drop_caches on a frozen fs previously
> > populated with eofb and/or cowblocks inodes and then running
> > anything that relies on inode lookup (i.e., ls).
> > 
> > To mitigate this behavior, invoke internal blockgc reclaim during
> > the freeze sequence to guarantee that inode eviction doesn't lead to
> > this state due to eofb or cowblocks inodes. This is similar to
> > current behavior on read-only remount. Since the deadlock issue was
> > present for such a long time, also document the subtle
> > ->destroy_inode() constraint to avoid unintentional reintroduction
> > of the deadlock problem in the future.
> 
> Yay for improved documentation. :)
> 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_super.c | 19 +++++++++++++++++--
> >  1 file changed, 17 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index c7ac486ca5d3..1d0f87e47fa4 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -623,8 +623,13 @@ xfs_fs_alloc_inode(
> >  }
> >  
> >  /*
> > - * Now that the generic code is guaranteed not to be accessing
> > - * the linux inode, we can inactivate and reclaim the inode.
> > + * Now that the generic code is guaranteed not to be accessing the inode, we can
> > + * inactivate and reclaim it.
> > + *
> > + * NOTE: ->destroy_inode() can be called (with ->s_umount held) while the
> > + * filesystem is frozen. Therefore it is generally unsafe to attempt transaction
> > + * allocation in this context. A transaction alloc that blocks on frozen state
> > + * from a context with ->s_umount held will deadlock with unfreeze.
> >   */
> >  STATIC void
> >  xfs_fs_destroy_inode(
> > @@ -764,6 +769,16 @@ xfs_fs_sync_fs(
> >  	 * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
> >  	 */
> >  	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> > +		struct xfs_icwalk	icw = {0};
> > +
> > +		/*
> > +		 * Clear out eofb and cowblocks inodes so eviction while frozen
> > +		 * doesn't leave them sitting in the inactivation queue where
> > +		 * they cannot be processed.
> 
> Would you mind adding an explicit link in the comment between needing to
> get /all/ the inodes and _FLAG_SYNC?
> 
> "We must process every cached inode, so this requires a synchronous
> cache scan."
> 

I changed it to the following to hopefully make it more descriptive
without making it longer:

                /*
                 * Run a sync blockgc scan to reclaim all eof and cow blocks so
                 * eviction while frozen doesn't leave inodes sitting in the
                 * inactivation queue where they cannot be processed.
                 */

> > +		 */
> > +		icw.icw_flags = XFS_ICWALK_FLAG_SYNC;
> > +		xfs_blockgc_free_space(mp, &icw);
> 
> This needs to check the return value, right?
> 

What do you want to do with the return value? It looks to me that
nothing actually checks the return value of ->sync_fs(). freeze_super()
calls sync_filesystem() and that doesn't, at least. That suggests the fs
is going to freeze regardless and so we probably don't want to bail out
of here early, at least. We could just warn on error or something and
then hand it up the stack anyways.. Hm?

Brian

> --D
> 
> > +
> >  		xfs_inodegc_stop(mp);
> >  		xfs_blockgc_stop(mp);
> >  	}
> > -- 
> > 2.31.1
> > 
>
Darrick J. Wong Jan. 13, 2022, 8:43 p.m. UTC | #3
On Thu, Jan 13, 2022 at 02:58:59PM -0500, Brian Foster wrote:
> On Thu, Jan 13, 2022 at 09:13:47AM -0800, Darrick J. Wong wrote:
> > On Thu, Jan 13, 2022 at 08:37:01AM -0500, Brian Foster wrote:
> > > We've had reports on distro (pre-deferred inactivation) kernels that
> > > inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> > > lock when invoked on a frozen XFS fs. This occurs because
> > > drop_caches acquires the lock
> > 
> > Eww, I hadn't even noticed drop_caches as a way in to a s_umount
> > deadlock.  Good catch!
> > 
> > > and then blocks in xfs_inactive() on
> > > transaction alloc for an inode that requires an eofb trim. unfreeze
> > > then blocks on the same lock and the fs is deadlocked.
> > > 
> > > With deferred inactivation, the deadlock problem is no longer
> > > present because ->destroy_inode() no longer blocks whether the fs is
> > > frozen or not. There is still unfortunate behavior in that lookups
> > > of a pending inactive inode spin loop waiting for the pending
> > > inactive state to clear, which won't happen until the fs is
> > > unfrozen. This was always possible to some degree, but is
> > > potentially amplified by the fact that reclaim no longer blocks on
> > > the first inode that requires inactivation work. Instead, we
> > > populate the inactivation queues indefinitely. The side effect can
> > > be observed easily by invoking drop_caches on a frozen fs previously
> > > populated with eofb and/or cowblocks inodes and then running
> > > anything that relies on inode lookup (i.e., ls).
> > > 
> > > To mitigate this behavior, invoke internal blockgc reclaim during
> > > the freeze sequence to guarantee that inode eviction doesn't lead to
> > > this state due to eofb or cowblocks inodes. This is similar to
> > > current behavior on read-only remount. Since the deadlock issue was
> > > present for such a long time, also document the subtle
> > > ->destroy_inode() constraint to avoid unintentional reintroduction
> > > of the deadlock problem in the future.
> > 
> > Yay for improved documentation. :)
> > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/xfs_super.c | 19 +++++++++++++++++--
> > >  1 file changed, 17 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index c7ac486ca5d3..1d0f87e47fa4 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -623,8 +623,13 @@ xfs_fs_alloc_inode(
> > >  }
> > >  
> > >  /*
> > > - * Now that the generic code is guaranteed not to be accessing
> > > - * the linux inode, we can inactivate and reclaim the inode.
> > > + * Now that the generic code is guaranteed not to be accessing the inode, we can
> > > + * inactivate and reclaim it.
> > > + *
> > > + * NOTE: ->destroy_inode() can be called (with ->s_umount held) while the
> > > + * filesystem is frozen. Therefore it is generally unsafe to attempt transaction
> > > + * allocation in this context. A transaction alloc that blocks on frozen state
> > > + * from a context with ->s_umount held will deadlock with unfreeze.
> > >   */
> > >  STATIC void
> > >  xfs_fs_destroy_inode(
> > > @@ -764,6 +769,16 @@ xfs_fs_sync_fs(
> > >  	 * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
> > >  	 */
> > >  	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> > > +		struct xfs_icwalk	icw = {0};
> > > +
> > > +		/*
> > > +		 * Clear out eofb and cowblocks inodes so eviction while frozen
> > > +		 * doesn't leave them sitting in the inactivation queue where
> > > +		 * they cannot be processed.
> > 
> > Would you mind adding an explicit link in the comment between needing to
> > get /all/ the inodes and _FLAG_SYNC?
> > 
> > "We must process every cached inode, so this requires a synchronous
> > cache scan."
> > 
> 
> I changed it to the following to hopefully make it more descriptive
> without making it longer:
> 
>                 /*
>                  * Run a sync blockgc scan to reclaim all eof and cow blocks so
>                  * eviction while frozen doesn't leave inodes sitting in the
>                  * inactivation queue where they cannot be processed.
>                  */

Works for me.

> > > +		 */
> > > +		icw.icw_flags = XFS_ICWALK_FLAG_SYNC;
> > > +		xfs_blockgc_free_space(mp, &icw);
> > 
> > This needs to check the return value, right?
> > 
> 
> What do you want to do with the return value? It looks to me that
> nothing actually checks the return value of ->sync_fs(). freeze_super()
> calls sync_filesystem() and that doesn't, at least. That suggests the fs
> is going to freeze regardless and so we probably don't want to bail out
> of here early, at least. We could just warn on error or something and
> then hand it up the stack anyways.. Hm?

Lovely....

$ git grep -- '->sync_fs('
fs/quota/dquot.c:694:           sb->s_op->sync_fs(sb, 1);
fs/quota/dquot.c:2262:          sb->s_op->sync_fs(sb, 1);
fs/sync.c:56:           sb->s_op->sync_fs(sb, 0);
fs/sync.c:63:           sb->s_op->sync_fs(sb, 1);
fs/sync.c:78:           sb->s_op->sync_fs(sb, *(int *)arg);

Indeed, nobody checks the return value.  Let me do some spelunking...

...ok, so ->sync_fs was introduced in 2.5.52:

https://elixir.bootlin.com/linux/v2.5.52/source/include/linux/fs.h#L814

and everybody has ignored the return code since then, despite syncfs(2)
(which /does/ have a return value) being introduced in 2.6.39.  As you
point out, fsfreeze also ignores the return value, which seems suspect
to me.

I /think/ the correct solution here is to fix the entire syncfs ->
sync_filesystem -> ->sync_fs() path to return error codes; fix fsfreeze
to abort if sync_filesystem returns an error; fix xfs_fs_reconfigure to
stop ignoring the return value when remounting; and then apply this
patch.

However, seeing how vfs debates tend to drag on, I'd be willing to
accept this patch if on error it would force_shutdown the filesystem
(and a third patch containing the xfs_fs_reconfigure fix), and a second
series to fix the vfs and remove that shutdown crutch.

How does that sound?

--D

> 
> Brian
> 
> > --D
> > 
> > > +
> > >  		xfs_inodegc_stop(mp);
> > >  		xfs_blockgc_stop(mp);
> > >  	}
> > > -- 
> > > 2.31.1
> > > 
> > 
>
Darrick J. Wong Jan. 13, 2022, 9:01 p.m. UTC | #4
On Thu, Jan 13, 2022 at 12:43:34PM -0800, Darrick J. Wong wrote:
> On Thu, Jan 13, 2022 at 02:58:59PM -0500, Brian Foster wrote:
> > On Thu, Jan 13, 2022 at 09:13:47AM -0800, Darrick J. Wong wrote:
> > > On Thu, Jan 13, 2022 at 08:37:01AM -0500, Brian Foster wrote:
> > > > We've had reports on distro (pre-deferred inactivation) kernels that
> > > > inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> > > > lock when invoked on a frozen XFS fs. This occurs because
> > > > drop_caches acquires the lock
> > > 
> > > Eww, I hadn't even noticed drop_caches as a way in to a s_umount
> > > deadlock.  Good catch!
> > > 
> > > > and then blocks in xfs_inactive() on
> > > > transaction alloc for an inode that requires an eofb trim. unfreeze
> > > > then blocks on the same lock and the fs is deadlocked.
> > > > 
> > > > With deferred inactivation, the deadlock problem is no longer
> > > > present because ->destroy_inode() no longer blocks whether the fs is
> > > > frozen or not. There is still unfortunate behavior in that lookups
> > > > of a pending inactive inode spin loop waiting for the pending
> > > > inactive state to clear, which won't happen until the fs is
> > > > unfrozen. This was always possible to some degree, but is
> > > > potentially amplified by the fact that reclaim no longer blocks on
> > > > the first inode that requires inactivation work. Instead, we
> > > > populate the inactivation queues indefinitely. The side effect can
> > > > be observed easily by invoking drop_caches on a frozen fs previously
> > > > populated with eofb and/or cowblocks inodes and then running
> > > > anything that relies on inode lookup (i.e., ls).
> > > > 
> > > > To mitigate this behavior, invoke internal blockgc reclaim during
> > > > the freeze sequence to guarantee that inode eviction doesn't lead to
> > > > this state due to eofb or cowblocks inodes. This is similar to
> > > > current behavior on read-only remount. Since the deadlock issue was
> > > > present for such a long time, also document the subtle
> > > > ->destroy_inode() constraint to avoid unintentional reintroduction
> > > > of the deadlock problem in the future.
> > > 
> > > Yay for improved documentation. :)
> > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > >  fs/xfs/xfs_super.c | 19 +++++++++++++++++--
> > > >  1 file changed, 17 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > index c7ac486ca5d3..1d0f87e47fa4 100644
> > > > --- a/fs/xfs/xfs_super.c
> > > > +++ b/fs/xfs/xfs_super.c
> > > > @@ -623,8 +623,13 @@ xfs_fs_alloc_inode(
> > > >  }
> > > >  
> > > >  /*
> > > > - * Now that the generic code is guaranteed not to be accessing
> > > > - * the linux inode, we can inactivate and reclaim the inode.
> > > > + * Now that the generic code is guaranteed not to be accessing the inode, we can
> > > > + * inactivate and reclaim it.
> > > > + *
> > > > + * NOTE: ->destroy_inode() can be called (with ->s_umount held) while the
> > > > + * filesystem is frozen. Therefore it is generally unsafe to attempt transaction
> > > > + * allocation in this context. A transaction alloc that blocks on frozen state
> > > > + * from a context with ->s_umount held will deadlock with unfreeze.
> > > >   */
> > > >  STATIC void
> > > >  xfs_fs_destroy_inode(
> > > > @@ -764,6 +769,16 @@ xfs_fs_sync_fs(
> > > >  	 * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
> > > >  	 */
> > > >  	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> > > > +		struct xfs_icwalk	icw = {0};
> > > > +
> > > > +		/*
> > > > +		 * Clear out eofb and cowblocks inodes so eviction while frozen
> > > > +		 * doesn't leave them sitting in the inactivation queue where
> > > > +		 * they cannot be processed.
> > > 
> > > Would you mind adding an explicit link in the comment between needing to
> > > get /all/ the inodes and _FLAG_SYNC?
> > > 
> > > "We must process every cached inode, so this requires a synchronous
> > > cache scan."
> > > 
> > 
> > I changed it to the following to hopefully make it more descriptive
> > without making it longer:
> > 
> >                 /*
> >                  * Run a sync blockgc scan to reclaim all eof and cow blocks so
> >                  * eviction while frozen doesn't leave inodes sitting in the
> >                  * inactivation queue where they cannot be processed.
> >                  */
> 
> Works for me.
> 
> > > > +		 */
> > > > +		icw.icw_flags = XFS_ICWALK_FLAG_SYNC;
> > > > +		xfs_blockgc_free_space(mp, &icw);
> > > 
> > > This needs to check the return value, right?
> > > 
> > 
> > What do you want to do with the return value? It looks to me that
> > nothing actually checks the return value of ->sync_fs(). freeze_super()
> > calls sync_filesystem() and that doesn't, at least. That suggests the fs
> > is going to freeze regardless and so we probably don't want to bail out
> > of here early, at least. We could just warn on error or something and
> > then hand it up the stack anyways.. Hm?
> 
> Lovely....
> 
> $ git grep -- '->sync_fs('
> fs/quota/dquot.c:694:           sb->s_op->sync_fs(sb, 1);
> fs/quota/dquot.c:2262:          sb->s_op->sync_fs(sb, 1);
> fs/sync.c:56:           sb->s_op->sync_fs(sb, 0);
> fs/sync.c:63:           sb->s_op->sync_fs(sb, 1);
> fs/sync.c:78:           sb->s_op->sync_fs(sb, *(int *)arg);
> 
> Indeed, nobody checks the return value.  Let me do some spelunking...
> 
> ...ok, so ->sync_fs was introduced in 2.5.52:
> 
> https://elixir.bootlin.com/linux/v2.5.52/source/include/linux/fs.h#L814
> 
> and everybody has ignored the return code since then, despite syncfs(2)
> (which /does/ have a return value) being introduced in 2.6.39.  As you
> point out, fsfreeze also ignores the return value, which seems suspect
> to me.
> 
> I /think/ the correct solution here is to fix the entire syncfs ->
> sync_filesystem -> ->sync_fs() path to return error codes; fix fsfreeze
> to abort if sync_filesystem returns an error; fix xfs_fs_reconfigure to
> stop ignoring the return value when remounting; and then apply this
> patch.
> 
> However, seeing how vfs debates tend to drag on, I'd be willing to
> accept this patch if on error it would force_shutdown the filesystem
> (and a third patch containing the xfs_fs_reconfigure fix), and a second
> series to fix the vfs and remove that shutdown crutch.
> 
> How does that sound?

...and now that I've done a more thorough check of the entire call
stack, it looks like the xfs_log_force call in xfs_fs_sync_fs also
drops the error code, so let's just leave this second patch as it is now
(i.e. your submission plus the comment change) and I'll put out a
separate series to fix the vfs and xfs_fs_sync_fs later.

--D

> --D
> 
> > 
> > Brian
> > 
> > > --D
> > > 
> > > > +
> > > >  		xfs_inodegc_stop(mp);
> > > >  		xfs_blockgc_stop(mp);
> > > >  	}
> > > > -- 
> > > > 2.31.1
> > > > 
> > > 
> >
Dave Chinner Jan. 13, 2022, 10:38 p.m. UTC | #5
On Thu, Jan 13, 2022 at 08:37:01AM -0500, Brian Foster wrote:
> We've had reports on distro (pre-deferred inactivation) kernels that
> inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> lock when invoked on a frozen XFS fs. This occurs because
> drop_caches acquires the lock and then blocks in xfs_inactive() on
> transaction alloc for an inode that requires an eofb trim. unfreeze
> then blocks on the same lock and the fs is deadlocked.
> 
> With deferred inactivation, the deadlock problem is no longer
> present because ->destroy_inode() no longer blocks whether the fs is
> frozen or not. There is still unfortunate behavior in that lookups
> of a pending inactive inode spin loop waiting for the pending
> inactive state to clear, which won't happen until the fs is
> unfrozen. This was always possible to some degree, but is
> potentially amplified by the fact that reclaim no longer blocks on
> the first inode that requires inactivation work. Instead, we
> populate the inactivation queues indefinitely. The side effect can
> be observed easily by invoking drop_caches on a frozen fs previously
> populated with eofb and/or cowblocks inodes and then running
> anything that relies on inode lookup (i.e., ls).
> 
> To mitigate this behavior, invoke internal blockgc reclaim during
> the freeze sequence to guarantee that inode eviction doesn't lead to
> this state due to eofb or cowblocks inodes. This is similar to
> current behavior on read-only remount. Since the deadlock issue was
> present for such a long time, also document the subtle
> ->destroy_inode() constraint to avoid unintentional reintroduction
> of the deadlock problem in the future.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_super.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index c7ac486ca5d3..1d0f87e47fa4 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -623,8 +623,13 @@ xfs_fs_alloc_inode(
>  }
>  
>  /*
> - * Now that the generic code is guaranteed not to be accessing
> - * the linux inode, we can inactivate and reclaim the inode.
> + * Now that the generic code is guaranteed not to be accessing the inode, we can
> + * inactivate and reclaim it.
> + *
> + * NOTE: ->destroy_inode() can be called (with ->s_umount held) while the
> + * filesystem is frozen. Therefore it is generally unsafe to attempt transaction
> + * allocation in this context. A transaction alloc that blocks on frozen state
> + * from a context with ->s_umount held will deadlock with unfreeze.
>   */
>  STATIC void
>  xfs_fs_destroy_inode(
> @@ -764,6 +769,16 @@ xfs_fs_sync_fs(
>  	 * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
>  	 */
>  	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> +		struct xfs_icwalk	icw = {0};
> +
> +		/*
> +		 * Clear out eofb and cowblocks inodes so eviction while frozen
> +		 * doesn't leave them sitting in the inactivation queue where
> +		 * they cannot be processed.
> +		 */
> +		icw.icw_flags = XFS_ICWALK_FLAG_SYNC;
> +		xfs_blockgc_free_space(mp, &icw);

Is a SYNC walk safe to run here? I know we run
xfs_blockgc_free_space() from XFS_IOC_FREE_EOFBLOCKS under
SB_FREEZE_WRITE protection, but here we have both frozen writes and
page faults we're running in a much more constrained freeze context
here.

i.e. the SYNC walk will keep busy looping if it can't get the
IOLOCK_EXCL on an inode that is in cache, so if we end up with an
inode locked and blocked on SB_FREEZE_WRITE or SB_FREEZE_PAGEFAULT
for whatever reason this will never return....

Cheers,

Dave.
Darrick J. Wong Jan. 14, 2022, 5:35 p.m. UTC | #6
On Fri, Jan 14, 2022 at 09:38:10AM +1100, Dave Chinner wrote:
> On Thu, Jan 13, 2022 at 08:37:01AM -0500, Brian Foster wrote:
> > We've had reports on distro (pre-deferred inactivation) kernels that
> > inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> > lock when invoked on a frozen XFS fs. This occurs because
> > drop_caches acquires the lock and then blocks in xfs_inactive() on
> > transaction alloc for an inode that requires an eofb trim. unfreeze
> > then blocks on the same lock and the fs is deadlocked.
> > 
> > With deferred inactivation, the deadlock problem is no longer
> > present because ->destroy_inode() no longer blocks whether the fs is
> > frozen or not. There is still unfortunate behavior in that lookups
> > of a pending inactive inode spin loop waiting for the pending
> > inactive state to clear, which won't happen until the fs is
> > unfrozen. This was always possible to some degree, but is
> > potentially amplified by the fact that reclaim no longer blocks on
> > the first inode that requires inactivation work. Instead, we
> > populate the inactivation queues indefinitely. The side effect can
> > be observed easily by invoking drop_caches on a frozen fs previously
> > populated with eofb and/or cowblocks inodes and then running
> > anything that relies on inode lookup (i.e., ls).
> > 
> > To mitigate this behavior, invoke internal blockgc reclaim during
> > the freeze sequence to guarantee that inode eviction doesn't lead to
> > this state due to eofb or cowblocks inodes. This is similar to
> > current behavior on read-only remount. Since the deadlock issue was
> > present for such a long time, also document the subtle
> > ->destroy_inode() constraint to avoid unintentional reintroduction
> > of the deadlock problem in the future.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_super.c | 19 +++++++++++++++++--
> >  1 file changed, 17 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index c7ac486ca5d3..1d0f87e47fa4 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -623,8 +623,13 @@ xfs_fs_alloc_inode(
> >  }
> >  
> >  /*
> > - * Now that the generic code is guaranteed not to be accessing
> > - * the linux inode, we can inactivate and reclaim the inode.
> > + * Now that the generic code is guaranteed not to be accessing the inode, we can
> > + * inactivate and reclaim it.
> > + *
> > + * NOTE: ->destroy_inode() can be called (with ->s_umount held) while the
> > + * filesystem is frozen. Therefore it is generally unsafe to attempt transaction
> > + * allocation in this context. A transaction alloc that blocks on frozen state
> > + * from a context with ->s_umount held will deadlock with unfreeze.
> >   */
> >  STATIC void
> >  xfs_fs_destroy_inode(
> > @@ -764,6 +769,16 @@ xfs_fs_sync_fs(
> >  	 * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
> >  	 */
> >  	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> > +		struct xfs_icwalk	icw = {0};
> > +
> > +		/*
> > +		 * Clear out eofb and cowblocks inodes so eviction while frozen
> > +		 * doesn't leave them sitting in the inactivation queue where
> > +		 * they cannot be processed.
> > +		 */
> > +		icw.icw_flags = XFS_ICWALK_FLAG_SYNC;
> > +		xfs_blockgc_free_space(mp, &icw);
> 
> Is a SYNC walk safe to run here? I know we run
> xfs_blockgc_free_space() from XFS_IOC_FREE_EOFBLOCKS under
> SB_FREEZE_WRITE protection, but here we have both frozen writes and
> page faults we're running in a much more constrained freeze context
> here.
> 
> i.e. the SYNC walk will keep busy looping if it can't get the
> IOLOCK_EXCL on an inode that is in cache, so if we end up with an
> inode locked and blocked on SB_FREEZE_WRITE or SB_FREEZE_PAGEFAULT
> for whatever reason this will never return....

Are you referring to the case where one could be read()ing from a file
into a buffer that's really a mmap'd page from another file while the
underlying fs is being frozen?

Also, I added this second patch and fstests runtime went up by 30%.
ISTR Dave commenting that freeze time would go way up when I submitted a
patch to clean out the cow blocks a few years ago.

Also also looking through the archives[1], Brian once commented that
cleaning up all this stuff should be done /if/ one decides to mount the
frozen-snapshot writable at some later point in time.

Maybe this means we ought to find a way to remove inodes from the percpu
inactivation lists?  iget used to be able to pry inodes out of deferred
inactivation...

[1] https://lore.kernel.org/linux-xfs/20190117181406.GF37591@bfoster/

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Brian Foster Jan. 14, 2022, 7:45 p.m. UTC | #7
On Fri, Jan 14, 2022 at 09:35:35AM -0800, Darrick J. Wong wrote:
> On Fri, Jan 14, 2022 at 09:38:10AM +1100, Dave Chinner wrote:
> > On Thu, Jan 13, 2022 at 08:37:01AM -0500, Brian Foster wrote:
> > > We've had reports on distro (pre-deferred inactivation) kernels that
> > > inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> > > lock when invoked on a frozen XFS fs. This occurs because
> > > drop_caches acquires the lock and then blocks in xfs_inactive() on
> > > transaction alloc for an inode that requires an eofb trim. unfreeze
> > > then blocks on the same lock and the fs is deadlocked.
> > > 
> > > With deferred inactivation, the deadlock problem is no longer
> > > present because ->destroy_inode() no longer blocks whether the fs is
> > > frozen or not. There is still unfortunate behavior in that lookups
> > > of a pending inactive inode spin loop waiting for the pending
> > > inactive state to clear, which won't happen until the fs is
> > > unfrozen. This was always possible to some degree, but is
> > > potentially amplified by the fact that reclaim no longer blocks on
> > > the first inode that requires inactivation work. Instead, we
> > > populate the inactivation queues indefinitely. The side effect can
> > > be observed easily by invoking drop_caches on a frozen fs previously
> > > populated with eofb and/or cowblocks inodes and then running
> > > anything that relies on inode lookup (i.e., ls).
> > > 
> > > To mitigate this behavior, invoke internal blockgc reclaim during
> > > the freeze sequence to guarantee that inode eviction doesn't lead to
> > > this state due to eofb or cowblocks inodes. This is similar to
> > > current behavior on read-only remount. Since the deadlock issue was
> > > present for such a long time, also document the subtle
> > > ->destroy_inode() constraint to avoid unintentional reintroduction
> > > of the deadlock problem in the future.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/xfs_super.c | 19 +++++++++++++++++--
> > >  1 file changed, 17 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index c7ac486ca5d3..1d0f87e47fa4 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -623,8 +623,13 @@ xfs_fs_alloc_inode(
> > >  }
> > >  
> > >  /*
> > > - * Now that the generic code is guaranteed not to be accessing
> > > - * the linux inode, we can inactivate and reclaim the inode.
> > > + * Now that the generic code is guaranteed not to be accessing the inode, we can
> > > + * inactivate and reclaim it.
> > > + *
> > > + * NOTE: ->destroy_inode() can be called (with ->s_umount held) while the
> > > + * filesystem is frozen. Therefore it is generally unsafe to attempt transaction
> > > + * allocation in this context. A transaction alloc that blocks on frozen state
> > > + * from a context with ->s_umount held will deadlock with unfreeze.
> > >   */
> > >  STATIC void
> > >  xfs_fs_destroy_inode(
> > > @@ -764,6 +769,16 @@ xfs_fs_sync_fs(
> > >  	 * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
> > >  	 */
> > >  	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> > > +		struct xfs_icwalk	icw = {0};
> > > +
> > > +		/*
> > > +		 * Clear out eofb and cowblocks inodes so eviction while frozen
> > > +		 * doesn't leave them sitting in the inactivation queue where
> > > +		 * they cannot be processed.
> > > +		 */
> > > +		icw.icw_flags = XFS_ICWALK_FLAG_SYNC;
> > > +		xfs_blockgc_free_space(mp, &icw);
> > 
> > Is a SYNC walk safe to run here? I know we run
> > xfs_blockgc_free_space() from XFS_IOC_FREE_EOFBLOCKS under
> > SB_FREEZE_WRITE protection, but here we have both frozen writes and
> > page faults we're running in a much more constrained freeze context
> > here.
> > 
> > i.e. the SYNC walk will keep busy looping if it can't get the
> > IOLOCK_EXCL on an inode that is in cache, so if we end up with an
> > inode locked and blocked on SB_FREEZE_WRITE or SB_FREEZE_PAGEFAULT
> > for whatever reason this will never return....
> 
> Are you referring to the case where one could be read()ing from a file
> into a buffer that's really a mmap'd page from another file while the
> underlying fs is being frozen?
> 

I thought this was generally safe as freeze protection sits outside of
the locks, but I'm not terribly sure about the read to a mapped buffer
case. If that allows an iolock holder to block on a pagefault due to
freeze, then SB_FREEZE_PAGEFAULT might be too late for a synchronous
scan (i.e. assuming nothing blocks this earlier or prefaults/pins the
target buffer)..?

> Also, I added this second patch and fstests runtime went up by 30%.
> ISTR Dave commenting that freeze time would go way up when I submitted a
> patch to clean out the cow blocks a few years ago.
> 

Do you have any more detailed data on this? I.e., is this an increase
across the board? A smaller set of tests doing many freezes with a
significant runtime increase?

I'm a little on the fence about this because personally, I'm not
terribly concerned about the performance of a single freeze call. At the
same time, I could see this adding up across hundreds of cycles or
whatever via a test or test suite, and that being potentially annoying
to devs/testers.

> Also also looking through the archives[1], Brian once commented that
> cleaning up all this stuff should be done /if/ one decides to mount the
> frozen-snapshot writable at some later point in time.
> 

That kind of sounds like the tradeoff of impacting the current/active fs
for the benefit of a snapshot that may or may not be used. If not, then
it's a waste of time. If so, it might be beneficial for the snap to more
accurately reflect the "eventual" state of the original. For cowblocks
perhaps it doesn't matter if the mount/recovery will scan and reclaim.
I'm not as sure for eofblocks, wouldn't the snap persist all that
"intended to be transient" speculative prealloc until/unless said files
are reopened/closed?

> Maybe this means we ought to find a way to remove inodes from the percpu
> inactivation lists?  iget used to be able to pry inodes out of deferred
> inactivation...
> 

Seems a reasonable option. Presumably that mitigates the lookup stalling
behavior without the need for any additional scanning work at freeze
time (and maybe eliminates the need for an inodegc flush too), but is
neutral wrt some of the other tradeoffs (like the above). I think the
former is the big question wrt to deferred inactivation whereas the
latter has been the case forever.

BTW, should we care at all about somebody dropping the entire cached
working set (worst case) onto these queues if the fs is frozen? Maybe
not if we have to cycle through all these inodes anyways for a full
working set eviction, and presumably a large evictor task (i.e.
drop_caches) will minimize the percpu queue distribution...

Brian

> [1] https://lore.kernel.org/linux-xfs/20190117181406.GF37591@bfoster/
> 
> --D
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
>
Darrick J. Wong Jan. 14, 2022, 9:30 p.m. UTC | #8
On Fri, Jan 14, 2022 at 02:45:10PM -0500, Brian Foster wrote:
> On Fri, Jan 14, 2022 at 09:35:35AM -0800, Darrick J. Wong wrote:
> > On Fri, Jan 14, 2022 at 09:38:10AM +1100, Dave Chinner wrote:
> > > On Thu, Jan 13, 2022 at 08:37:01AM -0500, Brian Foster wrote:
> > > > We've had reports on distro (pre-deferred inactivation) kernels that
> > > > inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> > > > lock when invoked on a frozen XFS fs. This occurs because
> > > > drop_caches acquires the lock and then blocks in xfs_inactive() on
> > > > transaction alloc for an inode that requires an eofb trim. unfreeze
> > > > then blocks on the same lock and the fs is deadlocked.
> > > > 
> > > > With deferred inactivation, the deadlock problem is no longer
> > > > present because ->destroy_inode() no longer blocks whether the fs is
> > > > frozen or not. There is still unfortunate behavior in that lookups
> > > > of a pending inactive inode spin loop waiting for the pending
> > > > inactive state to clear, which won't happen until the fs is
> > > > unfrozen. This was always possible to some degree, but is
> > > > potentially amplified by the fact that reclaim no longer blocks on
> > > > the first inode that requires inactivation work. Instead, we
> > > > populate the inactivation queues indefinitely. The side effect can
> > > > be observed easily by invoking drop_caches on a frozen fs previously
> > > > populated with eofb and/or cowblocks inodes and then running
> > > > anything that relies on inode lookup (i.e., ls).
> > > > 
> > > > To mitigate this behavior, invoke internal blockgc reclaim during
> > > > the freeze sequence to guarantee that inode eviction doesn't lead to
> > > > this state due to eofb or cowblocks inodes. This is similar to
> > > > current behavior on read-only remount. Since the deadlock issue was
> > > > present for such a long time, also document the subtle
> > > > ->destroy_inode() constraint to avoid unintentional reintroduction
> > > > of the deadlock problem in the future.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > >  fs/xfs/xfs_super.c | 19 +++++++++++++++++--
> > > >  1 file changed, 17 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > index c7ac486ca5d3..1d0f87e47fa4 100644
> > > > --- a/fs/xfs/xfs_super.c
> > > > +++ b/fs/xfs/xfs_super.c
> > > > @@ -623,8 +623,13 @@ xfs_fs_alloc_inode(
> > > >  }
> > > >  
> > > >  /*
> > > > - * Now that the generic code is guaranteed not to be accessing
> > > > - * the linux inode, we can inactivate and reclaim the inode.
> > > > + * Now that the generic code is guaranteed not to be accessing the inode, we can
> > > > + * inactivate and reclaim it.
> > > > + *
> > > > + * NOTE: ->destroy_inode() can be called (with ->s_umount held) while the
> > > > + * filesystem is frozen. Therefore it is generally unsafe to attempt transaction
> > > > + * allocation in this context. A transaction alloc that blocks on frozen state
> > > > + * from a context with ->s_umount held will deadlock with unfreeze.
> > > >   */
> > > >  STATIC void
> > > >  xfs_fs_destroy_inode(
> > > > @@ -764,6 +769,16 @@ xfs_fs_sync_fs(
> > > >  	 * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
> > > >  	 */
> > > >  	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> > > > +		struct xfs_icwalk	icw = {0};
> > > > +
> > > > +		/*
> > > > +		 * Clear out eofb and cowblocks inodes so eviction while frozen
> > > > +		 * doesn't leave them sitting in the inactivation queue where
> > > > +		 * they cannot be processed.
> > > > +		 */
> > > > +		icw.icw_flags = XFS_ICWALK_FLAG_SYNC;
> > > > +		xfs_blockgc_free_space(mp, &icw);
> > > 
> > > Is a SYNC walk safe to run here? I know we run
> > > xfs_blockgc_free_space() from XFS_IOC_FREE_EOFBLOCKS under
> > > SB_FREEZE_WRITE protection, but here we have both frozen writes and
> > > page faults we're running in a much more constrained freeze context
> > > here.
> > > 
> > > i.e. the SYNC walk will keep busy looping if it can't get the
> > > IOLOCK_EXCL on an inode that is in cache, so if we end up with an
> > > inode locked and blocked on SB_FREEZE_WRITE or SB_FREEZE_PAGEFAULT
> > > for whatever reason this will never return....
> > 
> > Are you referring to the case where one could be read()ing from a file
> > into a buffer that's really a mmap'd page from another file while the
> > underlying fs is being frozen?
> > 
> 
> I thought this was generally safe as freeze protection sits outside of
> the locks, but I'm not terribly sure about the read to a mapped buffer
> case. If that allows an iolock holder to block on a pagefault due to
> freeze, then SB_FREEZE_PAGEFAULT might be too late for a synchronous
> scan (i.e. assuming nothing blocks this earlier or prefaults/pins the
> target buffer)..?

I think so.  xfs_file_buffered_read takes IOLOCK_SHARED and calls
filemap_read, which calls copy_page_to_iter.  I /think/ the messy iovec
code calls copyout, which can then hit a write page fault, which takes
us to __xfs_filemap_fault.  That takes SB_PAGEFAULT, which is the
opposite order of what now goes on during a freeze.

> > Also, I added this second patch and fstests runtime went up by 30%.
> > ISTR Dave commenting that freeze time would go way up when I submitted a
> > patch to clean out the cow blocks a few years ago.
> > 
> 
> Do you have any more detailed data on this? I.e., is this an increase
> across the board? A smaller set of tests doing many freezes with a
> significant runtime increase?

I think it's constrained to tests that run freeze and thaw in a loop,
but the increases in those tests are large.

Here's what I see when I run all the tests that mention 'freeze' before
applying the patch:

generic/068      46s
generic/085      9s
generic/280      2s
generic/311      53s
generic/390      3s
generic/459      15s
generic/491      2s
xfs/006  8s
xfs/011  20s
xfs/119  6s
xfs/264  13s
xfs/297  42s
xfs/318  2s
xfs/325  2s
xfs/438  4s
xfs/517  46s

And here's after:

generic/068      47s
generic/085      17s
generic/280      4s
generic/311      81s
generic/390      4s
generic/459      14s
generic/491      2s
xfs/006  9s
xfs/011  21s
xfs/119  11s
xfs/264  18s
xfs/297  31s
xfs/318  3s
xfs/325  2s
xfs/438  5s
xfs/517  46s

Most of those tests run in more or less the same amount of time, except
for generic/085, generic/311, xfs/119, xfs/264, and xfs/297.  Of those,
they all freeze repeatedly except for xfs/119.

I would imagine that the same thing is going on with tests that touch
device-mapper, since a dm suspend also freezes the fs, but I didn't
check those tests all that thoroughly, since most of the dmerror /
dmflakey tests only switch the dm table once or twice per test.

> I'm a little on the fence about this because personally, I'm not
> terribly concerned about the performance of a single freeze call. At the
> same time, I could see this adding up across hundreds of cycles or
> whatever via a test or test suite, and that being potentially annoying
> to devs/testers.

Well... yeah.  The net effect on djwong-dev is that a -g all test run
went from 3.6h to 4.8h.  It was less bad for tip (2.8 to 3h).

> > Also also looking through the archives[1], Brian once commented that
> > cleaning up all this stuff should be done /if/ one decides to mount the
> > frozen-snapshot writable at some later point in time.
> > 
> 
> That kind of sounds like the tradeoff of impacting the current/active fs
> for the benefit of a snapshot that may or may not be used. If not, then
> it's a waste of time. If so, it might be beneficial for the snap to more
> accurately reflect the "eventual" state of the original. For cowblocks
> perhaps it doesn't matter if the mount/recovery will scan and reclaim.
> I'm not as sure for eofblocks, wouldn't the snap persist all that
> "intended to be transient" speculative prealloc until/unless said files
> are reopened/closed?

Yes, that is the current behavior. :)

I don't know if it's really optimal (it's at least lazy :P) and Eric has
tried to shift the balance away from "xfs snapshots take forever to
mount".

> > Maybe this means we ought to find a way to remove inodes from the percpu
> > inactivation lists?  iget used to be able to pry inodes out of deferred
> > inactivation...
> > 
> 
> Seems a reasonable option. Presumably that mitigates the lookup stalling
> behavior without the need for any additional scanning work at freeze
> time (and maybe eliminates the need for an inodegc flush too), but is
> neutral wrt some of the other tradeoffs (like the above). I think the
> former is the big question wrt to deferred inactivation whereas the
> latter has been the case forever.
> 
> BTW, should we care at all about somebody dropping the entire cached
> working set (worst case) onto these queues if the fs is frozen? Maybe
> not if we have to cycle through all these inodes anyways for a full
> working set eviction, and presumably a large evictor task (i.e.
> drop_caches) will minimize the percpu queue distribution...

I've wondered myself if we really need to dump quite so many inodes onto
the inactivation queue while we're frozen.  For posteof blocks we could
just leave them attached and hope that inactivation eventually gets
them, though that has the unfortunate side effect that space can
disappear into the depths.

COW mappings are attached incore to the inode cow fork but the refcount
btree ondisk, so in principle for frozen inodes we could move the cow
mappings to an internal bitmap, let the inode go, and free the abandoned
cow blocks at thaw time.

Unlinked inodes with no active refcount can of course block in the queue
forever; userspace can't have those back.

--D

> 
> Brian
> 
> > [1] https://lore.kernel.org/linux-xfs/20190117181406.GF37591@bfoster/
> > 
> > --D
> > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > 
>
Darrick J. Wong Jan. 15, 2022, 4:09 a.m. UTC | #9
On Fri, Jan 14, 2022 at 01:30:43PM -0800, Darrick J. Wong wrote:
> On Fri, Jan 14, 2022 at 02:45:10PM -0500, Brian Foster wrote:
> > On Fri, Jan 14, 2022 at 09:35:35AM -0800, Darrick J. Wong wrote:
> > > On Fri, Jan 14, 2022 at 09:38:10AM +1100, Dave Chinner wrote:
> > > > On Thu, Jan 13, 2022 at 08:37:01AM -0500, Brian Foster wrote:
> > > > > We've had reports on distro (pre-deferred inactivation) kernels that
> > > > > inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> > > > > lock when invoked on a frozen XFS fs. This occurs because
> > > > > drop_caches acquires the lock and then blocks in xfs_inactive() on
> > > > > transaction alloc for an inode that requires an eofb trim. unfreeze
> > > > > then blocks on the same lock and the fs is deadlocked.
> > > > > 
> > > > > With deferred inactivation, the deadlock problem is no longer
> > > > > present because ->destroy_inode() no longer blocks whether the fs is
> > > > > frozen or not. There is still unfortunate behavior in that lookups
> > > > > of a pending inactive inode spin loop waiting for the pending
> > > > > inactive state to clear, which won't happen until the fs is
> > > > > unfrozen. This was always possible to some degree, but is
> > > > > potentially amplified by the fact that reclaim no longer blocks on
> > > > > the first inode that requires inactivation work. Instead, we
> > > > > populate the inactivation queues indefinitely. The side effect can
> > > > > be observed easily by invoking drop_caches on a frozen fs previously
> > > > > populated with eofb and/or cowblocks inodes and then running
> > > > > anything that relies on inode lookup (i.e., ls).
> > > > > 
> > > > > To mitigate this behavior, invoke internal blockgc reclaim during
> > > > > the freeze sequence to guarantee that inode eviction doesn't lead to
> > > > > this state due to eofb or cowblocks inodes. This is similar to
> > > > > current behavior on read-only remount. Since the deadlock issue was
> > > > > present for such a long time, also document the subtle
> > > > > ->destroy_inode() constraint to avoid unintentional reintroduction
> > > > > of the deadlock problem in the future.
> > > > > 
> > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > ---
> > > > >  fs/xfs/xfs_super.c | 19 +++++++++++++++++--
> > > > >  1 file changed, 17 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > > index c7ac486ca5d3..1d0f87e47fa4 100644
> > > > > --- a/fs/xfs/xfs_super.c
> > > > > +++ b/fs/xfs/xfs_super.c
> > > > > @@ -623,8 +623,13 @@ xfs_fs_alloc_inode(
> > > > >  }
> > > > >  
> > > > >  /*
> > > > > - * Now that the generic code is guaranteed not to be accessing
> > > > > - * the linux inode, we can inactivate and reclaim the inode.
> > > > > + * Now that the generic code is guaranteed not to be accessing the inode, we can
> > > > > + * inactivate and reclaim it.
> > > > > + *
> > > > > + * NOTE: ->destroy_inode() can be called (with ->s_umount held) while the
> > > > > + * filesystem is frozen. Therefore it is generally unsafe to attempt transaction
> > > > > + * allocation in this context. A transaction alloc that blocks on frozen state
> > > > > + * from a context with ->s_umount held will deadlock with unfreeze.
> > > > >   */
> > > > >  STATIC void
> > > > >  xfs_fs_destroy_inode(
> > > > > @@ -764,6 +769,16 @@ xfs_fs_sync_fs(
> > > > >  	 * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
> > > > >  	 */
> > > > >  	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> > > > > +		struct xfs_icwalk	icw = {0};
> > > > > +
> > > > > +		/*
> > > > > +		 * Clear out eofb and cowblocks inodes so eviction while frozen
> > > > > +		 * doesn't leave them sitting in the inactivation queue where
> > > > > +		 * they cannot be processed.
> > > > > +		 */
> > > > > +		icw.icw_flags = XFS_ICWALK_FLAG_SYNC;
> > > > > +		xfs_blockgc_free_space(mp, &icw);
> > > > 
> > > > Is a SYNC walk safe to run here? I know we run
> > > > xfs_blockgc_free_space() from XFS_IOC_FREE_EOFBLOCKS under
> > > > SB_FREEZE_WRITE protection, but here we have both frozen writes and
> > > > page faults we're running in a much more constrained freeze context
> > > > here.
> > > > 
> > > > i.e. the SYNC walk will keep busy looping if it can't get the
> > > > IOLOCK_EXCL on an inode that is in cache, so if we end up with an
> > > > inode locked and blocked on SB_FREEZE_WRITE or SB_FREEZE_PAGEFAULT
> > > > for whatever reason this will never return....
> > > 
> > > Are you referring to the case where one could be read()ing from a file
> > > into a buffer that's really a mmap'd page from another file while the
> > > underlying fs is being frozen?
> > > 
> > 
> > I thought this was generally safe as freeze protection sits outside of
> > the locks, but I'm not terribly sure about the read to a mapped buffer
> > case. If that allows an iolock holder to block on a pagefault due to
> > freeze, then SB_FREEZE_PAGEFAULT might be too late for a synchronous
> > scan (i.e. assuming nothing blocks this earlier or prefaults/pins the
> > target buffer)..?
> 
> I think so.  xfs_file_buffered_read takes IOLOCK_SHARED and calls
> filemap_read, which calls copy_page_to_iter.  I /think/ the messy iovec
> code calls copyout, which can then hit a write page fault, which takes
> us to __xfs_filemap_fault.  That takes SB_PAGEFAULT, which is the
> opposite order of what now goes on during a freeze.
> 
> > > Also, I added this second patch and fstests runtime went up by 30%.
> > > ISTR Dave commenting that freeze time would go way up when I submitted a
> > > patch to clean out the cow blocks a few years ago.
> > > 
> > 
> > Do you have any more detailed data on this? I.e., is this an increase
> > across the board? A smaller set of tests doing many freezes with a
> > significant runtime increase?
> 
> I think it's constrained to tests that run freeze and thaw in a loop,
> but the increases in those tests are large.
> 
> Here's what I see when I run all the tests that mention 'freeze' before
> applying the patch:
> 
> generic/068      46s
> generic/085      9s
> generic/280      2s
> generic/311      53s
> generic/390      3s
> generic/459      15s
> generic/491      2s
> xfs/006  8s
> xfs/011  20s
> xfs/119  6s
> xfs/264  13s
> xfs/297  42s
> xfs/318  2s
> xfs/325  2s
> xfs/438  4s
> xfs/517  46s
> 
> And here's after:
> 
> generic/068      47s
> generic/085      17s
> generic/280      4s
> generic/311      81s
> generic/390      4s
> generic/459      14s
> generic/491      2s
> xfs/006  9s
> xfs/011  21s
> xfs/119  11s
> xfs/264  18s
> xfs/297  31s
> xfs/318  3s
> xfs/325  2s
> xfs/438  5s
> xfs/517  46s
> 
> Most of those tests run in more or less the same amount of time, except
> for generic/085, generic/311, xfs/119, xfs/264, and xfs/297.  Of those,
> they all freeze repeatedly except for xfs/119.
> 
> I would imagine that the same thing is going on with tests that touch
> device-mapper, since a dm suspend also freezes the fs, but I didn't
> check those tests all that thoroughly, since most of the dmerror /
> dmflakey tests only switch the dm table once or twice per test.
> 
> > I'm a little on the fence about this because personally, I'm not
> > terribly concerned about the performance of a single freeze call. At the
> > same time, I could see this adding up across hundreds of cycles or
> > whatever via a test or test suite, and that being potentially annoying
> > to devs/testers.
> 
> Well... yeah.  The net effect on djwong-dev is that a -g all test run
> went from 3.6h to 4.8h.  It was less bad for tip (2.8 to 3h).

...and just to undercut my own statements, it seems that CONFIG_LOCKDEP
got turned on somehow between a kernel from a couple of days ago and
when I rebuilt it with the new patches.  The overall fstests slowdowns
are more on the level of 3.6 to 3.7h and 2.8 to 2.9h, though the runtime
increases of the fsfreeze loop tests are about the same.

--D

> > > Also also looking through the archives[1], Brian once commented that
> > > cleaning up all this stuff should be done /if/ one decides to mount the
> > > frozen-snapshot writable at some later point in time.
> > > 
> > 
> > That kind of sounds like the tradeoff of impacting the current/active fs
> > for the benefit of a snapshot that may or may not be used. If not, then
> > it's a waste of time. If so, it might be beneficial for the snap to more
> > accurately reflect the "eventual" state of the original. For cowblocks
> > perhaps it doesn't matter if the mount/recovery will scan and reclaim.
> > I'm not as sure for eofblocks, wouldn't the snap persist all that
> > "intended to be transient" speculative prealloc until/unless said files
> > are reopened/closed?
> 
> Yes, that is the current behavior. :)
> 
> I don't know if it's really optimal (it's at least lazy :P) and Eric has
> tried to shift the balance away from "xfs snapshots take forever to
> mount".
> 
> > > Maybe this means we ought to find a way to remove inodes from the percpu
> > > inactivation lists?  iget used to be able to pry inodes out of deferred
> > > inactivation...
> > > 
> > 
> > Seems a reasonable option. Presumably that mitigates the lookup stalling
> > behavior without the need for any additional scanning work at freeze
> > time (and maybe eliminates the need for an inodegc flush too), but is
> > neutral wrt some of the other tradeoffs (like the above). I think the
> > former is the big question wrt to deferred inactivation whereas the
> > latter has been the case forever.
> > 
> > BTW, should we care at all about somebody dropping the entire cached
> > working set (worst case) onto these queues if the fs is frozen? Maybe
> > not if we have to cycle through all these inodes anyways for a full
> > working set eviction, and presumably a large evictor task (i.e.
> > drop_caches) will minimize the percpu queue distribution...
> 
> I've wondered myself if we really need to dump quite so many inodes onto
> the inactivation queue while we're frozen.  For posteof blocks we could
> just leave them attached and hope that inactivation eventually gets
> them, though that has the unfortunate side effect that space can
> disappear into the depths.
> 
> COW mappings are attached incore to the inode cow fork but the refcount
> btree ondisk, so in principle for frozen inodes we could move the cow
> mappings to an internal bitmap, let the inode go, and free the abandoned
> cow blocks at thaw time.
> 
> Unlinked inodes with no active refcount can of course block in the queue
> forever; userspace can't have those back.
> 
> --D
> 
> > 
> > Brian
> > 
> > > [1] https://lore.kernel.org/linux-xfs/20190117181406.GF37591@bfoster/
> > > 
> > > --D
> > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > > -- 
> > > > Dave Chinner
> > > > david@fromorbit.com
> > > 
> >
Dave Chinner Jan. 15, 2022, 10:40 p.m. UTC | #10
On Fri, Jan 14, 2022 at 01:30:43PM -0800, Darrick J. Wong wrote:
> On Fri, Jan 14, 2022 at 02:45:10PM -0500, Brian Foster wrote:
> > On Fri, Jan 14, 2022 at 09:35:35AM -0800, Darrick J. Wong wrote:
> > > On Fri, Jan 14, 2022 at 09:38:10AM +1100, Dave Chinner wrote:
> > > > On Thu, Jan 13, 2022 at 08:37:01AM -0500, Brian Foster wrote:
> > > > >  STATIC void
> > > > >  xfs_fs_destroy_inode(
> > > > > @@ -764,6 +769,16 @@ xfs_fs_sync_fs(
> > > > >  	 * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
> > > > >  	 */
> > > > >  	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> > > > > +		struct xfs_icwalk	icw = {0};
> > > > > +
> > > > > +		/*
> > > > > +		 * Clear out eofb and cowblocks inodes so eviction while frozen
> > > > > +		 * doesn't leave them sitting in the inactivation queue where
> > > > > +		 * they cannot be processed.
> > > > > +		 */
> > > > > +		icw.icw_flags = XFS_ICWALK_FLAG_SYNC;
> > > > > +		xfs_blockgc_free_space(mp, &icw);
> > > > 
> > > > Is a SYNC walk safe to run here? I know we run
> > > > xfs_blockgc_free_space() from XFS_IOC_FREE_EOFBLOCKS under
> > > > SB_FREEZE_WRITE protection, but here we have both frozen writes and
> > > > page faults we're running in a much more constrained freeze context
> > > > here.
> > > > 
> > > > i.e. the SYNC walk will keep busy looping if it can't get the
> > > > IOLOCK_EXCL on an inode that is in cache, so if we end up with an
> > > > inode locked and blocked on SB_FREEZE_WRITE or SB_FREEZE_PAGEFAULT
> > > > for whatever reason this will never return....
> > > 
> > > Are you referring to the case where one could be read()ing from a file
> > > into a buffer that's really a mmap'd page from another file while the
> > > underlying fs is being frozen?
> > > 
> > 
> > I thought this was generally safe as freeze protection sits outside of
> > the locks, but I'm not terribly sure about the read to a mapped buffer
> > case. If that allows an iolock holder to block on a pagefault due to
> > freeze, then SB_FREEZE_PAGEFAULT might be too late for a synchronous
> > scan (i.e. assuming nothing blocks this earlier or prefaults/pins the
> > target buffer)..?
> 
> I think so.  xfs_file_buffered_read takes IOLOCK_SHARED and calls
> filemap_read, which calls copy_page_to_iter.  I /think/ the messy iovec
> code calls copyout, which can then hit a write page fault, which takes
> us to __xfs_filemap_fault.  That takes SB_PAGEFAULT, which is the
> opposite order of what now goes on during a freeze.

Yeah, so this was the sort of thing I was concerned about. I
couldn't put my finger on an actual case, but the general locking
situation felt wrong which is why I asked the question.

That is, normal case for a -write- operation is:

	Prevent write freeze (shared lock!)
	take IO lock (shared or exclusive)
	<page fault>
	Prevent fault freeze (shared)
	take MMAP lock
	do allocation
	Prevent internal freeze (shared)
	run transaction

Which means the stack cannot happen unless a freeze is completely
locked out at the first level and it has to wait for the write
operation to complete before proceeding. So for modification
operations, doing:

	lock out write freeze (excl lock)
	lock out fault freeze (excl lock)
	take IOLOCK excl

would be safe.

However, in the case of a read operation, we don't have that inital
FREEZE_WRITE protection, so the first level an operation might hit
is the page fault freeze protection. Hence we now get inversion
on the write freeze/IOLOCK/fault freeze because we have IOLOCK/fault
freeze without any write freeze protection so this can happen:

task 1				freeze
				freeze write
				.....
				freeze page fault
read()
IOLOCK_SHARED
<page fault>
sb_start_pagefault()
<blocks on fault freeze>
				xfs_blockgc_free_space(SYNC)
				try IOLOCK_EXCL
				<busy loops on failed IOLOCK_EXCL>

So the sync walk here doesn't seem safe. A non-blocking, best-effort
walk would be fine, but busy-looping on inodes we can't lock looks
to be a deadlock vector.

ISTR that we recently wne through a similar readdir vs page fault
locking discussion, so it's not just read() file IO that could
trigger page fault freeze first with an unprotected IOLOCK already
held.

Cheers,

Dave.
Brian Foster Jan. 17, 2022, 1:37 p.m. UTC | #11
On Fri, Jan 14, 2022 at 01:30:43PM -0800, Darrick J. Wong wrote:
> On Fri, Jan 14, 2022 at 02:45:10PM -0500, Brian Foster wrote:
> > On Fri, Jan 14, 2022 at 09:35:35AM -0800, Darrick J. Wong wrote:
> > > On Fri, Jan 14, 2022 at 09:38:10AM +1100, Dave Chinner wrote:
> > > > On Thu, Jan 13, 2022 at 08:37:01AM -0500, Brian Foster wrote:
> > > > > We've had reports on distro (pre-deferred inactivation) kernels that
> > > > > inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> > > > > lock when invoked on a frozen XFS fs. This occurs because
> > > > > drop_caches acquires the lock and then blocks in xfs_inactive() on
> > > > > transaction alloc for an inode that requires an eofb trim. unfreeze
> > > > > then blocks on the same lock and the fs is deadlocked.
> > > > > 
> > > > > With deferred inactivation, the deadlock problem is no longer
> > > > > present because ->destroy_inode() no longer blocks whether the fs is
> > > > > frozen or not. There is still unfortunate behavior in that lookups
> > > > > of a pending inactive inode spin loop waiting for the pending
> > > > > inactive state to clear, which won't happen until the fs is
> > > > > unfrozen. This was always possible to some degree, but is
> > > > > potentially amplified by the fact that reclaim no longer blocks on
> > > > > the first inode that requires inactivation work. Instead, we
> > > > > populate the inactivation queues indefinitely. The side effect can
> > > > > be observed easily by invoking drop_caches on a frozen fs previously
> > > > > populated with eofb and/or cowblocks inodes and then running
> > > > > anything that relies on inode lookup (i.e., ls).
> > > > > 
> > > > > To mitigate this behavior, invoke internal blockgc reclaim during
> > > > > the freeze sequence to guarantee that inode eviction doesn't lead to
> > > > > this state due to eofb or cowblocks inodes. This is similar to
> > > > > current behavior on read-only remount. Since the deadlock issue was
> > > > > present for such a long time, also document the subtle
> > > > > ->destroy_inode() constraint to avoid unintentional reintroduction
> > > > > of the deadlock problem in the future.
> > > > > 
> > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > ---
> > > > >  fs/xfs/xfs_super.c | 19 +++++++++++++++++--
> > > > >  1 file changed, 17 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > > index c7ac486ca5d3..1d0f87e47fa4 100644
> > > > > --- a/fs/xfs/xfs_super.c
> > > > > +++ b/fs/xfs/xfs_super.c
> > > > > @@ -623,8 +623,13 @@ xfs_fs_alloc_inode(
> > > > >  }
> > > > >  
> > > > >  /*
> > > > > - * Now that the generic code is guaranteed not to be accessing
> > > > > - * the linux inode, we can inactivate and reclaim the inode.
> > > > > + * Now that the generic code is guaranteed not to be accessing the inode, we can
> > > > > + * inactivate and reclaim it.
> > > > > + *
> > > > > + * NOTE: ->destroy_inode() can be called (with ->s_umount held) while the
> > > > > + * filesystem is frozen. Therefore it is generally unsafe to attempt transaction
> > > > > + * allocation in this context. A transaction alloc that blocks on frozen state
> > > > > + * from a context with ->s_umount held will deadlock with unfreeze.
> > > > >   */
> > > > >  STATIC void
> > > > >  xfs_fs_destroy_inode(
> > > > > @@ -764,6 +769,16 @@ xfs_fs_sync_fs(
> > > > >  	 * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
> > > > >  	 */
> > > > >  	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> > > > > +		struct xfs_icwalk	icw = {0};
> > > > > +
> > > > > +		/*
> > > > > +		 * Clear out eofb and cowblocks inodes so eviction while frozen
> > > > > +		 * doesn't leave them sitting in the inactivation queue where
> > > > > +		 * they cannot be processed.
> > > > > +		 */
> > > > > +		icw.icw_flags = XFS_ICWALK_FLAG_SYNC;
> > > > > +		xfs_blockgc_free_space(mp, &icw);
> > > > 
> > > > Is a SYNC walk safe to run here? I know we run
> > > > xfs_blockgc_free_space() from XFS_IOC_FREE_EOFBLOCKS under
> > > > SB_FREEZE_WRITE protection, but here we have both frozen writes and
> > > > page faults we're running in a much more constrained freeze context
> > > > here.
> > > > 
> > > > i.e. the SYNC walk will keep busy looping if it can't get the
> > > > IOLOCK_EXCL on an inode that is in cache, so if we end up with an
> > > > inode locked and blocked on SB_FREEZE_WRITE or SB_FREEZE_PAGEFAULT
> > > > for whatever reason this will never return....
> > > 
> > > Are you referring to the case where one could be read()ing from a file
> > > into a buffer that's really a mmap'd page from another file while the
> > > underlying fs is being frozen?
> > > 
> > 
> > I thought this was generally safe as freeze protection sits outside of
> > the locks, but I'm not terribly sure about the read to a mapped buffer
> > case. If that allows an iolock holder to block on a pagefault due to
> > freeze, then SB_FREEZE_PAGEFAULT might be too late for a synchronous
> > scan (i.e. assuming nothing blocks this earlier or prefaults/pins the
> > target buffer)..?
> 
> I think so.  xfs_file_buffered_read takes IOLOCK_SHARED and calls
> filemap_read, which calls copy_page_to_iter.  I /think/ the messy iovec
> code calls copyout, which can then hit a write page fault, which takes
> us to __xfs_filemap_fault.  That takes SB_PAGEFAULT, which is the
> opposite order of what now goes on during a freeze.
> 
> > > Also, I added this second patch and fstests runtime went up by 30%.
> > > ISTR Dave commenting that freeze time would go way up when I submitted a
> > > patch to clean out the cow blocks a few years ago.
> > > 
> > 
> > Do you have any more detailed data on this? I.e., is this an increase
> > across the board? A smaller set of tests doing many freezes with a
> > significant runtime increase?
> 
> I think it's constrained to tests that run freeze and thaw in a loop,
> but the increases in those tests are large.
> 
> Here's what I see when I run all the tests that mention 'freeze' before
> applying the patch:
> 
> generic/068      46s
> generic/085      9s
> generic/280      2s
> generic/311      53s
> generic/390      3s
> generic/459      15s
> generic/491      2s
> xfs/006  8s
> xfs/011  20s
> xfs/119  6s
> xfs/264  13s
> xfs/297  42s
> xfs/318  2s
> xfs/325  2s
> xfs/438  4s
> xfs/517  46s
> 
> And here's after:
> 
> generic/068      47s
> generic/085      17s
> generic/280      4s
> generic/311      81s
> generic/390      4s
> generic/459      14s
> generic/491      2s
> xfs/006  9s
> xfs/011  21s
> xfs/119  11s
> xfs/264  18s
> xfs/297  31s
> xfs/318  3s
> xfs/325  2s
> xfs/438  5s
> xfs/517  46s
> 
> Most of those tests run in more or less the same amount of time, except
> for generic/085, generic/311, xfs/119, xfs/264, and xfs/297.  Of those,
> they all freeze repeatedly except for xfs/119.
> 
> I would imagine that the same thing is going on with tests that touch
> device-mapper, since a dm suspend also freezes the fs, but I didn't
> check those tests all that thoroughly, since most of the dmerror /
> dmflakey tests only switch the dm table once or twice per test.
> 

I think the test performance is more likely to impacted when there's a
combination of freeze and some workload that results in the blockgc scan
having to do work. Of course there will be some impact even with the
extra call, but that and your followup results that factor out lockdep
seem a much more reasonable impact to me.

The way I see it, we can always optimize looping tests that become too
slow and it's not exactly like tests are designed for runtime efficiency
in the first place. I feel like I run into new tests all the time that
don't really consider the broader runtime impact and whether they do
more work than really necessary. Unless there's some
immediate/unforeseen/disruptive change (like your initial numbers seemed
to suggest), this doesn't seem like a primary concern to me.

> > I'm a little on the fence about this because personally, I'm not
> > terribly concerned about the performance of a single freeze call. At the
> > same time, I could see this adding up across hundreds of cycles or
> > whatever via a test or test suite, and that being potentially annoying
> > to devs/testers.
> 
> Well... yeah.  The net effect on djwong-dev is that a -g all test run
> went from 3.6h to 4.8h.  It was less bad for tip (2.8 to 3h).
> 
> > > Also also looking through the archives[1], Brian once commented that
> > > cleaning up all this stuff should be done /if/ one decides to mount the
> > > frozen-snapshot writable at some later point in time.
> > > 
> > 
> > That kind of sounds like the tradeoff of impacting the current/active fs
> > for the benefit of a snapshot that may or may not be used. If not, then
> > it's a waste of time. If so, it might be beneficial for the snap to more
> > accurately reflect the "eventual" state of the original. For cowblocks
> > perhaps it doesn't matter if the mount/recovery will scan and reclaim.
> > I'm not as sure for eofblocks, wouldn't the snap persist all that
> > "intended to be transient" speculative prealloc until/unless said files
> > are reopened/closed?
> 
> Yes, that is the current behavior. :)
> 
> I don't know if it's really optimal (it's at least lazy :P) and Eric has
> tried to shift the balance away from "xfs snapshots take forever to
> mount".
> 
> > > Maybe this means we ought to find a way to remove inodes from the percpu
> > > inactivation lists?  iget used to be able to pry inodes out of deferred
> > > inactivation...
> > > 
> > 
> > Seems a reasonable option. Presumably that mitigates the lookup stalling
> > behavior without the need for any additional scanning work at freeze
> > time (and maybe eliminates the need for an inodegc flush too), but is
> > neutral wrt some of the other tradeoffs (like the above). I think the
> > former is the big question wrt to deferred inactivation whereas the
> > latter has been the case forever.
> > 
> > BTW, should we care at all about somebody dropping the entire cached
> > working set (worst case) onto these queues if the fs is frozen? Maybe
> > not if we have to cycle through all these inodes anyways for a full
> > working set eviction, and presumably a large evictor task (i.e.
> > drop_caches) will minimize the percpu queue distribution...
> 
> I've wondered myself if we really need to dump quite so many inodes onto
> the inactivation queue while we're frozen.  For posteof blocks we could
> just leave them attached and hope that inactivation eventually gets
> them, though that has the unfortunate side effect that space can
> disappear into the depths.
> 

I would really prefer to avoid any sort of approach that leaks post-eof
space as such. As unlikely as this situation might be, that reintroduces
paths to the old "where'd my XFS space go?" class of problems this
infrastructure was originally designed to address. 

ISTM the currently most viable options we've discussed are:

1. Leave things as is, accept potential for lookup stalls while frozen
and wait and see if this ever really becomes a problem for real users.

2. Tweak the scan to be async as Dave suggested in his followup mail.
This changes this patch from a guarantee to more of a mitigation, which
personally I think is fairly reasonable. We'd still have drained writers
and faulters by this point in the freeze, so the impact would probably
be limited to contention with readers of blockgc inodes (though this
probably should be tested).

3. Darrick's earlier thought to reintroduce inode reuse off the
inactivation queues. It's not clear to me how involved this might be.

4-N. I'm sure we can think up any number of other theoretical options
depending on how involved we want to get. The idea below sounds
plausible, but at the same time (and if I'm following correctly)
inventing a new way to free space off inodes purely for the use of
eviction during freeze sounds excessive wrt to complexity, future
maintenance burden, etc.

Perhaps yet another option could be something like a more granular
freeze callback that, if specified by the fs, invokes the filesystem at
each step of the freeze sequence instead of only at the end like
->freeze_fs() currently does. IIUC this problem mostly goes away if we
can run the scan a bit earlier, we could clean up the existing freeze
wart in ->sync_fs(), and perhaps other filesystems might find that
similarly useful. Of course this requires more changes outside of xfs..

Brian

> COW mappings are attached incore to the inode cow fork but the refcount
> btree ondisk, so in principle for frozen inodes we could move the cow
> mappings to an internal bitmap, let the inode go, and free the abandoned
> cow blocks at thaw time.
> 
> Unlinked inodes with no active refcount can of course block in the queue
> forever; userspace can't have those back.
> 
> --D
> 
> > 
> > Brian
> > 
> > > [1] https://lore.kernel.org/linux-xfs/20190117181406.GF37591@bfoster/
> > > 
> > > --D
> > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > > -- 
> > > > Dave Chinner
> > > > david@fromorbit.com
> > > 
> > 
>
Darrick J. Wong Jan. 18, 2022, 6:56 p.m. UTC | #12
On Mon, Jan 17, 2022 at 08:37:13AM -0500, Brian Foster wrote:
> On Fri, Jan 14, 2022 at 01:30:43PM -0800, Darrick J. Wong wrote:
> > On Fri, Jan 14, 2022 at 02:45:10PM -0500, Brian Foster wrote:
> > > On Fri, Jan 14, 2022 at 09:35:35AM -0800, Darrick J. Wong wrote:
> > > > On Fri, Jan 14, 2022 at 09:38:10AM +1100, Dave Chinner wrote:
> > > > > On Thu, Jan 13, 2022 at 08:37:01AM -0500, Brian Foster wrote:
> > > > > > We've had reports on distro (pre-deferred inactivation) kernels that
> > > > > > inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> > > > > > lock when invoked on a frozen XFS fs. This occurs because
> > > > > > drop_caches acquires the lock and then blocks in xfs_inactive() on
> > > > > > transaction alloc for an inode that requires an eofb trim. unfreeze
> > > > > > then blocks on the same lock and the fs is deadlocked.
> > > > > > 
> > > > > > With deferred inactivation, the deadlock problem is no longer
> > > > > > present because ->destroy_inode() no longer blocks whether the fs is
> > > > > > frozen or not. There is still unfortunate behavior in that lookups
> > > > > > of a pending inactive inode spin loop waiting for the pending
> > > > > > inactive state to clear, which won't happen until the fs is
> > > > > > unfrozen. This was always possible to some degree, but is
> > > > > > potentially amplified by the fact that reclaim no longer blocks on
> > > > > > the first inode that requires inactivation work. Instead, we
> > > > > > populate the inactivation queues indefinitely. The side effect can
> > > > > > be observed easily by invoking drop_caches on a frozen fs previously
> > > > > > populated with eofb and/or cowblocks inodes and then running
> > > > > > anything that relies on inode lookup (i.e., ls).
> > > > > > 
> > > > > > To mitigate this behavior, invoke internal blockgc reclaim during
> > > > > > the freeze sequence to guarantee that inode eviction doesn't lead to
> > > > > > this state due to eofb or cowblocks inodes. This is similar to
> > > > > > current behavior on read-only remount. Since the deadlock issue was
> > > > > > present for such a long time, also document the subtle
> > > > > > ->destroy_inode() constraint to avoid unintentional reintroduction
> > > > > > of the deadlock problem in the future.
> > > > > > 
> > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > > ---
> > > > > >  fs/xfs/xfs_super.c | 19 +++++++++++++++++--
> > > > > >  1 file changed, 17 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > > > index c7ac486ca5d3..1d0f87e47fa4 100644
> > > > > > --- a/fs/xfs/xfs_super.c
> > > > > > +++ b/fs/xfs/xfs_super.c
> > > > > > @@ -623,8 +623,13 @@ xfs_fs_alloc_inode(
> > > > > >  }
> > > > > >  
> > > > > >  /*
> > > > > > - * Now that the generic code is guaranteed not to be accessing
> > > > > > - * the linux inode, we can inactivate and reclaim the inode.
> > > > > > + * Now that the generic code is guaranteed not to be accessing the inode, we can
> > > > > > + * inactivate and reclaim it.
> > > > > > + *
> > > > > > + * NOTE: ->destroy_inode() can be called (with ->s_umount held) while the
> > > > > > + * filesystem is frozen. Therefore it is generally unsafe to attempt transaction
> > > > > > + * allocation in this context. A transaction alloc that blocks on frozen state
> > > > > > + * from a context with ->s_umount held will deadlock with unfreeze.
> > > > > >   */
> > > > > >  STATIC void
> > > > > >  xfs_fs_destroy_inode(
> > > > > > @@ -764,6 +769,16 @@ xfs_fs_sync_fs(
> > > > > >  	 * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
> > > > > >  	 */
> > > > > >  	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> > > > > > +		struct xfs_icwalk	icw = {0};
> > > > > > +
> > > > > > +		/*
> > > > > > +		 * Clear out eofb and cowblocks inodes so eviction while frozen
> > > > > > +		 * doesn't leave them sitting in the inactivation queue where
> > > > > > +		 * they cannot be processed.
> > > > > > +		 */
> > > > > > +		icw.icw_flags = XFS_ICWALK_FLAG_SYNC;
> > > > > > +		xfs_blockgc_free_space(mp, &icw);
> > > > > 
> > > > > Is a SYNC walk safe to run here? I know we run
> > > > > xfs_blockgc_free_space() from XFS_IOC_FREE_EOFBLOCKS under
> > > > > SB_FREEZE_WRITE protection, but here we have both frozen writes and
> > > > > page faults we're running in a much more constrained freeze context
> > > > > here.
> > > > > 
> > > > > i.e. the SYNC walk will keep busy looping if it can't get the
> > > > > IOLOCK_EXCL on an inode that is in cache, so if we end up with an
> > > > > inode locked and blocked on SB_FREEZE_WRITE or SB_FREEZE_PAGEFAULT
> > > > > for whatever reason this will never return....
> > > > 
> > > > Are you referring to the case where one could be read()ing from a file
> > > > into a buffer that's really a mmap'd page from another file while the
> > > > underlying fs is being frozen?
> > > > 
> > > 
> > > I thought this was generally safe as freeze protection sits outside of
> > > the locks, but I'm not terribly sure about the read to a mapped buffer
> > > case. If that allows an iolock holder to block on a pagefault due to
> > > freeze, then SB_FREEZE_PAGEFAULT might be too late for a synchronous
> > > scan (i.e. assuming nothing blocks this earlier or prefaults/pins the
> > > target buffer)..?
> > 
> > I think so.  xfs_file_buffered_read takes IOLOCK_SHARED and calls
> > filemap_read, which calls copy_page_to_iter.  I /think/ the messy iovec
> > code calls copyout, which can then hit a write page fault, which takes
> > us to __xfs_filemap_fault.  That takes SB_PAGEFAULT, which is the
> > opposite order of what now goes on during a freeze.
> > 
> > > > Also, I added this second patch and fstests runtime went up by 30%.
> > > > ISTR Dave commenting that freeze time would go way up when I submitted a
> > > > patch to clean out the cow blocks a few years ago.
> > > > 
> > > 
> > > Do you have any more detailed data on this? I.e., is this an increase
> > > across the board? A smaller set of tests doing many freezes with a
> > > significant runtime increase?
> > 
> > I think it's constrained to tests that run freeze and thaw in a loop,
> > but the increases in those tests are large.
> > 
> > Here's what I see when I run all the tests that mention 'freeze' before
> > applying the patch:
> > 
> > generic/068      46s
> > generic/085      9s
> > generic/280      2s
> > generic/311      53s
> > generic/390      3s
> > generic/459      15s
> > generic/491      2s
> > xfs/006  8s
> > xfs/011  20s
> > xfs/119  6s
> > xfs/264  13s
> > xfs/297  42s
> > xfs/318  2s
> > xfs/325  2s
> > xfs/438  4s
> > xfs/517  46s
> > 
> > And here's after:
> > 
> > generic/068      47s
> > generic/085      17s
> > generic/280      4s
> > generic/311      81s
> > generic/390      4s
> > generic/459      14s
> > generic/491      2s
> > xfs/006  9s
> > xfs/011  21s
> > xfs/119  11s
> > xfs/264  18s
> > xfs/297  31s
> > xfs/318  3s
> > xfs/325  2s
> > xfs/438  5s
> > xfs/517  46s
> > 
> > Most of those tests run in more or less the same amount of time, except
> > for generic/085, generic/311, xfs/119, xfs/264, and xfs/297.  Of those,
> > they all freeze repeatedly except for xfs/119.
> > 
> > I would imagine that the same thing is going on with tests that touch
> > device-mapper, since a dm suspend also freezes the fs, but I didn't
> > check those tests all that thoroughly, since most of the dmerror /
> > dmflakey tests only switch the dm table once or twice per test.
> > 
> 
> I think the test performance is more likely to impacted when there's a
> combination of freeze and some workload that results in the blockgc scan
> having to do work. Of course there will be some impact even with the
> extra call, but that and your followup results that factor out lockdep
> seem a much more reasonable impact to me.
> 
> The way I see it, we can always optimize looping tests that become too
> slow and it's not exactly like tests are designed for runtime efficiency
> in the first place. I feel like I run into new tests all the time that
> don't really consider the broader runtime impact and whether they do
> more work than really necessary. Unless there's some
> immediate/unforeseen/disruptive change (like your initial numbers seemed
> to suggest), this doesn't seem like a primary concern to me.
> 
> > > I'm a little on the fence about this because personally, I'm not
> > > terribly concerned about the performance of a single freeze call. At the
> > > same time, I could see this adding up across hundreds of cycles or
> > > whatever via a test or test suite, and that being potentially annoying
> > > to devs/testers.
> > 
> > Well... yeah.  The net effect on djwong-dev is that a -g all test run
> > went from 3.6h to 4.8h.  It was less bad for tip (2.8 to 3h).
> > 
> > > > Also also looking through the archives[1], Brian once commented that
> > > > cleaning up all this stuff should be done /if/ one decides to mount the
> > > > frozen-snapshot writable at some later point in time.
> > > > 
> > > 
> > > That kind of sounds like the tradeoff of impacting the current/active fs
> > > for the benefit of a snapshot that may or may not be used. If not, then
> > > it's a waste of time. If so, it might be beneficial for the snap to more
> > > accurately reflect the "eventual" state of the original. For cowblocks
> > > perhaps it doesn't matter if the mount/recovery will scan and reclaim.
> > > I'm not as sure for eofblocks, wouldn't the snap persist all that
> > > "intended to be transient" speculative prealloc until/unless said files
> > > are reopened/closed?
> > 
> > Yes, that is the current behavior. :)
> > 
> > I don't know if it's really optimal (it's at least lazy :P) and Eric has
> > tried to shift the balance away from "xfs snapshots take forever to
> > mount".
> > 
> > > > Maybe this means we ought to find a way to remove inodes from the percpu
> > > > inactivation lists?  iget used to be able to pry inodes out of deferred
> > > > inactivation...
> > > > 
> > > 
> > > Seems a reasonable option. Presumably that mitigates the lookup stalling
> > > behavior without the need for any additional scanning work at freeze
> > > time (and maybe eliminates the need for an inodegc flush too), but is
> > > neutral wrt some of the other tradeoffs (like the above). I think the
> > > former is the big question wrt to deferred inactivation whereas the
> > > latter has been the case forever.
> > > 
> > > BTW, should we care at all about somebody dropping the entire cached
> > > working set (worst case) onto these queues if the fs is frozen? Maybe
> > > not if we have to cycle through all these inodes anyways for a full
> > > working set eviction, and presumably a large evictor task (i.e.
> > > drop_caches) will minimize the percpu queue distribution...
> > 
> > I've wondered myself if we really need to dump quite so many inodes onto
> > the inactivation queue while we're frozen.  For posteof blocks we could
> > just leave them attached and hope that inactivation eventually gets
> > them, though that has the unfortunate side effect that space can
> > disappear into the depths.
> > 
> 
> I would really prefer to avoid any sort of approach that leaks post-eof
> space as such. As unlikely as this situation might be, that reintroduces
> paths to the old "where'd my XFS space go?" class of problems this
> infrastructure was originally designed to address. 
> 
> ISTM the currently most viable options we've discussed are:
> 
> 1. Leave things as is, accept potential for lookup stalls while frozen
> and wait and see if this ever really becomes a problem for real users.

From a laziness perspective, I like this option. :)

But we /do/ have customers who file escalations for every stack trace
they see in dmesg, even if it's merely the hangcheck timer telling them
that a process has stalled on a frozen fs.

> 2. Tweak the scan to be async as Dave suggested in his followup mail.
> This changes this patch from a guarantee to more of a mitigation, which
> personally I think is fairly reasonable. We'd still have drained writers
> and faulters by this point in the freeze, so the impact would probably
> be limited to contention with readers of blockgc inodes (though this
> probably should be tested).

<nod> Switching to an async scan would ameliorate the situation, but not
solve it.

> 3. Darrick's earlier thought to reintroduce inode reuse off the
> inactivation queues. It's not clear to me how involved this might be.

I /think/ it's fairly simple to teach xfs_iget to xfs_iget_recycle and
clear NEEDS_INACTIVE from the inode and the inodegc worker to re-check
if an inode is NEEDS_INACTIVE before setting INACTIVATING.

A bigger problem might be how to prevent the inode from being reclaimed
until the i_gclist is clear /and/ we know the worker is done with it.

> 4-N. I'm sure we can think up any number of other theoretical options
> depending on how involved we want to get. The idea below sounds
> plausible, but at the same time (and if I'm following correctly)
> inventing a new way to free space off inodes purely for the use of
> eviction during freeze sounds excessive wrt to complexity, future
> maintenance burden, etc.

Yeah.

> Perhaps yet another option could be something like a more granular
> freeze callback that, if specified by the fs, invokes the filesystem at
> each step of the freeze sequence instead of only at the end like
> ->freeze_fs() currently does. IIUC this problem mostly goes away if we
> can run the scan a bit earlier, we could clean up the existing freeze
> wart in ->sync_fs(), and perhaps other filesystems might find that
> similarly useful. Of course this requires more changes outside of xfs..

<nod> I wish we could lock out file reads during whole-fs transitions
like freezing and ro remounts, but I bet the community would not be
amenable to adding mnt_want_read type things to all the vfs functions.

--D

> 
> Brian
> 
> > COW mappings are attached incore to the inode cow fork but the refcount
> > btree ondisk, so in principle for frozen inodes we could move the cow
> > mappings to an internal bitmap, let the inode go, and free the abandoned
> > cow blocks at thaw time.
> > 
> > Unlinked inodes with no active refcount can of course block in the queue
> > forever; userspace can't have those back.
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > [1] https://lore.kernel.org/linux-xfs/20190117181406.GF37591@bfoster/
> > > > 
> > > > --D
> > > > 
> > > > > Cheers,
> > > > > 
> > > > > Dave.
> > > > > -- 
> > > > > Dave Chinner
> > > > > david@fromorbit.com
> > > > 
> > > 
> > 
>
Brian Foster Jan. 19, 2022, 8:07 p.m. UTC | #13
On Tue, Jan 18, 2022 at 10:56:47AM -0800, Darrick J. Wong wrote:
> On Mon, Jan 17, 2022 at 08:37:13AM -0500, Brian Foster wrote:
> > On Fri, Jan 14, 2022 at 01:30:43PM -0800, Darrick J. Wong wrote:
> > > On Fri, Jan 14, 2022 at 02:45:10PM -0500, Brian Foster wrote:
> > > > On Fri, Jan 14, 2022 at 09:35:35AM -0800, Darrick J. Wong wrote:
> > > > > On Fri, Jan 14, 2022 at 09:38:10AM +1100, Dave Chinner wrote:
> > > > > > On Thu, Jan 13, 2022 at 08:37:01AM -0500, Brian Foster wrote:
> > > > > > > We've had reports on distro (pre-deferred inactivation) kernels that
> > > > > > > inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> > > > > > > lock when invoked on a frozen XFS fs. This occurs because
> > > > > > > drop_caches acquires the lock and then blocks in xfs_inactive() on
> > > > > > > transaction alloc for an inode that requires an eofb trim. unfreeze
> > > > > > > then blocks on the same lock and the fs is deadlocked.
> > > > > > > 
> > > > > > > With deferred inactivation, the deadlock problem is no longer
> > > > > > > present because ->destroy_inode() no longer blocks whether the fs is
> > > > > > > frozen or not. There is still unfortunate behavior in that lookups
> > > > > > > of a pending inactive inode spin loop waiting for the pending
> > > > > > > inactive state to clear, which won't happen until the fs is
> > > > > > > unfrozen. This was always possible to some degree, but is
> > > > > > > potentially amplified by the fact that reclaim no longer blocks on
> > > > > > > the first inode that requires inactivation work. Instead, we
> > > > > > > populate the inactivation queues indefinitely. The side effect can
> > > > > > > be observed easily by invoking drop_caches on a frozen fs previously
> > > > > > > populated with eofb and/or cowblocks inodes and then running
> > > > > > > anything that relies on inode lookup (i.e., ls).
> > > > > > > 
> > > > > > > To mitigate this behavior, invoke internal blockgc reclaim during
> > > > > > > the freeze sequence to guarantee that inode eviction doesn't lead to
> > > > > > > this state due to eofb or cowblocks inodes. This is similar to
> > > > > > > current behavior on read-only remount. Since the deadlock issue was
> > > > > > > present for such a long time, also document the subtle
> > > > > > > ->destroy_inode() constraint to avoid unintentional reintroduction
> > > > > > > of the deadlock problem in the future.
> > > > > > > 
> > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > > > ---
> > > > > > >  fs/xfs/xfs_super.c | 19 +++++++++++++++++--
> > > > > > >  1 file changed, 17 insertions(+), 2 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > > > > index c7ac486ca5d3..1d0f87e47fa4 100644
> > > > > > > --- a/fs/xfs/xfs_super.c
> > > > > > > +++ b/fs/xfs/xfs_super.c
> > > > > > > @@ -623,8 +623,13 @@ xfs_fs_alloc_inode(
> > > > > > >  }
> > > > > > >  
> > > > > > >  /*
> > > > > > > - * Now that the generic code is guaranteed not to be accessing
> > > > > > > - * the linux inode, we can inactivate and reclaim the inode.
> > > > > > > + * Now that the generic code is guaranteed not to be accessing the inode, we can
> > > > > > > + * inactivate and reclaim it.
> > > > > > > + *
> > > > > > > + * NOTE: ->destroy_inode() can be called (with ->s_umount held) while the
> > > > > > > + * filesystem is frozen. Therefore it is generally unsafe to attempt transaction
> > > > > > > + * allocation in this context. A transaction alloc that blocks on frozen state
> > > > > > > + * from a context with ->s_umount held will deadlock with unfreeze.
> > > > > > >   */
> > > > > > >  STATIC void
> > > > > > >  xfs_fs_destroy_inode(
> > > > > > > @@ -764,6 +769,16 @@ xfs_fs_sync_fs(
> > > > > > >  	 * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
> > > > > > >  	 */
> > > > > > >  	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> > > > > > > +		struct xfs_icwalk	icw = {0};
> > > > > > > +
> > > > > > > +		/*
> > > > > > > +		 * Clear out eofb and cowblocks inodes so eviction while frozen
> > > > > > > +		 * doesn't leave them sitting in the inactivation queue where
> > > > > > > +		 * they cannot be processed.
> > > > > > > +		 */
> > > > > > > +		icw.icw_flags = XFS_ICWALK_FLAG_SYNC;
> > > > > > > +		xfs_blockgc_free_space(mp, &icw);
> > > > > > 
> > > > > > Is a SYNC walk safe to run here? I know we run
> > > > > > xfs_blockgc_free_space() from XFS_IOC_FREE_EOFBLOCKS under
> > > > > > SB_FREEZE_WRITE protection, but here we have both frozen writes and
> > > > > > page faults we're running in a much more constrained freeze context
> > > > > > here.
> > > > > > 
> > > > > > i.e. the SYNC walk will keep busy looping if it can't get the
> > > > > > IOLOCK_EXCL on an inode that is in cache, so if we end up with an
> > > > > > inode locked and blocked on SB_FREEZE_WRITE or SB_FREEZE_PAGEFAULT
> > > > > > for whatever reason this will never return....
> > > > > 
> > > > > Are you referring to the case where one could be read()ing from a file
> > > > > into a buffer that's really a mmap'd page from another file while the
> > > > > underlying fs is being frozen?
> > > > > 
> > > > 
> > > > I thought this was generally safe as freeze protection sits outside of
> > > > the locks, but I'm not terribly sure about the read to a mapped buffer
> > > > case. If that allows an iolock holder to block on a pagefault due to
> > > > freeze, then SB_FREEZE_PAGEFAULT might be too late for a synchronous
> > > > scan (i.e. assuming nothing blocks this earlier or prefaults/pins the
> > > > target buffer)..?
> > > 
> > > I think so.  xfs_file_buffered_read takes IOLOCK_SHARED and calls
> > > filemap_read, which calls copy_page_to_iter.  I /think/ the messy iovec
> > > code calls copyout, which can then hit a write page fault, which takes
> > > us to __xfs_filemap_fault.  That takes SB_PAGEFAULT, which is the
> > > opposite order of what now goes on during a freeze.
> > > 
> > > > > Also, I added this second patch and fstests runtime went up by 30%.
> > > > > ISTR Dave commenting that freeze time would go way up when I submitted a
> > > > > patch to clean out the cow blocks a few years ago.
> > > > > 
> > > > 
> > > > Do you have any more detailed data on this? I.e., is this an increase
> > > > across the board? A smaller set of tests doing many freezes with a
> > > > significant runtime increase?
> > > 
> > > I think it's constrained to tests that run freeze and thaw in a loop,
> > > but the increases in those tests are large.
> > > 
> > > Here's what I see when I run all the tests that mention 'freeze' before
> > > applying the patch:
> > > 
> > > generic/068      46s
> > > generic/085      9s
> > > generic/280      2s
> > > generic/311      53s
> > > generic/390      3s
> > > generic/459      15s
> > > generic/491      2s
> > > xfs/006  8s
> > > xfs/011  20s
> > > xfs/119  6s
> > > xfs/264  13s
> > > xfs/297  42s
> > > xfs/318  2s
> > > xfs/325  2s
> > > xfs/438  4s
> > > xfs/517  46s
> > > 
> > > And here's after:
> > > 
> > > generic/068      47s
> > > generic/085      17s
> > > generic/280      4s
> > > generic/311      81s
> > > generic/390      4s
> > > generic/459      14s
> > > generic/491      2s
> > > xfs/006  9s
> > > xfs/011  21s
> > > xfs/119  11s
> > > xfs/264  18s
> > > xfs/297  31s
> > > xfs/318  3s
> > > xfs/325  2s
> > > xfs/438  5s
> > > xfs/517  46s
> > > 
> > > Most of those tests run in more or less the same amount of time, except
> > > for generic/085, generic/311, xfs/119, xfs/264, and xfs/297.  Of those,
> > > they all freeze repeatedly except for xfs/119.
> > > 
> > > I would imagine that the same thing is going on with tests that touch
> > > device-mapper, since a dm suspend also freezes the fs, but I didn't
> > > check those tests all that thoroughly, since most of the dmerror /
> > > dmflakey tests only switch the dm table once or twice per test.
> > > 
> > 
> > I think the test performance is more likely to impacted when there's a
> > combination of freeze and some workload that results in the blockgc scan
> > having to do work. Of course there will be some impact even with the
> > extra call, but that and your followup results that factor out lockdep
> > seem a much more reasonable impact to me.
> > 
> > The way I see it, we can always optimize looping tests that become too
> > slow and it's not exactly like tests are designed for runtime efficiency
> > in the first place. I feel like I run into new tests all the time that
> > don't really consider the broader runtime impact and whether they do
> > more work than really necessary. Unless there's some
> > immediate/unforeseen/disruptive change (like your initial numbers seemed
> > to suggest), this doesn't seem like a primary concern to me.
> > 
> > > > I'm a little on the fence about this because personally, I'm not
> > > > terribly concerned about the performance of a single freeze call. At the
> > > > same time, I could see this adding up across hundreds of cycles or
> > > > whatever via a test or test suite, and that being potentially annoying
> > > > to devs/testers.
> > > 
> > > Well... yeah.  The net effect on djwong-dev is that a -g all test run
> > > went from 3.6h to 4.8h.  It was less bad for tip (2.8 to 3h).
> > > 
> > > > > Also also looking through the archives[1], Brian once commented that
> > > > > cleaning up all this stuff should be done /if/ one decides to mount the
> > > > > frozen-snapshot writable at some later point in time.
> > > > > 
> > > > 
> > > > That kind of sounds like the tradeoff of impacting the current/active fs
> > > > for the benefit of a snapshot that may or may not be used. If not, then
> > > > it's a waste of time. If so, it might be beneficial for the snap to more
> > > > accurately reflect the "eventual" state of the original. For cowblocks
> > > > perhaps it doesn't matter if the mount/recovery will scan and reclaim.
> > > > I'm not as sure for eofblocks, wouldn't the snap persist all that
> > > > "intended to be transient" speculative prealloc until/unless said files
> > > > are reopened/closed?
> > > 
> > > Yes, that is the current behavior. :)
> > > 
> > > I don't know if it's really optimal (it's at least lazy :P) and Eric has
> > > tried to shift the balance away from "xfs snapshots take forever to
> > > mount".
> > > 
> > > > > Maybe this means we ought to find a way to remove inodes from the percpu
> > > > > inactivation lists?  iget used to be able to pry inodes out of deferred
> > > > > inactivation...
> > > > > 
> > > > 
> > > > Seems a reasonable option. Presumably that mitigates the lookup stalling
> > > > behavior without the need for any additional scanning work at freeze
> > > > time (and maybe eliminates the need for an inodegc flush too), but is
> > > > neutral wrt some of the other tradeoffs (like the above). I think the
> > > > former is the big question wrt to deferred inactivation whereas the
> > > > latter has been the case forever.
> > > > 
> > > > BTW, should we care at all about somebody dropping the entire cached
> > > > working set (worst case) onto these queues if the fs is frozen? Maybe
> > > > not if we have to cycle through all these inodes anyways for a full
> > > > working set eviction, and presumably a large evictor task (i.e.
> > > > drop_caches) will minimize the percpu queue distribution...
> > > 
> > > I've wondered myself if we really need to dump quite so many inodes onto
> > > the inactivation queue while we're frozen.  For posteof blocks we could
> > > just leave them attached and hope that inactivation eventually gets
> > > them, though that has the unfortunate side effect that space can
> > > disappear into the depths.
> > > 
> > 
> > I would really prefer to avoid any sort of approach that leaks post-eof
> > space as such. As unlikely as this situation might be, that reintroduces
> > paths to the old "where'd my XFS space go?" class of problems this
> > infrastructure was originally designed to address. 
> > 
> > ISTM the currently most viable options we've discussed are:
> > 
> > 1. Leave things as is, accept potential for lookup stalls while frozen
> > and wait and see if this ever really becomes a problem for real users.
> 
> From a laziness perspective, I like this option. :)
> 
> But we /do/ have customers who file escalations for every stack trace
> they see in dmesg, even if it's merely the hangcheck timer telling them
> that a process has stalled on a frozen fs.
> 
> > 2. Tweak the scan to be async as Dave suggested in his followup mail.
> > This changes this patch from a guarantee to more of a mitigation, which
> > personally I think is fairly reasonable. We'd still have drained writers
> > and faulters by this point in the freeze, so the impact would probably
> > be limited to contention with readers of blockgc inodes (though this
> > probably should be tested).
> 
> <nod> Switching to an async scan would ameliorate the situation, but not
> solve it.
> 
> > 3. Darrick's earlier thought to reintroduce inode reuse off the
> > inactivation queues. It's not clear to me how involved this might be.
> 
> I /think/ it's fairly simple to teach xfs_iget to xfs_iget_recycle and
> clear NEEDS_INACTIVE from the inode and the inodegc worker to re-check
> if an inode is NEEDS_INACTIVE before setting INACTIVATING.
> 
> A bigger problem might be how to prevent the inode from being reclaimed
> until the i_gclist is clear /and/ we know the worker is done with it.
> 
> > 4-N. I'm sure we can think up any number of other theoretical options
> > depending on how involved we want to get. The idea below sounds
> > plausible, but at the same time (and if I'm following correctly)
> > inventing a new way to free space off inodes purely for the use of
> > eviction during freeze sounds excessive wrt to complexity, future
> > maintenance burden, etc.
> 
> Yeah.
> 
> > Perhaps yet another option could be something like a more granular
> > freeze callback that, if specified by the fs, invokes the filesystem at
> > each step of the freeze sequence instead of only at the end like
> > ->freeze_fs() currently does. IIUC this problem mostly goes away if we
> > can run the scan a bit earlier, we could clean up the existing freeze
> > wart in ->sync_fs(), and perhaps other filesystems might find that
> > similarly useful. Of course this requires more changes outside of xfs..
> 
> <nod> I wish we could lock out file reads during whole-fs transitions
> like freezing and ro remounts, but I bet the community would not be
> amenable to adding mnt_want_read type things to all the vfs functions.
> 

Err, yeah. That seems potentially more invasive.

Well, if you wanted to go the freeze api route and are Ok with pulling
an async scan as an incremental step (given that the end goal with that
approach is a sync blockgc scan), I don't mind taking a stab at the
freeze API thing from there. I'm not sure how likely it might be to land
a new freeze interface, but in thinking about it a bit we might not need
a new callback interface at all since the freeze state is already in the
superblock. E.g., consider an optional superblock flag we could set to
instruct the VFS to call ->freeze_fs() once per state transition rather
than only at the end. With that, we'd check sb->s_writers.frozen at each
invocation (like we already do in ->sync_fs()) and DTRT based on the
current state. E.g., perhaps with high level logic along the lines of:

SB_FREEZE_WRITE:
	- full sync_fs() and sync blockgc scan
SB_FREEZE_PAGEFAULT:
	- async blockgc (for the read -> mapped write case)
	- stop inode/block gc scanners
SB_FREEZE_FS:
	- remaining/existing ->fs_freeze() logic

... and then pull the corresponding hacks out of xfs_fs_sync_fs().

Of course if you wanted to recycle inactive inodes or do something else
entirely, then it's probably not worth going down this path..

Brian

> --D
> 
> > 
> > Brian
> > 
> > > COW mappings are attached incore to the inode cow fork but the refcount
> > > btree ondisk, so in principle for frozen inodes we could move the cow
> > > mappings to an internal bitmap, let the inode go, and free the abandoned
> > > cow blocks at thaw time.
> > > 
> > > Unlinked inodes with no active refcount can of course block in the queue
> > > forever; userspace can't have those back.
> > > 
> > > --D
> > > 
> > > > 
> > > > Brian
> > > > 
> > > > > [1] https://lore.kernel.org/linux-xfs/20190117181406.GF37591@bfoster/
> > > > > 
> > > > > --D
> > > > > 
> > > > > > Cheers,
> > > > > > 
> > > > > > Dave.
> > > > > > -- 
> > > > > > Dave Chinner
> > > > > > david@fromorbit.com
> > > > > 
> > > > 
> > > 
> > 
>
Darrick J. Wong Jan. 20, 2022, 12:36 a.m. UTC | #14
On Wed, Jan 19, 2022 at 03:07:15PM -0500, Brian Foster wrote:
> On Tue, Jan 18, 2022 at 10:56:47AM -0800, Darrick J. Wong wrote:
> > On Mon, Jan 17, 2022 at 08:37:13AM -0500, Brian Foster wrote:
> > > On Fri, Jan 14, 2022 at 01:30:43PM -0800, Darrick J. Wong wrote:
> > > > On Fri, Jan 14, 2022 at 02:45:10PM -0500, Brian Foster wrote:
> > > > > On Fri, Jan 14, 2022 at 09:35:35AM -0800, Darrick J. Wong wrote:
> > > > > > On Fri, Jan 14, 2022 at 09:38:10AM +1100, Dave Chinner wrote:
> > > > > > > On Thu, Jan 13, 2022 at 08:37:01AM -0500, Brian Foster wrote:
> > > > > > > > We've had reports on distro (pre-deferred inactivation) kernels that
> > > > > > > > inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> > > > > > > > lock when invoked on a frozen XFS fs. This occurs because
> > > > > > > > drop_caches acquires the lock and then blocks in xfs_inactive() on
> > > > > > > > transaction alloc for an inode that requires an eofb trim. unfreeze
> > > > > > > > then blocks on the same lock and the fs is deadlocked.
> > > > > > > > 
> > > > > > > > With deferred inactivation, the deadlock problem is no longer
> > > > > > > > present because ->destroy_inode() no longer blocks whether the fs is
> > > > > > > > frozen or not. There is still unfortunate behavior in that lookups
> > > > > > > > of a pending inactive inode spin loop waiting for the pending
> > > > > > > > inactive state to clear, which won't happen until the fs is
> > > > > > > > unfrozen. This was always possible to some degree, but is
> > > > > > > > potentially amplified by the fact that reclaim no longer blocks on
> > > > > > > > the first inode that requires inactivation work. Instead, we
> > > > > > > > populate the inactivation queues indefinitely. The side effect can
> > > > > > > > be observed easily by invoking drop_caches on a frozen fs previously
> > > > > > > > populated with eofb and/or cowblocks inodes and then running
> > > > > > > > anything that relies on inode lookup (i.e., ls).
> > > > > > > > 
> > > > > > > > To mitigate this behavior, invoke internal blockgc reclaim during
> > > > > > > > the freeze sequence to guarantee that inode eviction doesn't lead to
> > > > > > > > this state due to eofb or cowblocks inodes. This is similar to
> > > > > > > > current behavior on read-only remount. Since the deadlock issue was
> > > > > > > > present for such a long time, also document the subtle
> > > > > > > > ->destroy_inode() constraint to avoid unintentional reintroduction
> > > > > > > > of the deadlock problem in the future.
> > > > > > > > 
> > > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > > > > ---
> > > > > > > >  fs/xfs/xfs_super.c | 19 +++++++++++++++++--
> > > > > > > >  1 file changed, 17 insertions(+), 2 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > > > > > index c7ac486ca5d3..1d0f87e47fa4 100644
> > > > > > > > --- a/fs/xfs/xfs_super.c
> > > > > > > > +++ b/fs/xfs/xfs_super.c
> > > > > > > > @@ -623,8 +623,13 @@ xfs_fs_alloc_inode(
> > > > > > > >  }
> > > > > > > >  
> > > > > > > >  /*
> > > > > > > > - * Now that the generic code is guaranteed not to be accessing
> > > > > > > > - * the linux inode, we can inactivate and reclaim the inode.
> > > > > > > > + * Now that the generic code is guaranteed not to be accessing the inode, we can
> > > > > > > > + * inactivate and reclaim it.
> > > > > > > > + *
> > > > > > > > + * NOTE: ->destroy_inode() can be called (with ->s_umount held) while the
> > > > > > > > + * filesystem is frozen. Therefore it is generally unsafe to attempt transaction
> > > > > > > > + * allocation in this context. A transaction alloc that blocks on frozen state
> > > > > > > > + * from a context with ->s_umount held will deadlock with unfreeze.
> > > > > > > >   */
> > > > > > > >  STATIC void
> > > > > > > >  xfs_fs_destroy_inode(
> > > > > > > > @@ -764,6 +769,16 @@ xfs_fs_sync_fs(
> > > > > > > >  	 * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
> > > > > > > >  	 */
> > > > > > > >  	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> > > > > > > > +		struct xfs_icwalk	icw = {0};
> > > > > > > > +
> > > > > > > > +		/*
> > > > > > > > +		 * Clear out eofb and cowblocks inodes so eviction while frozen
> > > > > > > > +		 * doesn't leave them sitting in the inactivation queue where
> > > > > > > > +		 * they cannot be processed.
> > > > > > > > +		 */
> > > > > > > > +		icw.icw_flags = XFS_ICWALK_FLAG_SYNC;
> > > > > > > > +		xfs_blockgc_free_space(mp, &icw);
> > > > > > > 
> > > > > > > Is a SYNC walk safe to run here? I know we run
> > > > > > > xfs_blockgc_free_space() from XFS_IOC_FREE_EOFBLOCKS under
> > > > > > > SB_FREEZE_WRITE protection, but here we have both frozen writes and
> > > > > > > page faults we're running in a much more constrained freeze context
> > > > > > > here.
> > > > > > > 
> > > > > > > i.e. the SYNC walk will keep busy looping if it can't get the
> > > > > > > IOLOCK_EXCL on an inode that is in cache, so if we end up with an
> > > > > > > inode locked and blocked on SB_FREEZE_WRITE or SB_FREEZE_PAGEFAULT
> > > > > > > for whatever reason this will never return....
> > > > > > 
> > > > > > Are you referring to the case where one could be read()ing from a file
> > > > > > into a buffer that's really a mmap'd page from another file while the
> > > > > > underlying fs is being frozen?
> > > > > > 
> > > > > 
> > > > > I thought this was generally safe as freeze protection sits outside of
> > > > > the locks, but I'm not terribly sure about the read to a mapped buffer
> > > > > case. If that allows an iolock holder to block on a pagefault due to
> > > > > freeze, then SB_FREEZE_PAGEFAULT might be too late for a synchronous
> > > > > scan (i.e. assuming nothing blocks this earlier or prefaults/pins the
> > > > > target buffer)..?
> > > > 
> > > > I think so.  xfs_file_buffered_read takes IOLOCK_SHARED and calls
> > > > filemap_read, which calls copy_page_to_iter.  I /think/ the messy iovec
> > > > code calls copyout, which can then hit a write page fault, which takes
> > > > us to __xfs_filemap_fault.  That takes SB_PAGEFAULT, which is the
> > > > opposite order of what now goes on during a freeze.
> > > > 
> > > > > > Also, I added this second patch and fstests runtime went up by 30%.
> > > > > > ISTR Dave commenting that freeze time would go way up when I submitted a
> > > > > > patch to clean out the cow blocks a few years ago.
> > > > > > 
> > > > > 
> > > > > Do you have any more detailed data on this? I.e., is this an increase
> > > > > across the board? A smaller set of tests doing many freezes with a
> > > > > significant runtime increase?
> > > > 
> > > > I think it's constrained to tests that run freeze and thaw in a loop,
> > > > but the increases in those tests are large.
> > > > 
> > > > Here's what I see when I run all the tests that mention 'freeze' before
> > > > applying the patch:
> > > > 
> > > > generic/068      46s
> > > > generic/085      9s
> > > > generic/280      2s
> > > > generic/311      53s
> > > > generic/390      3s
> > > > generic/459      15s
> > > > generic/491      2s
> > > > xfs/006  8s
> > > > xfs/011  20s
> > > > xfs/119  6s
> > > > xfs/264  13s
> > > > xfs/297  42s
> > > > xfs/318  2s
> > > > xfs/325  2s
> > > > xfs/438  4s
> > > > xfs/517  46s
> > > > 
> > > > And here's after:
> > > > 
> > > > generic/068      47s
> > > > generic/085      17s
> > > > generic/280      4s
> > > > generic/311      81s
> > > > generic/390      4s
> > > > generic/459      14s
> > > > generic/491      2s
> > > > xfs/006  9s
> > > > xfs/011  21s
> > > > xfs/119  11s
> > > > xfs/264  18s
> > > > xfs/297  31s
> > > > xfs/318  3s
> > > > xfs/325  2s
> > > > xfs/438  5s
> > > > xfs/517  46s
> > > > 
> > > > Most of those tests run in more or less the same amount of time, except
> > > > for generic/085, generic/311, xfs/119, xfs/264, and xfs/297.  Of those,
> > > > they all freeze repeatedly except for xfs/119.
> > > > 
> > > > I would imagine that the same thing is going on with tests that touch
> > > > device-mapper, since a dm suspend also freezes the fs, but I didn't
> > > > check those tests all that thoroughly, since most of the dmerror /
> > > > dmflakey tests only switch the dm table once or twice per test.
> > > > 
> > > 
> > > I think the test performance is more likely to impacted when there's a
> > > combination of freeze and some workload that results in the blockgc scan
> > > having to do work. Of course there will be some impact even with the
> > > extra call, but that and your followup results that factor out lockdep
> > > seem a much more reasonable impact to me.
> > > 
> > > The way I see it, we can always optimize looping tests that become too
> > > slow and it's not exactly like tests are designed for runtime efficiency
> > > in the first place. I feel like I run into new tests all the time that
> > > don't really consider the broader runtime impact and whether they do
> > > more work than really necessary. Unless there's some
> > > immediate/unforeseen/disruptive change (like your initial numbers seemed
> > > to suggest), this doesn't seem like a primary concern to me.
> > > 
> > > > > I'm a little on the fence about this because personally, I'm not
> > > > > terribly concerned about the performance of a single freeze call. At the
> > > > > same time, I could see this adding up across hundreds of cycles or
> > > > > whatever via a test or test suite, and that being potentially annoying
> > > > > to devs/testers.
> > > > 
> > > > Well... yeah.  The net effect on djwong-dev is that a -g all test run
> > > > went from 3.6h to 4.8h.  It was less bad for tip (2.8 to 3h).
> > > > 
> > > > > > Also also looking through the archives[1], Brian once commented that
> > > > > > cleaning up all this stuff should be done /if/ one decides to mount the
> > > > > > frozen-snapshot writable at some later point in time.
> > > > > > 
> > > > > 
> > > > > That kind of sounds like the tradeoff of impacting the current/active fs
> > > > > for the benefit of a snapshot that may or may not be used. If not, then
> > > > > it's a waste of time. If so, it might be beneficial for the snap to more
> > > > > accurately reflect the "eventual" state of the original. For cowblocks
> > > > > perhaps it doesn't matter if the mount/recovery will scan and reclaim.
> > > > > I'm not as sure for eofblocks, wouldn't the snap persist all that
> > > > > "intended to be transient" speculative prealloc until/unless said files
> > > > > are reopened/closed?
> > > > 
> > > > Yes, that is the current behavior. :)
> > > > 
> > > > I don't know if it's really optimal (it's at least lazy :P) and Eric has
> > > > tried to shift the balance away from "xfs snapshots take forever to
> > > > mount".
> > > > 
> > > > > > Maybe this means we ought to find a way to remove inodes from the percpu
> > > > > > inactivation lists?  iget used to be able to pry inodes out of deferred
> > > > > > inactivation...
> > > > > > 
> > > > > 
> > > > > Seems a reasonable option. Presumably that mitigates the lookup stalling
> > > > > behavior without the need for any additional scanning work at freeze
> > > > > time (and maybe eliminates the need for an inodegc flush too), but is
> > > > > neutral wrt some of the other tradeoffs (like the above). I think the
> > > > > former is the big question wrt to deferred inactivation whereas the
> > > > > latter has been the case forever.
> > > > > 
> > > > > BTW, should we care at all about somebody dropping the entire cached
> > > > > working set (worst case) onto these queues if the fs is frozen? Maybe
> > > > > not if we have to cycle through all these inodes anyways for a full
> > > > > working set eviction, and presumably a large evictor task (i.e.
> > > > > drop_caches) will minimize the percpu queue distribution...
> > > > 
> > > > I've wondered myself if we really need to dump quite so many inodes onto
> > > > the inactivation queue while we're frozen.  For posteof blocks we could
> > > > just leave them attached and hope that inactivation eventually gets
> > > > them, though that has the unfortunate side effect that space can
> > > > disappear into the depths.
> > > > 
> > > 
> > > I would really prefer to avoid any sort of approach that leaks post-eof
> > > space as such. As unlikely as this situation might be, that reintroduces
> > > paths to the old "where'd my XFS space go?" class of problems this
> > > infrastructure was originally designed to address. 
> > > 
> > > ISTM the currently most viable options we've discussed are:
> > > 
> > > 1. Leave things as is, accept potential for lookup stalls while frozen
> > > and wait and see if this ever really becomes a problem for real users.
> > 
> > From a laziness perspective, I like this option. :)
> > 
> > But we /do/ have customers who file escalations for every stack trace
> > they see in dmesg, even if it's merely the hangcheck timer telling them
> > that a process has stalled on a frozen fs.
> > 
> > > 2. Tweak the scan to be async as Dave suggested in his followup mail.
> > > This changes this patch from a guarantee to more of a mitigation, which
> > > personally I think is fairly reasonable. We'd still have drained writers
> > > and faulters by this point in the freeze, so the impact would probably
> > > be limited to contention with readers of blockgc inodes (though this
> > > probably should be tested).
> > 
> > <nod> Switching to an async scan would ameliorate the situation, but not
> > solve it.
> > 
> > > 3. Darrick's earlier thought to reintroduce inode reuse off the
> > > inactivation queues. It's not clear to me how involved this might be.
> > 
> > I /think/ it's fairly simple to teach xfs_iget to xfs_iget_recycle and
> > clear NEEDS_INACTIVE from the inode and the inodegc worker to re-check
> > if an inode is NEEDS_INACTIVE before setting INACTIVATING.
> > 
> > A bigger problem might be how to prevent the inode from being reclaimed
> > until the i_gclist is clear /and/ we know the worker is done with it.
> > 
> > > 4-N. I'm sure we can think up any number of other theoretical options
> > > depending on how involved we want to get. The idea below sounds
> > > plausible, but at the same time (and if I'm following correctly)
> > > inventing a new way to free space off inodes purely for the use of
> > > eviction during freeze sounds excessive wrt to complexity, future
> > > maintenance burden, etc.
> > 
> > Yeah.
> > 
> > > Perhaps yet another option could be something like a more granular
> > > freeze callback that, if specified by the fs, invokes the filesystem at
> > > each step of the freeze sequence instead of only at the end like
> > > ->freeze_fs() currently does. IIUC this problem mostly goes away if we
> > > can run the scan a bit earlier, we could clean up the existing freeze
> > > wart in ->sync_fs(), and perhaps other filesystems might find that
> > > similarly useful. Of course this requires more changes outside of xfs..
> > 
> > <nod> I wish we could lock out file reads during whole-fs transitions
> > like freezing and ro remounts, but I bet the community would not be
> > amenable to adding mnt_want_read type things to all the vfs functions.
> > 
> 
> Err, yeah. That seems potentially more invasive.

Yep.  Freezes should be fairly infrequent, which means we'd be imposing
a pretty high penalty on every read ever just for the sake of a
relatively rare operation.

> Well, if you wanted to go the freeze api route and are Ok with pulling
> an async scan as an incremental step (given that the end goal with that
> approach is a sync blockgc scan), I don't mind taking a stab at the
> freeze API thing from there. I'm not sure how likely it might be to land
> a new freeze interface, but in thinking about it a bit we might not need
> a new callback interface at all since the freeze state is already in the
> superblock. E.g., consider an optional superblock flag we could set to
> instruct the VFS to call ->freeze_fs() once per state transition rather
> than only at the end. With that, we'd check sb->s_writers.frozen at each
> invocation (like we already do in ->sync_fs()) and DTRT based on the
> current state. E.g., perhaps with high level logic along the lines of:
> 
> SB_FREEZE_WRITE:
> 	- full sync_fs() and sync blockgc scan
> SB_FREEZE_PAGEFAULT:
> 	- async blockgc (for the read -> mapped write case)
> 	- stop inode/block gc scanners
> SB_FREEZE_FS:
> 	- remaining/existing ->fs_freeze() logic
> 
> ... and then pull the corresponding hacks out of xfs_fs_sync_fs().

That sounds like a reasonable place to start.  At some point after the
merge window closes I'm going to send some patches to -fsdevel to fix
the problem that sync_filesystem ignores the ->sync_fs() return value,
but think the conflicts should not be insurmountable.

> Of course if you wanted to recycle inactive inodes or do something else
> entirely, then it's probably not worth going down this path..

I'm a bit partial to /trying/ to recycle inactive inodes because (a)
it's less tangling with -fsdevel for you and (b) inode scans in the
online repair patchset got a little weird once xfs_iget lost the ability
to recycle _NEEDS_INACTIVE inodes...

OTOH I tried to figure out how to deal with the lockless list that those
inodes are put on, and I couldn't figure out how to get them off the
list safely, so that might be a dead end.  If you have any ideas I'm all
ears. :)

--D

> Brian
> 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > COW mappings are attached incore to the inode cow fork but the refcount
> > > > btree ondisk, so in principle for frozen inodes we could move the cow
> > > > mappings to an internal bitmap, let the inode go, and free the abandoned
> > > > cow blocks at thaw time.
> > > > 
> > > > Unlinked inodes with no active refcount can of course block in the queue
> > > > forever; userspace can't have those back.
> > > > 
> > > > --D
> > > > 
> > > > > 
> > > > > Brian
> > > > > 
> > > > > > [1] https://lore.kernel.org/linux-xfs/20190117181406.GF37591@bfoster/
> > > > > > 
> > > > > > --D
> > > > > > 
> > > > > > > Cheers,
> > > > > > > 
> > > > > > > Dave.
> > > > > > > -- 
> > > > > > > Dave Chinner
> > > > > > > david@fromorbit.com
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
>
Dave Chinner Jan. 20, 2022, 5:18 a.m. UTC | #15
On Wed, Jan 19, 2022 at 04:36:36PM -0800, Darrick J. Wong wrote:
> OTOH I tried to figure out how to deal with the lockless list that those
> inodes are put on, and I couldn't figure out how to get them off the
> list safely, so that might be a dead end.  If you have any ideas I'm all
> ears. :)

You can't get them off the middle of the llist without adding
locking to all the llist operations. I chose llist because it's
lockless primitives matched the "add single/remove all" pattern of
batch processing that the per-cpu inactive queue implementation
required.  Hence if you want to do anything other that "single
add/remove all" with the inactive queue, you're going to have to
replace it with a different queue implementation....

Cheers,

Dave.
Brian Foster Jan. 24, 2022, 4:57 p.m. UTC | #16
On Wed, Jan 19, 2022 at 04:36:36PM -0800, Darrick J. Wong wrote:
> On Wed, Jan 19, 2022 at 03:07:15PM -0500, Brian Foster wrote:
> > On Tue, Jan 18, 2022 at 10:56:47AM -0800, Darrick J. Wong wrote:
> > > On Mon, Jan 17, 2022 at 08:37:13AM -0500, Brian Foster wrote:
> > > > On Fri, Jan 14, 2022 at 01:30:43PM -0800, Darrick J. Wong wrote:
> > > > > On Fri, Jan 14, 2022 at 02:45:10PM -0500, Brian Foster wrote:
> > > > > > On Fri, Jan 14, 2022 at 09:35:35AM -0800, Darrick J. Wong wrote:
> > > > > > > On Fri, Jan 14, 2022 at 09:38:10AM +1100, Dave Chinner wrote:
> > > > > > > > On Thu, Jan 13, 2022 at 08:37:01AM -0500, Brian Foster wrote:
> > > > > > > > > We've had reports on distro (pre-deferred inactivation) kernels that
> > > > > > > > > inode reclaim (i.e. via drop_caches) can deadlock on the s_umount
> > > > > > > > > lock when invoked on a frozen XFS fs. This occurs because
> > > > > > > > > drop_caches acquires the lock and then blocks in xfs_inactive() on
> > > > > > > > > transaction alloc for an inode that requires an eofb trim. unfreeze
> > > > > > > > > then blocks on the same lock and the fs is deadlocked.
> > > > > > > > > 
> > > > > > > > > With deferred inactivation, the deadlock problem is no longer
> > > > > > > > > present because ->destroy_inode() no longer blocks whether the fs is
> > > > > > > > > frozen or not. There is still unfortunate behavior in that lookups
> > > > > > > > > of a pending inactive inode spin loop waiting for the pending
> > > > > > > > > inactive state to clear, which won't happen until the fs is
> > > > > > > > > unfrozen. This was always possible to some degree, but is
> > > > > > > > > potentially amplified by the fact that reclaim no longer blocks on
> > > > > > > > > the first inode that requires inactivation work. Instead, we
> > > > > > > > > populate the inactivation queues indefinitely. The side effect can
> > > > > > > > > be observed easily by invoking drop_caches on a frozen fs previously
> > > > > > > > > populated with eofb and/or cowblocks inodes and then running
> > > > > > > > > anything that relies on inode lookup (i.e., ls).
> > > > > > > > > 
> > > > > > > > > To mitigate this behavior, invoke internal blockgc reclaim during
> > > > > > > > > the freeze sequence to guarantee that inode eviction doesn't lead to
> > > > > > > > > this state due to eofb or cowblocks inodes. This is similar to
> > > > > > > > > current behavior on read-only remount. Since the deadlock issue was
> > > > > > > > > present for such a long time, also document the subtle
> > > > > > > > > ->destroy_inode() constraint to avoid unintentional reintroduction
> > > > > > > > > of the deadlock problem in the future.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > > > > > ---
> > > > > > > > >  fs/xfs/xfs_super.c | 19 +++++++++++++++++--
> > > > > > > > >  1 file changed, 17 insertions(+), 2 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > > > > > > index c7ac486ca5d3..1d0f87e47fa4 100644
> > > > > > > > > --- a/fs/xfs/xfs_super.c
> > > > > > > > > +++ b/fs/xfs/xfs_super.c
> > > > > > > > > @@ -623,8 +623,13 @@ xfs_fs_alloc_inode(
> > > > > > > > >  }
> > > > > > > > >  
> > > > > > > > >  /*
> > > > > > > > > - * Now that the generic code is guaranteed not to be accessing
> > > > > > > > > - * the linux inode, we can inactivate and reclaim the inode.
> > > > > > > > > + * Now that the generic code is guaranteed not to be accessing the inode, we can
> > > > > > > > > + * inactivate and reclaim it.
> > > > > > > > > + *
> > > > > > > > > + * NOTE: ->destroy_inode() can be called (with ->s_umount held) while the
> > > > > > > > > + * filesystem is frozen. Therefore it is generally unsafe to attempt transaction
> > > > > > > > > + * allocation in this context. A transaction alloc that blocks on frozen state
> > > > > > > > > + * from a context with ->s_umount held will deadlock with unfreeze.
> > > > > > > > >   */
> > > > > > > > >  STATIC void
> > > > > > > > >  xfs_fs_destroy_inode(
> > > > > > > > > @@ -764,6 +769,16 @@ xfs_fs_sync_fs(
> > > > > > > > >  	 * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
> > > > > > > > >  	 */
> > > > > > > > >  	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
> > > > > > > > > +		struct xfs_icwalk	icw = {0};
> > > > > > > > > +
> > > > > > > > > +		/*
> > > > > > > > > +		 * Clear out eofb and cowblocks inodes so eviction while frozen
> > > > > > > > > +		 * doesn't leave them sitting in the inactivation queue where
> > > > > > > > > +		 * they cannot be processed.
> > > > > > > > > +		 */
> > > > > > > > > +		icw.icw_flags = XFS_ICWALK_FLAG_SYNC;
> > > > > > > > > +		xfs_blockgc_free_space(mp, &icw);
> > > > > > > > 
> > > > > > > > Is a SYNC walk safe to run here? I know we run
> > > > > > > > xfs_blockgc_free_space() from XFS_IOC_FREE_EOFBLOCKS under
> > > > > > > > SB_FREEZE_WRITE protection, but here we have both frozen writes and
> > > > > > > > page faults we're running in a much more constrained freeze context
> > > > > > > > here.
> > > > > > > > 
> > > > > > > > i.e. the SYNC walk will keep busy looping if it can't get the
> > > > > > > > IOLOCK_EXCL on an inode that is in cache, so if we end up with an
> > > > > > > > inode locked and blocked on SB_FREEZE_WRITE or SB_FREEZE_PAGEFAULT
> > > > > > > > for whatever reason this will never return....
> > > > > > > 
> > > > > > > Are you referring to the case where one could be read()ing from a file
> > > > > > > into a buffer that's really a mmap'd page from another file while the
> > > > > > > underlying fs is being frozen?
> > > > > > > 
> > > > > > 
> > > > > > I thought this was generally safe as freeze protection sits outside of
> > > > > > the locks, but I'm not terribly sure about the read to a mapped buffer
> > > > > > case. If that allows an iolock holder to block on a pagefault due to
> > > > > > freeze, then SB_FREEZE_PAGEFAULT might be too late for a synchronous
> > > > > > scan (i.e. assuming nothing blocks this earlier or prefaults/pins the
> > > > > > target buffer)..?
> > > > > 
> > > > > I think so.  xfs_file_buffered_read takes IOLOCK_SHARED and calls
> > > > > filemap_read, which calls copy_page_to_iter.  I /think/ the messy iovec
> > > > > code calls copyout, which can then hit a write page fault, which takes
> > > > > us to __xfs_filemap_fault.  That takes SB_PAGEFAULT, which is the
> > > > > opposite order of what now goes on during a freeze.
> > > > > 
> > > > > > > Also, I added this second patch and fstests runtime went up by 30%.
> > > > > > > ISTR Dave commenting that freeze time would go way up when I submitted a
> > > > > > > patch to clean out the cow blocks a few years ago.
> > > > > > > 
> > > > > > 
> > > > > > Do you have any more detailed data on this? I.e., is this an increase
> > > > > > across the board? A smaller set of tests doing many freezes with a
> > > > > > significant runtime increase?
> > > > > 
> > > > > I think it's constrained to tests that run freeze and thaw in a loop,
> > > > > but the increases in those tests are large.
> > > > > 
> > > > > Here's what I see when I run all the tests that mention 'freeze' before
> > > > > applying the patch:
> > > > > 
> > > > > generic/068      46s
> > > > > generic/085      9s
> > > > > generic/280      2s
> > > > > generic/311      53s
> > > > > generic/390      3s
> > > > > generic/459      15s
> > > > > generic/491      2s
> > > > > xfs/006  8s
> > > > > xfs/011  20s
> > > > > xfs/119  6s
> > > > > xfs/264  13s
> > > > > xfs/297  42s
> > > > > xfs/318  2s
> > > > > xfs/325  2s
> > > > > xfs/438  4s
> > > > > xfs/517  46s
> > > > > 
> > > > > And here's after:
> > > > > 
> > > > > generic/068      47s
> > > > > generic/085      17s
> > > > > generic/280      4s
> > > > > generic/311      81s
> > > > > generic/390      4s
> > > > > generic/459      14s
> > > > > generic/491      2s
> > > > > xfs/006  9s
> > > > > xfs/011  21s
> > > > > xfs/119  11s
> > > > > xfs/264  18s
> > > > > xfs/297  31s
> > > > > xfs/318  3s
> > > > > xfs/325  2s
> > > > > xfs/438  5s
> > > > > xfs/517  46s
> > > > > 
> > > > > Most of those tests run in more or less the same amount of time, except
> > > > > for generic/085, generic/311, xfs/119, xfs/264, and xfs/297.  Of those,
> > > > > they all freeze repeatedly except for xfs/119.
> > > > > 
> > > > > I would imagine that the same thing is going on with tests that touch
> > > > > device-mapper, since a dm suspend also freezes the fs, but I didn't
> > > > > check those tests all that thoroughly, since most of the dmerror /
> > > > > dmflakey tests only switch the dm table once or twice per test.
> > > > > 
> > > > 
> > > > I think the test performance is more likely to impacted when there's a
> > > > combination of freeze and some workload that results in the blockgc scan
> > > > having to do work. Of course there will be some impact even with the
> > > > extra call, but that and your followup results that factor out lockdep
> > > > seem a much more reasonable impact to me.
> > > > 
> > > > The way I see it, we can always optimize looping tests that become too
> > > > slow and it's not exactly like tests are designed for runtime efficiency
> > > > in the first place. I feel like I run into new tests all the time that
> > > > don't really consider the broader runtime impact and whether they do
> > > > more work than really necessary. Unless there's some
> > > > immediate/unforeseen/disruptive change (like your initial numbers seemed
> > > > to suggest), this doesn't seem like a primary concern to me.
> > > > 
> > > > > > I'm a little on the fence about this because personally, I'm not
> > > > > > terribly concerned about the performance of a single freeze call. At the
> > > > > > same time, I could see this adding up across hundreds of cycles or
> > > > > > whatever via a test or test suite, and that being potentially annoying
> > > > > > to devs/testers.
> > > > > 
> > > > > Well... yeah.  The net effect on djwong-dev is that a -g all test run
> > > > > went from 3.6h to 4.8h.  It was less bad for tip (2.8 to 3h).
> > > > > 
> > > > > > > Also also looking through the archives[1], Brian once commented that
> > > > > > > cleaning up all this stuff should be done /if/ one decides to mount the
> > > > > > > frozen-snapshot writable at some later point in time.
> > > > > > > 
> > > > > > 
> > > > > > That kind of sounds like the tradeoff of impacting the current/active fs
> > > > > > for the benefit of a snapshot that may or may not be used. If not, then
> > > > > > it's a waste of time. If so, it might be beneficial for the snap to more
> > > > > > accurately reflect the "eventual" state of the original. For cowblocks
> > > > > > perhaps it doesn't matter if the mount/recovery will scan and reclaim.
> > > > > > I'm not as sure for eofblocks, wouldn't the snap persist all that
> > > > > > "intended to be transient" speculative prealloc until/unless said files
> > > > > > are reopened/closed?
> > > > > 
> > > > > Yes, that is the current behavior. :)
> > > > > 
> > > > > I don't know if it's really optimal (it's at least lazy :P) and Eric has
> > > > > tried to shift the balance away from "xfs snapshots take forever to
> > > > > mount".
> > > > > 
> > > > > > > Maybe this means we ought to find a way to remove inodes from the percpu
> > > > > > > inactivation lists?  iget used to be able to pry inodes out of deferred
> > > > > > > inactivation...
> > > > > > > 
> > > > > > 
> > > > > > Seems a reasonable option. Presumably that mitigates the lookup stalling
> > > > > > behavior without the need for any additional scanning work at freeze
> > > > > > time (and maybe eliminates the need for an inodegc flush too), but is
> > > > > > neutral wrt some of the other tradeoffs (like the above). I think the
> > > > > > former is the big question wrt to deferred inactivation whereas the
> > > > > > latter has been the case forever.
> > > > > > 
> > > > > > BTW, should we care at all about somebody dropping the entire cached
> > > > > > working set (worst case) onto these queues if the fs is frozen? Maybe
> > > > > > not if we have to cycle through all these inodes anyways for a full
> > > > > > working set eviction, and presumably a large evictor task (i.e.
> > > > > > drop_caches) will minimize the percpu queue distribution...
> > > > > 
> > > > > I've wondered myself if we really need to dump quite so many inodes onto
> > > > > the inactivation queue while we're frozen.  For posteof blocks we could
> > > > > just leave them attached and hope that inactivation eventually gets
> > > > > them, though that has the unfortunate side effect that space can
> > > > > disappear into the depths.
> > > > > 
> > > > 
> > > > I would really prefer to avoid any sort of approach that leaks post-eof
> > > > space as such. As unlikely as this situation might be, that reintroduces
> > > > paths to the old "where'd my XFS space go?" class of problems this
> > > > infrastructure was originally designed to address. 
> > > > 
> > > > ISTM the currently most viable options we've discussed are:
> > > > 
> > > > 1. Leave things as is, accept potential for lookup stalls while frozen
> > > > and wait and see if this ever really becomes a problem for real users.
> > > 
> > > From a laziness perspective, I like this option. :)
> > > 
> > > But we /do/ have customers who file escalations for every stack trace
> > > they see in dmesg, even if it's merely the hangcheck timer telling them
> > > that a process has stalled on a frozen fs.
> > > 
> > > > 2. Tweak the scan to be async as Dave suggested in his followup mail.
> > > > This changes this patch from a guarantee to more of a mitigation, which
> > > > personally I think is fairly reasonable. We'd still have drained writers
> > > > and faulters by this point in the freeze, so the impact would probably
> > > > be limited to contention with readers of blockgc inodes (though this
> > > > probably should be tested).
> > > 
> > > <nod> Switching to an async scan would ameliorate the situation, but not
> > > solve it.
> > > 
> > > > 3. Darrick's earlier thought to reintroduce inode reuse off the
> > > > inactivation queues. It's not clear to me how involved this might be.
> > > 
> > > I /think/ it's fairly simple to teach xfs_iget to xfs_iget_recycle and
> > > clear NEEDS_INACTIVE from the inode and the inodegc worker to re-check
> > > if an inode is NEEDS_INACTIVE before setting INACTIVATING.
> > > 
> > > A bigger problem might be how to prevent the inode from being reclaimed
> > > until the i_gclist is clear /and/ we know the worker is done with it.
> > > 
> > > > 4-N. I'm sure we can think up any number of other theoretical options
> > > > depending on how involved we want to get. The idea below sounds
> > > > plausible, but at the same time (and if I'm following correctly)
> > > > inventing a new way to free space off inodes purely for the use of
> > > > eviction during freeze sounds excessive wrt to complexity, future
> > > > maintenance burden, etc.
> > > 
> > > Yeah.
> > > 
> > > > Perhaps yet another option could be something like a more granular
> > > > freeze callback that, if specified by the fs, invokes the filesystem at
> > > > each step of the freeze sequence instead of only at the end like
> > > > ->freeze_fs() currently does. IIUC this problem mostly goes away if we
> > > > can run the scan a bit earlier, we could clean up the existing freeze
> > > > wart in ->sync_fs(), and perhaps other filesystems might find that
> > > > similarly useful. Of course this requires more changes outside of xfs..
> > > 
> > > <nod> I wish we could lock out file reads during whole-fs transitions
> > > like freezing and ro remounts, but I bet the community would not be
> > > amenable to adding mnt_want_read type things to all the vfs functions.
> > > 
> > 
> > Err, yeah. That seems potentially more invasive.
> 
> Yep.  Freezes should be fairly infrequent, which means we'd be imposing
> a pretty high penalty on every read ever just for the sake of a
> relatively rare operation.
> 
> > Well, if you wanted to go the freeze api route and are Ok with pulling
> > an async scan as an incremental step (given that the end goal with that
> > approach is a sync blockgc scan), I don't mind taking a stab at the
> > freeze API thing from there. I'm not sure how likely it might be to land
> > a new freeze interface, but in thinking about it a bit we might not need
> > a new callback interface at all since the freeze state is already in the
> > superblock. E.g., consider an optional superblock flag we could set to
> > instruct the VFS to call ->freeze_fs() once per state transition rather
> > than only at the end. With that, we'd check sb->s_writers.frozen at each
> > invocation (like we already do in ->sync_fs()) and DTRT based on the
> > current state. E.g., perhaps with high level logic along the lines of:
> > 
> > SB_FREEZE_WRITE:
> > 	- full sync_fs() and sync blockgc scan
> > SB_FREEZE_PAGEFAULT:
> > 	- async blockgc (for the read -> mapped write case)
> > 	- stop inode/block gc scanners
> > SB_FREEZE_FS:
> > 	- remaining/existing ->fs_freeze() logic
> > 
> > ... and then pull the corresponding hacks out of xfs_fs_sync_fs().
> 
> That sounds like a reasonable place to start.  At some point after the
> merge window closes I'm going to send some patches to -fsdevel to fix
> the problem that sync_filesystem ignores the ->sync_fs() return value,
> but think the conflicts should not be insurmountable.
> 
> > Of course if you wanted to recycle inactive inodes or do something else
> > entirely, then it's probably not worth going down this path..
> 
> I'm a bit partial to /trying/ to recycle inactive inodes because (a)
> it's less tangling with -fsdevel for you and (b) inode scans in the
> online repair patchset got a little weird once xfs_iget lost the ability
> to recycle _NEEDS_INACTIVE inodes...
> 
> OTOH I tried to figure out how to deal with the lockless list that those
> inodes are put on, and I couldn't figure out how to get them off the
> list safely, so that might be a dead end.  If you have any ideas I'm all
> ears. :)
> 

So one of the things that I've been kind of unclear on about the current
deferred inactivation implementation is the primary goal of the percpu
optimization. I obviously can see the value of the percpu list in
general, but how much processing needs to occur in percpu context to
achieve the primary goal?

For example, I can see how a single or small multi threaded sustained
file removal might be batched efficiently, but what happens if said task
happens to bounce around many cpus? What if a system has hundreds of
cpus and enough removal tasks to populate most or all of the queues? Is
it worth having 200 percpu workqueue tasks doing block truncations and
inode frees to a filesystem that might have something like 16-32 AGs?

So I dunno, ISTM that the current implementation can be hyper efficient
for some workloads and perhaps unpredictable for others. As Dave already
alluded to, the tradeoff often times for such hyper efficient structures
is functional inflexibility, which is kind of what we're seeing between
the inability to recycle inodes wrt to this topic as well as potential
optimizations on the whole RCU recycling thing. The only real approach
that comes to mind for managing this kind of infrastructure short of
changing data structures is to preserve the ability to drain and quiesce
it regardless of filesystem state.

For example, I wonder if we could do something like have the percpu
queues amortize insertion into lock protected perag lists that can be
managed/processed accordingly rather than complete the entire
inactivation sequence in percpu context. From there, the perag lists
could be processed by an unbound/multithreaded workqueue that's maybe
more aligned with the AG count on the filesystem than cpu count on the
system.

I suspect something like that would allow for a more ideal combination
of performance and flexibility because then the percpu component never
has to freeze at all, it can just continue to drain into the perag lists
based on specified heuristic. Processing of the aggregated lists would
have to stop on freeze of course, but from there we'd at least have a
lock protected data structure to allow things like lookup to pull inodes
off for recycling purposes, online repair (?), etc., or for the
inactivation workqueue to skip inodes that might still be waiting on a
grace period to expire before they should be freed and made reallocation
candidates (including intelligence on when to most efficiently wait on a
grace period). (And maybe in the future this could tie in with your
earlier thought about making inactivation more granular in terms of
separating inode truncation from block and inode freeing so more work
can be done per AG lock cycle, etc.).

Of course, this would most likely have to come at some performance cost
as compared to the current, lockless implementation. The question to me
is more what is the cost in order to achieve the desired flexibility?
I.e., if we could preserve 80% of the performance benefit of the current
implementation with additional flexibility of lifting some of the work
out of the percpu tasks, would that be worth it? I'm not really sure
what all of the tradeoffs are here because of my questions above around
the goals of the implementation, etc. I'm sure I'm missing other context
as well, so that is mostly just my current handwavy thinking about how
to possibly accomplish some of these functional goals without having to
completely change out the existing data structure...

Brian

> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > 
> > > > Brian
> > > > 
> > > > > COW mappings are attached incore to the inode cow fork but the refcount
> > > > > btree ondisk, so in principle for frozen inodes we could move the cow
> > > > > mappings to an internal bitmap, let the inode go, and free the abandoned
> > > > > cow blocks at thaw time.
> > > > > 
> > > > > Unlinked inodes with no active refcount can of course block in the queue
> > > > > forever; userspace can't have those back.
> > > > > 
> > > > > --D
> > > > > 
> > > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > > [1] https://lore.kernel.org/linux-xfs/20190117181406.GF37591@bfoster/
> > > > > > > 
> > > > > > > --D
> > > > > > > 
> > > > > > > > Cheers,
> > > > > > > > 
> > > > > > > > Dave.
> > > > > > > > -- 
> > > > > > > > Dave Chinner
> > > > > > > > david@fromorbit.com
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
>
Dave Chinner Feb. 2, 2022, 2:22 a.m. UTC | #17
On Mon, Jan 24, 2022 at 11:57:12AM -0500, Brian Foster wrote:
> On Wed, Jan 19, 2022 at 04:36:36PM -0800, Darrick J. Wong wrote:
> > > Of course if you wanted to recycle inactive inodes or do something else
> > > entirely, then it's probably not worth going down this path..
> > 
> > I'm a bit partial to /trying/ to recycle inactive inodes because (a)
> > it's less tangling with -fsdevel for you and (b) inode scans in the
> > online repair patchset got a little weird once xfs_iget lost the ability
> > to recycle _NEEDS_INACTIVE inodes...
> > 
> > OTOH I tried to figure out how to deal with the lockless list that those
> > inodes are put on, and I couldn't figure out how to get them off the
> > list safely, so that might be a dead end.  If you have any ideas I'm all
> > ears. :)
> > 
> 
> So one of the things that I've been kind of unclear on about the current
> deferred inactivation implementation is the primary goal of the percpu
> optimization. I obviously can see the value of the percpu list in
> general, but how much processing needs to occur in percpu context to
> achieve the primary goal?
> 
> For example, I can see how a single or small multi threaded sustained
> file removal might be batched efficiently, but what happens if said task
> happens to bounce around many cpus?

In that case we have a scheduler problem, not a per-cpu
infrastructure issue.

> What if a system has hundreds of
> cpus and enough removal tasks to populate most or all of the
> queues?

It behaves identically to before the per-cpu inactivation queues
were added. Essentially, everything serialises and burns CPU
spinning on the CIL push lock regardless of where the work is coming
from. The per-cpu queues do not impact this behaviour at all, nor do
they change the distribution of the work that needs to be done.

Even Darrick's original proposal had this same behaviour:

https://lore.kernel.org/linux-xfs/20210801234709.GD2757197@dread.disaster.area/

> Is
> it worth having 200 percpu workqueue tasks doing block truncations and
> inode frees to a filesystem that might have something like 16-32 AGs?

If you have a workload with 200-way concurrency that hits a
filesystem path with 16-32 way concurrency because of per-AG
locking (e.g. unlink needs to lock the AGI twice - once to put the
inode on the unlinked list, then again to remove and free it),
then you're only going to get 16-32 way concurrency from your
workload regardless of whether you have per-cpu algorithms for parts
of the workload.

From a workload scalability POV, we've always used the rule that the
filesystem AG count needs to be at least 2x the concurrency
requirement of the workload. Generally speaking, that means if you
want the FS to scale to operations on all CPUs at once, you need to
configure the FS with agcount=(2 * ncpus). This was one fo the first
things I was taught about performance tuning large CPU count HPC
machines when I first started working at SGI back in 2002, and it's
still true today.

> So I dunno, ISTM that the current implementation can be hyper efficient
> for some workloads and perhaps unpredictable for others. As Dave already
> alluded to, the tradeoff often times for such hyper efficient structures
> is functional inflexibility, which is kind of what we're seeing between
> the inability to recycle inodes wrt to this topic as well as potential
> optimizations on the whole RCU recycling thing. The only real approach
> that comes to mind for managing this kind of infrastructure short of
> changing data structures is to preserve the ability to drain and quiesce
> it regardless of filesystem state.
> 
> For example, I wonder if we could do something like have the percpu
> queues amortize insertion into lock protected perag lists that can be
> managed/processed accordingly rather than complete the entire
> inactivation sequence in percpu context. From there, the perag lists
> could be processed by an unbound/multithreaded workqueue that's maybe
> more aligned with the AG count on the filesystem than cpu count on the
> system.

That per-ag based back end processing is exactly what Darrick's
original proposals did:

https://lore.kernel.org/linux-xfs/162758431072.332903.17159226037941080971.stgit@magnolia/

It used radix tree walks run in background kworker threads to batch
all the inode inactivation in a given AG via radix tree walks to
find them.

It performed poorly at scale because it destroyed the CPU affinity
between the unlink() operation and the inactivation operation of the
unlinked inode when the last reference to the inode is dropped and
the inode evicted from task syscall exit context. REgardless of
whether there is a per-cpu front end or not, the per-ag based
processing will destroy the CPU affinity of the data set being
processed because we cannot guarantee that the per-ag objects are
all local to the CPU that is processing the per-ag objects....

IOWs, all the per-cpu queues are trying to do is keep the inode
inactivation processing local to the CPU where they are already hot
in cache.  The per-cpu queues do not acheive this perfectly in all
cases, but they perform far, far better than the original
parallelised AG affinity inactivation processing that cannot
maintain any cache affinity for the objects being processed at all.

Cheers,

Dave.
Brian Foster Feb. 10, 2022, 7:03 p.m. UTC | #18
On Wed, Feb 02, 2022 at 01:22:40PM +1100, Dave Chinner wrote:
> On Mon, Jan 24, 2022 at 11:57:12AM -0500, Brian Foster wrote:
> > On Wed, Jan 19, 2022 at 04:36:36PM -0800, Darrick J. Wong wrote:
> > > > Of course if you wanted to recycle inactive inodes or do something else
> > > > entirely, then it's probably not worth going down this path..
> > > 
> > > I'm a bit partial to /trying/ to recycle inactive inodes because (a)
> > > it's less tangling with -fsdevel for you and (b) inode scans in the
> > > online repair patchset got a little weird once xfs_iget lost the ability
> > > to recycle _NEEDS_INACTIVE inodes...
> > > 
> > > OTOH I tried to figure out how to deal with the lockless list that those
> > > inodes are put on, and I couldn't figure out how to get them off the
> > > list safely, so that might be a dead end.  If you have any ideas I'm all
> > > ears. :)
> > > 
> > 
> > So one of the things that I've been kind of unclear on about the current
> > deferred inactivation implementation is the primary goal of the percpu
> > optimization. I obviously can see the value of the percpu list in
> > general, but how much processing needs to occur in percpu context to
> > achieve the primary goal?
> > 
> > For example, I can see how a single or small multi threaded sustained
> > file removal might be batched efficiently, but what happens if said task
> > happens to bounce around many cpus?
> 
> In that case we have a scheduler problem, not a per-cpu
> infrastructure issue.
> 

Last I tested on my box, a single threaded rm -rf had executed on
something like 24 of the 80 cpus available after about ~1m of runtime.
Of course the inactivation work for an inode occurs on the cpu that the
rm task was running on when the inode was destroyed, but I don't think
there's any guarantee that kworker task will be the next task selected
on the cpu if the system is loaded with other runnable tasks (and then
whatever task is selected doesn't pollute the cache). For example, I
also noticed rm-<pidX> -> rm-<pidY> context switching on concurrent
remove tests. That is obviously not a caching issue in this case because
both tasks are still doing remove work, but behavior is less
deterministic of the target task happens to be something unrelated. Of
course, that doesn't mean the approach might not otherwise be effective
for the majority of common workloads..

> > What if a system has hundreds of
> > cpus and enough removal tasks to populate most or all of the
> > queues?
> 
> It behaves identically to before the per-cpu inactivation queues
> were added. Essentially, everything serialises and burns CPU
> spinning on the CIL push lock regardless of where the work is coming
> from. The per-cpu queues do not impact this behaviour at all, nor do
> they change the distribution of the work that needs to be done.
> 
> Even Darrick's original proposal had this same behaviour:
> 
> https://lore.kernel.org/linux-xfs/20210801234709.GD2757197@dread.disaster.area/
> 

Ok..

> > Is
> > it worth having 200 percpu workqueue tasks doing block truncations and
> > inode frees to a filesystem that might have something like 16-32 AGs?
> 
> If you have a workload with 200-way concurrency that hits a
> filesystem path with 16-32 way concurrency because of per-AG
> locking (e.g. unlink needs to lock the AGI twice - once to put the
> inode on the unlinked list, then again to remove and free it),
> then you're only going to get 16-32 way concurrency from your
> workload regardless of whether you have per-cpu algorithms for parts
> of the workload.
> 
> From a workload scalability POV, we've always used the rule that the
> filesystem AG count needs to be at least 2x the concurrency
> requirement of the workload. Generally speaking, that means if you
> want the FS to scale to operations on all CPUs at once, you need to
> configure the FS with agcount=(2 * ncpus). This was one fo the first
> things I was taught about performance tuning large CPU count HPC
> machines when I first started working at SGI back in 2002, and it's
> still true today.
> 

Makes sense.

> > So I dunno, ISTM that the current implementation can be hyper efficient
> > for some workloads and perhaps unpredictable for others. As Dave already
> > alluded to, the tradeoff often times for such hyper efficient structures
> > is functional inflexibility, which is kind of what we're seeing between
> > the inability to recycle inodes wrt to this topic as well as potential
> > optimizations on the whole RCU recycling thing. The only real approach
> > that comes to mind for managing this kind of infrastructure short of
> > changing data structures is to preserve the ability to drain and quiesce
> > it regardless of filesystem state.
> > 
> > For example, I wonder if we could do something like have the percpu
> > queues amortize insertion into lock protected perag lists that can be
> > managed/processed accordingly rather than complete the entire
> > inactivation sequence in percpu context. From there, the perag lists
> > could be processed by an unbound/multithreaded workqueue that's maybe
> > more aligned with the AG count on the filesystem than cpu count on the
> > system.
> 
> That per-ag based back end processing is exactly what Darrick's
> original proposals did:
> 
> https://lore.kernel.org/linux-xfs/162758431072.332903.17159226037941080971.stgit@magnolia/
> 
> It used radix tree walks run in background kworker threads to batch
> all the inode inactivation in a given AG via radix tree walks to
> find them.
> 
> It performed poorly at scale because it destroyed the CPU affinity
> between the unlink() operation and the inactivation operation of the
> unlinked inode when the last reference to the inode is dropped and
> the inode evicted from task syscall exit context. REgardless of
> whether there is a per-cpu front end or not, the per-ag based
> processing will destroy the CPU affinity of the data set being
> processed because we cannot guarantee that the per-ag objects are
> all local to the CPU that is processing the per-ag objects....
> 

Ok. The role/significance of cpu caching was not as apparent to me when
I had last replied to this thread. The discussion around the rcu inode
lifecycle issue helped clear some of that up.

That said, why not conditionally tag and divert to a background worker
when the inodegc is disabled? That could allow NEEDS_INACTIVE inodes to
be claimed/recycled from other contexts in scenarios like when the fs is
frozen, since they won't be stuck in inaccessible and inactive percpu
queues, but otherwise preserves current behavior in normal runtime
conditions. Darrick mentioned online repair wanting to do something
similar earlier, but it's not clear to me if scrub could or would want
to disable the percpu inodegc workers in favor of a temporary/background
mode while repair is running. I'm just guessing that performance is
probably small enough of a concern in that situation that it wouldn't be
a mitigating factor. Hm?

Brian

> IOWs, all the per-cpu queues are trying to do is keep the inode
> inactivation processing local to the CPU where they are already hot
> in cache.  The per-cpu queues do not acheive this perfectly in all
> cases, but they perform far, far better than the original
> parallelised AG affinity inactivation processing that cannot
> maintain any cache affinity for the objects being processed at all.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Feb. 10, 2022, 11:08 p.m. UTC | #19
On Thu, Feb 10, 2022 at 02:03:23PM -0500, Brian Foster wrote:
> On Wed, Feb 02, 2022 at 01:22:40PM +1100, Dave Chinner wrote:
> > On Mon, Jan 24, 2022 at 11:57:12AM -0500, Brian Foster wrote:
> > > On Wed, Jan 19, 2022 at 04:36:36PM -0800, Darrick J. Wong wrote:
> > > > > Of course if you wanted to recycle inactive inodes or do something else
> > > > > entirely, then it's probably not worth going down this path..
> > > > 
> > > > I'm a bit partial to /trying/ to recycle inactive inodes because (a)
> > > > it's less tangling with -fsdevel for you and (b) inode scans in the
> > > > online repair patchset got a little weird once xfs_iget lost the ability
> > > > to recycle _NEEDS_INACTIVE inodes...
> > > > 
> > > > OTOH I tried to figure out how to deal with the lockless list that those
> > > > inodes are put on, and I couldn't figure out how to get them off the
> > > > list safely, so that might be a dead end.  If you have any ideas I'm all
> > > > ears. :)
> > > > 
> > > 
> > > So one of the things that I've been kind of unclear on about the current
> > > deferred inactivation implementation is the primary goal of the percpu
> > > optimization. I obviously can see the value of the percpu list in
> > > general, but how much processing needs to occur in percpu context to
> > > achieve the primary goal?
> > > 
> > > For example, I can see how a single or small multi threaded sustained
> > > file removal might be batched efficiently, but what happens if said task
> > > happens to bounce around many cpus?
> > 
> > In that case we have a scheduler problem, not a per-cpu
> > infrastructure issue.
> > 
> 
> Last I tested on my box, a single threaded rm -rf had executed on
> something like 24 of the 80 cpus available after about ~1m of runtime.

Not surprising - the scheduler will select a sibling cores that
share caches when the previous CPU the task was running on is busy
running CPU affine tasks (i.e. the per-cpu inactive kworker thread).

But how many CPUs it bounces the workload around over a long period
is not important. What is important is the cache residency between
the inodes when they are queued and when they are then inactivated.
That's measured in microseconds to single digit milliseconds (i.e.
within a single scheduling tick), and this is the metric that
matters the most. It doesn't matter if the first batch is unlinked
on CPU 1 and then inactived on CPU 1 while the scheduler moves the
unlink task to CPU 2 where it queues the next batch to be
inactivated on CPU 2, and so on. What matters is the data set in
each batch remains on the same CPU for inactivation processing.

The tracing I've done indicates taht the majority of the time that
the scehduler bounces the tasks between two or three sibling CPUs
repeatedly. This occurs when the inactivation is keeping up with the
unlink queueing side. When we have lots of extents to remove in
inactivation, the amount of inactivation work is much greater than
the unlink() work, and so we see inactivation batch processing
taking longer and the CPU spread of the unlink task gets larger
because there are more CPUs running CPU-affine tasks doing
background inactivation.

IOWs, having the number of CPUs the unlink task is scheduled on
grow over the long term is not at all unexpected - this is exactly
what we'd expect to see when we move towards async background
processing of complex operations...

> Of course the inactivation work for an inode occurs on the cpu that the
> rm task was running on when the inode was destroyed, but I don't think
> there's any guarantee that kworker task will be the next task selected
> on the cpu if the system is loaded with other runnable tasks (and then
> whatever task is selected doesn't pollute the cache).

The scheduler will select the next CPU based on phsyical machine
topology - core latencies, shared caches, numa distances, whether
the target CPU has affinity bound tasks already queued, etc.

> For example, I
> also noticed rm-<pidX> -> rm-<pidY> context switching on concurrent
> remove tests. That is obviously not a caching issue in this case because
> both tasks are still doing remove work, but behavior is less
> deterministic of the target task happens to be something unrelated. Of
> course, that doesn't mean the approach might not otherwise be effective
> for the majority of common workloads..

As long as the context switch rate does not substantially increase,
having tasks migrate between sibling cores every so often isn't a
huge problem.

> > That per-ag based back end processing is exactly what Darrick's
> > original proposals did:
> > 
> > https://lore.kernel.org/linux-xfs/162758431072.332903.17159226037941080971.stgit@magnolia/
> > 
> > It used radix tree walks run in background kworker threads to batch
> > all the inode inactivation in a given AG via radix tree walks to
> > find them.
> > 
> > It performed poorly at scale because it destroyed the CPU affinity
> > between the unlink() operation and the inactivation operation of the
> > unlinked inode when the last reference to the inode is dropped and
> > the inode evicted from task syscall exit context. REgardless of
> > whether there is a per-cpu front end or not, the per-ag based
> > processing will destroy the CPU affinity of the data set being
> > processed because we cannot guarantee that the per-ag objects are
> > all local to the CPU that is processing the per-ag objects....
> > 
> 
> Ok. The role/significance of cpu caching was not as apparent to me when
> I had last replied to this thread. The discussion around the rcu inode
> lifecycle issue helped clear some of that up.
> 
> That said, why not conditionally tag and divert to a background worker
> when the inodegc is disabled? That could allow NEEDS_INACTIVE inodes to
> be claimed/recycled from other contexts in scenarios like when the fs is
> frozen, since they won't be stuck in inaccessible and inactive percpu
> queues, but otherwise preserves current behavior in normal runtime
> conditions. Darrick mentioned online repair wanting to do something
> similar earlier, but it's not clear to me if scrub could or would want
> to disable the percpu inodegc workers in favor of a temporary/background
> mode while repair is running. I'm just guessing that performance is
> probably small enough of a concern in that situation that it wouldn't be
> a mitigating factor. Hm?

WE probably could do this, but I'm not sure the complexity is
justified by the rarity of the problem it is trying to avoid.
Freezes are not long term, nor are they particularly common for
performance sensitive workloads. Hence I'm just not this corner case
is important enough to justify doing the work given that we've had
similar freeze-will-delay-some-stuff-indefinitely behaviour for a
long time...

Cheers,

Dave.
Darrick J. Wong Feb. 15, 2022, 1:54 a.m. UTC | #20
On Fri, Feb 11, 2022 at 10:08:06AM +1100, Dave Chinner wrote:
> On Thu, Feb 10, 2022 at 02:03:23PM -0500, Brian Foster wrote:
> > On Wed, Feb 02, 2022 at 01:22:40PM +1100, Dave Chinner wrote:
> > > On Mon, Jan 24, 2022 at 11:57:12AM -0500, Brian Foster wrote:
> > > > On Wed, Jan 19, 2022 at 04:36:36PM -0800, Darrick J. Wong wrote:
> > > > > > Of course if you wanted to recycle inactive inodes or do something else
> > > > > > entirely, then it's probably not worth going down this path..
> > > > > 
> > > > > I'm a bit partial to /trying/ to recycle inactive inodes because (a)
> > > > > it's less tangling with -fsdevel for you and (b) inode scans in the
> > > > > online repair patchset got a little weird once xfs_iget lost the ability
> > > > > to recycle _NEEDS_INACTIVE inodes...
> > > > > 
> > > > > OTOH I tried to figure out how to deal with the lockless list that those
> > > > > inodes are put on, and I couldn't figure out how to get them off the
> > > > > list safely, so that might be a dead end.  If you have any ideas I'm all
> > > > > ears. :)
> > > > > 
> > > > 
> > > > So one of the things that I've been kind of unclear on about the current
> > > > deferred inactivation implementation is the primary goal of the percpu
> > > > optimization. I obviously can see the value of the percpu list in
> > > > general, but how much processing needs to occur in percpu context to
> > > > achieve the primary goal?
> > > > 
> > > > For example, I can see how a single or small multi threaded sustained
> > > > file removal might be batched efficiently, but what happens if said task
> > > > happens to bounce around many cpus?
> > > 
> > > In that case we have a scheduler problem, not a per-cpu
> > > infrastructure issue.
> > > 
> > 
> > Last I tested on my box, a single threaded rm -rf had executed on
> > something like 24 of the 80 cpus available after about ~1m of runtime.
> 
> Not surprising - the scheduler will select a sibling cores that
> share caches when the previous CPU the task was running on is busy
> running CPU affine tasks (i.e. the per-cpu inactive kworker thread).
> 
> But how many CPUs it bounces the workload around over a long period
> is not important. What is important is the cache residency between
> the inodes when they are queued and when they are then inactivated.
> That's measured in microseconds to single digit milliseconds (i.e.
> within a single scheduling tick), and this is the metric that
> matters the most. It doesn't matter if the first batch is unlinked
> on CPU 1 and then inactived on CPU 1 while the scheduler moves the
> unlink task to CPU 2 where it queues the next batch to be
> inactivated on CPU 2, and so on. What matters is the data set in
> each batch remains on the same CPU for inactivation processing.
> 
> The tracing I've done indicates taht the majority of the time that
> the scehduler bounces the tasks between two or three sibling CPUs
> repeatedly. This occurs when the inactivation is keeping up with the
> unlink queueing side. When we have lots of extents to remove in
> inactivation, the amount of inactivation work is much greater than
> the unlink() work, and so we see inactivation batch processing
> taking longer and the CPU spread of the unlink task gets larger
> because there are more CPUs running CPU-affine tasks doing
> background inactivation.
> 
> IOWs, having the number of CPUs the unlink task is scheduled on
> grow over the long term is not at all unexpected - this is exactly
> what we'd expect to see when we move towards async background
> processing of complex operations...
> 
> > Of course the inactivation work for an inode occurs on the cpu that the
> > rm task was running on when the inode was destroyed, but I don't think
> > there's any guarantee that kworker task will be the next task selected
> > on the cpu if the system is loaded with other runnable tasks (and then
> > whatever task is selected doesn't pollute the cache).
> 
> The scheduler will select the next CPU based on phsyical machine
> topology - core latencies, shared caches, numa distances, whether
> the target CPU has affinity bound tasks already queued, etc.
> 
> > For example, I
> > also noticed rm-<pidX> -> rm-<pidY> context switching on concurrent
> > remove tests. That is obviously not a caching issue in this case because
> > both tasks are still doing remove work, but behavior is less
> > deterministic of the target task happens to be something unrelated. Of
> > course, that doesn't mean the approach might not otherwise be effective
> > for the majority of common workloads..
> 
> As long as the context switch rate does not substantially increase,
> having tasks migrate between sibling cores every so often isn't a
> huge problem.
> 
> > > That per-ag based back end processing is exactly what Darrick's
> > > original proposals did:
> > > 
> > > https://lore.kernel.org/linux-xfs/162758431072.332903.17159226037941080971.stgit@magnolia/
> > > 
> > > It used radix tree walks run in background kworker threads to batch
> > > all the inode inactivation in a given AG via radix tree walks to
> > > find them.
> > > 
> > > It performed poorly at scale because it destroyed the CPU affinity
> > > between the unlink() operation and the inactivation operation of the
> > > unlinked inode when the last reference to the inode is dropped and
> > > the inode evicted from task syscall exit context. REgardless of
> > > whether there is a per-cpu front end or not, the per-ag based
> > > processing will destroy the CPU affinity of the data set being
> > > processed because we cannot guarantee that the per-ag objects are
> > > all local to the CPU that is processing the per-ag objects....
> > > 
> > 
> > Ok. The role/significance of cpu caching was not as apparent to me when
> > I had last replied to this thread. The discussion around the rcu inode
> > lifecycle issue helped clear some of that up.
> > 
> > That said, why not conditionally tag and divert to a background worker
> > when the inodegc is disabled? That could allow NEEDS_INACTIVE inodes to
> > be claimed/recycled from other contexts in scenarios like when the fs is
> > frozen, since they won't be stuck in inaccessible and inactive percpu
> > queues, but otherwise preserves current behavior in normal runtime
> > conditions. Darrick mentioned online repair wanting to do something
> > similar earlier, but it's not clear to me if scrub could or would want
> > to disable the percpu inodegc workers in favor of a temporary/background
> > mode while repair is running. I'm just guessing that performance is
> > probably small enough of a concern in that situation that it wouldn't be
> > a mitigating factor. Hm?
> 
> WE probably could do this, but I'm not sure the complexity is
> justified by the rarity of the problem it is trying to avoid.
> Freezes are not long term, nor are they particularly common for
> performance sensitive workloads. Hence I'm just not this corner case
> is important enough to justify doing the work given that we've had
> similar freeze-will-delay-some-stuff-indefinitely behaviour for a
> long time...

We /do/ have a few complaints lodged about hangcheck warnings when the
filesystem has to be frozen for a very long time.  It'd be nice to
unblock the callers that want to grab a still-reachable inode, though.

As for the online repair case that Brian mentioned, I've largely
eliminated the need for scrub freezes by using jump labels, notifier
chains, and an in-memory shadow btree to make it so that full fs scans
can run in the background while the rest of the fs continues running.
That'll be discussed ... soon, when I get to the point where I'm ready
to start a full design review.

(Hilariously, between that and better management of the DONTCACHE flag,
I don't actually need deferred inactivation for repair anymore... :P)

--D

> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Feb. 15, 2022, 9:26 a.m. UTC | #21
On Mon, Feb 14, 2022 at 05:54:16PM -0800, Darrick J. Wong wrote:
> On Fri, Feb 11, 2022 at 10:08:06AM +1100, Dave Chinner wrote:
> > On Thu, Feb 10, 2022 at 02:03:23PM -0500, Brian Foster wrote:
> > > On Wed, Feb 02, 2022 at 01:22:40PM +1100, Dave Chinner wrote:
> > > > On Mon, Jan 24, 2022 at 11:57:12AM -0500, Brian Foster wrote:
> > > That said, why not conditionally tag and divert to a background worker
> > > when the inodegc is disabled? That could allow NEEDS_INACTIVE inodes to
> > > be claimed/recycled from other contexts in scenarios like when the fs is
> > > frozen, since they won't be stuck in inaccessible and inactive percpu
> > > queues, but otherwise preserves current behavior in normal runtime
> > > conditions. Darrick mentioned online repair wanting to do something
> > > similar earlier, but it's not clear to me if scrub could or would want
> > > to disable the percpu inodegc workers in favor of a temporary/background
> > > mode while repair is running. I'm just guessing that performance is
> > > probably small enough of a concern in that situation that it wouldn't be
> > > a mitigating factor. Hm?
> > 
> > WE probably could do this, but I'm not sure the complexity is
> > justified by the rarity of the problem it is trying to avoid.
> > Freezes are not long term, nor are they particularly common for
> > performance sensitive workloads. Hence I'm just not this corner case
> > is important enough to justify doing the work given that we've had
> > similar freeze-will-delay-some-stuff-indefinitely behaviour for a
> > long time...
> 
> We /do/ have a few complaints lodged about hangcheck warnings when the
> filesystem has to be frozen for a very long time.  It'd be nice to
> unblock the callers that want to grab a still-reachable inode, though.

I suspect this problem largely goes away with moving inactivation
up to the VFS level - we'll still block the background EOF block
trimming work on a freeze, but it won't prevent lookups on those
inodes from taking new references to the inode...

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index c7ac486ca5d3..1d0f87e47fa4 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -623,8 +623,13 @@  xfs_fs_alloc_inode(
 }
 
 /*
- * Now that the generic code is guaranteed not to be accessing
- * the linux inode, we can inactivate and reclaim the inode.
+ * Now that the generic code is guaranteed not to be accessing the inode, we can
+ * inactivate and reclaim it.
+ *
+ * NOTE: ->destroy_inode() can be called (with ->s_umount held) while the
+ * filesystem is frozen. Therefore it is generally unsafe to attempt transaction
+ * allocation in this context. A transaction alloc that blocks on frozen state
+ * from a context with ->s_umount held will deadlock with unfreeze.
  */
 STATIC void
 xfs_fs_destroy_inode(
@@ -764,6 +769,16 @@  xfs_fs_sync_fs(
 	 * when the state is either SB_FREEZE_FS or SB_FREEZE_COMPLETE.
 	 */
 	if (sb->s_writers.frozen == SB_FREEZE_PAGEFAULT) {
+		struct xfs_icwalk	icw = {0};
+
+		/*
+		 * Clear out eofb and cowblocks inodes so eviction while frozen
+		 * doesn't leave them sitting in the inactivation queue where
+		 * they cannot be processed.
+		 */
+		icw.icw_flags = XFS_ICWALK_FLAG_SYNC;
+		xfs_blockgc_free_space(mp, &icw);
+
 		xfs_inodegc_stop(mp);
 		xfs_blockgc_stop(mp);
 	}