diff mbox

[RFC,4/5] ext4: add fs freezing support on suspend/hibernation

Message ID 20171003185313.1017-5-mcgrof@kernel.org (mailing list archive)
State RFC, archived
Headers show

Commit Message

Luis Chamberlain Oct. 3, 2017, 6:53 p.m. UTC
This also removes the superflous freezer calls as they are no longer
needed. We need to avoid sync call on thaw as otherwise we end up with
a stall on bio_submit().

Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
---
 fs/ext4/super.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

Comments

Theodore Ts'o Oct. 3, 2017, 7:59 p.m. UTC | #1
On Tue, Oct 03, 2017 at 11:53:12AM -0700, Luis R. Rodriguez wrote:
> @@ -4926,7 +4926,7 @@ static int ext4_unfreeze(struct super_block *sb)
>  		ext4_set_feature_journal_needs_recovery(sb);
>  	}
>  
> -	ext4_commit_super(sb, 1);
> +	ext4_commit_super(sb, 0);
>  	return 0;
>  }
>

After we remove add the NEEDS_RECOVERY flag, we need to make sure
recovery flag is pushed out to disk before any other changes are
allowed to be pushed out to disk.  That's why we originally did the
update synchronously.

There are other ways we could fulfill this requirements, but doing a
synchronous update is the simplest way to handle this.  Was it
necessary to change this given the other changes you are making the fs
freeze implementation?

					- Ted
Luis Chamberlain Oct. 3, 2017, 8:13 p.m. UTC | #2
On Tue, Oct 03, 2017 at 03:59:30PM -0400, Theodore Ts'o wrote:
> On Tue, Oct 03, 2017 at 11:53:12AM -0700, Luis R. Rodriguez wrote:
> > @@ -4926,7 +4926,7 @@ static int ext4_unfreeze(struct super_block *sb)
> >  		ext4_set_feature_journal_needs_recovery(sb);
> >  	}
> >  
> > -	ext4_commit_super(sb, 1);
> > +	ext4_commit_super(sb, 0);
> >  	return 0;
> >  }
> >
> 
> After we remove add the NEEDS_RECOVERY flag, we need to make sure
> recovery flag is pushed out to disk before any other changes are
> allowed to be pushed out to disk.  That's why we originally did the
> update synchronously.

I see. I had to try to dig through linux-history to see why this was done, and
I actually could not find an exact commit which explained what you just did.
Thanks!

Hrm, so on freeze we issue the same commit as well, so is a sync *really* needed
on thaw?

> There are other ways we could fulfill this requirements, but doing a
> synchronous update is the simplest way to handle this.  Was it
> necessary to change this given the other changes you are making the fs
> freeze implementation?

No, it was am empirical evaluation done at testing, I observed bio_submit()
stalls, we never get completion from the device. Even if we call thaw at the
very end of resume, after the devices should be up, we still end up in the same
situation. Given how I order freezing filesystems after freezing userspace I do
believe we should thaw filesystems prior unfreezing userspace, its why I placed
the call where it is now.

So this is something I was hoping others could help grok/iron out. I'd prefer
if we could live with thaw'ing unchanged, but this requires to figure this
issue out. Why we can't implicate bio_submit() on fs thaw as-is in this
implementation.

  Luis
Theodore Ts'o Oct. 4, 2017, 1:42 a.m. UTC | #3
On Tue, Oct 03, 2017 at 10:13:21PM +0200, Luis R. Rodriguez wrote:
> > After we remove add the NEEDS_RECOVERY flag, we need to make sure
> > recovery flag is pushed out to disk before any other changes are
> > allowed to be pushed out to disk.  That's why we originally did the
> > update synchronously.
> 
> I see. I had to try to dig through linux-history to see why this was done, and
> I actually could not find an exact commit which explained what you just did.
> Thanks!
> 
> Hrm, so on freeze we issue the same commit as well, so is a sync *really* needed
> on thaw?

So let me explain what's going on.  When we do a freeze, we make sure
all of the blocks that are written to the journal are writen to the
final location on disk, and the journal is is truncated.  (This is
called a "journal checkpoint".)  Then we clear the NEEDS RECOVERY
feature flag and set the EXT4_VALID_FS flags in the superblock.  In
the no journal case, we flush out all metadata writes, and we set the
EXT4_VALID_FS flag.  In both cases, we call ext4_commit_super(sb, 1)
to make sure the flags update is pushed out to disk.  This puts the
file system into a "quiscent" state; in fact, it looks like the file
system has been unmounted, so that it becomes safe to run
dump/restore, run fsck -n on the file system, etc.

The problem on the thaw side is that we need to make sure that
EXT4_VALID_FS flag is removed (and if journaling is enabled, the NEEDS
RECOVERY feature flag is set) and the superblock is flushed out to the
storage device before any other writes are persisted on the disk.  In
the case where we have journalling enabled, we could wait until the
first journal commit to write out the superblock, since in journal
mode all metadata updates to the disk are suppressed until the journal
commit.  We don't do that today, but it is a change we could make.

However, in the no journal node we need to make sure the EXT4_VALID_FS
flag is cleared on disk before any other metadata operations can take
place, and calling ext4_commit_super(sb, 1) is the only real way to do
that.

> No, it was am empirical evaluation done at testing, I observed bio_submit()
> stalls, we never get completion from the device. Even if we call thaw at the
> very end of resume, after the devices should be up, we still end up in the same
> situation. Given how I order freezing filesystems after freezing userspace I do
> believe we should thaw filesystems prior unfreezing userspace, its why I placed
> the call where it is now.

So when we call ext4_commit_super(sb, 1), we issue the bio, and then
we block waiting for the I/O to complete.  That's a blocking call.  Is
the thaw context one which allows blocking?  If userspace is still
frozen, does that imply that the scheduler isn't allow to run?  If
that's the case, then that's probably the problem.

More generally, if the thawing code needs to allocate memory, or do
any number of things that could potentially block, this could
potentially be an issue.  Or if we have a network block device, or
something else in the storage stack that needs to run a kernel thread
context (or a workqueue, etc.) --- is the fact that userspace is
frozen mean the scheduler is going to refuse to schedule()?

I know at one point we made a distinction between freezing userspace
threads and kernel threads, but were there people who didn't like this
because of unnecessary complexity.  It sounds to me like on the thaw
side, we might also need to unfreeze kernel threads first, and then
allow userspace threads to run.  Do we do that today, or in the new
freeze/thaw code?  It's been quite a while since I've looked at that
part of the kernel.

					- Ted
Dave Chinner Oct. 4, 2017, 7:05 a.m. UTC | #4
On Tue, Oct 03, 2017 at 09:42:04PM -0400, Theodore Ts'o wrote:
> On Tue, Oct 03, 2017 at 10:13:21PM +0200, Luis R. Rodriguez wrote:
> > > After we remove add the NEEDS_RECOVERY flag, we need to make sure
> > > recovery flag is pushed out to disk before any other changes are
> > > allowed to be pushed out to disk.  That's why we originally did the
> > > update synchronously.
> > 
> > I see. I had to try to dig through linux-history to see why this was done, and
> > I actually could not find an exact commit which explained what you just did.
> > Thanks!
> > 
> > Hrm, so on freeze we issue the same commit as well, so is a sync *really* needed
> > on thaw?
> 
> So let me explain what's going on.  When we do a freeze, we make sure
> all of the blocks that are written to the journal are writen to the
> final location on disk, and the journal is is truncated.  (This is
> called a "journal checkpoint".)  Then we clear the NEEDS RECOVERY
> feature flag and set the EXT4_VALID_FS flags in the superblock.  In
> the no journal case, we flush out all metadata writes, and we set the
> EXT4_VALID_FS flag.  In both cases, we call ext4_commit_super(sb, 1)
> to make sure the flags update is pushed out to disk.  This puts the
> file system into a "quiscent" state; in fact, it looks like the file
> system has been unmounted, so that it becomes safe to run
> dump/restore, run fsck -n on the file system, etc.
> 
> The problem on the thaw side is that we need to make sure that
> EXT4_VALID_FS flag is removed (and if journaling is enabled, the NEEDS
> RECOVERY feature flag is set) and the superblock is flushed out to the
> storage device before any other writes are persisted on the disk.  In
> the case where we have journalling enabled, we could wait until the
> first journal commit to write out the superblock, since in journal
> mode all metadata updates to the disk are suppressed until the journal
> commit.  We don't do that today, but it is a change we could make.
> 
> However, in the no journal node we need to make sure the EXT4_VALID_FS
> flag is cleared on disk before any other metadata operations can take
> place, and calling ext4_commit_super(sb, 1) is the only real way to do
> that.
> 
> > No, it was am empirical evaluation done at testing, I observed bio_submit()
> > stalls, we never get completion from the device. Even if we call thaw at the
> > very end of resume, after the devices should be up, we still end up in the same
> > situation. Given how I order freezing filesystems after freezing userspace I do
> > believe we should thaw filesystems prior unfreezing userspace, its why I placed
> > the call where it is now.
> 
> So when we call ext4_commit_super(sb, 1), we issue the bio, and then
> we block waiting for the I/O to complete.  That's a blocking call.  Is
> the thaw context one which allows blocking?

thaw_super() does down_write(), so it *must* be able to sleep.

> If userspace is still
> frozen, does that imply that the scheduler isn't allow to run?  If
> that's the case, then that's probably the problem.
> 
> More generally, if the thawing code needs to allocate memory, or do

Which is does. xfs_fs_unfreeze() queues work to a workqueue, so
there's memory allocation there.

> any number of things that could potentially block, this could
> potentially be an issue.

yup, gfs2_unfreeze does even more stuff - it releases glocks which
may end up queuing work, waking other threads and freeing stuff via
call_rcu()...

Basically, before thawing filesystems the rest of the kernel
infrastructure needs to have been restarted. i.e. the order
needs to be:

freeze userspace
freeze filesystems
freeze kernel threads
freeze workqueues

thaw workqueues
thaw kernel threads
thaw filesystems
thaw userspace

and it should end up that way.

> Or if we have a network block device, or
> something else in the storage stack that needs to run a kernel thread
> context (or a workqueue, etc.) --- is the fact that userspace is
> frozen mean the scheduler is going to refuse to schedule()?

No.

Cheers,

Dave.
Bart Van Assche Oct. 4, 2017, 3:25 p.m. UTC | #5
On Wed, 2017-10-04 at 18:05 +1100, Dave Chinner wrote:
> Basically, before thawing filesystems the rest of the kernel

> infrastructure needs to have been restarted. i.e. the order

> needs to be:

> 

> freeze userspace

> freeze filesystems

> freeze kernel threads

> freeze workqueues

> 

> thaw workqueues

> thaw kernel threads

> thaw filesystems

> thaw userspace

> 

> and it should end up that way.


Does this order support filesystems that have been implemented in user space,
e.g. filesystems based on FUSE?

Thanks,

Bart.
Theodore Ts'o Oct. 4, 2017, 4:48 p.m. UTC | #6
On Wed, Oct 04, 2017 at 06:05:23PM +1100, Dave Chinner wrote:
> Basically, before thawing filesystems the rest of the kernel
> infrastructure needs to have been restarted. i.e. the order
> needs to be:
> 
> freeze userspace
> freeze filesystems
> freeze kernel threads
> freeze workqueues
> 
> thaw workqueues
> thaw kernel threads
> thaw filesystems
> thaw userspace
> 
> and it should end up that way.
> 
> > Or if we have a network block device, or
> > something else in the storage stack that needs to run a kernel thread
> > context (or a workqueue, etc.) --- is the fact that userspace is
> > frozen mean the scheduler is going to refuse to schedule()?
> 
> No.

Well, that's what the answer *should* be.  I was asking what this
patch series does, and given that Luis reported that with this patch
series ext4_commit_super(sb, 1) is hanging, I have my suspicions about
what the answer might be with this patch set.  (Especially since the
claimed goal of the patch set is, "kthread freezing with filesystem
freeze/thaw".

Cheers,

						- Ted
Dave Chinner Oct. 4, 2017, 10:22 p.m. UTC | #7
On Wed, Oct 04, 2017 at 12:48:39PM -0400, Theodore Ts'o wrote:
> On Wed, Oct 04, 2017 at 06:05:23PM +1100, Dave Chinner wrote:
> > Basically, before thawing filesystems the rest of the kernel
> > infrastructure needs to have been restarted. i.e. the order
> > needs to be:
> > 
> > freeze userspace
> > freeze filesystems
> > freeze kernel threads
> > freeze workqueues
> > 
> > thaw workqueues
> > thaw kernel threads
> > thaw filesystems
> > thaw userspace
> > 
> > and it should end up that way.
> > 
> > > Or if we have a network block device, or
> > > something else in the storage stack that needs to run a kernel thread
> > > context (or a workqueue, etc.) --- is the fact that userspace is
> > > frozen mean the scheduler is going to refuse to schedule()?
> > 
> > No.
> 
> Well, that's what the answer *should* be.  I was asking what this
> patch series does, and given that Luis reported that with this patch
> series ext4_commit_super(sb, 1) is hanging, I have my suspicions about
> what the answer might be with this patch set.  (Especially since the
> claimed goal of the patch set is, "kthread freezing with filesystem
> freeze/thaw".

There are know bugs in the patchset w.r.t.  workqueues and kernel
threads, and IO completion requires workqueues and kernel threads to
be running correctly. Hence filesystem thaw needs to occur after
workqueues and kernel threads are thawed.

The existing code has this assumption - that filesystem will start
working again the moment workqueues and kernel threads are thawed,
but trying to do IO before that will not work. So it's the same
with freeze/thaw of filesystems - thaw of filesystems will not work
if the rest of the kernel machinery is still frozen...

Cheers,

Dave.
diff mbox

Patch

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3cc5635d00cc..03d3bbd27dae 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -119,7 +119,8 @@  static struct file_system_type ext2_fs_type = {
 	.name		= "ext2",
 	.mount		= ext4_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV |
+			  FS_FREEZE_ON_SUSPEND,
 };
 MODULE_ALIAS_FS("ext2");
 MODULE_ALIAS("ext2");
@@ -134,7 +135,8 @@  static struct file_system_type ext3_fs_type = {
 	.name		= "ext3",
 	.mount		= ext4_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV |
+			  FS_FREEZE_ON_SUSPEND,
 };
 MODULE_ALIAS_FS("ext3");
 MODULE_ALIAS("ext3");
@@ -2979,8 +2981,6 @@  static int ext4_lazyinit_thread(void *arg)
 		}
 		mutex_unlock(&eli->li_list_mtx);
 
-		try_to_freeze();
-
 		cur = jiffies;
 		if ((time_after_eq(cur, next_wakeup)) ||
 		    (MAX_JIFFY_OFFSET == next_wakeup)) {
@@ -4926,7 +4926,7 @@  static int ext4_unfreeze(struct super_block *sb)
 		ext4_set_feature_journal_needs_recovery(sb);
 	}
 
-	ext4_commit_super(sb, 1);
+	ext4_commit_super(sb, 0);
 	return 0;
 }
 
@@ -5771,7 +5771,8 @@  static struct file_system_type ext4_fs_type = {
 	.name		= "ext4",
 	.mount		= ext4_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV |
+			  FS_FREEZE_ON_SUSPEND,
 };
 MODULE_ALIAS_FS("ext4");