Message ID | 20250327140613.25178-2-James.Bottomley@HansenPartnership.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | vfs freeze/thaw on suspend/resume | expand |
On Thu, 2025-03-27 at 10:06 -0400, James Bottomley wrote: [...] > -static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool > reader) > +static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool > reader, > + bool freeze) > { > DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function); > bool wait; > @@ -156,7 +157,8 @@ static void percpu_rwsem_wait(struct > percpu_rw_semaphore *sem, bool reader) > spin_unlock_irq(&sem->waiters.lock); > > while (wait) { > - set_current_state(TASK_UNINTERRUPTIBLE); > + set_current_state(TASK_UNINTERRUPTIBLE | > + freeze ? TASK_FREEZABLE : 0); This is a bit embarrassing, the bug I've been chasing is here: the ? operator is lower in precedence than | meaning this expression always evaluates to TASK_FREEZABLE and nothing else (which is why the process goes into R state and never wakes up). Let me fix that and redo all the testing. Regards, James
On Mon, Mar 31, 2025 at 03:51:43PM -0400, James Bottomley wrote: > On Thu, 2025-03-27 at 10:06 -0400, James Bottomley wrote: > [...] > > -static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool > > reader) > > +static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool > > reader, > > + bool freeze) > > { > > DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function); > > bool wait; > > @@ -156,7 +157,8 @@ static void percpu_rwsem_wait(struct > > percpu_rw_semaphore *sem, bool reader) > > spin_unlock_irq(&sem->waiters.lock); > > > > while (wait) { > > - set_current_state(TASK_UNINTERRUPTIBLE); > > + set_current_state(TASK_UNINTERRUPTIBLE | > > + freeze ? TASK_FREEZABLE : 0); > > This is a bit embarrassing, the bug I've been chasing is here: the ? > operator is lower in precedence than | meaning this expression always > evaluates to TASK_FREEZABLE and nothing else (which is why the process > goes into R state and never wakes up). > > Let me fix that and redo all the testing. I don't think that's it. I think you're missing making pagefault writers such as systemd-journald freezable: diff --git a/include/linux/fs.h b/include/linux/fs.h index b379a46b5576..528e73f192ac 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct super_block *sb, int level) static inline void __sb_start_write(struct super_block *sb, int level) { percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1, - level == SB_FREEZE_WRITE); + (level == SB_FREEZE_WRITE || + level == SB_FREEZE_PAGEFAULT)); } static inline bool __sb_start_write_trylock(struct super_block *sb, int level)
On Tue, 2025-04-01 at 01:32 +0200, Christian Brauner wrote: > On Mon, Mar 31, 2025 at 03:51:43PM -0400, James Bottomley wrote: > > On Thu, 2025-03-27 at 10:06 -0400, James Bottomley wrote: > > [...] > > > -static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, > > > bool > > > reader) > > > +static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, > > > bool > > > reader, > > > + bool freeze) > > > { > > > DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function); > > > bool wait; > > > @@ -156,7 +157,8 @@ static void percpu_rwsem_wait(struct > > > percpu_rw_semaphore *sem, bool reader) > > > spin_unlock_irq(&sem->waiters.lock); > > > > > > while (wait) { > > > - set_current_state(TASK_UNINTERRUPTIBLE); > > > + set_current_state(TASK_UNINTERRUPTIBLE | > > > + freeze ? TASK_FREEZABLE : 0); > > > > This is a bit embarrassing, the bug I've been chasing is here: the > > ? > > operator is lower in precedence than | meaning this expression > > always > > evaluates to TASK_FREEZABLE and nothing else (which is why the > > process > > goes into R state and never wakes up). > > > > Let me fix that and redo all the testing. > > I don't think that's it. I think you're missing making pagefault > writers such > as systemd-journald freezable: > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index b379a46b5576..528e73f192ac 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct > super_block *sb, int level) > static inline void __sb_start_write(struct super_block *sb, int > level) > { > percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1, > - level == SB_FREEZE_WRITE); > + (level == SB_FREEZE_WRITE || > + level == SB_FREEZE_PAGEFAULT)); > } Yes, I was about to tell Jan that the condition here simply needs to be true. All our rwsem levels need to be freezable to avoid a hibernation failure. Regards, James
On Mon 31-03-25 21:13:20, James Bottomley wrote: > On Tue, 2025-04-01 at 01:32 +0200, Christian Brauner wrote: > > On Mon, Mar 31, 2025 at 03:51:43PM -0400, James Bottomley wrote: > > > On Thu, 2025-03-27 at 10:06 -0400, James Bottomley wrote: > > > [...] > > > > -static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, > > > > bool > > > > reader) > > > > +static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, > > > > bool > > > > reader, > > > > + bool freeze) > > > > { > > > > DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function); > > > > bool wait; > > > > @@ -156,7 +157,8 @@ static void percpu_rwsem_wait(struct > > > > percpu_rw_semaphore *sem, bool reader) > > > > spin_unlock_irq(&sem->waiters.lock); > > > > > > > > while (wait) { > > > > - set_current_state(TASK_UNINTERRUPTIBLE); > > > > + set_current_state(TASK_UNINTERRUPTIBLE | > > > > + freeze ? TASK_FREEZABLE : 0); > > > > > > This is a bit embarrassing, the bug I've been chasing is here: the > > > ? > > > operator is lower in precedence than | meaning this expression > > > always > > > evaluates to TASK_FREEZABLE and nothing else (which is why the > > > process > > > goes into R state and never wakes up). > > > > > > Let me fix that and redo all the testing. > > > > I don't think that's it. I think you're missing making pagefault > > writers such > > as systemd-journald freezable: > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index b379a46b5576..528e73f192ac 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct > > super_block *sb, int level) > > static inline void __sb_start_write(struct super_block *sb, int > > level) > > { > > percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1, > > - level == SB_FREEZE_WRITE); > > + (level == SB_FREEZE_WRITE || > > + level == SB_FREEZE_PAGEFAULT)); > > } > > Yes, I was about to tell Jan that the condition here simply needs to be > true. All our rwsem levels need to be freezable to avoid a hibernation > failure. So there is one snag with this. SB_FREEZE_PAGEFAULT level is acquired under mmap_sem, SB_FREEZE_INTERNAL level is possibly acquired under some other filesystem locks. So if you freeze the filesystem, a task can block on frozen filesystem with e.g. mmap_sem held and if some other task then blocks on grabbing that mmap_sem, hibernation fails because we'll be unable to hibernate the task waiting for mmap_sem. So if you'd like to completely avoid these hibernation failures, you'd have to make a slew of filesystem related locks use freezable sleeping. I don't think that's feasible. I was hoping that failures due to SB_FREEZE_PAGEFAULT level not being freezable would be rare enough but you've proven they are quite frequent. We can try making SB_FREEZE_PAGEFAULT level (or even SB_FREEZE_INTERNAL) freezable and see whether that works good enough... Honza
On Tue, Apr 01, 2025 at 01:20:37PM +0200, Jan Kara wrote: > On Mon 31-03-25 21:13:20, James Bottomley wrote: > > On Tue, 2025-04-01 at 01:32 +0200, Christian Brauner wrote: > > > On Mon, Mar 31, 2025 at 03:51:43PM -0400, James Bottomley wrote: > > > > On Thu, 2025-03-27 at 10:06 -0400, James Bottomley wrote: > > > > [...] > > > > > -static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, > > > > > bool > > > > > reader) > > > > > +static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, > > > > > bool > > > > > reader, > > > > > + bool freeze) > > > > > { > > > > > DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function); > > > > > bool wait; > > > > > @@ -156,7 +157,8 @@ static void percpu_rwsem_wait(struct > > > > > percpu_rw_semaphore *sem, bool reader) > > > > > spin_unlock_irq(&sem->waiters.lock); > > > > > > > > > > while (wait) { > > > > > - set_current_state(TASK_UNINTERRUPTIBLE); > > > > > + set_current_state(TASK_UNINTERRUPTIBLE | > > > > > + freeze ? TASK_FREEZABLE : 0); > > > > > > > > This is a bit embarrassing, the bug I've been chasing is here: the > > > > ? > > > > operator is lower in precedence than | meaning this expression > > > > always > > > > evaluates to TASK_FREEZABLE and nothing else (which is why the > > > > process > > > > goes into R state and never wakes up). > > > > > > > > Let me fix that and redo all the testing. > > > > > > I don't think that's it. I think you're missing making pagefault > > > writers such > > > as systemd-journald freezable: > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > index b379a46b5576..528e73f192ac 100644 > > > --- a/include/linux/fs.h > > > +++ b/include/linux/fs.h > > > @@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct > > > super_block *sb, int level) > > > static inline void __sb_start_write(struct super_block *sb, int > > > level) > > > { > > > percpu_down_read_freezable(sb->s_writers.rw_sem + level - 1, > > > - level == SB_FREEZE_WRITE); > > > + (level == SB_FREEZE_WRITE || > > > + level == SB_FREEZE_PAGEFAULT)); > > > } > > > > Yes, I was about to tell Jan that the condition here simply needs to be > > true. All our rwsem levels need to be freezable to avoid a hibernation > > failure. > > So there is one snag with this. SB_FREEZE_PAGEFAULT level is acquired under > mmap_sem, SB_FREEZE_INTERNAL level is possibly acquired under some other > filesystem locks. So if you freeze the filesystem, a task can block on > frozen filesystem with e.g. mmap_sem held and if some other task then Yeah, I wondered about that yesterday. > blocks on grabbing that mmap_sem, hibernation fails because we'll be unable > to hibernate the task waiting for mmap_sem. So if you'd like to completely > avoid these hibernation failures, you'd have to make a slew of filesystem > related locks use freezable sleeping. I don't think that's feasible. > > I was hoping that failures due to SB_FREEZE_PAGEFAULT level not being > freezable would be rare enough but you've proven they are quite frequent. > We can try making SB_FREEZE_PAGEFAULT level (or even SB_FREEZE_INTERNAL) > freezable and see whether that works good enough... I think that's fine and we'll see whether this causes a lot of issues. I've got the patchset written in a way now that userspace can just enable or disable freeze during migration.
On Tue, 2025-04-01 at 13:20 +0200, Jan Kara wrote: > On Mon 31-03-25 21:13:20, James Bottomley wrote: > > On Tue, 2025-04-01 at 01:32 +0200, Christian Brauner wrote: [...] > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > index b379a46b5576..528e73f192ac 100644 > > > --- a/include/linux/fs.h > > > +++ b/include/linux/fs.h > > > @@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct > > > super_block *sb, int level) > > > static inline void __sb_start_write(struct super_block *sb, int > > > level) > > > { > > > percpu_down_read_freezable(sb->s_writers.rw_sem + level - > > > 1, > > > - level == SB_FREEZE_WRITE); > > > + (level == SB_FREEZE_WRITE || > > > + level == > > > SB_FREEZE_PAGEFAULT)); > > > } > > > > Yes, I was about to tell Jan that the condition here simply needs > > to be true. All our rwsem levels need to be freezable to avoid a > > hibernation failure. > > So there is one snag with this. SB_FREEZE_PAGEFAULT level is acquired > under mmap_sem, SB_FREEZE_INTERNAL level is possibly acquired under > some other filesystem locks. Just for SB_FREEZE_INTERNAL, I think there's no case of sb_start_intwrite() that can ever hold in D wait because by the time we acquire the semaphore for write, the internal freeze_fs should have been called and the filesystem should have quiesced itself. On the other hand, if that theory itself is true, there's no real need for sb_start_intwrite() at all because it can never conflict. > So if you freeze the filesystem, a task can block on frozen > filesystem with e.g. mmap_sem held and if some other task then blocks > on grabbing that mmap_sem, hibernation fails because we'll be unable > to hibernate the task waiting for mmap_sem. So if you'd like to > completely avoid these hibernation failures, you'd have to make a > slew of filesystem related locks use freezable sleeping. I don't > think that's feasible. I wouldn't see that because I'm on x86_64 and that takes the vma_lock in page faults not the mmap_lock. The granularity of all these locks is process level, so it's hard to see what they'd be racing with ... even if I conjecture two threads trying to write to something, they'd have to have some internal co-ordination which would likely prevent the second one from writing if the first got stuck on the page fault. > I was hoping that failures due to SB_FREEZE_PAGEFAULT level not being > freezable would be rare enough but you've proven they are quite > frequent. We can try making SB_FREEZE_PAGEFAULT level (or even > SB_FREEZE_INTERNAL) freezable and see whether that works good > enough... I'll try to construct a more severe test than systemd-journald ... it looks to be single threaded in its operation. Regards, James
On Tue 01-04-25 08:52:02, James Bottomley wrote: > On Tue, 2025-04-01 at 13:20 +0200, Jan Kara wrote: > > On Mon 31-03-25 21:13:20, James Bottomley wrote: > > > On Tue, 2025-04-01 at 01:32 +0200, Christian Brauner wrote: > [...] > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > > index b379a46b5576..528e73f192ac 100644 > > > > --- a/include/linux/fs.h > > > > +++ b/include/linux/fs.h > > > > @@ -1782,7 +1782,8 @@ static inline void __sb_end_write(struct > > > > super_block *sb, int level) > > > > static inline void __sb_start_write(struct super_block *sb, int > > > > level) > > > > { > > > > percpu_down_read_freezable(sb->s_writers.rw_sem + level - > > > > 1, > > > > - level == SB_FREEZE_WRITE); > > > > + (level == SB_FREEZE_WRITE || > > > > + level == > > > > SB_FREEZE_PAGEFAULT)); > > > > } > > > > > > Yes, I was about to tell Jan that the condition here simply needs > > > to be true. All our rwsem levels need to be freezable to avoid a > > > hibernation failure. > > > > So there is one snag with this. SB_FREEZE_PAGEFAULT level is acquired > > under mmap_sem, SB_FREEZE_INTERNAL level is possibly acquired under > > some other filesystem locks. > > Just for SB_FREEZE_INTERNAL, I think there's no case of > sb_start_intwrite() that can ever hold in D wait because by the time we > acquire the semaphore for write, the internal freeze_fs should have > been called and the filesystem should have quiesced itself. On the > other hand, if that theory itself is true, there's no real need for > sb_start_intwrite() at all because it can never conflict. This is not true. Sure, userspace should all be blocked, dirty pages written back, but you still have filesystem background tasks like lazy initialization of inode tables, inode garbage collection, regular lazy updates of statistics in the superblock. These generally happen from kthreads / work queues and they can still be scheduled and executed although freeze_super() has started blocking SB_FREEZE_WRITE and SB_FREEZE_PAGEFAULT levels... And generally this freeze level is there exactly because it needs to be acquired from locking context which doesn't allow usage of SB_FREEZE_WRITE or SB_FREEZE_PAGEFAULT levels. > > So if you freeze the filesystem, a task can block on frozen > > filesystem with e.g. mmap_sem held and if some other task then blocks > > on grabbing that mmap_sem, hibernation fails because we'll be unable > > to hibernate the task waiting for mmap_sem. So if you'd like to > > completely avoid these hibernation failures, you'd have to make a > > slew of filesystem related locks use freezable sleeping. I don't > > think that's feasible. > > I wouldn't see that because I'm on x86_64 and that takes the vma_lock > in page faults not the mmap_lock. The granularity of all these locks > is process level, so it's hard to see what they'd be racing with ... I agree that because of vma_lock it would be much harder to see this. But as far as I remember mmap_sem is still a fallback option when we race with VMA modification even for x86 so this problem is possible to hit, just much more unlikely. > even if I conjecture two threads trying to write to something, they'd > have to have some internal co-ordination which would likely prevent the > second one from writing if the first got stuck on the page fault. > > > I was hoping that failures due to SB_FREEZE_PAGEFAULT level not being > > freezable would be rare enough but you've proven they are quite > > frequent. We can try making SB_FREEZE_PAGEFAULT level (or even > > SB_FREEZE_INTERNAL) freezable and see whether that works good > > enough... > > I'll try to construct a more severe test than systemd-journald ... it > looks to be single threaded in its operation. OK, thanks! Honza
diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h index c012df33a9f0..a55fe709b832 100644 --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -42,9 +42,10 @@ is_static struct percpu_rw_semaphore name = { \ #define DEFINE_STATIC_PERCPU_RWSEM(name) \ __DEFINE_PERCPU_RWSEM(name, static) -extern bool __percpu_down_read(struct percpu_rw_semaphore *, bool); +extern bool __percpu_down_read(struct percpu_rw_semaphore *, bool, bool); -static inline void percpu_down_read(struct percpu_rw_semaphore *sem) +static inline void percpu_down_read_internal(struct percpu_rw_semaphore *sem, + bool freezable) { might_sleep(); @@ -62,7 +63,7 @@ static inline void percpu_down_read(struct percpu_rw_semaphore *sem) if (likely(rcu_sync_is_idle(&sem->rss))) this_cpu_inc(*sem->read_count); else - __percpu_down_read(sem, false); /* Unconditional memory barrier */ + __percpu_down_read(sem, false, freezable); /* Unconditional memory barrier */ /* * The preempt_enable() prevents the compiler from * bleeding the critical section out. @@ -70,6 +71,17 @@ static inline void percpu_down_read(struct percpu_rw_semaphore *sem) preempt_enable(); } +static inline void percpu_down_read(struct percpu_rw_semaphore *sem) +{ + percpu_down_read_internal(sem, false); +} + +static inline void percpu_down_read_freezable(struct percpu_rw_semaphore *sem, + bool freeze) +{ + percpu_down_read_internal(sem, freeze); +} + static inline bool percpu_down_read_trylock(struct percpu_rw_semaphore *sem) { bool ret = true; @@ -81,7 +93,7 @@ static inline bool percpu_down_read_trylock(struct percpu_rw_semaphore *sem) if (likely(rcu_sync_is_idle(&sem->rss))) this_cpu_inc(*sem->read_count); else - ret = __percpu_down_read(sem, true); /* Unconditional memory barrier */ + ret = __percpu_down_read(sem, true, false); /* Unconditional memory barrier */ preempt_enable(); /* * The barrier() from preempt_enable() prevents the compiler from diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c index 6083883c4fe0..890837b73476 100644 --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -138,7 +138,8 @@ static int percpu_rwsem_wake_function(struct wait_queue_entry *wq_entry, return !reader; /* wake (readers until) 1 writer */ } -static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader) +static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader, + bool freeze) { DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function); bool wait; @@ -156,7 +157,8 @@ static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader) spin_unlock_irq(&sem->waiters.lock); while (wait) { - set_current_state(TASK_UNINTERRUPTIBLE); + set_current_state(TASK_UNINTERRUPTIBLE | + freeze ? TASK_FREEZABLE : 0); if (!smp_load_acquire(&wq_entry.private)) break; schedule(); @@ -164,7 +166,8 @@ static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader) __set_current_state(TASK_RUNNING); } -bool __sched __percpu_down_read(struct percpu_rw_semaphore *sem, bool try) +bool __sched __percpu_down_read(struct percpu_rw_semaphore *sem, bool try, + bool freeze) { if (__percpu_down_read_trylock(sem)) return true; @@ -174,7 +177,7 @@ bool __sched __percpu_down_read(struct percpu_rw_semaphore *sem, bool try) trace_contention_begin(sem, LCB_F_PERCPU | LCB_F_READ); preempt_enable(); - percpu_rwsem_wait(sem, /* .reader = */ true); + percpu_rwsem_wait(sem, /* .reader = */ true, freeze); preempt_disable(); trace_contention_end(sem, 0); @@ -237,7 +240,7 @@ void __sched percpu_down_write(struct percpu_rw_semaphore *sem) */ if (!__percpu_down_write_trylock(sem)) { trace_contention_begin(sem, LCB_F_PERCPU | LCB_F_WRITE); - percpu_rwsem_wait(sem, /* .reader = */ false); + percpu_rwsem_wait(sem, /* .reader = */ false, false); contended = true; }
Percpu-rwsems are used for superblock locking. However, we know the read percpu-rwsem we take for sb_start_write() on a frozen filesystem needs not to inhibit system from suspending or hibernating. That means it needs to wait with TASK_UNINTERRUPTIBLE | TASK_FREEZABLE. Introduce a new percpu_down_read_freezable() that allows us to control whether TASK_FREEZABLE is added to the wait flags. Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> --- Since this is an RFC, added the percpu-rwsem maintainers for information and guidance to check if we're on the right track or whether they would prefer an alternative API. --- include/linux/percpu-rwsem.h | 20 ++++++++++++++++---- kernel/locking/percpu-rwsem.c | 13 ++++++++----- 2 files changed, 24 insertions(+), 9 deletions(-)