btrfs: zoned: mark relocation as writing

Message ID	e935669f435882ba179812a955bcbb89d964db26.1642401849.git.naohiro.aota@wdc.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> IronPort-SDR: 1wvJnzwawv1ryKLy/c9ICg53MRDKIxSyoXz6w9MHDyFzIcgJvY+DsabReA23h+csuCmZr2KYU8 rzQQTf1ExrtKXl7HGQXHjykgh8TxXXOf5QqhDpJvXaBZapSSR60MB4j28D739dL2yVfIOjjW1D BdVbJDji77lHhuXFnCyB2kfgsYe/wT9uhfEUByYFW0BG9EjMh41+9PwxItRus9SpYi2NoTZP07 HIJvxpCaATmDHUAoUPTdBxZ3UtX5nnn8wHFc7b1PYTlAhpU+FdN8RTdrCQmP+m+vuYIFuPnR43 80cWt2wUzDA9QRWhimUthftU IronPort-SDR: n/wbqX9d4vw9j8n6wAqbG0dFefT0+DkBDbSRkgPL+kyHKnbbxzsG/pBXqHwHoNRi9giLJBxW6l nycqKPjQzMtK7qtxd3GaVd+ChGsa/solk3PbMk+tkmMTJTkiNbM0bp3wCm9x+JyXJiETjWsLFY xu0QDjRu5U55u/hh4932DgdJpxClpyNfEUlJWUIABcvSuxE2oQAwJLTeNH8qN2W6UTTbAObr+R n2Nkd+yDm4rLdyWD4fEcxWWA+LaZH9QP8TtFz+RRS3Os7ov4lm5qprPVu6f3MuWViS5kGQiig1 COw= WDCIronportException: Internal From: Naohiro Aota <naohiro.aota@wdc.com> To: David Sterba <dsterba@suse.com> Cc: linux-btrfs@vger.kernel.org, johannes.thumshirn@wdc.com, Naohiro Aota <naohiro.aota@wdc.com> Subject: [PATCH] btrfs: zoned: mark relocation as writing Date: Mon, 17 Jan 2022 15:56:52 +0900 Message-Id: <e935669f435882ba179812a955bcbb89d964db26.1642401849.git.naohiro.aota@wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: zoned: mark relocation as writing \| expand btrfs: zoned: mark relocation as writing

Message ID

e935669f435882ba179812a955bcbb89d964db26.1642401849.git.naohiro.aota@wdc.com (mailing list archive)

State

New, archived

Headers

IronPort-SDR: 
 1wvJnzwawv1ryKLy/c9ICg53MRDKIxSyoXz6w9MHDyFzIcgJvY+DsabReA23h+csuCmZr2KYU8
 rzQQTf1ExrtKXl7HGQXHjykgh8TxXXOf5QqhDpJvXaBZapSSR60MB4j28D739dL2yVfIOjjW1D
 BdVbJDji77lHhuXFnCyB2kfgsYe/wT9uhfEUByYFW0BG9EjMh41+9PwxItRus9SpYi2NoTZP07
 HIJvxpCaATmDHUAoUPTdBxZ3UtX5nnn8wHFc7b1PYTlAhpU+FdN8RTdrCQmP+m+vuYIFuPnR43
 80cWt2wUzDA9QRWhimUthftU
IronPort-SDR: 
 n/wbqX9d4vw9j8n6wAqbG0dFefT0+DkBDbSRkgPL+kyHKnbbxzsG/pBXqHwHoNRi9giLJBxW6l
 nycqKPjQzMtK7qtxd3GaVd+ChGsa/solk3PbMk+tkmMTJTkiNbM0bp3wCm9x+JyXJiETjWsLFY
 xu0QDjRu5U55u/hh4932DgdJpxClpyNfEUlJWUIABcvSuxE2oQAwJLTeNH8qN2W6UTTbAObr+R
 n2Nkd+yDm4rLdyWD4fEcxWWA+LaZH9QP8TtFz+RRS3Os7ov4lm5qprPVu6f3MuWViS5kGQiig1
 COw=
WDCIronportException: Internal
From: Naohiro Aota <naohiro.aota@wdc.com>
To: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org, johannes.thumshirn@wdc.com,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH] btrfs: zoned: mark relocation as writing
Date: Mon, 17 Jan 2022 15:56:52 +0900
Message-Id: 
 <e935669f435882ba179812a955bcbb89d964db26.1642401849.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

btrfs: zoned: mark relocation as writing | expand

Commit Message

Naohiro Aota Jan. 17, 2022, 6:56 a.m. UTC

There is a hung_task issue with running generic/068 on an SMR
device. The hang occurs while a process is trying to thaw the
filesystem. The process is trying to take sb->s_umount to thaw the
FS. The lock is held by fsstress, which calls btrfs_sync_fs() and is
waiting for an ordered extent to finish. However, as the FS is frozen,
the ordered extent never finish.

Having an ordered extent while the FS is frozen is the root cause of
the hang. The ordered extent is initiated from btrfs_relocate_chunk()
which is called from btrfs_reclaim_bgs_work().

This commit add sb_*_write() around btrfs_relocate_chunk() call
site. For the usual "btrfs balance" command, we already call it with
mnt_want_file() in btrfs_ioctl_balance().

Additionally, add an ASSERT in btrfs_relocate_chunk() to check it is
properly called.

Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones")
Cc: stable@vger.kernel.org # 5.13+
Link: https://github.com/naota/linux/issues/56
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c | 3 +++
 fs/btrfs/volumes.c     | 5 +++++
 2 files changed, 8 insertions(+)

Comments

Johannes Thumshirn Jan. 17, 2022, 10:36 a.m. UTC | #1

OK as somehow Thunderbird is broken for me, let’s see if Outlook works: 

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

David Sterba Jan. 17, 2022, 2:40 p.m. UTC | #2

On Mon, Jan 17, 2022 at 03:56:52PM +0900, Naohiro Aota wrote:
> There is a hung_task issue with running generic/068 on an SMR
> device. The hang occurs while a process is trying to thaw the
> filesystem. The process is trying to take sb->s_umount to thaw the
> FS. The lock is held by fsstress, which calls btrfs_sync_fs() and is
> waiting for an ordered extent to finish. However, as the FS is frozen,
> the ordered extent never finish.
> 
> Having an ordered extent while the FS is frozen is the root cause of
> the hang. The ordered extent is initiated from btrfs_relocate_chunk()
> which is called from btrfs_reclaim_bgs_work().
> 
> This commit add sb_*_write() around btrfs_relocate_chunk() call
> site. For the usual "btrfs balance" command, we already call it with
> mnt_want_file() in btrfs_ioctl_balance().
> 
> Additionally, add an ASSERT in btrfs_relocate_chunk() to check it is
> properly called.
> 
> Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones")
> Cc: stable@vger.kernel.org # 5.13+
> Link: https://github.com/naota/linux/issues/56
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/block-group.c | 3 +++
>  fs/btrfs/volumes.c     | 5 +++++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 68feabc83a27..913fee0daafd 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1516,11 +1516,13 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>  	if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE))
>  		return;
>  
> +	sb_start_write(fs_info->sb);
>  	/*
>  	 * Long running balances can keep us blocked here for eternity, so
>  	 * simply skip reclaim if we're unable to get the mutex.
>  	 */
>  	if (!mutex_trylock(&fs_info->reclaim_bgs_lock)) {
> +		sb_end_write(fs_info->sb);

Should this be some sort of try lock as well? IIRC sb_start_write can
block, so this would block the whole thread.

>  		btrfs_exclop_finish(fs_info);
>  		return;
>  	}
> @@ -1595,6 +1597,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>  	}
>  	spin_unlock(&fs_info->unused_bgs_lock);
>  	mutex_unlock(&fs_info->reclaim_bgs_lock);
> +	sb_end_write(fs_info->sb);
>  	btrfs_exclop_finish(fs_info);
>  }
>  
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index b07d382d53a8..3975511f3201 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3251,6 +3251,9 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
>  	u64 length;
>  	int ret;
>  
> +	/* Assert we called sb_start_write(), not to race with FS freezing */
> +	lockdep_assert_held_read(fs_info->sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);

This seems to be peeking into the internals and I can't say if this is
a good idea or not, some wrappe would make more sense.

> +
>  	/*
>  	 * Prevent races with automatic removal of unused block groups.
>  	 * After we relocate and before we remove the chunk with offset
> @@ -8306,6 +8309,7 @@ static int relocating_repair_kthread(void *data)
>  		return -EBUSY;
>  	}
>  
> +	sb_start_write(fs_info->sb);

I was wondering if the trylock semantics should be here as well but
proably not, because the next lock is a big one too.

>  	mutex_lock(&fs_info->reclaim_bgs_lock);
>  
>  	/* Ensure block group still exists */
> @@ -8329,6 +8333,7 @@ static int relocating_repair_kthread(void *data)
>  	if (cache)
>  		btrfs_put_block_group(cache);
>  	mutex_unlock(&fs_info->reclaim_bgs_lock);
> +	sb_end_write(fs_info->sb);
>  	btrfs_exclop_finish(fs_info);
>  
>  	return ret;
> -- 
> 2.34.1

Naohiro Aota Jan. 24, 2022, 4:15 a.m. UTC | #3

On Mon, Jan 17, 2022 at 03:40:14PM +0100, David Sterba wrote:
> On Mon, Jan 17, 2022 at 03:56:52PM +0900, Naohiro Aota wrote:
> > There is a hung_task issue with running generic/068 on an SMR
> > device. The hang occurs while a process is trying to thaw the
> > filesystem. The process is trying to take sb->s_umount to thaw the
> > FS. The lock is held by fsstress, which calls btrfs_sync_fs() and is
> > waiting for an ordered extent to finish. However, as the FS is frozen,
> > the ordered extent never finish.
> > 
> > Having an ordered extent while the FS is frozen is the root cause of
> > the hang. The ordered extent is initiated from btrfs_relocate_chunk()
> > which is called from btrfs_reclaim_bgs_work().
> > 
> > This commit add sb_*_write() around btrfs_relocate_chunk() call
> > site. For the usual "btrfs balance" command, we already call it with
> > mnt_want_file() in btrfs_ioctl_balance().
> > 
> > Additionally, add an ASSERT in btrfs_relocate_chunk() to check it is
> > properly called.
> > 
> > Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones")
> > Cc: stable@vger.kernel.org # 5.13+
> > Link: https://github.com/naota/linux/issues/56
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> >  fs/btrfs/block-group.c | 3 +++
> >  fs/btrfs/volumes.c     | 5 +++++
> >  2 files changed, 8 insertions(+)
> > 
> > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > index 68feabc83a27..913fee0daafd 100644
> > --- a/fs/btrfs/block-group.c
> > +++ b/fs/btrfs/block-group.c
> > @@ -1516,11 +1516,13 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
> >  	if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE))
> >  		return;
> >  
> > +	sb_start_write(fs_info->sb);
> >  	/*
> >  	 * Long running balances can keep us blocked here for eternity, so
> >  	 * simply skip reclaim if we're unable to get the mutex.
> >  	 */
> >  	if (!mutex_trylock(&fs_info->reclaim_bgs_lock)) {
> > +		sb_end_write(fs_info->sb);
> 
> Should this be some sort of try lock as well? IIRC sb_start_write can
> block, so this would block the whole thread.

Contrary to the word "write," sb_start_write() takes a read lock. And,
the write side is only taken by the FS freeze path. So, I don't think
it's necessary to make it try locking.

I just noticed sb_start_write() should come before
btrfs_exclop_balance(), so that it has the same order as in
btrfs_ioctl_balance(). Will fix that.

> >  		btrfs_exclop_finish(fs_info);
> >  		return;
> >  	}
> > @@ -1595,6 +1597,7 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
> >  	}
> >  	spin_unlock(&fs_info->unused_bgs_lock);
> >  	mutex_unlock(&fs_info->reclaim_bgs_lock);
> > +	sb_end_write(fs_info->sb);
> >  	btrfs_exclop_finish(fs_info);
> >  }
> >  
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index b07d382d53a8..3975511f3201 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -3251,6 +3251,9 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
> >  	u64 length;
> >  	int ret;
> >  
> > +	/* Assert we called sb_start_write(), not to race with FS freezing */
> > +	lockdep_assert_held_read(fs_info->sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
> 
> This seems to be peeking into the internals and I can't say if this is
> a good idea or not, some wrappe would make more sense.

I agree. I'll make a common wrapper in the next version.

> > +
> >  	/*
> >  	 * Prevent races with automatic removal of unused block groups.
> >  	 * After we relocate and before we remove the chunk with offset
> > @@ -8306,6 +8309,7 @@ static int relocating_repair_kthread(void *data)
> >  		return -EBUSY;
> >  	}
> >  
> > +	sb_start_write(fs_info->sb);
> 
> I was wondering if the trylock semantics should be here as well but
> proably not, because the next lock is a big one too.

I think this can be intact as shown as above.

> >  	mutex_lock(&fs_info->reclaim_bgs_lock);
> >  
> >  	/* Ensure block group still exists */
> > @@ -8329,6 +8333,7 @@ static int relocating_repair_kthread(void *data)
> >  	if (cache)
> >  		btrfs_put_block_group(cache);
> >  	mutex_unlock(&fs_info->reclaim_bgs_lock);
> > +	sb_end_write(fs_info->sb);
> >  	btrfs_exclop_finish(fs_info);
> >  
> >  	return ret;
> > -- 
> > 2.34.1

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 68feabc83a27..913fee0daafd 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1516,11 +1516,13 @@  void btrfs_reclaim_bgs_work(struct work_struct *work)
 	if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE))
 		return;
 
+	sb_start_write(fs_info->sb);
 	/*
 	 * Long running balances can keep us blocked here for eternity, so
 	 * simply skip reclaim if we're unable to get the mutex.
 	 */
 	if (!mutex_trylock(&fs_info->reclaim_bgs_lock)) {
+		sb_end_write(fs_info->sb);
 		btrfs_exclop_finish(fs_info);
 		return;
 	}
@@ -1595,6 +1597,7 @@  void btrfs_reclaim_bgs_work(struct work_struct *work)
 	}
 	spin_unlock(&fs_info->unused_bgs_lock);
 	mutex_unlock(&fs_info->reclaim_bgs_lock);
+	sb_end_write(fs_info->sb);
 	btrfs_exclop_finish(fs_info);
 }
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b07d382d53a8..3975511f3201 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3251,6 +3251,9 @@  int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 	u64 length;
 	int ret;
 
+	/* Assert we called sb_start_write(), not to race with FS freezing */
+	lockdep_assert_held_read(fs_info->sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
+
 	/*
 	 * Prevent races with automatic removal of unused block groups.
 	 * After we relocate and before we remove the chunk with offset
@@ -8306,6 +8309,7 @@  static int relocating_repair_kthread(void *data)
 		return -EBUSY;
 	}
 
+	sb_start_write(fs_info->sb);
 	mutex_lock(&fs_info->reclaim_bgs_lock);
 
 	/* Ensure block group still exists */
@@ -8329,6 +8333,7 @@  static int relocating_repair_kthread(void *data)
 	if (cache)
 		btrfs_put_block_group(cache);
 	mutex_unlock(&fs_info->reclaim_bgs_lock);
+	sb_end_write(fs_info->sb);
 	btrfs_exclop_finish(fs_info);
 
 	return ret;

btrfs: zoned: mark relocation as writing

Commit Message

Comments

Patch