diff mbox series

btrfs: zoned: fix calc_available_free_space for zoned mode

Message ID 0803c4de21aac935169b8289de1cb71c695452be.1719326990.git.naohiro.aota@wdc.com (mailing list archive)
State New, archived
Headers show
Series btrfs: zoned: fix calc_available_free_space for zoned mode | expand

Commit Message

Naohiro Aota June 25, 2024, 2:58 p.m. UTC
calc_available_free_space() returns the total size of metadata (or system)
block groups, which can be allocated from unallocated disk space. The logic
is wrong on zoned mode in two places.

First, the calculation of data_chunk_size is wrong. We always allocate one
zone as one chunk, and no partial allocation of a zone. So, we should use
zone_size (= data_sinfo->chunk_size) as it is.

Second, the result "avail" may not be zone aligned. Since we always
allocate one zone as one chunk on zoned mode, returning non-zone size
alingned bytes will result in less pressure on the async metadata reclaim
process.

This is serious for the nearly full state with a large zone size
device. Allowing over-commit too much will result in less async reclaim
work and end up in ENOSPC. We can align down to the zone size to avoid
that.

Fixes: cb6cbab79055 ("btrfs: adjust overcommit logic when very close to full")
CC: stable@vger.kernel.org # 6.9
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
As I mentioned in [1], I based this patch on for-next but before
Boris's "dynamic block_group reclaim threshold" series, because it would
be easier to backport this patch. I'll resend this patch basing on
for-next, if you'd prefer that.

[1] https://lore.kernel.org/linux-btrfs/avjakfevy3gtwcgxugnzwsfkld35tfgktd5ywpz3kac6gfraxh@uic6zl3jmgbl/
---
 fs/btrfs/space-info.c | 24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

Comments

Boris Burkov June 25, 2024, 4:17 p.m. UTC | #1
On Tue, Jun 25, 2024 at 11:58:49PM +0900, Naohiro Aota wrote:
> calc_available_free_space() returns the total size of metadata (or system)
> block groups, which can be allocated from unallocated disk space. The logic
> is wrong on zoned mode in two places.
> 
> First, the calculation of data_chunk_size is wrong. We always allocate one
> zone as one chunk, and no partial allocation of a zone. So, we should use
> zone_size (= data_sinfo->chunk_size) as it is.
> 
> Second, the result "avail" may not be zone aligned. Since we always
> allocate one zone as one chunk on zoned mode, returning non-zone size
> alingned bytes will result in less pressure on the async metadata reclaim
> process.
> 
> This is serious for the nearly full state with a large zone size
> device. Allowing over-commit too much will result in less async reclaim
> work and end up in ENOSPC. We can align down to the zone size to avoid
> that.

I sort of wish we could get rid of the "data_sinfo->chunk_size is wrong"
thing. If we never actually use that value, perhaps we can try to really
fix it? I didn't do it in my patches either, so there must have been
something that made me hesitate that I'm now forgetting.

Obviously not the focus here, so with that said, the fix looks
good.

Reviewed-by: Boris Burkov <boris@bur.io>

> 
> Fixes: cb6cbab79055 ("btrfs: adjust overcommit logic when very close to full")
> CC: stable@vger.kernel.org # 6.9
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
> As I mentioned in [1], I based this patch on for-next but before
> Boris's "dynamic block_group reclaim threshold" series, because it would
> be easier to backport this patch. I'll resend this patch basing on
> for-next, if you'd prefer that.

I looked at the merge conflict and it's the fact that I pulled the data
chunk logic into a helper function. Just fixing it is easy, but like you
said, makes backporting this fix fussier.

Since the conflict is pretty trivial, I support re-ordering things so
that this goes under the reclaim stuff, whether we pull my
calc_effective_data_chunk_size refactor into this patch and put it
under, or just edit the refactor to bring along the zoned fix. Let me
know if I can help with that!

> 
> [1] https://lore.kernel.org/linux-btrfs/avjakfevy3gtwcgxugnzwsfkld35tfgktd5ywpz3kac6gfraxh@uic6zl3jmgbl/
> ---
>  fs/btrfs/space-info.c | 24 +++++++++++++++++++++---
>  1 file changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 0283ee9bf813..85ff44a74223 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -373,11 +373,18 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
>  	 * "optimal" chunk size based on the fs size.  However when we actually
>  	 * allocate the chunk we will strip this down further, making it no more
>  	 * than 10% of the disk or 1G, whichever is smaller.
> +	 *
> +	 * On the zoned mode, we need to use zone_size (=
> +	 * data_sinfo->chunk_size) as it is.
>  	 */
>  	data_sinfo = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_DATA);
> -	data_chunk_size = min(data_sinfo->chunk_size,
> -			      mult_perc(fs_info->fs_devices->total_rw_bytes, 10));
> -	data_chunk_size = min_t(u64, data_chunk_size, SZ_1G);
> +	if (!btrfs_is_zoned(fs_info)) {
> +		data_chunk_size = min(data_sinfo->chunk_size,
> +				      mult_perc(fs_info->fs_devices->total_rw_bytes, 10));
> +		data_chunk_size = min_t(u64, data_chunk_size, SZ_1G);
> +	} else {
> +		data_chunk_size = data_sinfo->chunk_size;
> +	}
>  
>  	/*
>  	 * Since data allocations immediately use block groups as part of the
> @@ -405,6 +412,17 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
>  		avail >>= 3;
>  	else
>  		avail >>= 1;
> +
> +	/*
> +	 * On the zoned mode, we always allocate one zone as one chunk.
> +	 * Returning non-zone size alingned bytes here will result in
> +	 * less pressure for the async metadata reclaim process, and it
> +	 * will over-commit too much leading to ENOSPC. Align down to the
> +	 * zone size to avoid that.
> +	 */
> +	if (btrfs_is_zoned(fs_info))
> +		avail = ALIGN_DOWN(avail, fs_info->zone_size);
> +
>  	return avail;
>  }
>  
> -- 
> 2.45.2
>
Naohiro Aota June 26, 2024, 6:11 a.m. UTC | #2
On Tue, Jun 25, 2024 at 09:17:14AM GMT, Boris Burkov wrote:
> On Tue, Jun 25, 2024 at 11:58:49PM +0900, Naohiro Aota wrote:
> > calc_available_free_space() returns the total size of metadata (or system)
> > block groups, which can be allocated from unallocated disk space. The logic
> > is wrong on zoned mode in two places.
> > 
> > First, the calculation of data_chunk_size is wrong. We always allocate one
> > zone as one chunk, and no partial allocation of a zone. So, we should use
> > zone_size (= data_sinfo->chunk_size) as it is.
> > 
> > Second, the result "avail" may not be zone aligned. Since we always
> > allocate one zone as one chunk on zoned mode, returning non-zone size
> > alingned bytes will result in less pressure on the async metadata reclaim
> > process.
> > 
> > This is serious for the nearly full state with a large zone size
> > device. Allowing over-commit too much will result in less async reclaim
> > work and end up in ENOSPC. We can align down to the zone size to avoid
> > that.
> 
> I sort of wish we could get rid of the "data_sinfo->chunk_size is wrong"
> thing. If we never actually use that value, perhaps we can try to really
> fix it? I didn't do it in my patches either, so there must have been
> something that made me hesitate that I'm now forgetting.

The "data_sinfo->chunk_size" itself is correct. It contains a preferable
default chunk size and it is used always on zoned mode. OTOH, on regular
mode, we need to tweak the size depending on the situation (unallocated
disk space). So, I don't see anything that needs to be fixed.

> 
> Obviously not the focus here, so with that said, the fix looks
> good.
> 
> Reviewed-by: Boris Burkov <boris@bur.io>
> 
> > 
> > Fixes: cb6cbab79055 ("btrfs: adjust overcommit logic when very close to full")
> > CC: stable@vger.kernel.org # 6.9
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> > As I mentioned in [1], I based this patch on for-next but before
> > Boris's "dynamic block_group reclaim threshold" series, because it would
> > be easier to backport this patch. I'll resend this patch basing on
> > for-next, if you'd prefer that.
> 
> I looked at the merge conflict and it's the fact that I pulled the data
> chunk logic into a helper function. Just fixing it is easy, but like you
> said, makes backporting this fix fussier.
> 
> Since the conflict is pretty trivial, I support re-ordering things so
> that this goes under the reclaim stuff, whether we pull my
> calc_effective_data_chunk_size refactor into this patch and put it
> under, or just edit the refactor to bring along the zoned fix. Let me
> know if I can help with that!

Thank you. I placed my fix patch before your series and modified your patch
accordingly.
diff mbox series

Patch

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 0283ee9bf813..85ff44a74223 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -373,11 +373,18 @@  static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
 	 * "optimal" chunk size based on the fs size.  However when we actually
 	 * allocate the chunk we will strip this down further, making it no more
 	 * than 10% of the disk or 1G, whichever is smaller.
+	 *
+	 * On the zoned mode, we need to use zone_size (=
+	 * data_sinfo->chunk_size) as it is.
 	 */
 	data_sinfo = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_DATA);
-	data_chunk_size = min(data_sinfo->chunk_size,
-			      mult_perc(fs_info->fs_devices->total_rw_bytes, 10));
-	data_chunk_size = min_t(u64, data_chunk_size, SZ_1G);
+	if (!btrfs_is_zoned(fs_info)) {
+		data_chunk_size = min(data_sinfo->chunk_size,
+				      mult_perc(fs_info->fs_devices->total_rw_bytes, 10));
+		data_chunk_size = min_t(u64, data_chunk_size, SZ_1G);
+	} else {
+		data_chunk_size = data_sinfo->chunk_size;
+	}
 
 	/*
 	 * Since data allocations immediately use block groups as part of the
@@ -405,6 +412,17 @@  static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
 		avail >>= 3;
 	else
 		avail >>= 1;
+
+	/*
+	 * On the zoned mode, we always allocate one zone as one chunk.
+	 * Returning non-zone size alingned bytes here will result in
+	 * less pressure for the async metadata reclaim process, and it
+	 * will over-commit too much leading to ENOSPC. Align down to the
+	 * zone size to avoid that.
+	 */
+	if (btrfs_is_zoned(fs_info))
+		avail = ALIGN_DOWN(avail, fs_info->zone_size);
+
 	return avail;
 }