btrfs: properly split extent_map for REQ_OP_ZONE_APPEND

Message ID	20210628085728.2813793-1-naohiro.aota@wdc.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> IronPort-SDR: GU9QBf/VepvQncWN8lqquefqGwLajwqI9bAEoQflJx5vf5oRoEDlkEqYvayblbCe2yQJYT08UH QQDmmBPrrxfgv6D7QkajdCecZDxVvVUUzhiwCLU9mSisGBI1u0KEM5unuQSXV09m7VuaUH3yod t9I8sqC0IybV3jY9Q+Y7xTG5NvJqiahiWHAKuv7bcUpXGMXOcGR4MmmTutanY4lDCKd/zEQlXN rUawgo+v+gXYhU78ctFu9IveOQvugMdc7VQS/zmoHIhG7+2KcoOmq9v8pKYJYumiXEEWn/hiy7 Yg0= IronPort-SDR: Rg6WV2cpAtNtIfrtV9YLgjGsRLzxuHdighyIxmZXlF+FWd0ngX4RQ3f9FQq1+dPzB3eQYqy+We 0ubbrvuWa+ICNicbXTx/5i36p2u4IYQug0PEMKke6d1VnvLYQ7p6U1XC+LN4ZTqMlLzVEsRL+e Rtgl3XGH9xzfMHN/ANcqkLaridMI852ySt6BiMayKVFix/YiNTjBgTbsLOMPGggI966JfttpIH 0jn0r4HwvhGKIB8XqcGZK+XgrLPApU1XltNkfuYP/X0Y2+HtDYPJm1D14/W7M/w4dymhrAeeCb 3WEklHu5fli6qAT6KRLUfaNK IronPort-SDR: XYrBNVs+a+htwD77tnRPOf+tIXsN+BsuIscl6FyNLdkYofq403yimpus1cUbqQp04znnk6VTu2 7hkedBYLuoU1p2/GXQ4C5nvyScxc5CkmYIFza061/r33bz2MPFXRbuyMcz2Me2Ya7gicR/Uda+ jmrG6Q3PTlGpDvfCB7r8KF4H/6tFcQOvlz1U4i6f+1I5UO8OZkE3bdKdbmW72KQS3E4hvO88HD mdwPlsxYX1kYYJPDz01/Jp5mMro7GNO8QBuG4k6Or97L3Cfjr6cmkXcNJY2jePcZ/96Mf6RY+c 6FU= WDCIronportException: Internal From: Naohiro Aota <naohiro.aota@wdc.com> To: linux-btrfs@vger.kernel.org Cc: David Sterba <dsterba@suse.com>, Naohiro Aota <naohiro.aota@wdc.com>, stable@vger.kernel.org, Damien Le Moal <damien.lemoal@wdc.com>, Johannes Thumshirn <johannes.thumshirn@wdc.com> Subject: [PATCH] btrfs: properly split extent_map for REQ_OP_ZONE_APPEND Date: Mon, 28 Jun 2021 17:57:28 +0900 Message-Id: <20210628085728.2813793-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: properly split extent_map for REQ_OP_ZONE_APPEND \| expand btrfs: properly split extent_map for REQ_OP_ZONE_APPEND

Naohiro Aota June 28, 2021, 8:57 a.m. UTC

Damien reported a test failure with btrfs/209. The test itself ran fine,
but the fsck run afterwards reported a corrupted filesystem.

The filesystem corruption happens because we're splitting an extent and
then writing the extent twice. We have to split the extent though, because
we're creating too large extents for a REQ_OP_ZONE_APPEND operation.

When dumping the extent tree, we can see two EXTENT_ITEMs at the same
start address but different lengths.

$ btrfs inspect dump-tree /dev/nullb1 -t extent
...
   item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
           refs 1 gen 7 flags DATA
           extent data backref root FS_TREE objectid 257 offset 786432 count 1
   item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
           refs 1 gen 7 flags DATA
           extent data backref root FS_TREE objectid 257 offset 786432 count 1

The duplicated EXTENT_ITEMs originally come from wrongly split extent_map in
extract_ordered_extent(). Since extract_ordered_extent() uses
create_io_em() to split an existing extent_map, we will have
split->orig_start != split->start. Then, it will be logged with non-zero
"extent data offset". Finally, the logged entries are replayed into
a duplicated EXTENT_ITEM.

Introduce and use proper splitting function for extent_map. The function is
intended to be simple and specific usage for extract_ordered_extent() e.g.
not supporting compression case (we do not allow splitting compressed
extent_map anyway).

Fixes: d22002fd37bd ("btrfs: zoned: split ordered extent when bio is sent")
Cc: stable@vger.kernel.org # 5.12+
Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c | 151 ++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 122 insertions(+), 29 deletions(-)

Qu Wenruo June 28, 2021, 11:50 a.m. UTC | #1

On 2021/6/28 下午4:57, Naohiro Aota wrote:
> Damien reported a test failure with btrfs/209. The test itself ran fine,
> but the fsck run afterwards reported a corrupted filesystem.
>
> The filesystem corruption happens because we're splitting an extent and
> then writing the extent twice. We have to split the extent though, because
> we're creating too large extents for a REQ_OP_ZONE_APPEND operation.
>
> When dumping the extent tree, we can see two EXTENT_ITEMs at the same
> start address but different lengths.
>
> $ btrfs inspect dump-tree /dev/nullb1 -t extent
> ...
>     item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
>             refs 1 gen 7 flags DATA
>             extent data backref root FS_TREE objectid 257 offset 786432 count 1
>     item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
>             refs 1 gen 7 flags DATA
>             extent data backref root FS_TREE objectid 257 offset 786432 count 1
>
> The duplicated EXTENT_ITEMs originally come from wrongly split extent_map in
> extract_ordered_extent(). Since extract_ordered_extent() uses
> create_io_em() to split an existing extent_map, we will have
> split->orig_start != split->start. Then, it will be logged with non-zero
> "extent data offset". Finally, the logged entries are replayed into
> a duplicated EXTENT_ITEM.
>
> Introduce and use proper splitting function for extent_map. The function is
> intended to be simple and specific usage for extract_ordered_extent() e.g.
> not supporting compression case (we do not allow splitting compressed
> extent_map anyway).

This may be a pretty stupid question, but why do we need to split the
extent map (and extent item) into several more and causing more extent
items?


I understand for zoned write, we have extra limitation on how many bytes
we can submit before reaching the zone limit.

But we also have stripe boundary for non-zoned device.

And in that case, we just split them into different bios, other than
split the extent into smaller extents.

Of course for current zoned support, only SINGLE profile is supported
thus no stripe boundary to bother.

But I'm wondering if we could do the same thing without really splitting
the extent map.

Thanks,
Qu

>
> Fixes: d22002fd37bd ("btrfs: zoned: split ordered extent when bio is sent")
> Cc: stable@vger.kernel.org # 5.12+
> Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>   fs/btrfs/inode.c | 151 ++++++++++++++++++++++++++++++++++++++---------
>   1 file changed, 122 insertions(+), 29 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e6eb20987351..79cdcaeab8de 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2271,13 +2271,131 @@ static blk_status_t btrfs_submit_bio_start(struct inode *inode, struct bio *bio,
>   	return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
>   }
>
> +/*
> + * split_zoned_em - split an extent_map at [start, start+len]
> + *
> + * This function is intended to be used only for extract_ordered_extent().
> + */
> +static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
> +			  u64 pre, u64 post)
> +{
> +	struct extent_map_tree *em_tree = &inode->extent_tree;
> +	struct extent_map *em;
> +	struct extent_map *split_pre = NULL;
> +	struct extent_map *split_mid = NULL;
> +	struct extent_map *split_post = NULL;
> +	int ret = 0;
> +	int modified;
> +	unsigned long flags;
> +
> +	/* Sanity check */
> +	if (pre == 0 && post == 0)
> +		return 0;
> +
> +	split_pre = alloc_extent_map();
> +	if (pre)
> +		split_mid = alloc_extent_map();
> +	if (post)
> +		split_post = alloc_extent_map();
> +	if (!split_pre || (pre && !split_mid) || (post && !split_post)) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ASSERT(pre + post < len);
> +
> +	lock_extent(&inode->io_tree, start, start + len - 1);
> +	write_lock(&em_tree->lock);
> +	em = lookup_extent_mapping(em_tree, start, len);
> +	if (!em) {
> +		ret = -EIO;
> +		goto out_unlock;
> +	}
> +
> +	ASSERT(em->len == len);
> +	ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
> +	ASSERT(em->block_start < EXTENT_MAP_LAST_BYTE);
> +
> +	flags = em->flags;
> +	clear_bit(EXTENT_FLAG_PINNED, &em->flags);
> +	clear_bit(EXTENT_FLAG_LOGGING, &flags);
> +	modified = !list_empty(&em->list);
> +
> +	/*
> +	 * First, replace the em with a new extent_map starting from
> +	 * em->start
> +	 */
> +
> +	split_pre->start = em->start;
> +	split_pre->len = pre ? pre : (em->len - post);
> +	split_pre->orig_start = split_pre->start;
> +	split_pre->block_start = em->block_start;
> +	split_pre->block_len = split_pre->len;
> +	split_pre->orig_block_len = split_pre->block_len;
> +	split_pre->ram_bytes = split_pre->len;
> +	split_pre->flags = flags;
> +	split_pre->compress_type = em->compress_type;
> +	split_pre->generation = em->generation;
> +
> +	replace_extent_mapping(em_tree, em, split_pre, modified);
> +
> +	/*
> +	 * Now we only have an extent_map at:
> +	 *     [em->start, em->start + pre] if pre != 0
> +	 *     [em->start, em->start + em->len - post] if pre == 0
> +	 */
> +
> +	if (pre) {
> +		/* Insert the middle extent_map */
> +		split_mid->start = em->start + pre;
> +		split_mid->len = em->len - pre - post;
> +		split_mid->orig_start = split_mid->start;
> +		split_mid->block_start = em->block_start + pre;
> +		split_mid->block_len = split_mid->len;
> +		split_mid->orig_block_len = split_mid->block_len;
> +		split_mid->ram_bytes = split_mid->len;
> +		split_mid->flags = flags;
> +		split_mid->compress_type = em->compress_type;
> +		split_mid->generation = em->generation;
> +		add_extent_mapping(em_tree, split_mid, modified);
> +	}
> +
> +	if (post) {
> +		split_post->start = em->start + em->len - post;
> +		split_post->len = post;
> +		split_post->orig_start = split_post->start;
> +		split_post->block_start = em->block_start + em->len - post;
> +		split_post->block_len = split_post->len;
> +		split_post->orig_block_len = split_post->block_len;
> +		split_post->ram_bytes = split_post->len;
> +		split_post->flags = flags;
> +		split_post->compress_type = em->compress_type;
> +		split_post->generation = em->generation;
> +		add_extent_mapping(em_tree, split_post, modified);
> +	}
> +
> +	/* once for us */
> +	free_extent_map(em);
> +	/* once for the tree */
> +	free_extent_map(em);
> +
> +out_unlock:
> +	write_unlock(&em_tree->lock);
> +	unlock_extent(&inode->io_tree, start, start + len - 1);
> +out:
> +	free_extent_map(split_pre);
> +	free_extent_map(split_mid);
> +	free_extent_map(split_post);
> +
> +	return ret;
> +}
> +
>   static blk_status_t extract_ordered_extent(struct btrfs_inode *inode,
>   					   struct bio *bio, loff_t file_offset)
>   {
>   	struct btrfs_ordered_extent *ordered;
> -	struct extent_map *em = NULL, *em_new = NULL;
> -	struct extent_map_tree *em_tree = &inode->extent_tree;
>   	u64 start = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
> +	u64 file_len;
>   	u64 len = bio->bi_iter.bi_size;
>   	u64 end = start + len;
>   	u64 ordered_end;
> @@ -2317,41 +2435,16 @@ static blk_status_t extract_ordered_extent(struct btrfs_inode *inode,
>   		goto out;
>   	}
>
> +	file_len = ordered->num_bytes;
>   	pre = start - ordered->disk_bytenr;
>   	post = ordered_end - end;
>
>   	ret = btrfs_split_ordered_extent(ordered, pre, post);
>   	if (ret)
>   		goto out;
> -
> -	read_lock(&em_tree->lock);
> -	em = lookup_extent_mapping(em_tree, ordered->file_offset, len);
> -	if (!em) {
> -		read_unlock(&em_tree->lock);
> -		ret = -EIO;
> -		goto out;
> -	}
> -	read_unlock(&em_tree->lock);
> -
> -	ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
> -	/*
> -	 * We cannot reuse em_new here but have to create a new one, as
> -	 * unpin_extent_cache() expects the start of the extent map to be the
> -	 * logical offset of the file, which does not hold true anymore after
> -	 * splitting.
> -	 */
> -	em_new = create_io_em(inode, em->start + pre, len,
> -			      em->start + pre, em->block_start + pre, len,
> -			      len, len, BTRFS_COMPRESS_NONE,
> -			      BTRFS_ORDERED_REGULAR);
> -	if (IS_ERR(em_new)) {
> -		ret = PTR_ERR(em_new);
> -		goto out;
> -	}
> -	free_extent_map(em_new);
> +	ret = split_zoned_em(inode, file_offset, file_len, pre, post);
>
>   out:
> -	free_extent_map(em);
>   	btrfs_put_ordered_extent(ordered);
>
>   	return errno_to_blk_status(ret);
>

Damien Le Moal June 28, 2021, 12:08 p.m. UTC | #2

On 2021/06/28 20:50, Qu Wenruo wrote:
> 
> 
> On 2021/6/28 下午4:57, Naohiro Aota wrote:
>> Damien reported a test failure with btrfs/209. The test itself ran fine,
>> but the fsck run afterwards reported a corrupted filesystem.
>>
>> The filesystem corruption happens because we're splitting an extent and
>> then writing the extent twice. We have to split the extent though, because
>> we're creating too large extents for a REQ_OP_ZONE_APPEND operation.
>>
>> When dumping the extent tree, we can see two EXTENT_ITEMs at the same
>> start address but different lengths.
>>
>> $ btrfs inspect dump-tree /dev/nullb1 -t extent
>> ...
>>     item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
>>             refs 1 gen 7 flags DATA
>>             extent data backref root FS_TREE objectid 257 offset 786432 count 1
>>     item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
>>             refs 1 gen 7 flags DATA
>>             extent data backref root FS_TREE objectid 257 offset 786432 count 1
>>
>> The duplicated EXTENT_ITEMs originally come from wrongly split extent_map in
>> extract_ordered_extent(). Since extract_ordered_extent() uses
>> create_io_em() to split an existing extent_map, we will have
>> split->orig_start != split->start. Then, it will be logged with non-zero
>> "extent data offset". Finally, the logged entries are replayed into
>> a duplicated EXTENT_ITEM.
>>
>> Introduce and use proper splitting function for extent_map. The function is
>> intended to be simple and specific usage for extract_ordered_extent() e.g.
>> not supporting compression case (we do not allow splitting compressed
>> extent_map anyway).
> 
> This may be a pretty stupid question, but why do we need to split the
> extent map (and extent item) into several more and causing more extent
> items?
> 
> 
> I understand for zoned write, we have extra limitation on how many bytes
> we can submit before reaching the zone limit.
> 
> But we also have stripe boundary for non-zoned device.
> 
> And in that case, we just split them into different bios, other than
> split the extent into smaller extents.
> 
> Of course for current zoned support, only SINGLE profile is supported
> thus no stripe boundary to bother.
> 
> But I'm wondering if we could do the same thing without really splitting
> the extent map.

The problem is not the limit on the zone end, which as you mention is the same
as the block group end. The problem is that data write use zone append (ZA)
operations. ZA BIOs cannot be split so a large extent may need to be processed
with multiple ZA BIOs, While that is also true for regular writes, the major
difference is that ZA are "nameless" write operation giving back the written
sectors on completion. And ZA operations may be reordered by the block layer
(not intentionally though). Combine both of these characteristics and you can
see that the data for a large extent may end up being shuffled when written
resulting in data corruption and the impossibility to map the extent to some
start sector.

To avoid this problem, zoned btrfs uses the principle "one data extent == one ZA
BIO". So large extents need to be split. This is unfortunate, but we can revisit
this later and optimize, e.g. merge back together the fragments of an extent
once written if they actually were written sequentially in the zone.

> 
> Thanks,
> Qu
> 
>>
>> Fixes: d22002fd37bd ("btrfs: zoned: split ordered extent when bio is sent")
>> Cc: stable@vger.kernel.org # 5.12+
>> Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
>> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>> ---
>>   fs/btrfs/inode.c | 151 ++++++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 122 insertions(+), 29 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index e6eb20987351..79cdcaeab8de 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -2271,13 +2271,131 @@ static blk_status_t btrfs_submit_bio_start(struct inode *inode, struct bio *bio,
>>   	return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
>>   }
>>
>> +/*
>> + * split_zoned_em - split an extent_map at [start, start+len]
>> + *
>> + * This function is intended to be used only for extract_ordered_extent().
>> + */
>> +static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
>> +			  u64 pre, u64 post)
>> +{
>> +	struct extent_map_tree *em_tree = &inode->extent_tree;
>> +	struct extent_map *em;
>> +	struct extent_map *split_pre = NULL;
>> +	struct extent_map *split_mid = NULL;
>> +	struct extent_map *split_post = NULL;
>> +	int ret = 0;
>> +	int modified;
>> +	unsigned long flags;
>> +
>> +	/* Sanity check */
>> +	if (pre == 0 && post == 0)
>> +		return 0;
>> +
>> +	split_pre = alloc_extent_map();
>> +	if (pre)
>> +		split_mid = alloc_extent_map();
>> +	if (post)
>> +		split_post = alloc_extent_map();
>> +	if (!split_pre || (pre && !split_mid) || (post && !split_post)) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	ASSERT(pre + post < len);
>> +
>> +	lock_extent(&inode->io_tree, start, start + len - 1);
>> +	write_lock(&em_tree->lock);
>> +	em = lookup_extent_mapping(em_tree, start, len);
>> +	if (!em) {
>> +		ret = -EIO;
>> +		goto out_unlock;
>> +	}
>> +
>> +	ASSERT(em->len == len);
>> +	ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
>> +	ASSERT(em->block_start < EXTENT_MAP_LAST_BYTE);
>> +
>> +	flags = em->flags;
>> +	clear_bit(EXTENT_FLAG_PINNED, &em->flags);
>> +	clear_bit(EXTENT_FLAG_LOGGING, &flags);
>> +	modified = !list_empty(&em->list);
>> +
>> +	/*
>> +	 * First, replace the em with a new extent_map starting from
>> +	 * em->start
>> +	 */
>> +
>> +	split_pre->start = em->start;
>> +	split_pre->len = pre ? pre : (em->len - post);
>> +	split_pre->orig_start = split_pre->start;
>> +	split_pre->block_start = em->block_start;
>> +	split_pre->block_len = split_pre->len;
>> +	split_pre->orig_block_len = split_pre->block_len;
>> +	split_pre->ram_bytes = split_pre->len;
>> +	split_pre->flags = flags;
>> +	split_pre->compress_type = em->compress_type;
>> +	split_pre->generation = em->generation;
>> +
>> +	replace_extent_mapping(em_tree, em, split_pre, modified);
>> +
>> +	/*
>> +	 * Now we only have an extent_map at:
>> +	 *     [em->start, em->start + pre] if pre != 0
>> +	 *     [em->start, em->start + em->len - post] if pre == 0
>> +	 */
>> +
>> +	if (pre) {
>> +		/* Insert the middle extent_map */
>> +		split_mid->start = em->start + pre;
>> +		split_mid->len = em->len - pre - post;
>> +		split_mid->orig_start = split_mid->start;
>> +		split_mid->block_start = em->block_start + pre;
>> +		split_mid->block_len = split_mid->len;
>> +		split_mid->orig_block_len = split_mid->block_len;
>> +		split_mid->ram_bytes = split_mid->len;
>> +		split_mid->flags = flags;
>> +		split_mid->compress_type = em->compress_type;
>> +		split_mid->generation = em->generation;
>> +		add_extent_mapping(em_tree, split_mid, modified);
>> +	}
>> +
>> +	if (post) {
>> +		split_post->start = em->start + em->len - post;
>> +		split_post->len = post;
>> +		split_post->orig_start = split_post->start;
>> +		split_post->block_start = em->block_start + em->len - post;
>> +		split_post->block_len = split_post->len;
>> +		split_post->orig_block_len = split_post->block_len;
>> +		split_post->ram_bytes = split_post->len;
>> +		split_post->flags = flags;
>> +		split_post->compress_type = em->compress_type;
>> +		split_post->generation = em->generation;
>> +		add_extent_mapping(em_tree, split_post, modified);
>> +	}
>> +
>> +	/* once for us */
>> +	free_extent_map(em);
>> +	/* once for the tree */
>> +	free_extent_map(em);
>> +
>> +out_unlock:
>> +	write_unlock(&em_tree->lock);
>> +	unlock_extent(&inode->io_tree, start, start + len - 1);
>> +out:
>> +	free_extent_map(split_pre);
>> +	free_extent_map(split_mid);
>> +	free_extent_map(split_post);
>> +
>> +	return ret;
>> +}
>> +
>>   static blk_status_t extract_ordered_extent(struct btrfs_inode *inode,
>>   					   struct bio *bio, loff_t file_offset)
>>   {
>>   	struct btrfs_ordered_extent *ordered;
>> -	struct extent_map *em = NULL, *em_new = NULL;
>> -	struct extent_map_tree *em_tree = &inode->extent_tree;
>>   	u64 start = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
>> +	u64 file_len;
>>   	u64 len = bio->bi_iter.bi_size;
>>   	u64 end = start + len;
>>   	u64 ordered_end;
>> @@ -2317,41 +2435,16 @@ static blk_status_t extract_ordered_extent(struct btrfs_inode *inode,
>>   		goto out;
>>   	}
>>
>> +	file_len = ordered->num_bytes;
>>   	pre = start - ordered->disk_bytenr;
>>   	post = ordered_end - end;
>>
>>   	ret = btrfs_split_ordered_extent(ordered, pre, post);
>>   	if (ret)
>>   		goto out;
>> -
>> -	read_lock(&em_tree->lock);
>> -	em = lookup_extent_mapping(em_tree, ordered->file_offset, len);
>> -	if (!em) {
>> -		read_unlock(&em_tree->lock);
>> -		ret = -EIO;
>> -		goto out;
>> -	}
>> -	read_unlock(&em_tree->lock);
>> -
>> -	ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
>> -	/*
>> -	 * We cannot reuse em_new here but have to create a new one, as
>> -	 * unpin_extent_cache() expects the start of the extent map to be the
>> -	 * logical offset of the file, which does not hold true anymore after
>> -	 * splitting.
>> -	 */
>> -	em_new = create_io_em(inode, em->start + pre, len,
>> -			      em->start + pre, em->block_start + pre, len,
>> -			      len, len, BTRFS_COMPRESS_NONE,
>> -			      BTRFS_ORDERED_REGULAR);
>> -	if (IS_ERR(em_new)) {
>> -		ret = PTR_ERR(em_new);
>> -		goto out;
>> -	}
>> -	free_extent_map(em_new);
>> +	ret = split_zoned_em(inode, file_offset, file_len, pre, post);
>>
>>   out:
>> -	free_extent_map(em);
>>   	btrfs_put_ordered_extent(ordered);
>>
>>   	return errno_to_blk_status(ret);
>>
>

Qu Wenruo June 28, 2021, 12:12 p.m. UTC | #3

On 2021/6/28 下午8:08, Damien Le Moal wrote:
> On 2021/06/28 20:50, Qu Wenruo wrote:
>>
>>
>> On 2021/6/28 下午4:57, Naohiro Aota wrote:
>>> Damien reported a test failure with btrfs/209. The test itself ran fine,
>>> but the fsck run afterwards reported a corrupted filesystem.
>>>
>>> The filesystem corruption happens because we're splitting an extent and
>>> then writing the extent twice. We have to split the extent though, because
>>> we're creating too large extents for a REQ_OP_ZONE_APPEND operation.
>>>
>>> When dumping the extent tree, we can see two EXTENT_ITEMs at the same
>>> start address but different lengths.
>>>
>>> $ btrfs inspect dump-tree /dev/nullb1 -t extent
>>> ...
>>>      item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
>>>              refs 1 gen 7 flags DATA
>>>              extent data backref root FS_TREE objectid 257 offset 786432 count 1
>>>      item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
>>>              refs 1 gen 7 flags DATA
>>>              extent data backref root FS_TREE objectid 257 offset 786432 count 1
>>>
>>> The duplicated EXTENT_ITEMs originally come from wrongly split extent_map in
>>> extract_ordered_extent(). Since extract_ordered_extent() uses
>>> create_io_em() to split an existing extent_map, we will have
>>> split->orig_start != split->start. Then, it will be logged with non-zero
>>> "extent data offset". Finally, the logged entries are replayed into
>>> a duplicated EXTENT_ITEM.
>>>
>>> Introduce and use proper splitting function for extent_map. The function is
>>> intended to be simple and specific usage for extract_ordered_extent() e.g.
>>> not supporting compression case (we do not allow splitting compressed
>>> extent_map anyway).
>>
>> This may be a pretty stupid question, but why do we need to split the
>> extent map (and extent item) into several more and causing more extent
>> items?
>>
>>
>> I understand for zoned write, we have extra limitation on how many bytes
>> we can submit before reaching the zone limit.
>>
>> But we also have stripe boundary for non-zoned device.
>>
>> And in that case, we just split them into different bios, other than
>> split the extent into smaller extents.
>>
>> Of course for current zoned support, only SINGLE profile is supported
>> thus no stripe boundary to bother.
>>
>> But I'm wondering if we could do the same thing without really splitting
>> the extent map.
>
> The problem is not the limit on the zone end, which as you mention is the same
> as the block group end. The problem is that data write use zone append (ZA)
> operations. ZA BIOs cannot be split so a large extent may need to be processed
> with multiple ZA BIOs, While that is also true for regular writes, the major
> difference is that ZA are "nameless" write operation giving back the written
> sectors on completion. And ZA operations may be reordered by the block layer
> (not intentionally though). Combine both of these characteristics and you can
> see that the data for a large extent may end up being shuffled when written
> resulting in data corruption and the impossibility to map the extent to some
> start sector.
>
> To avoid this problem, zoned btrfs uses the principle "one data extent == one ZA
> BIO". So large extents need to be split. This is unfortunate, but we can revisit
> this later and optimize, e.g. merge back together the fragments of an extent
> once written if they actually were written sequentially in the zone.

Got it, then the current behavior of splitting into smaller extents is
completely fine.

When we optimize zoned code to something without this behavior, we only
need to defrag them.
So no problem at all, even for the future.

Thanks for the explanation, it really makes me interested in the zoned
code now,
Qu
>
>>
>> Thanks,
>> Qu
>>
>>>
>>> Fixes: d22002fd37bd ("btrfs: zoned: split ordered extent when bio is sent")
>>> Cc: stable@vger.kernel.org # 5.12+
>>> Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
>>> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>> ---
>>>    fs/btrfs/inode.c | 151 ++++++++++++++++++++++++++++++++++++++---------
>>>    1 file changed, 122 insertions(+), 29 deletions(-)
>>>
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index e6eb20987351..79cdcaeab8de 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -2271,13 +2271,131 @@ static blk_status_t btrfs_submit_bio_start(struct inode *inode, struct bio *bio,
>>>    	return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
>>>    }
>>>
>>> +/*
>>> + * split_zoned_em - split an extent_map at [start, start+len]
>>> + *
>>> + * This function is intended to be used only for extract_ordered_extent().
>>> + */
>>> +static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
>>> +			  u64 pre, u64 post)
>>> +{
>>> +	struct extent_map_tree *em_tree = &inode->extent_tree;
>>> +	struct extent_map *em;
>>> +	struct extent_map *split_pre = NULL;
>>> +	struct extent_map *split_mid = NULL;
>>> +	struct extent_map *split_post = NULL;
>>> +	int ret = 0;
>>> +	int modified;
>>> +	unsigned long flags;
>>> +
>>> +	/* Sanity check */
>>> +	if (pre == 0 && post == 0)
>>> +		return 0;
>>> +
>>> +	split_pre = alloc_extent_map();
>>> +	if (pre)
>>> +		split_mid = alloc_extent_map();
>>> +	if (post)
>>> +		split_post = alloc_extent_map();
>>> +	if (!split_pre || (pre && !split_mid) || (post && !split_post)) {
>>> +		ret = -ENOMEM;
>>> +		goto out;
>>> +	}
>>> +
>>> +	ASSERT(pre + post < len);
>>> +
>>> +	lock_extent(&inode->io_tree, start, start + len - 1);
>>> +	write_lock(&em_tree->lock);
>>> +	em = lookup_extent_mapping(em_tree, start, len);
>>> +	if (!em) {
>>> +		ret = -EIO;
>>> +		goto out_unlock;
>>> +	}
>>> +
>>> +	ASSERT(em->len == len);
>>> +	ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
>>> +	ASSERT(em->block_start < EXTENT_MAP_LAST_BYTE);
>>> +
>>> +	flags = em->flags;
>>> +	clear_bit(EXTENT_FLAG_PINNED, &em->flags);
>>> +	clear_bit(EXTENT_FLAG_LOGGING, &flags);
>>> +	modified = !list_empty(&em->list);
>>> +
>>> +	/*
>>> +	 * First, replace the em with a new extent_map starting from
>>> +	 * em->start
>>> +	 */
>>> +
>>> +	split_pre->start = em->start;
>>> +	split_pre->len = pre ? pre : (em->len - post);
>>> +	split_pre->orig_start = split_pre->start;
>>> +	split_pre->block_start = em->block_start;
>>> +	split_pre->block_len = split_pre->len;
>>> +	split_pre->orig_block_len = split_pre->block_len;
>>> +	split_pre->ram_bytes = split_pre->len;
>>> +	split_pre->flags = flags;
>>> +	split_pre->compress_type = em->compress_type;
>>> +	split_pre->generation = em->generation;
>>> +
>>> +	replace_extent_mapping(em_tree, em, split_pre, modified);
>>> +
>>> +	/*
>>> +	 * Now we only have an extent_map at:
>>> +	 *     [em->start, em->start + pre] if pre != 0
>>> +	 *     [em->start, em->start + em->len - post] if pre == 0
>>> +	 */
>>> +
>>> +	if (pre) {
>>> +		/* Insert the middle extent_map */
>>> +		split_mid->start = em->start + pre;
>>> +		split_mid->len = em->len - pre - post;
>>> +		split_mid->orig_start = split_mid->start;
>>> +		split_mid->block_start = em->block_start + pre;
>>> +		split_mid->block_len = split_mid->len;
>>> +		split_mid->orig_block_len = split_mid->block_len;
>>> +		split_mid->ram_bytes = split_mid->len;
>>> +		split_mid->flags = flags;
>>> +		split_mid->compress_type = em->compress_type;
>>> +		split_mid->generation = em->generation;
>>> +		add_extent_mapping(em_tree, split_mid, modified);
>>> +	}
>>> +
>>> +	if (post) {
>>> +		split_post->start = em->start + em->len - post;
>>> +		split_post->len = post;
>>> +		split_post->orig_start = split_post->start;
>>> +		split_post->block_start = em->block_start + em->len - post;
>>> +		split_post->block_len = split_post->len;
>>> +		split_post->orig_block_len = split_post->block_len;
>>> +		split_post->ram_bytes = split_post->len;
>>> +		split_post->flags = flags;
>>> +		split_post->compress_type = em->compress_type;
>>> +		split_post->generation = em->generation;
>>> +		add_extent_mapping(em_tree, split_post, modified);
>>> +	}
>>> +
>>> +	/* once for us */
>>> +	free_extent_map(em);
>>> +	/* once for the tree */
>>> +	free_extent_map(em);
>>> +
>>> +out_unlock:
>>> +	write_unlock(&em_tree->lock);
>>> +	unlock_extent(&inode->io_tree, start, start + len - 1);
>>> +out:
>>> +	free_extent_map(split_pre);
>>> +	free_extent_map(split_mid);
>>> +	free_extent_map(split_post);
>>> +
>>> +	return ret;
>>> +}
>>> +
>>>    static blk_status_t extract_ordered_extent(struct btrfs_inode *inode,
>>>    					   struct bio *bio, loff_t file_offset)
>>>    {
>>>    	struct btrfs_ordered_extent *ordered;
>>> -	struct extent_map *em = NULL, *em_new = NULL;
>>> -	struct extent_map_tree *em_tree = &inode->extent_tree;
>>>    	u64 start = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
>>> +	u64 file_len;
>>>    	u64 len = bio->bi_iter.bi_size;
>>>    	u64 end = start + len;
>>>    	u64 ordered_end;
>>> @@ -2317,41 +2435,16 @@ static blk_status_t extract_ordered_extent(struct btrfs_inode *inode,
>>>    		goto out;
>>>    	}
>>>
>>> +	file_len = ordered->num_bytes;
>>>    	pre = start - ordered->disk_bytenr;
>>>    	post = ordered_end - end;
>>>
>>>    	ret = btrfs_split_ordered_extent(ordered, pre, post);
>>>    	if (ret)
>>>    		goto out;
>>> -
>>> -	read_lock(&em_tree->lock);
>>> -	em = lookup_extent_mapping(em_tree, ordered->file_offset, len);
>>> -	if (!em) {
>>> -		read_unlock(&em_tree->lock);
>>> -		ret = -EIO;
>>> -		goto out;
>>> -	}
>>> -	read_unlock(&em_tree->lock);
>>> -
>>> -	ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
>>> -	/*
>>> -	 * We cannot reuse em_new here but have to create a new one, as
>>> -	 * unpin_extent_cache() expects the start of the extent map to be the
>>> -	 * logical offset of the file, which does not hold true anymore after
>>> -	 * splitting.
>>> -	 */
>>> -	em_new = create_io_em(inode, em->start + pre, len,
>>> -			      em->start + pre, em->block_start + pre, len,
>>> -			      len, len, BTRFS_COMPRESS_NONE,
>>> -			      BTRFS_ORDERED_REGULAR);
>>> -	if (IS_ERR(em_new)) {
>>> -		ret = PTR_ERR(em_new);
>>> -		goto out;
>>> -	}
>>> -	free_extent_map(em_new);
>>> +	ret = split_zoned_em(inode, file_offset, file_len, pre, post);
>>>
>>>    out:
>>> -	free_extent_map(em);
>>>    	btrfs_put_ordered_extent(ordered);
>>>
>>>    	return errno_to_blk_status(ret);
>>>
>>
>
>

David Sterba June 30, 2021, 7:25 p.m. UTC | #4

On Mon, Jun 28, 2021 at 05:57:28PM +0900, Naohiro Aota wrote:
> Damien reported a test failure with btrfs/209. The test itself ran fine,
> but the fsck run afterwards reported a corrupted filesystem.
> 
> The filesystem corruption happens because we're splitting an extent and
> then writing the extent twice. We have to split the extent though, because
> we're creating too large extents for a REQ_OP_ZONE_APPEND operation.
> 
> When dumping the extent tree, we can see two EXTENT_ITEMs at the same
> start address but different lengths.
> 
> $ btrfs inspect dump-tree /dev/nullb1 -t extent
> ...
>    item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
>            refs 1 gen 7 flags DATA
>            extent data backref root FS_TREE objectid 257 offset 786432 count 1
>    item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
>            refs 1 gen 7 flags DATA
>            extent data backref root FS_TREE objectid 257 offset 786432 count 1
> 
> The duplicated EXTENT_ITEMs originally come from wrongly split extent_map in
> extract_ordered_extent(). Since extract_ordered_extent() uses
> create_io_em() to split an existing extent_map, we will have
> split->orig_start != split->start. Then, it will be logged with non-zero
> "extent data offset". Finally, the logged entries are replayed into
> a duplicated EXTENT_ITEM.
> 
> Introduce and use proper splitting function for extent_map. The function is
> intended to be simple and specific usage for extract_ordered_extent() e.g.
> not supporting compression case (we do not allow splitting compressed
> extent_map anyway).
> 
> Fixes: d22002fd37bd ("btrfs: zoned: split ordered extent when bio is sent")
> Cc: stable@vger.kernel.org # 5.12+
> Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Added to a topic branch, I think I've hit the problem this patch is
supposed to fix so I'll to reproduce it before adding it to misc-next.
I've added Daminen's answer to the changelog as it's really helpful to
understand why it's fixed that way.

Filipe Manana July 1, 2021, 4:42 p.m. UTC | #5

On Mon, Jun 28, 2021 at 10:06 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
>
> Damien reported a test failure with btrfs/209. The test itself ran fine,
> but the fsck run afterwards reported a corrupted filesystem.
>
> The filesystem corruption happens because we're splitting an extent and
> then writing the extent twice. We have to split the extent though, because
> we're creating too large extents for a REQ_OP_ZONE_APPEND operation.
>
> When dumping the extent tree, we can see two EXTENT_ITEMs at the same
> start address but different lengths.
>
> $ btrfs inspect dump-tree /dev/nullb1 -t extent
> ...
>    item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
>            refs 1 gen 7 flags DATA
>            extent data backref root FS_TREE objectid 257 offset 786432 count 1
>    item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
>            refs 1 gen 7 flags DATA
>            extent data backref root FS_TREE objectid 257 offset 786432 count 1
>
> The duplicated EXTENT_ITEMs originally come from wrongly split extent_map in
> extract_ordered_extent(). Since extract_ordered_extent() uses
> create_io_em() to split an existing extent_map, we will have
> split->orig_start != split->start. Then, it will be logged with non-zero
> "extent data offset". Finally, the logged entries are replayed into
> a duplicated EXTENT_ITEM.
>
> Introduce and use proper splitting function for extent_map. The function is
> intended to be simple and specific usage for extract_ordered_extent() e.g.
> not supporting compression case (we do not allow splitting compressed
> extent_map anyway).
>
> Fixes: d22002fd37bd ("btrfs: zoned: split ordered extent when bio is sent")
> Cc: stable@vger.kernel.org # 5.12+
> Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/inode.c | 151 ++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 122 insertions(+), 29 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e6eb20987351..79cdcaeab8de 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2271,13 +2271,131 @@ static blk_status_t btrfs_submit_bio_start(struct inode *inode, struct bio *bio,
>         return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
>  }
>
> +/*
> + * split_zoned_em - split an extent_map at [start, start+len]
> + *
> + * This function is intended to be used only for extract_ordered_extent().
> + */
> +static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
> +                         u64 pre, u64 post)
> +{
> +       struct extent_map_tree *em_tree = &inode->extent_tree;
> +       struct extent_map *em;
> +       struct extent_map *split_pre = NULL;
> +       struct extent_map *split_mid = NULL;
> +       struct extent_map *split_post = NULL;
> +       int ret = 0;
> +       int modified;
> +       unsigned long flags;
> +
> +       /* Sanity check */
> +       if (pre == 0 && post == 0)
> +               return 0;
> +
> +       split_pre = alloc_extent_map();
> +       if (pre)
> +               split_mid = alloc_extent_map();
> +       if (post)
> +               split_post = alloc_extent_map();
> +       if (!split_pre || (pre && !split_mid) || (post && !split_post)) {
> +               ret = -ENOMEM;
> +               goto out;
> +       }
> +
> +       ASSERT(pre + post < len);
> +
> +       lock_extent(&inode->io_tree, start, start + len - 1);
> +       write_lock(&em_tree->lock);
> +       em = lookup_extent_mapping(em_tree, start, len);
> +       if (!em) {
> +               ret = -EIO;
> +               goto out_unlock;
> +       }
> +
> +       ASSERT(em->len == len);
> +       ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
> +       ASSERT(em->block_start < EXTENT_MAP_LAST_BYTE);
> +
> +       flags = em->flags;
> +       clear_bit(EXTENT_FLAG_PINNED, &em->flags);
> +       clear_bit(EXTENT_FLAG_LOGGING, &flags);
> +       modified = !list_empty(&em->list);
> +
> +       /*
> +        * First, replace the em with a new extent_map starting from
> +        * em->start
> +        */
> +
> +       split_pre->start = em->start;
> +       split_pre->len = pre ? pre : (em->len - post);
> +       split_pre->orig_start = split_pre->start;
> +       split_pre->block_start = em->block_start;
> +       split_pre->block_len = split_pre->len;
> +       split_pre->orig_block_len = split_pre->block_len;
> +       split_pre->ram_bytes = split_pre->len;
> +       split_pre->flags = flags;
> +       split_pre->compress_type = em->compress_type;
> +       split_pre->generation = em->generation;
> +
> +       replace_extent_mapping(em_tree, em, split_pre, modified);
> +
> +       /*
> +        * Now we only have an extent_map at:
> +        *     [em->start, em->start + pre] if pre != 0
> +        *     [em->start, em->start + em->len - post] if pre == 0
> +        */
> +
> +       if (pre) {
> +               /* Insert the middle extent_map */
> +               split_mid->start = em->start + pre;
> +               split_mid->len = em->len - pre - post;
> +               split_mid->orig_start = split_mid->start;
> +               split_mid->block_start = em->block_start + pre;
> +               split_mid->block_len = split_mid->len;
> +               split_mid->orig_block_len = split_mid->block_len;
> +               split_mid->ram_bytes = split_mid->len;
> +               split_mid->flags = flags;
> +               split_mid->compress_type = em->compress_type;
> +               split_mid->generation = em->generation;
> +               add_extent_mapping(em_tree, split_mid, modified);
> +       }
> +
> +       if (post) {
> +               split_post->start = em->start + em->len - post;
> +               split_post->len = post;
> +               split_post->orig_start = split_post->start;
> +               split_post->block_start = em->block_start + em->len - post;
> +               split_post->block_len = split_post->len;
> +               split_post->orig_block_len = split_post->block_len;
> +               split_post->ram_bytes = split_post->len;
> +               split_post->flags = flags;
> +               split_post->compress_type = em->compress_type;
> +               split_post->generation = em->generation;
> +               add_extent_mapping(em_tree, split_post, modified);
> +       }

So this happens when running delalloc, after creating the original
extent map and ordered extent, the original "em" must have had the
PINNED flag set.

The "pre" and "post" extent maps should have the PINNED flag set. It's
important to have the flag set to prevent extent map merging, which
could result in a log corruption if the file is being fsync'ed
multiple times in the current transaction and running delalloc was
triggered precisely by an fsync. The corruption result would be
logging extent items with overlapping ranges, since we construct them
based on extent maps, and that's why we have the PINNED flag to
prevent merging.

Or did I miss something?

Thanks.

> +
> +       /* once for us */
> +       free_extent_map(em);
> +       /* once for the tree */
> +       free_extent_map(em);
> +
> +out_unlock:
> +       write_unlock(&em_tree->lock);
> +       unlock_extent(&inode->io_tree, start, start + len - 1);
> +out:
> +       free_extent_map(split_pre);
> +       free_extent_map(split_mid);
> +       free_extent_map(split_post);
> +
> +       return ret;
> +}
> +
>  static blk_status_t extract_ordered_extent(struct btrfs_inode *inode,
>                                            struct bio *bio, loff_t file_offset)
>  {
>         struct btrfs_ordered_extent *ordered;
> -       struct extent_map *em = NULL, *em_new = NULL;
> -       struct extent_map_tree *em_tree = &inode->extent_tree;
>         u64 start = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
> +       u64 file_len;
>         u64 len = bio->bi_iter.bi_size;
>         u64 end = start + len;
>         u64 ordered_end;
> @@ -2317,41 +2435,16 @@ static blk_status_t extract_ordered_extent(struct btrfs_inode *inode,
>                 goto out;
>         }
>
> +       file_len = ordered->num_bytes;
>         pre = start - ordered->disk_bytenr;
>         post = ordered_end - end;
>
>         ret = btrfs_split_ordered_extent(ordered, pre, post);
>         if (ret)
>                 goto out;
> -
> -       read_lock(&em_tree->lock);
> -       em = lookup_extent_mapping(em_tree, ordered->file_offset, len);
> -       if (!em) {
> -               read_unlock(&em_tree->lock);
> -               ret = -EIO;
> -               goto out;
> -       }
> -       read_unlock(&em_tree->lock);
> -
> -       ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
> -       /*
> -        * We cannot reuse em_new here but have to create a new one, as
> -        * unpin_extent_cache() expects the start of the extent map to be the
> -        * logical offset of the file, which does not hold true anymore after
> -        * splitting.
> -        */
> -       em_new = create_io_em(inode, em->start + pre, len,
> -                             em->start + pre, em->block_start + pre, len,
> -                             len, len, BTRFS_COMPRESS_NONE,
> -                             BTRFS_ORDERED_REGULAR);
> -       if (IS_ERR(em_new)) {
> -               ret = PTR_ERR(em_new);
> -               goto out;
> -       }
> -       free_extent_map(em_new);
> +       ret = split_zoned_em(inode, file_offset, file_len, pre, post);
>
>  out:
> -       free_extent_map(em);
>         btrfs_put_ordered_extent(ordered);
>
>         return errno_to_blk_status(ret);
> --
> 2.32.0
>

Filipe Manana July 1, 2021, 4:55 p.m. UTC | #6

On Thu, Jul 1, 2021 at 5:42 PM Filipe Manana <fdmanana@gmail.com> wrote:
>
> On Mon, Jun 28, 2021 at 10:06 AM Naohiro Aota <naohiro.aota@wdc.com> wrote:
> >
> > Damien reported a test failure with btrfs/209. The test itself ran fine,
> > but the fsck run afterwards reported a corrupted filesystem.
> >
> > The filesystem corruption happens because we're splitting an extent and
> > then writing the extent twice. We have to split the extent though, because
> > we're creating too large extents for a REQ_OP_ZONE_APPEND operation.
> >
> > When dumping the extent tree, we can see two EXTENT_ITEMs at the same
> > start address but different lengths.
> >
> > $ btrfs inspect dump-tree /dev/nullb1 -t extent
> > ...
> >    item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
> >            refs 1 gen 7 flags DATA
> >            extent data backref root FS_TREE objectid 257 offset 786432 count 1
> >    item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
> >            refs 1 gen 7 flags DATA
> >            extent data backref root FS_TREE objectid 257 offset 786432 count 1
> >
> > The duplicated EXTENT_ITEMs originally come from wrongly split extent_map in
> > extract_ordered_extent(). Since extract_ordered_extent() uses
> > create_io_em() to split an existing extent_map, we will have
> > split->orig_start != split->start. Then, it will be logged with non-zero
> > "extent data offset". Finally, the logged entries are replayed into
> > a duplicated EXTENT_ITEM.
> >
> > Introduce and use proper splitting function for extent_map. The function is
> > intended to be simple and specific usage for extract_ordered_extent() e.g.
> > not supporting compression case (we do not allow splitting compressed
> > extent_map anyway).
> >
> > Fixes: d22002fd37bd ("btrfs: zoned: split ordered extent when bio is sent")
> > Cc: stable@vger.kernel.org # 5.12+
> > Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
> > Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> >  fs/btrfs/inode.c | 151 ++++++++++++++++++++++++++++++++++++++---------
> >  1 file changed, 122 insertions(+), 29 deletions(-)
> >
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index e6eb20987351..79cdcaeab8de 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -2271,13 +2271,131 @@ static blk_status_t btrfs_submit_bio_start(struct inode *inode, struct bio *bio,
> >         return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
> >  }
> >
> > +/*
> > + * split_zoned_em - split an extent_map at [start, start+len]
> > + *
> > + * This function is intended to be used only for extract_ordered_extent().
> > + */
> > +static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len,
> > +                         u64 pre, u64 post)
> > +{
> > +       struct extent_map_tree *em_tree = &inode->extent_tree;
> > +       struct extent_map *em;
> > +       struct extent_map *split_pre = NULL;
> > +       struct extent_map *split_mid = NULL;
> > +       struct extent_map *split_post = NULL;
> > +       int ret = 0;
> > +       int modified;
> > +       unsigned long flags;
> > +
> > +       /* Sanity check */
> > +       if (pre == 0 && post == 0)
> > +               return 0;
> > +
> > +       split_pre = alloc_extent_map();
> > +       if (pre)
> > +               split_mid = alloc_extent_map();
> > +       if (post)
> > +               split_post = alloc_extent_map();
> > +       if (!split_pre || (pre && !split_mid) || (post && !split_post)) {
> > +               ret = -ENOMEM;
> > +               goto out;
> > +       }
> > +
> > +       ASSERT(pre + post < len);
> > +
> > +       lock_extent(&inode->io_tree, start, start + len - 1);
> > +       write_lock(&em_tree->lock);
> > +       em = lookup_extent_mapping(em_tree, start, len);
> > +       if (!em) {
> > +               ret = -EIO;
> > +               goto out_unlock;
> > +       }
> > +
> > +       ASSERT(em->len == len);
> > +       ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
> > +       ASSERT(em->block_start < EXTENT_MAP_LAST_BYTE);
> > +
> > +       flags = em->flags;
> > +       clear_bit(EXTENT_FLAG_PINNED, &em->flags);
> > +       clear_bit(EXTENT_FLAG_LOGGING, &flags);
> > +       modified = !list_empty(&em->list);
> > +
> > +       /*
> > +        * First, replace the em with a new extent_map starting from
> > +        * em->start
> > +        */
> > +
> > +       split_pre->start = em->start;
> > +       split_pre->len = pre ? pre : (em->len - post);
> > +       split_pre->orig_start = split_pre->start;
> > +       split_pre->block_start = em->block_start;
> > +       split_pre->block_len = split_pre->len;
> > +       split_pre->orig_block_len = split_pre->block_len;
> > +       split_pre->ram_bytes = split_pre->len;
> > +       split_pre->flags = flags;
> > +       split_pre->compress_type = em->compress_type;
> > +       split_pre->generation = em->generation;
> > +
> > +       replace_extent_mapping(em_tree, em, split_pre, modified);
> > +
> > +       /*
> > +        * Now we only have an extent_map at:
> > +        *     [em->start, em->start + pre] if pre != 0
> > +        *     [em->start, em->start + em->len - post] if pre == 0
> > +        */
> > +
> > +       if (pre) {
> > +               /* Insert the middle extent_map */
> > +               split_mid->start = em->start + pre;
> > +               split_mid->len = em->len - pre - post;
> > +               split_mid->orig_start = split_mid->start;
> > +               split_mid->block_start = em->block_start + pre;
> > +               split_mid->block_len = split_mid->len;
> > +               split_mid->orig_block_len = split_mid->block_len;
> > +               split_mid->ram_bytes = split_mid->len;
> > +               split_mid->flags = flags;
> > +               split_mid->compress_type = em->compress_type;
> > +               split_mid->generation = em->generation;
> > +               add_extent_mapping(em_tree, split_mid, modified);
> > +       }
> > +
> > +       if (post) {
> > +               split_post->start = em->start + em->len - post;
> > +               split_post->len = post;
> > +               split_post->orig_start = split_post->start;
> > +               split_post->block_start = em->block_start + em->len - post;
> > +               split_post->block_len = split_post->len;
> > +               split_post->orig_block_len = split_post->block_len;
> > +               split_post->ram_bytes = split_post->len;
> > +               split_post->flags = flags;
> > +               split_post->compress_type = em->compress_type;
> > +               split_post->generation = em->generation;
> > +               add_extent_mapping(em_tree, split_post, modified);
> > +       }
>
> So this happens when running delalloc, after creating the original
> extent map and ordered extent, the original "em" must have had the
> PINNED flag set.
>
> The "pre" and "post" extent maps should have the PINNED flag set. It's
> important to have the flag set to prevent extent map merging, which
> could result in a log corruption if the file is being fsync'ed
> multiple times in the current transaction and running delalloc was
> triggered precisely by an fsync. The corruption result would be
> logging extent items with overlapping ranges, since we construct them
> based on extent maps, and that's why we have the PINNED flag to
> prevent merging.

Well, it actually happens that merging should not happen because the
original extent map was in the list of modified extents, and so should
be the new extent maps.
But we are really supposed to have the PINNED flag from the moment we
run delalloc and create a new extent map until the respective ordered
extent completes and unpins it.

Also EXTENT_FLAG_LOGGING should not be set at this point - if it were
we would screw up with a task logging the extent map.

Maybe assert that it is not set in the original extent map?
And also assert that the original em is in the list of modified
extents and has the PINNED flag set?

Thanks.

>
> Or did I miss something?
>
> Thanks.
>
> > +
> > +       /* once for us */
> > +       free_extent_map(em);
> > +       /* once for the tree */
> > +       free_extent_map(em);
> > +
> > +out_unlock:
> > +       write_unlock(&em_tree->lock);
> > +       unlock_extent(&inode->io_tree, start, start + len - 1);
> > +out:
> > +       free_extent_map(split_pre);
> > +       free_extent_map(split_mid);
> > +       free_extent_map(split_post);
> > +
> > +       return ret;
> > +}
> > +
> >  static blk_status_t extract_ordered_extent(struct btrfs_inode *inode,
> >                                            struct bio *bio, loff_t file_offset)
> >  {
> >         struct btrfs_ordered_extent *ordered;
> > -       struct extent_map *em = NULL, *em_new = NULL;
> > -       struct extent_map_tree *em_tree = &inode->extent_tree;
> >         u64 start = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
> > +       u64 file_len;
> >         u64 len = bio->bi_iter.bi_size;
> >         u64 end = start + len;
> >         u64 ordered_end;
> > @@ -2317,41 +2435,16 @@ static blk_status_t extract_ordered_extent(struct btrfs_inode *inode,
> >                 goto out;
> >         }
> >
> > +       file_len = ordered->num_bytes;
> >         pre = start - ordered->disk_bytenr;
> >         post = ordered_end - end;
> >
> >         ret = btrfs_split_ordered_extent(ordered, pre, post);
> >         if (ret)
> >                 goto out;
> > -
> > -       read_lock(&em_tree->lock);
> > -       em = lookup_extent_mapping(em_tree, ordered->file_offset, len);
> > -       if (!em) {
> > -               read_unlock(&em_tree->lock);
> > -               ret = -EIO;
> > -               goto out;
> > -       }
> > -       read_unlock(&em_tree->lock);
> > -
> > -       ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
> > -       /*
> > -        * We cannot reuse em_new here but have to create a new one, as
> > -        * unpin_extent_cache() expects the start of the extent map to be the
> > -        * logical offset of the file, which does not hold true anymore after
> > -        * splitting.
> > -        */
> > -       em_new = create_io_em(inode, em->start + pre, len,
> > -                             em->start + pre, em->block_start + pre, len,
> > -                             len, len, BTRFS_COMPRESS_NONE,
> > -                             BTRFS_ORDERED_REGULAR);
> > -       if (IS_ERR(em_new)) {
> > -               ret = PTR_ERR(em_new);
> > -               goto out;
> > -       }
> > -       free_extent_map(em_new);
> > +       ret = split_zoned_em(inode, file_offset, file_len, pre, post);
> >
> >  out:
> > -       free_extent_map(em);
> >         btrfs_put_ordered_extent(ordered);
> >
> >         return errno_to_blk_status(ret);
> > --
> > 2.32.0
> >
>
>
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”

David Sterba July 28, 2021, 3:47 p.m. UTC | #7

On Thu, Jul 01, 2021 at 05:55:51PM +0100, Filipe Manana wrote:
> > > +       if (pre) {
> > > +               /* Insert the middle extent_map */
> > > +               split_mid->start = em->start + pre;
> > > +               split_mid->len = em->len - pre - post;
> > > +               split_mid->orig_start = split_mid->start;
> > > +               split_mid->block_start = em->block_start + pre;
> > > +               split_mid->block_len = split_mid->len;
> > > +               split_mid->orig_block_len = split_mid->block_len;
> > > +               split_mid->ram_bytes = split_mid->len;
> > > +               split_mid->flags = flags;
> > > +               split_mid->compress_type = em->compress_type;
> > > +               split_mid->generation = em->generation;
> > > +               add_extent_mapping(em_tree, split_mid, modified);
> > > +       }
> > > +
> > > +       if (post) {
> > > +               split_post->start = em->start + em->len - post;
> > > +               split_post->len = post;
> > > +               split_post->orig_start = split_post->start;
> > > +               split_post->block_start = em->block_start + em->len - post;
> > > +               split_post->block_len = split_post->len;
> > > +               split_post->orig_block_len = split_post->block_len;
> > > +               split_post->ram_bytes = split_post->len;
> > > +               split_post->flags = flags;
> > > +               split_post->compress_type = em->compress_type;
> > > +               split_post->generation = em->generation;
> > > +               add_extent_mapping(em_tree, split_post, modified);
> > > +       }
> >
> > So this happens when running delalloc, after creating the original
> > extent map and ordered extent, the original "em" must have had the
> > PINNED flag set.
> >
> > The "pre" and "post" extent maps should have the PINNED flag set. It's
> > important to have the flag set to prevent extent map merging, which
> > could result in a log corruption if the file is being fsync'ed
> > multiple times in the current transaction and running delalloc was
> > triggered precisely by an fsync. The corruption result would be
> > logging extent items with overlapping ranges, since we construct them
> > based on extent maps, and that's why we have the PINNED flag to
> > prevent merging.
> 
> Well, it actually happens that merging should not happen because the
> original extent map was in the list of modified extents, and so should
> be the new extent maps.
> But we are really supposed to have the PINNED flag from the moment we
> run delalloc and create a new extent map until the respective ordered
> extent completes and unpins it.
> 
> Also EXTENT_FLAG_LOGGING should not be set at this point - if it were
> we would screw up with a task logging the extent map.
> 
> Maybe assert that it is not set in the original extent map?
> And also assert that the original em is in the list of modified
> extents and has the PINNED flag set?

Agreed, the asserts should be here, Naohiro, please send a followup,
thanks.

Naohiro Aota Aug. 3, 2021, 11:51 p.m. UTC | #8

On Wed, Jul 28, 2021 at 05:47:03PM +0200, David Sterba wrote:
> On Thu, Jul 01, 2021 at 05:55:51PM +0100, Filipe Manana wrote:
> > > > +       if (pre) {
> > > > +               /* Insert the middle extent_map */
> > > > +               split_mid->start = em->start + pre;
> > > > +               split_mid->len = em->len - pre - post;
> > > > +               split_mid->orig_start = split_mid->start;
> > > > +               split_mid->block_start = em->block_start + pre;
> > > > +               split_mid->block_len = split_mid->len;
> > > > +               split_mid->orig_block_len = split_mid->block_len;
> > > > +               split_mid->ram_bytes = split_mid->len;
> > > > +               split_mid->flags = flags;
> > > > +               split_mid->compress_type = em->compress_type;
> > > > +               split_mid->generation = em->generation;
> > > > +               add_extent_mapping(em_tree, split_mid, modified);
> > > > +       }
> > > > +
> > > > +       if (post) {
> > > > +               split_post->start = em->start + em->len - post;
> > > > +               split_post->len = post;
> > > > +               split_post->orig_start = split_post->start;
> > > > +               split_post->block_start = em->block_start + em->len - post;
> > > > +               split_post->block_len = split_post->len;
> > > > +               split_post->orig_block_len = split_post->block_len;
> > > > +               split_post->ram_bytes = split_post->len;
> > > > +               split_post->flags = flags;
> > > > +               split_post->compress_type = em->compress_type;
> > > > +               split_post->generation = em->generation;
> > > > +               add_extent_mapping(em_tree, split_post, modified);
> > > > +       }
> > >
> > > So this happens when running delalloc, after creating the original
> > > extent map and ordered extent, the original "em" must have had the
> > > PINNED flag set.
> > >
> > > The "pre" and "post" extent maps should have the PINNED flag set. It's
> > > important to have the flag set to prevent extent map merging, which
> > > could result in a log corruption if the file is being fsync'ed
> > > multiple times in the current transaction and running delalloc was
> > > triggered precisely by an fsync. The corruption result would be
> > > logging extent items with overlapping ranges, since we construct them
> > > based on extent maps, and that's why we have the PINNED flag to
> > > prevent merging.
> > 
> > Well, it actually happens that merging should not happen because the
> > original extent map was in the list of modified extents, and so should
> > be the new extent maps.
> > But we are really supposed to have the PINNED flag from the moment we
> > run delalloc and create a new extent map until the respective ordered
> > extent completes and unpins it.
> > 
> > Also EXTENT_FLAG_LOGGING should not be set at this point - if it were
> > we would screw up with a task logging the extent map.
> > 
> > Maybe assert that it is not set in the original extent map?
> > And also assert that the original em is in the list of modified
> > extents and has the PINNED flag set?
> 
> Agreed, the asserts should be here, Naohiro, please send a followup,
> thanks.

Sure. I'll soon send an update.

btrfs: properly split extent_map for REQ_OP_ZONE_APPEND

Commit Message

Comments

Patch