mbox series

[v4,0/3] btrfs: more explaination on extent_map members

Message ID cover.1712875458.git.wqu@suse.com (mailing list archive)
Headers show
Series btrfs: more explaination on extent_map members | expand

Message

Qu Wenruo April 11, 2024, 10:47 p.m. UTC
[REPO]
https://github.com/adam900710/linux.git/ em_cleanup

The first 3 patches only. As later patches are the huge rename part.

[CHANGELOG]
v4:
- Add extra comments for the extent map merging
  Since extent maps can be merged, the members are only matching
  on-disk file extent items before merging.
  This also means users of extent_map should not rely on it for
  a reliable file extent item member.
  (That's also why we use ordered_extent not extent_maps, to update
   file extent items)

v3:
- Rebased to latest for-next branch
- Further comments polishment
- Coding style update to follow the guideline

v2:
- Add Filipe's cleanup on mod_start/mod_len
  These two members are no longer utilized, saving me quite some time on
  digging into their usage.

- Update the comments of the extent_map structure
  To make them more readable and less confusing.

- Further cleanup for inline extent_map reading

- A new patch to do extra sanity checks for create_io_em()
  Firstly pure NOCOW writes should not call create_io_em(), secondly
  with the new knowledge of extent_map, it's easier to do extra sanity
  checks for the already pretty long parameter list.

Btrfs uses extent_map to represent a in-memory file extent.

There are severam members that are 1:1 mappe in on-disk file extent
items and extent maps:

- extent_map::start	==	key.offset
- extent_map::len	==	file_extent_num_bytes
- extent_map::ram_bytes	==	file_extent_ram_bytes

But that's all, the remaining are pretty different:

- Use block_start to indicate holes/inline extents
  Meanwhile btrfs on-disk file extent items go with a dedicated type for
  inline extents, and disk_bytenr 0 for holes.

- Weird block_start/orig_block_len/orig_start
  In theory we can directly go with the same file_extent_disk_bytenr,
  file_extent_disk_num_bytes and file_extent_offset to calculate the
  remaining members (block_start/orig_start/orig_block_len/block_len).

  But for whatever reason, we didn't go that path and have a hell of
  weird and inconsistent calculation for them.

Qu Wenruo (3):
  btrfs: add extra comments on extent_map members
  btrfs: simplify the inline extent map creation
  btrfs: add extra sanity checks for create_io_em()

 fs/btrfs/extent_map.h | 58 ++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/file-item.c  | 20 +++++++--------
 fs/btrfs/inode.c      | 40 ++++++++++++++++++++++++++++-
 3 files changed, 106 insertions(+), 12 deletions(-)

Comments

Filipe Manana April 12, 2024, 10:31 a.m. UTC | #1
On Thu, Apr 11, 2024 at 11:48 PM Qu Wenruo <wqu@suse.com> wrote:
>
> [REPO]
> https://github.com/adam900710/linux.git/ em_cleanup
>
> The first 3 patches only. As later patches are the huge rename part.
>
> [CHANGELOG]
> v4:
> - Add extra comments for the extent map merging
>   Since extent maps can be merged, the members are only matching
>   on-disk file extent items before merging.
>   This also means users of extent_map should not rely on it for
>   a reliable file extent item member.
>   (That's also why we use ordered_extent not extent_maps, to update
>    file extent items)

So this isn't true.
We use information from the ordered extent to create the file extent
item, when finishing the ordered extent, because the ordered extent is
immediately accessible.
We could also use the extent map, we would just have to do a tree search for it.
The extent map created when starting delalloc or at the beginning of a
DIO write is pinned and in the list of modified extents, so it can't
be merged and it represents exactly one file extent item - the one
going to be inserted the inode's subvolume tree when finishing the
ordered extent.

>
> v3:
> - Rebased to latest for-next branch
> - Further comments polishment
> - Coding style update to follow the guideline
>
> v2:
> - Add Filipe's cleanup on mod_start/mod_len
>   These two members are no longer utilized, saving me quite some time on
>   digging into their usage.
>
> - Update the comments of the extent_map structure
>   To make them more readable and less confusing.
>
> - Further cleanup for inline extent_map reading
>
> - A new patch to do extra sanity checks for create_io_em()
>   Firstly pure NOCOW writes should not call create_io_em(), secondly
>   with the new knowledge of extent_map, it's easier to do extra sanity
>   checks for the already pretty long parameter list.
>
> Btrfs uses extent_map to represent a in-memory file extent.
>
> There are severam members that are 1:1 mappe in on-disk file extent
> items and extent maps:
>
> - extent_map::start     ==      key.offset
> - extent_map::len       ==      file_extent_num_bytes
> - extent_map::ram_bytes ==      file_extent_ram_bytes
>
> But that's all, the remaining are pretty different:
>
> - Use block_start to indicate holes/inline extents
>   Meanwhile btrfs on-disk file extent items go with a dedicated type for
>   inline extents, and disk_bytenr 0 for holes.
>
> - Weird block_start/orig_block_len/orig_start
>   In theory we can directly go with the same file_extent_disk_bytenr,
>   file_extent_disk_num_bytes and file_extent_offset to calculate the
>   remaining members (block_start/orig_start/orig_block_len/block_len).
>
>   But for whatever reason, we didn't go that path and have a hell of
>   weird and inconsistent calculation for them.
>
> Qu Wenruo (3):
>   btrfs: add extra comments on extent_map members
>   btrfs: simplify the inline extent map creation
>   btrfs: add extra sanity checks for create_io_em()
>
>  fs/btrfs/extent_map.h | 58 ++++++++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/file-item.c  | 20 +++++++--------
>  fs/btrfs/inode.c      | 40 ++++++++++++++++++++++++++++-
>  3 files changed, 106 insertions(+), 12 deletions(-)
>
> --
> 2.44.0
>
>
Qu Wenruo April 12, 2024, 10:45 a.m. UTC | #2
在 2024/4/12 20:01, Filipe Manana 写道:
> On Thu, Apr 11, 2024 at 11:48 PM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [REPO]
>> https://github.com/adam900710/linux.git/ em_cleanup
>>
>> The first 3 patches only. As later patches are the huge rename part.
>>
>> [CHANGELOG]
>> v4:
>> - Add extra comments for the extent map merging
>>    Since extent maps can be merged, the members are only matching
>>    on-disk file extent items before merging.
>>    This also means users of extent_map should not rely on it for
>>    a reliable file extent item member.
>>    (That's also why we use ordered_extent not extent_maps, to update
>>     file extent items)
> 
> So this isn't true.
> We use information from the ordered extent to create the file extent
> item, when finishing the ordered extent, because the ordered extent is
> immediately accessible.
> We could also use the extent map, we would just have to do a tree search for it.
> The extent map created when starting delalloc or at the beginning of a
> DIO write is pinned and in the list of modified extents, so it can't
> be merged and it represents exactly one file extent item - the one
> going to be inserted the inode's subvolume tree when finishing the
> ordered extent.

If you're doing COW writes, extent map is fine, as it's always new 
extents and they are all pinned thus won't be merged.

For pure NOCOW, it's also fine because there would be no extent map 
inserted, nor updating file extent items.

But for PREALLOC writes (NOCOW into preallocated writes), it's not the 
case as the target extent maps can be merged already.

Thus that's why for NOCOW we go with a subvolume tree search to grab the 
correct original disk_bytenr/disk_num_bytes/offset to go.

If you use extent map instead of search subvolume tree, the following 
layout can screw up everything:

Filepos 0:
disk_bytenr X, disk_num_bytes 64K
num_bytes 64K, offset 0, ram_bytes 64K

Filepos 64K
disk_bytenr X+64K, disk_num_bytes 64K
num_bytes 64K, offset 0, ram_bytes 64K.

The extent map would be (after merge):

filepos 0, len 128K block_start X, block len 128K, orig_start 0.

Even with my new disk_bytenr it would be no difference, the disk_bytenr 
would still be X, and disk_num_bytes would be merged 128K.

So the merging behavior is affecting PREALLOC writes, and we're already 
avoiding the only pitfall.

Thanks,
Qu

> 
>>
>> v3:
>> - Rebased to latest for-next branch
>> - Further comments polishment
>> - Coding style update to follow the guideline
>>
>> v2:
>> - Add Filipe's cleanup on mod_start/mod_len
>>    These two members are no longer utilized, saving me quite some time on
>>    digging into their usage.
>>
>> - Update the comments of the extent_map structure
>>    To make them more readable and less confusing.
>>
>> - Further cleanup for inline extent_map reading
>>
>> - A new patch to do extra sanity checks for create_io_em()
>>    Firstly pure NOCOW writes should not call create_io_em(), secondly
>>    with the new knowledge of extent_map, it's easier to do extra sanity
>>    checks for the already pretty long parameter list.
>>
>> Btrfs uses extent_map to represent a in-memory file extent.
>>
>> There are severam members that are 1:1 mappe in on-disk file extent
>> items and extent maps:
>>
>> - extent_map::start     ==      key.offset
>> - extent_map::len       ==      file_extent_num_bytes
>> - extent_map::ram_bytes ==      file_extent_ram_bytes
>>
>> But that's all, the remaining are pretty different:
>>
>> - Use block_start to indicate holes/inline extents
>>    Meanwhile btrfs on-disk file extent items go with a dedicated type for
>>    inline extents, and disk_bytenr 0 for holes.
>>
>> - Weird block_start/orig_block_len/orig_start
>>    In theory we can directly go with the same file_extent_disk_bytenr,
>>    file_extent_disk_num_bytes and file_extent_offset to calculate the
>>    remaining members (block_start/orig_start/orig_block_len/block_len).
>>
>>    But for whatever reason, we didn't go that path and have a hell of
>>    weird and inconsistent calculation for them.
>>
>> Qu Wenruo (3):
>>    btrfs: add extra comments on extent_map members
>>    btrfs: simplify the inline extent map creation
>>    btrfs: add extra sanity checks for create_io_em()
>>
>>   fs/btrfs/extent_map.h | 58 ++++++++++++++++++++++++++++++++++++++++++-
>>   fs/btrfs/file-item.c  | 20 +++++++--------
>>   fs/btrfs/inode.c      | 40 ++++++++++++++++++++++++++++-
>>   3 files changed, 106 insertions(+), 12 deletions(-)
>>
>> --
>> 2.44.0
>>
>>
Filipe Manana April 12, 2024, 1:16 p.m. UTC | #3
On Fri, Apr 12, 2024 at 11:45 AM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> 在 2024/4/12 20:01, Filipe Manana 写道:
> > On Thu, Apr 11, 2024 at 11:48 PM Qu Wenruo <wqu@suse.com> wrote:
> >>
> >> [REPO]
> >> https://github.com/adam900710/linux.git/ em_cleanup
> >>
> >> The first 3 patches only. As later patches are the huge rename part.
> >>
> >> [CHANGELOG]
> >> v4:
> >> - Add extra comments for the extent map merging
> >>    Since extent maps can be merged, the members are only matching
> >>    on-disk file extent items before merging.
> >>    This also means users of extent_map should not rely on it for
> >>    a reliable file extent item member.
> >>    (That's also why we use ordered_extent not extent_maps, to update
> >>     file extent items)
> >
> > So this isn't true.
> > We use information from the ordered extent to create the file extent
> > item, when finishing the ordered extent, because the ordered extent is
> > immediately accessible.
> > We could also use the extent map, we would just have to do a tree search for it.
> > The extent map created when starting delalloc or at the beginning of a
> > DIO write is pinned and in the list of modified extents, so it can't
> > be merged and it represents exactly one file extent item - the one
> > going to be inserted the inode's subvolume tree when finishing the
> > ordered extent.
>
> If you're doing COW writes, extent map is fine, as it's always new
> extents and they are all pinned thus won't be merged.
>
> For pure NOCOW, it's also fine because there would be no extent map
> inserted, nor updating file extent items.
>
> But for PREALLOC writes (NOCOW into preallocated writes), it's not the
> case as the target extent maps can be merged already.

Yes, but prealloc is an exception, as we don't insert a new file
extent item, just update an existing one or split and change the type
from prealloc to regular, etc.
My comment was obviously for COW, which is the only case that needs to
insert a new file extent item.

>
> Thus that's why for NOCOW we go with a subvolume tree search to grab the
> correct original disk_bytenr/disk_num_bytes/offset to go.
>
> If you use extent map instead of search subvolume tree, the following
> layout can screw up everything:
>
> Filepos 0:
> disk_bytenr X, disk_num_bytes 64K
> num_bytes 64K, offset 0, ram_bytes 64K
>
> Filepos 64K
> disk_bytenr X+64K, disk_num_bytes 64K
> num_bytes 64K, offset 0, ram_bytes 64K.
>
> The extent map would be (after merge):
>
> filepos 0, len 128K block_start X, block len 128K, orig_start 0.
>
> Even with my new disk_bytenr it would be no difference, the disk_bytenr
> would still be X, and disk_num_bytes would be merged 128K.
>
> So the merging behavior is affecting PREALLOC writes, and we're already
> avoiding the only pitfall.
>
> Thanks,
> Qu
>
> >
> >>
> >> v3:
> >> - Rebased to latest for-next branch
> >> - Further comments polishment
> >> - Coding style update to follow the guideline
> >>
> >> v2:
> >> - Add Filipe's cleanup on mod_start/mod_len
> >>    These two members are no longer utilized, saving me quite some time on
> >>    digging into their usage.
> >>
> >> - Update the comments of the extent_map structure
> >>    To make them more readable and less confusing.
> >>
> >> - Further cleanup for inline extent_map reading
> >>
> >> - A new patch to do extra sanity checks for create_io_em()
> >>    Firstly pure NOCOW writes should not call create_io_em(), secondly
> >>    with the new knowledge of extent_map, it's easier to do extra sanity
> >>    checks for the already pretty long parameter list.
> >>
> >> Btrfs uses extent_map to represent a in-memory file extent.
> >>
> >> There are severam members that are 1:1 mappe in on-disk file extent
> >> items and extent maps:
> >>
> >> - extent_map::start     ==      key.offset
> >> - extent_map::len       ==      file_extent_num_bytes
> >> - extent_map::ram_bytes ==      file_extent_ram_bytes
> >>
> >> But that's all, the remaining are pretty different:
> >>
> >> - Use block_start to indicate holes/inline extents
> >>    Meanwhile btrfs on-disk file extent items go with a dedicated type for
> >>    inline extents, and disk_bytenr 0 for holes.
> >>
> >> - Weird block_start/orig_block_len/orig_start
> >>    In theory we can directly go with the same file_extent_disk_bytenr,
> >>    file_extent_disk_num_bytes and file_extent_offset to calculate the
> >>    remaining members (block_start/orig_start/orig_block_len/block_len).
> >>
> >>    But for whatever reason, we didn't go that path and have a hell of
> >>    weird and inconsistent calculation for them.
> >>
> >> Qu Wenruo (3):
> >>    btrfs: add extra comments on extent_map members
> >>    btrfs: simplify the inline extent map creation
> >>    btrfs: add extra sanity checks for create_io_em()
> >>
> >>   fs/btrfs/extent_map.h | 58 ++++++++++++++++++++++++++++++++++++++++++-
> >>   fs/btrfs/file-item.c  | 20 +++++++--------
> >>   fs/btrfs/inode.c      | 40 ++++++++++++++++++++++++++++-
> >>   3 files changed, 106 insertions(+), 12 deletions(-)
> >>
> >> --
> >> 2.44.0
> >>
> >>