Message ID | cover.1712875458.git.wqu@suse.com (mailing list archive) |
---|---|
Headers | show |
Series | btrfs: more explaination on extent_map members | expand |
On Thu, Apr 11, 2024 at 11:48 PM Qu Wenruo <wqu@suse.com> wrote: > > [REPO] > https://github.com/adam900710/linux.git/ em_cleanup > > The first 3 patches only. As later patches are the huge rename part. > > [CHANGELOG] > v4: > - Add extra comments for the extent map merging > Since extent maps can be merged, the members are only matching > on-disk file extent items before merging. > This also means users of extent_map should not rely on it for > a reliable file extent item member. > (That's also why we use ordered_extent not extent_maps, to update > file extent items) So this isn't true. We use information from the ordered extent to create the file extent item, when finishing the ordered extent, because the ordered extent is immediately accessible. We could also use the extent map, we would just have to do a tree search for it. The extent map created when starting delalloc or at the beginning of a DIO write is pinned and in the list of modified extents, so it can't be merged and it represents exactly one file extent item - the one going to be inserted the inode's subvolume tree when finishing the ordered extent. > > v3: > - Rebased to latest for-next branch > - Further comments polishment > - Coding style update to follow the guideline > > v2: > - Add Filipe's cleanup on mod_start/mod_len > These two members are no longer utilized, saving me quite some time on > digging into their usage. > > - Update the comments of the extent_map structure > To make them more readable and less confusing. > > - Further cleanup for inline extent_map reading > > - A new patch to do extra sanity checks for create_io_em() > Firstly pure NOCOW writes should not call create_io_em(), secondly > with the new knowledge of extent_map, it's easier to do extra sanity > checks for the already pretty long parameter list. > > Btrfs uses extent_map to represent a in-memory file extent. > > There are severam members that are 1:1 mappe in on-disk file extent > items and extent maps: > > - extent_map::start == key.offset > - extent_map::len == file_extent_num_bytes > - extent_map::ram_bytes == file_extent_ram_bytes > > But that's all, the remaining are pretty different: > > - Use block_start to indicate holes/inline extents > Meanwhile btrfs on-disk file extent items go with a dedicated type for > inline extents, and disk_bytenr 0 for holes. > > - Weird block_start/orig_block_len/orig_start > In theory we can directly go with the same file_extent_disk_bytenr, > file_extent_disk_num_bytes and file_extent_offset to calculate the > remaining members (block_start/orig_start/orig_block_len/block_len). > > But for whatever reason, we didn't go that path and have a hell of > weird and inconsistent calculation for them. > > Qu Wenruo (3): > btrfs: add extra comments on extent_map members > btrfs: simplify the inline extent map creation > btrfs: add extra sanity checks for create_io_em() > > fs/btrfs/extent_map.h | 58 ++++++++++++++++++++++++++++++++++++++++++- > fs/btrfs/file-item.c | 20 +++++++-------- > fs/btrfs/inode.c | 40 ++++++++++++++++++++++++++++- > 3 files changed, 106 insertions(+), 12 deletions(-) > > -- > 2.44.0 > >
在 2024/4/12 20:01, Filipe Manana 写道: > On Thu, Apr 11, 2024 at 11:48 PM Qu Wenruo <wqu@suse.com> wrote: >> >> [REPO] >> https://github.com/adam900710/linux.git/ em_cleanup >> >> The first 3 patches only. As later patches are the huge rename part. >> >> [CHANGELOG] >> v4: >> - Add extra comments for the extent map merging >> Since extent maps can be merged, the members are only matching >> on-disk file extent items before merging. >> This also means users of extent_map should not rely on it for >> a reliable file extent item member. >> (That's also why we use ordered_extent not extent_maps, to update >> file extent items) > > So this isn't true. > We use information from the ordered extent to create the file extent > item, when finishing the ordered extent, because the ordered extent is > immediately accessible. > We could also use the extent map, we would just have to do a tree search for it. > The extent map created when starting delalloc or at the beginning of a > DIO write is pinned and in the list of modified extents, so it can't > be merged and it represents exactly one file extent item - the one > going to be inserted the inode's subvolume tree when finishing the > ordered extent. If you're doing COW writes, extent map is fine, as it's always new extents and they are all pinned thus won't be merged. For pure NOCOW, it's also fine because there would be no extent map inserted, nor updating file extent items. But for PREALLOC writes (NOCOW into preallocated writes), it's not the case as the target extent maps can be merged already. Thus that's why for NOCOW we go with a subvolume tree search to grab the correct original disk_bytenr/disk_num_bytes/offset to go. If you use extent map instead of search subvolume tree, the following layout can screw up everything: Filepos 0: disk_bytenr X, disk_num_bytes 64K num_bytes 64K, offset 0, ram_bytes 64K Filepos 64K disk_bytenr X+64K, disk_num_bytes 64K num_bytes 64K, offset 0, ram_bytes 64K. The extent map would be (after merge): filepos 0, len 128K block_start X, block len 128K, orig_start 0. Even with my new disk_bytenr it would be no difference, the disk_bytenr would still be X, and disk_num_bytes would be merged 128K. So the merging behavior is affecting PREALLOC writes, and we're already avoiding the only pitfall. Thanks, Qu > >> >> v3: >> - Rebased to latest for-next branch >> - Further comments polishment >> - Coding style update to follow the guideline >> >> v2: >> - Add Filipe's cleanup on mod_start/mod_len >> These two members are no longer utilized, saving me quite some time on >> digging into their usage. >> >> - Update the comments of the extent_map structure >> To make them more readable and less confusing. >> >> - Further cleanup for inline extent_map reading >> >> - A new patch to do extra sanity checks for create_io_em() >> Firstly pure NOCOW writes should not call create_io_em(), secondly >> with the new knowledge of extent_map, it's easier to do extra sanity >> checks for the already pretty long parameter list. >> >> Btrfs uses extent_map to represent a in-memory file extent. >> >> There are severam members that are 1:1 mappe in on-disk file extent >> items and extent maps: >> >> - extent_map::start == key.offset >> - extent_map::len == file_extent_num_bytes >> - extent_map::ram_bytes == file_extent_ram_bytes >> >> But that's all, the remaining are pretty different: >> >> - Use block_start to indicate holes/inline extents >> Meanwhile btrfs on-disk file extent items go with a dedicated type for >> inline extents, and disk_bytenr 0 for holes. >> >> - Weird block_start/orig_block_len/orig_start >> In theory we can directly go with the same file_extent_disk_bytenr, >> file_extent_disk_num_bytes and file_extent_offset to calculate the >> remaining members (block_start/orig_start/orig_block_len/block_len). >> >> But for whatever reason, we didn't go that path and have a hell of >> weird and inconsistent calculation for them. >> >> Qu Wenruo (3): >> btrfs: add extra comments on extent_map members >> btrfs: simplify the inline extent map creation >> btrfs: add extra sanity checks for create_io_em() >> >> fs/btrfs/extent_map.h | 58 ++++++++++++++++++++++++++++++++++++++++++- >> fs/btrfs/file-item.c | 20 +++++++-------- >> fs/btrfs/inode.c | 40 ++++++++++++++++++++++++++++- >> 3 files changed, 106 insertions(+), 12 deletions(-) >> >> -- >> 2.44.0 >> >>
On Fri, Apr 12, 2024 at 11:45 AM Qu Wenruo <wqu@suse.com> wrote: > > > > 在 2024/4/12 20:01, Filipe Manana 写道: > > On Thu, Apr 11, 2024 at 11:48 PM Qu Wenruo <wqu@suse.com> wrote: > >> > >> [REPO] > >> https://github.com/adam900710/linux.git/ em_cleanup > >> > >> The first 3 patches only. As later patches are the huge rename part. > >> > >> [CHANGELOG] > >> v4: > >> - Add extra comments for the extent map merging > >> Since extent maps can be merged, the members are only matching > >> on-disk file extent items before merging. > >> This also means users of extent_map should not rely on it for > >> a reliable file extent item member. > >> (That's also why we use ordered_extent not extent_maps, to update > >> file extent items) > > > > So this isn't true. > > We use information from the ordered extent to create the file extent > > item, when finishing the ordered extent, because the ordered extent is > > immediately accessible. > > We could also use the extent map, we would just have to do a tree search for it. > > The extent map created when starting delalloc or at the beginning of a > > DIO write is pinned and in the list of modified extents, so it can't > > be merged and it represents exactly one file extent item - the one > > going to be inserted the inode's subvolume tree when finishing the > > ordered extent. > > If you're doing COW writes, extent map is fine, as it's always new > extents and they are all pinned thus won't be merged. > > For pure NOCOW, it's also fine because there would be no extent map > inserted, nor updating file extent items. > > But for PREALLOC writes (NOCOW into preallocated writes), it's not the > case as the target extent maps can be merged already. Yes, but prealloc is an exception, as we don't insert a new file extent item, just update an existing one or split and change the type from prealloc to regular, etc. My comment was obviously for COW, which is the only case that needs to insert a new file extent item. > > Thus that's why for NOCOW we go with a subvolume tree search to grab the > correct original disk_bytenr/disk_num_bytes/offset to go. > > If you use extent map instead of search subvolume tree, the following > layout can screw up everything: > > Filepos 0: > disk_bytenr X, disk_num_bytes 64K > num_bytes 64K, offset 0, ram_bytes 64K > > Filepos 64K > disk_bytenr X+64K, disk_num_bytes 64K > num_bytes 64K, offset 0, ram_bytes 64K. > > The extent map would be (after merge): > > filepos 0, len 128K block_start X, block len 128K, orig_start 0. > > Even with my new disk_bytenr it would be no difference, the disk_bytenr > would still be X, and disk_num_bytes would be merged 128K. > > So the merging behavior is affecting PREALLOC writes, and we're already > avoiding the only pitfall. > > Thanks, > Qu > > > > >> > >> v3: > >> - Rebased to latest for-next branch > >> - Further comments polishment > >> - Coding style update to follow the guideline > >> > >> v2: > >> - Add Filipe's cleanup on mod_start/mod_len > >> These two members are no longer utilized, saving me quite some time on > >> digging into their usage. > >> > >> - Update the comments of the extent_map structure > >> To make them more readable and less confusing. > >> > >> - Further cleanup for inline extent_map reading > >> > >> - A new patch to do extra sanity checks for create_io_em() > >> Firstly pure NOCOW writes should not call create_io_em(), secondly > >> with the new knowledge of extent_map, it's easier to do extra sanity > >> checks for the already pretty long parameter list. > >> > >> Btrfs uses extent_map to represent a in-memory file extent. > >> > >> There are severam members that are 1:1 mappe in on-disk file extent > >> items and extent maps: > >> > >> - extent_map::start == key.offset > >> - extent_map::len == file_extent_num_bytes > >> - extent_map::ram_bytes == file_extent_ram_bytes > >> > >> But that's all, the remaining are pretty different: > >> > >> - Use block_start to indicate holes/inline extents > >> Meanwhile btrfs on-disk file extent items go with a dedicated type for > >> inline extents, and disk_bytenr 0 for holes. > >> > >> - Weird block_start/orig_block_len/orig_start > >> In theory we can directly go with the same file_extent_disk_bytenr, > >> file_extent_disk_num_bytes and file_extent_offset to calculate the > >> remaining members (block_start/orig_start/orig_block_len/block_len). > >> > >> But for whatever reason, we didn't go that path and have a hell of > >> weird and inconsistent calculation for them. > >> > >> Qu Wenruo (3): > >> btrfs: add extra comments on extent_map members > >> btrfs: simplify the inline extent map creation > >> btrfs: add extra sanity checks for create_io_em() > >> > >> fs/btrfs/extent_map.h | 58 ++++++++++++++++++++++++++++++++++++++++++- > >> fs/btrfs/file-item.c | 20 +++++++-------- > >> fs/btrfs/inode.c | 40 ++++++++++++++++++++++++++++- > >> 3 files changed, 106 insertions(+), 12 deletions(-) > >> > >> -- > >> 2.44.0 > >> > >>