Message ID | 20200707035944.15150-1-robbieko@synology.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] btrfs: speedup mount time with readahead chunk tree | expand |
On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote: > From: Robbie Ko <robbieko@synology.com> > > When mounting, we always need to read the whole chunk tree, > when there are too many chunk items, most of the time is > spent on btrfs_read_chunk_tree, because we only read one > leaf at a time. > > It is unreasonable to limit the readahead mechanism to a > range of 64k, so we have removed that limit. > > In addition we added reada_maximum_size to customize the > size of the pre-reader, The default is 64k to maintain the > original behavior. > > So we fix this by used readahead mechanism, and set readahead > max size to ULLONG_MAX which reads all the leaves after the > key in the node when reading a level 1 node. The readahead of chunk tree is a special case as we know we will need the whole tree, in all other cases the search readahead needs is supposed to read only one leaf. For that reason I don't want to touch the current path readahead logic at all and do the chunk tree readahead in one go instead of the per-search. Also I don't like to see size increase of btrfs_path just to use the custom once. The idea of the whole tree readahead is to do something like: - find first item - start readahead on all leaves from its level 1 node parent (readahead_tree_block) - when the level 1 parent changes during iterating items, start the readahead again This skips readahead of all nodes above level 1, if you find a nicer way to readahead the whole tree I won't object, but for the first implementation the level 1 seems ok to me. > I have a test environment as follows: > > 200TB btrfs volume: used 192TB > > Data, single: total=192.00TiB, used=192.00TiB > System, DUP: total=40.00MiB, used=19.91MiB Can you please check what's the chunk tree height? 'btrfs inspect tree-stats' prints that but it takes long as needs to go through the whole metadata, so extracting it from 'btrfs inspect dump-tree -c chunk' would be faster. Thanks.
David Sterba 於 2020/7/8 上午3:25 寫道: > On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote: >> From: Robbie Ko <robbieko@synology.com> >> >> When mounting, we always need to read the whole chunk tree, >> when there are too many chunk items, most of the time is >> spent on btrfs_read_chunk_tree, because we only read one >> leaf at a time. >> >> It is unreasonable to limit the readahead mechanism to a >> range of 64k, so we have removed that limit. >> >> In addition we added reada_maximum_size to customize the >> size of the pre-reader, The default is 64k to maintain the >> original behavior. >> >> So we fix this by used readahead mechanism, and set readahead >> max size to ULLONG_MAX which reads all the leaves after the >> key in the node when reading a level 1 node. > The readahead of chunk tree is a special case as we know we will need > the whole tree, in all other cases the search readahead needs is > supposed to read only one leaf. If, in most cases, readahead requires that only one leaf be read, then reada_ maximum_size should be nodesize instead of 64k, or use reada_maximum_ nr (default:1) seems better. > > For that reason I don't want to touch the current path readahead logic > at all and do the chunk tree readahead in one go instead of the > per-search. I don't know why we don't make the change to readahead, because the current readahead is limited to the logical address in 64k is very unreasonable, and there is a good chance that the logical address of the next leaf node will not appear in 64k, so the existing readahead is almost useless. > > Also I don't like to see size increase of btrfs_path just to use the > custom once. This variable is the parameter that controls the speed of the readahead, and I think it should be adjustable, not the hard code in the readahead function. In the future, more scenarios will be available. For example, BGTREE. will be improved significantly faster, My own tests have improved the speed by almost 500%. Reference: https://lwn.net/Articles/801990 / > > The idea of the whole tree readahead is to do something like: > > - find first item > - start readahead on all leaves from its level 1 node parent > (readahead_tree_block) > - when the level 1 parent changes during iterating items, start the > readahead again > > This skips readahead of all nodes above level 1, if you find a nicer way > to readahead the whole tree I won't object, but for the first > implementation the level 1 seems ok to me. > >> I have a test environment as follows: >> >> 200TB btrfs volume: used 192TB >> >> Data, single: total=192.00TiB, used=192.00TiB >> System, DUP: total=40.00MiB, used=19.91MiB > Can you please check what's the chunk tree height? 'btrfs inspect > tree-stats' prints that but it takes long as needs to go through the > whole metadata, so extracting it from 'btrfs inspect dump-tree -c chunk' > would be faster. Thanks. Chunk tree height 3, level (0-2)
On Wed, Jul 08, 2020 at 10:19:22AM +0800, Robbie Ko wrote: > David Sterba 於 2020/7/8 上午3:25 寫道: > > On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote: > >> From: Robbie Ko <robbieko@synology.com> > >> > >> When mounting, we always need to read the whole chunk tree, > >> when there are too many chunk items, most of the time is > >> spent on btrfs_read_chunk_tree, because we only read one > >> leaf at a time. > >> > >> It is unreasonable to limit the readahead mechanism to a > >> range of 64k, so we have removed that limit. > >> > >> In addition we added reada_maximum_size to customize the > >> size of the pre-reader, The default is 64k to maintain the > >> original behavior. > >> > >> So we fix this by used readahead mechanism, and set readahead > >> max size to ULLONG_MAX which reads all the leaves after the > >> key in the node when reading a level 1 node. > > The readahead of chunk tree is a special case as we know we will need > > the whole tree, in all other cases the search readahead needs is > > supposed to read only one leaf. > > If, in most cases, readahead requires that only one leaf be read, then > reada_ maximum_size should be nodesize instead of 64k, or use > reada_maximum_ nr (default:1) seems better. > > > > > For that reason I don't want to touch the current path readahead logic > > at all and do the chunk tree readahead in one go instead of the > > per-search. > > I don't know why we don't make the change to readahead, because the current > readahead is limited to the logical address in 64k is very unreasonable, > and there is a good chance that the logical address of the next leaf > node will > not appear in 64k, so the existing readahead is almost useless. I see and it seems that the assumption about layout and chances succesfuly read blocks ahead is not valid. The logic of readahead could be improved but that would need more performance evaluation. > > Also I don't like to see size increase of btrfs_path just to use the > > custom once. > > This variable is the parameter that controls the speed of the readahead, > and I think it should be adjustable, not the hard code in the readahead > function. Yes, but it takes 8 bytes and stays constant that does not even need 8 bytes. I just don't want to touch the generic readahead logic that is started by b-tree search because that would affect all workloads, while your're interested in speeding up the chunk tree load. > In the future, more scenarios will be available. > For example, BGTREE. will be improved significantly faster, > My own tests have improved the speed by almost 500%. > Reference: https://lwn.net/Articles/801990 / The optimized block group items whould be a huge win even without the readahead but that's a different problem. > > The idea of the whole tree readahead is to do something like: > > > > - find first item > > - start readahead on all leaves from its level 1 node parent > > (readahead_tree_block) > > - when the level 1 parent changes during iterating items, start the > > readahead again > > > > This skips readahead of all nodes above level 1, if you find a nicer way > > to readahead the whole tree I won't object, but for the first > > implementation the level 1 seems ok to me. > > > >> I have a test environment as follows: > >> > >> 200TB btrfs volume: used 192TB > >> > >> Data, single: total=192.00TiB, used=192.00TiB > >> System, DUP: total=40.00MiB, used=19.91MiB > > Can you please check what's the chunk tree height? 'btrfs inspect > > tree-stats' prints that but it takes long as needs to go through the > > whole metadata, so extracting it from 'btrfs inspect dump-tree -c chunk' > > would be faster. Thanks. > Chunk tree height 3, level (0-2) Thanks.
On 2020-07-08 16:04, David Sterba wrote: > On Wed, Jul 08, 2020 at 10:19:22AM +0800, Robbie Ko wrote: >> David Sterba 於 2020/7/8 上午3:25 寫道: >>> On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote: >>>> From: Robbie Ko <robbieko@synology.com> >>>> >>>> When mounting, we always need to read the whole chunk tree, >>>> when there are too many chunk items, most of the time is >>>> spent on btrfs_read_chunk_tree, because we only read one >>>> leaf at a time. >>>> >>>> It is unreasonable to limit the readahead mechanism to a >>>> range of 64k, so we have removed that limit. >>>> >>>> In addition we added reada_maximum_size to customize the >>>> size of the pre-reader, The default is 64k to maintain the >>>> original behavior. >>>> >>>> So we fix this by used readahead mechanism, and set readahead >>>> max size to ULLONG_MAX which reads all the leaves after the >>>> key in the node when reading a level 1 node. >>> The readahead of chunk tree is a special case as we know we will need >>> the whole tree, in all other cases the search readahead needs is >>> supposed to read only one leaf. >> >> If, in most cases, readahead requires that only one leaf be read, then >> reada_ maximum_size should be nodesize instead of 64k, or use >> reada_maximum_ nr (default:1) seems better. >> >>> >>> For that reason I don't want to touch the current path readahead logic >>> at all and do the chunk tree readahead in one go instead of the >>> per-search. >> >> I don't know why we don't make the change to readahead, because the current >> readahead is limited to the logical address in 64k is very unreasonable, >> and there is a good chance that the logical address of the next leaf >> node will >> not appear in 64k, so the existing readahead is almost useless. > > I see and it seems that the assumption about layout and chances > succesfuly read blocks ahead is not valid. The logic of readahead could > be improved but that would need more performance evaluation. FWIW I gave this a try and see the following numbers, averaged over multiple mount/unmount cycles on spinning rust: without patch : ~2.7s with patch : ~4.5s ..ahem.. -h
On Wed, Jul 08, 2020 at 04:57:56PM +0200, Holger Hoffstätte wrote: > On 2020-07-08 16:04, David Sterba wrote: > > On Wed, Jul 08, 2020 at 10:19:22AM +0800, Robbie Ko wrote: > >> David Sterba 於 2020/7/8 上午3:25 寫道: > >> I don't know why we don't make the change to readahead, because the current > >> readahead is limited to the logical address in 64k is very unreasonable, > >> and there is a good chance that the logical address of the next leaf > >> node will > >> not appear in 64k, so the existing readahead is almost useless. > > > > I see and it seems that the assumption about layout and chances > > succesfuly read blocks ahead is not valid. The logic of readahead could > > be improved but that would need more performance evaluation. > > FWIW I gave this a try and see the following numbers, averaged over multiple > mount/unmount cycles on spinning rust: > > without patch : ~2.7s > with patch : ~4.5s > > ..ahem.. Hard to argue against numbers, thanks for posting that.
On Tue, Jul 07, 2020 at 09:25:11PM +0200, David Sterba wrote: > On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote: > > From: Robbie Ko <robbieko@synology.com> > > > > When mounting, we always need to read the whole chunk tree, > > when there are too many chunk items, most of the time is > > spent on btrfs_read_chunk_tree, because we only read one > > leaf at a time. > > > > It is unreasonable to limit the readahead mechanism to a > > range of 64k, so we have removed that limit. > > > > In addition we added reada_maximum_size to customize the > > size of the pre-reader, The default is 64k to maintain the > > original behavior. > > > > So we fix this by used readahead mechanism, and set readahead > > max size to ULLONG_MAX which reads all the leaves after the > > key in the node when reading a level 1 node. > > The readahead of chunk tree is a special case as we know we will need > the whole tree, in all other cases the search readahead needs is > supposed to read only one leaf. > > For that reason I don't want to touch the current path readahead logic > at all and do the chunk tree readahead in one go instead of the > per-search. > > Also I don't like to see size increase of btrfs_path just to use the > custom once. > > The idea of the whole tree readahead is to do something like: > > - find first item > - start readahead on all leaves from its level 1 node parent > (readahead_tree_block) > - when the level 1 parent changes during iterating items, start the > readahead again > > This skips readahead of all nodes above level 1, if you find a nicer way > to readahead the whole tree I won't object, but for the first > implementation the level 1 seems ok to me. Patch below, I tried to create large system chunk by fallocate on a sparse loop device, but got only 1 node on level 1 so the readahead cannot show off. # btrfs fi df . Data, single: total=59.83TiB, used=59.83TiB System, single: total=36.00MiB, used=6.20MiB Metadata, single: total=1.01GiB, used=91.78MiB GlobalReserve, single: total=26.80MiB, used=0.00B There were 395 leaf nodes that got read ahead, time between the first and last is 0.83s and the block group tree read took about 40 seconds. This was in a VM with file-backed images, and the loop device was constructed from these devices so it's spinning rust. I don't have results for non-prefetched mount to compare at the moment. ---- diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index c7a3d4d730a3..e19891243199 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -7013,6 +7013,19 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info, return ret; } +void readahead_tree_node_children(struct extent_buffer *node) +{ + int i; + const int nr_items = btrfs_header_nritems(node); + + for (i = 0; i < nr_items; i++) { + u64 start; + + start = btrfs_node_blockptr(node, i); + readahead_tree_block(node->fs_info, start); + } +} + int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info) { struct btrfs_root *root = fs_info->chunk_root; @@ -7023,6 +7036,7 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info) int ret; int slot; u64 total_dev = 0; + u64 last_ra_node = 0; path = btrfs_alloc_path(); if (!path) @@ -7048,6 +7062,8 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info) if (ret < 0) goto error; while (1) { + struct extent_buffer *node; + leaf = path->nodes[0]; slot = path->slots[0]; if (slot >= btrfs_header_nritems(leaf)) { @@ -7058,6 +7074,13 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info) goto error; break; } + node = path->nodes[1]; + if (node) { + if (last_ra_node != node->start) { + readahead_tree_node_children(node); + last_ra_node = node->start; + } + } btrfs_item_key_to_cpu(leaf, &found_key, slot); if (found_key.type == BTRFS_DEV_ITEM_KEY) { struct btrfs_dev_item *dev_item;
Holger Hoffstätte 於 2020/7/8 下午10:57 寫道: > On 2020-07-08 16:04, David Sterba wrote: >> On Wed, Jul 08, 2020 at 10:19:22AM +0800, Robbie Ko wrote: >>> David Sterba 於 2020/7/8 上午3:25 寫道: >>>> On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote: >>>>> From: Robbie Ko <robbieko@synology.com> >>>>> >>>>> When mounting, we always need to read the whole chunk tree, >>>>> when there are too many chunk items, most of the time is >>>>> spent on btrfs_read_chunk_tree, because we only read one >>>>> leaf at a time. >>>>> >>>>> It is unreasonable to limit the readahead mechanism to a >>>>> range of 64k, so we have removed that limit. >>>>> >>>>> In addition we added reada_maximum_size to customize the >>>>> size of the pre-reader, The default is 64k to maintain the >>>>> original behavior. >>>>> >>>>> So we fix this by used readahead mechanism, and set readahead >>>>> max size to ULLONG_MAX which reads all the leaves after the >>>>> key in the node when reading a level 1 node. >>>> The readahead of chunk tree is a special case as we know we will need >>>> the whole tree, in all other cases the search readahead needs is >>>> supposed to read only one leaf. >>> >>> If, in most cases, readahead requires that only one leaf be read, then >>> reada_ maximum_size should be nodesize instead of 64k, or use >>> reada_maximum_ nr (default:1) seems better. >>> >>>> >>>> For that reason I don't want to touch the current path readahead logic >>>> at all and do the chunk tree readahead in one go instead of the >>>> per-search. >>> >>> I don't know why we don't make the change to readahead, because the >>> current >>> readahead is limited to the logical address in 64k is very >>> unreasonable, >>> and there is a good chance that the logical address of the next leaf >>> node will >>> not appear in 64k, so the existing readahead is almost useless. >> >> I see and it seems that the assumption about layout and chances >> succesfuly read blocks ahead is not valid. The logic of readahead could >> be improved but that would need more performance evaluation. > > FWIW I gave this a try and see the following numbers, averaged over > multiple > mount/unmount cycles on spinning rust: > > without patch : ~2.7s > with patch : ~4.5s > > ..ahem.. > I have the following two questions for you. 1. What is the version you are using? 2. Can you please measure the time of btrfs_read_chunk_tree alone? I think the problem you are having is that btrfs_read_block_groups is slowing down because it is using the wrong READA flag, which is causing a lot of useless IO's when reading the block group. This can be fixed with the following commit. btrfs: block-group: don't set the wrong READA flag for btrfs_read_block_groups() https://git.kernel.org/pub/scm/linux/kernel /git/torvalds/linux.git/commit/?h=v5.8-rc4& id=83fe9e12b0558eae519351cff00da1e06bc054d2 > -h
David Sterba 於 2020/7/9 上午5:11 寫道: > On Tue, Jul 07, 2020 at 09:25:11PM +0200, David Sterba wrote: >> On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote: >>> From: Robbie Ko <robbieko@synology.com> >>> >>> When mounting, we always need to read the whole chunk tree, >>> when there are too many chunk items, most of the time is >>> spent on btrfs_read_chunk_tree, because we only read one >>> leaf at a time. >>> >>> It is unreasonable to limit the readahead mechanism to a >>> range of 64k, so we have removed that limit. >>> >>> In addition we added reada_maximum_size to customize the >>> size of the pre-reader, The default is 64k to maintain the >>> original behavior. >>> >>> So we fix this by used readahead mechanism, and set readahead >>> max size to ULLONG_MAX which reads all the leaves after the >>> key in the node when reading a level 1 node. >> The readahead of chunk tree is a special case as we know we will need >> the whole tree, in all other cases the search readahead needs is >> supposed to read only one leaf. >> >> For that reason I don't want to touch the current path readahead logic >> at all and do the chunk tree readahead in one go instead of the >> per-search. >> >> Also I don't like to see size increase of btrfs_path just to use the >> custom once. >> >> The idea of the whole tree readahead is to do something like: >> >> - find first item >> - start readahead on all leaves from its level 1 node parent >> (readahead_tree_block) >> - when the level 1 parent changes during iterating items, start the >> readahead again >> >> This skips readahead of all nodes above level 1, if you find a nicer way >> to readahead the whole tree I won't object, but for the first >> implementation the level 1 seems ok to me. > Patch below, I tried to create large system chunk by fallocate on a > sparse loop device, but got only 1 node on level 1 so the readahead > cannot show off. > > # btrfs fi df . > Data, single: total=59.83TiB, used=59.83TiB > System, single: total=36.00MiB, used=6.20MiB > Metadata, single: total=1.01GiB, used=91.78MiB > GlobalReserve, single: total=26.80MiB, used=0.00B > > There were 395 leaf nodes that got read ahead, time between the first > and last is 0.83s and the block group tree read took about 40 seconds. > This was in a VM with file-backed images, and the loop device was > constructed from these devices so it's spinning rust. > > I don't have results for non-prefetched mount to compare at the moment. > I think what you're doing is working. But there are many similar problems that need to be improved. 1. load_free_space_tree We need to read all BTRFS_FREE_SPACE_BITMAP_KEY and BTRFS_FREE_SPACE_EXTENT_KEY until the next FREE_SPACE_INFO_KEY. 2. populate_free_space_tree We need to read all BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_ITEM_KEY until the end of the block group 3. btrfs_real_readdir We need as many reads as possible (inode, BTRFS_DIR_INDEX_KEY). 4. btrfs_clone We need as many reads as possible (inode, BTRFS_EXTENT_DATA_KEY). 5. btrfs_verify_dev_extents We need to read all the BTRFS_DEV_EXTENT_KEYs. 6. caching_kthread (inode-map.c) We need all the BTRFS_INODE_ITEM_KEY of fs_tree to build the inode map For the above cases. It is not possible to write a special readahead code for each case. We have to provide a new readaread framework Enable the caller to determine the scope of readaheads needed. The possible parameters of the readahead are as follows 1. reada_maximum_nr : Read a maximum of several leaves at a time. 2. reada_max_key : READA_FORWARD Early Suspension Condition 3. reada_min_key : READA_BACK Abort condition ahead of time. We need to review all users of readahead to confirm that the The behavior of readahead. For example, in scrub_enumerate_chunks readahead has the effect of Very small, Because most of the time is spent on scrub_chunk, The processing of scrub_chunk for all DEV_EXTENT in a leaf is A long time. If the dev tree has been modified in the meantime, the previously pre-reading leaf may be useless. > ---- > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index c7a3d4d730a3..e19891243199 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -7013,6 +7013,19 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info, > return ret; > } > > +void readahead_tree_node_children(struct extent_buffer *node) > +{ > + int i; > + const int nr_items = btrfs_header_nritems(node); > + > + for (i = 0; i < nr_items; i++) { > + u64 start; > + > + start = btrfs_node_blockptr(node, i); > + readahead_tree_block(node->fs_info, start); > + } > +} > + > int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info) > { > struct btrfs_root *root = fs_info->chunk_root; > @@ -7023,6 +7036,7 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info) > int ret; > int slot; > u64 total_dev = 0; > + u64 last_ra_node = 0; > > path = btrfs_alloc_path(); > if (!path) > @@ -7048,6 +7062,8 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info) > if (ret < 0) > goto error; > while (1) { > + struct extent_buffer *node; > + > leaf = path->nodes[0]; > slot = path->slots[0]; > if (slot >= btrfs_header_nritems(leaf)) { > @@ -7058,6 +7074,13 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info) > goto error; > break; > } > + node = path->nodes[1]; > + if (node) { > + if (last_ra_node != node->start) { > + readahead_tree_node_children(node); > + last_ra_node = node->start; > + } > + } > btrfs_item_key_to_cpu(leaf, &found_key, slot); > if (found_key.type == BTRFS_DEV_ITEM_KEY) { > struct btrfs_dev_item *dev_item;
On 2020-07-09 03:46, Robbie Ko wrote: > > Holger Hoffstätte 於 2020/7/8 下午10:57 寫道: >> On 2020-07-08 16:04, David Sterba wrote: >>> On Wed, Jul 08, 2020 at 10:19:22AM +0800, Robbie Ko wrote: >>>> David Sterba 於 2020/7/8 上午3:25 寫道: >>>>> On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote: >>>>>> From: Robbie Ko <robbieko@synology.com> >>>>>> >>>>>> When mounting, we always need to read the whole chunk tree, >>>>>> when there are too many chunk items, most of the time is >>>>>> spent on btrfs_read_chunk_tree, because we only read one >>>>>> leaf at a time. >>>>>> >>>>>> It is unreasonable to limit the readahead mechanism to a >>>>>> range of 64k, so we have removed that limit. >>>>>> >>>>>> In addition we added reada_maximum_size to customize the >>>>>> size of the pre-reader, The default is 64k to maintain the >>>>>> original behavior. >>>>>> >>>>>> So we fix this by used readahead mechanism, and set readahead >>>>>> max size to ULLONG_MAX which reads all the leaves after the >>>>>> key in the node when reading a level 1 node. >>>>> The readahead of chunk tree is a special case as we know we will need >>>>> the whole tree, in all other cases the search readahead needs is >>>>> supposed to read only one leaf. >>>> >>>> If, in most cases, readahead requires that only one leaf be read, then >>>> reada_ maximum_size should be nodesize instead of 64k, or use >>>> reada_maximum_ nr (default:1) seems better. >>>> >>>>> >>>>> For that reason I don't want to touch the current path readahead logic >>>>> at all and do the chunk tree readahead in one go instead of the >>>>> per-search. >>>> >>>> I don't know why we don't make the change to readahead, because the current >>>> readahead is limited to the logical address in 64k is very unreasonable, >>>> and there is a good chance that the logical address of the next leaf >>>> node will >>>> not appear in 64k, so the existing readahead is almost useless. >>> >>> I see and it seems that the assumption about layout and chances >>> succesfuly read blocks ahead is not valid. The logic of readahead could >>> be improved but that would need more performance evaluation. >> >> FWIW I gave this a try and see the following numbers, averaged over multiple >> mount/unmount cycles on spinning rust: >> >> without patch : ~2.7s >> with patch : ~4.5s >> >> ..ahem.. >> > I have the following two questions for you. > 1. What is the version you are using? 5.7.8 + a few select patches from 5.8. > 2. Can you please measure the time of btrfs_read_chunk_tree alone? No perf on this system & not enough time right now, sorry. But it shouldn't matter either way, see below. > I think the problem you are having is that btrfs_read_block_groups is > slowing down because it is using the wrong READA flag, which is causing > a lot of useless IO's when reading the block group. > > This can be fixed with the following commit. > btrfs: block-group: don't set the wrong READA flag for btrfs_read_block_groups() > https://git.kernel.org/pub/scm/linux/kernel /git/torvalds/linux.git/commit/?h=v5.8-rc4& id=83fe9e12b0558eae519351cff00da1e06bc054d2 Ah yes, that was missing. However it doesn't seem to improve things that much either; with 83fe9e12 but with or without your patch I now get ~2.8..~2.9s mount time. Probably because I don't have that many metadata block groups (only 4GB). From a conceptual perspective it it probably much easier just to merge the bgtree patchset, since that does the right thing without upsetting the overall readahead apple cart. thanks, Holger
On Thu, Jul 09, 2020 at 10:38:42AM +0800, Robbie Ko wrote: > David Sterba 於 2020/7/9 上午5:11 寫道: > > On Tue, Jul 07, 2020 at 09:25:11PM +0200, David Sterba wrote: > >> On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote: > >> This skips readahead of all nodes above level 1, if you find a nicer way > >> to readahead the whole tree I won't object, but for the first > >> implementation the level 1 seems ok to me. > > Patch below, I tried to create large system chunk by fallocate on a > > sparse loop device, but got only 1 node on level 1 so the readahead > > cannot show off. > > > > # btrfs fi df . > > Data, single: total=59.83TiB, used=59.83TiB > > System, single: total=36.00MiB, used=6.20MiB > > Metadata, single: total=1.01GiB, used=91.78MiB > > GlobalReserve, single: total=26.80MiB, used=0.00B > > > > There were 395 leaf nodes that got read ahead, time between the first > > and last is 0.83s and the block group tree read took about 40 seconds. > > This was in a VM with file-backed images, and the loop device was > > constructed from these devices so it's spinning rust. > > > > I don't have results for non-prefetched mount to compare at the moment. > > > I think what you're doing is working. > > But there are many similar problems that need to be improved. Agreed, but this started from 'let's readahead chunk tree' and now we drifted to fixing or perhaps making better use of readahead in several other areas. > 1. load_free_space_tree > We need to read all BTRFS_FREE_SPACE_BITMAP_KEY and > BTRFS_FREE_SPACE_EXTENT_KEY until the next FREE_SPACE_INFO_KEY. > > 2. populate_free_space_tree > We need to read all BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_ITEM_KEY > until the end of the block group > > 3. btrfs_real_readdir > We need as many reads as possible (inode, BTRFS_DIR_INDEX_KEY). > > 4. btrfs_clone > We need as many reads as possible (inode, BTRFS_EXTENT_DATA_KEY). > > 5. btrfs_verify_dev_extents > We need to read all the BTRFS_DEV_EXTENT_KEYs. Each case needs to be evaluated separately because the items live in different trees and other item types could be scattered among the ones we're interested in. But the list gives an insight in what types of readahead we might need, like full key range [from, to], or just all items of one key type. > 6. caching_kthread (inode-map.c) > We need all the BTRFS_INODE_ITEM_KEY of fs_tree to build the inode map > > For the above cases. > It is not possible to write a special readahead code for each case. > We have to provide a new readaread framework > Enable the caller to determine the scope of readaheads needed. > The possible parameters of the readahead are as follows > 1. reada_maximum_nr : Read a maximum of several leaves at a time. > 2. reada_max_key : READA_FORWARD Early Suspension Condition > 3. reada_min_key : READA_BACK Abort condition ahead of time. Yeah something like that. > We need to review all users of readahead to confirm that the The > behavior of readahead. > For example, in scrub_enumerate_chunks readahead has the effect of Very > small, > Because most of the time is spent on scrub_chunk, > The processing of scrub_chunk for all DEV_EXTENT in a leaf is A long time. > If the dev tree has been modified in the meantime, the previously > pre-reading leaf may be useless. Yes that's another case where doing the readahead is useless. So, now it's a question if we should start with the easy cases with specific readahead and then unify them under a common API, or try to figure out the API and then audit all us. I'd be more in favor of the former as it allows to give us a baseline where the readahead would be implemented optimally, the followup API cleanup would need to keep the performance. The latter is IMHO harder just because getting an API right on the first try usually does not work.
David Sterba 於 2020/7/9 下午5:13 寫道: > On Thu, Jul 09, 2020 at 10:38:42AM +0800, Robbie Ko wrote: >> David Sterba 於 2020/7/9 上午5:11 寫道: >>> On Tue, Jul 07, 2020 at 09:25:11PM +0200, David Sterba wrote: >>>> On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote: >>>> This skips readahead of all nodes above level 1, if you find a nicer way >>>> to readahead the whole tree I won't object, but for the first >>>> implementation the level 1 seems ok to me. >>> Patch below, I tried to create large system chunk by fallocate on a >>> sparse loop device, but got only 1 node on level 1 so the readahead >>> cannot show off. >>> >>> # btrfs fi df . >>> Data, single: total=59.83TiB, used=59.83TiB >>> System, single: total=36.00MiB, used=6.20MiB >>> Metadata, single: total=1.01GiB, used=91.78MiB >>> GlobalReserve, single: total=26.80MiB, used=0.00B >>> >>> There were 395 leaf nodes that got read ahead, time between the first >>> and last is 0.83s and the block group tree read took about 40 seconds. >>> This was in a VM with file-backed images, and the loop device was >>> constructed from these devices so it's spinning rust. >>> >>> I don't have results for non-prefetched mount to compare at the moment. >>> >> I think what you're doing is working. >> >> But there are many similar problems that need to be improved. > Agreed, but this started from 'let's readahead chunk tree' and now we > drifted to fixing or perhaps making better use of readahead in several > other areas. > >> 1. load_free_space_tree >> We need to read all BTRFS_FREE_SPACE_BITMAP_KEY and >> BTRFS_FREE_SPACE_EXTENT_KEY until the next FREE_SPACE_INFO_KEY. >> >> 2. populate_free_space_tree >> We need to read all BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_ITEM_KEY >> until the end of the block group >> >> 3. btrfs_real_readdir >> We need as many reads as possible (inode, BTRFS_DIR_INDEX_KEY). >> >> 4. btrfs_clone >> We need as many reads as possible (inode, BTRFS_EXTENT_DATA_KEY). >> >> 5. btrfs_verify_dev_extents >> We need to read all the BTRFS_DEV_EXTENT_KEYs. > Each case needs to be evaluated separately because the items live in > different trees and other item types could be scattered among the ones > we're interested in. > > But the list gives an insight in what types of readahead we might need, > like full key range [from, to], or just all items of one key type. > >> 6. caching_kthread (inode-map.c) >> We need all the BTRFS_INODE_ITEM_KEY of fs_tree to build the inode map >> >> For the above cases. >> It is not possible to write a special readahead code for each case. >> We have to provide a new readaread framework >> Enable the caller to determine the scope of readaheads needed. >> The possible parameters of the readahead are as follows >> 1. reada_maximum_nr : Read a maximum of several leaves at a time. >> 2. reada_max_key : READA_FORWARD Early Suspension Condition >> 3. reada_min_key : READA_BACK Abort condition ahead of time. > Yeah something like that. > >> We need to review all users of readahead to confirm that the The >> behavior of readahead. >> For example, in scrub_enumerate_chunks readahead has the effect of Very >> small, >> Because most of the time is spent on scrub_chunk, >> The processing of scrub_chunk for all DEV_EXTENT in a leaf is A long time. >> If the dev tree has been modified in the meantime, the previously >> pre-reading leaf may be useless. > Yes that's another case where doing the readahead is useless. > > So, now it's a question if we should start with the easy cases with > specific readahead and then unify them under a common API, or try to > figure out the API and then audit all us. > > I'd be more in favor of the former as it allows to give us a baseline > where the readahead would be implemented optimally, the followup API > cleanup would need to keep the performance. > > The latter is IMHO harder just because getting an API right on the first > try usually does not work. Okay, I agree with you. Please help submit your patch to speedup chunk tree read. Thank you for your help.
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 3a7648bff42c..dc84f526cd93 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -75,7 +75,14 @@ size_t __const btrfs_get_num_csums(void) struct btrfs_path *btrfs_alloc_path(void) { - return kmem_cache_zalloc(btrfs_path_cachep, GFP_NOFS); + struct btrfs_path *path; + + path = kmem_cache_zalloc(btrfs_path_cachep, GFP_NOFS); + + if (path) + path->reada_maximum_size = 65536; + + return path; } /* this also releases the path */ @@ -2161,12 +2168,10 @@ static void reada_for_search(struct btrfs_fs_info *fs_info, struct btrfs_disk_key disk_key; u32 nritems; u64 search; - u64 target; u64 nread = 0; struct extent_buffer *eb; u32 nr; u32 blocksize; - u32 nscan = 0; if (level != 1) return; @@ -2184,8 +2189,6 @@ static void reada_for_search(struct btrfs_fs_info *fs_info, return; } - target = search; - nritems = btrfs_header_nritems(node); nr = slot; @@ -2205,13 +2208,9 @@ static void reada_for_search(struct btrfs_fs_info *fs_info, break; } search = btrfs_node_blockptr(node, nr); - if ((search <= target && target - search <= 65536) || - (search > target && search - target <= 65536)) { - readahead_tree_block(fs_info, search); - nread += blocksize; - } - nscan++; - if ((nread > 65536 || nscan > 32)) + readahead_tree_block(fs_info, search); + nread += blocksize; + if (nread > path->reada_maximum_size) break; } } diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index d404cce8ae40..ea88a6473eb8 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -360,6 +360,7 @@ struct btrfs_path { /* if there is real range locking, this locks field will change */ u8 locks[BTRFS_MAX_LEVEL]; u8 reada; + u64 reada_maximum_size; /* keep some upper locks as we walk down */ u8 lowest_level; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 0d6e785bcb98..fc87b2e9e865 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -7043,6 +7043,8 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info) path = btrfs_alloc_path(); if (!path) return -ENOMEM; + path->reada = READA_FORWARD; + path->reada_maximum_size = ULLONG_MAX; /* * uuid_mutex is needed only if we are mounting a sprout FS