From patchwork Wed Oct 5 19:49:18 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Burkov X-Patchwork-Id: 12999598 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 46DEDC43217 for ; Wed, 5 Oct 2022 19:49:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230242AbiJETtc (ORCPT ); Wed, 5 Oct 2022 15:49:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46810 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230221AbiJETt3 (ORCPT ); Wed, 5 Oct 2022 15:49:29 -0400 Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com [66.111.4.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A3F637F25C for ; Wed, 5 Oct 2022 12:49:26 -0700 (PDT) Received: from compute1.internal (compute1.nyi.internal [10.202.2.41]) by mailout.nyi.internal (Postfix) with ESMTP id E3B5E5C01C4; Wed, 5 Oct 2022 15:49:25 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute1.internal (MEProxy); Wed, 05 Oct 2022 15:49:25 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc :content-transfer-encoding:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to; s=fm2; t=1664999365; x=1665085765; bh=BO 1+CiMC4oqAmR5mpTyhYgLTfZF8TEoQLLolW3UFqOA=; b=DyAAi7W6SMoHK+PLMJ NO7u5ePKGfMxtMsHWlat0lqXofnb4e/yDjF7m7vZqLQ2KoCywxdX8VAfvvZJLoU/ bOHAqVqZwXYcW6ZJIgzz+QFI4Q8hBsA3YuwRavxFxWp2iUXj+Dm2RdpUjEZc7sMG 4ZsEHSTxvmaXmzgL811gdt1RTQ8Iwk4XkkXjuUNKODYIxLBeRBbwwWMxI7Juowdi SyE/j6WB3+vkFlZb3F2kGZVoWIC3Avf4EwQaP56VMvLz+1QzXYpG/BrXnJvy+VPB 1jt8wK20Dh+E2I72FlRWqZyt90r7fGya6ytQFl9C5Mg0ihMzdPOEEip3EdZWswUE yKAQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:sender:subject :subject:to:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm2; t=1664999365; x=1665085765; bh=BO1+CiMC4oqAm R5mpTyhYgLTfZF8TEoQLLolW3UFqOA=; b=J213dAaUlKt3SgAtIYgQ6fuT0SKRc NtvaX4CuiuiC9PmLscooslbwca6Q3FHCdBmheDZWAu7pufIM0qzoqN0t+iQl1AUF KreuJw0TzvxLgDdwq1OKpp4driJunXBtRDoZ9LsrnL4kBwRuu0tGfrpuUaOrarHV 3cSwZn0NO+ly257oOibsPno+EdLUbUwJ6lElTRmxZtYmroq9hgvK9IjfR224JJ2h lGjs9p6tOxqUzfxClK7iUQmjmMqejemt6E5m81A5zP0u8e40ynhq0dLzusgbKfp7 xjnJVbk6vdXH+4V014Z1AW/PwaBocS8sCoVH7jXmAeL+SgCVgBTwV5d2w== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvfedrfeeifedgudegtdcutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkffojghfggfgsedtke ertdertddtnecuhfhrohhmpeeuohhrihhsuceuuhhrkhhovhcuoegsohhrihhssegsuhhr rdhioheqnecuggftrfgrthhtvghrnhepieeuffeuvdeiueejhfehiefgkeevudejjeejff evvdehtddufeeihfekgeeuheelnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghm pehmrghilhhfrhhomhepsghorhhishessghurhdrihho X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 5 Oct 2022 15:49:25 -0400 (EDT) From: Boris Burkov To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 1/5] btrfs: 1G falloc extents Date: Wed, 5 Oct 2022 12:49:18 -0700 Message-Id: X-Mailer: git-send-email 2.37.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org When doing a large fallocate, btrfs will break it up into 256MiB extents. Our data block groups are 1GiB, so a more natural maximum size is 1GiB, so that we tend to allocate and fully use block groups rather than fragmenting the file around. This is especially useful if large fallocates tend to be for "round" amounts, which strikes me as a reasonable assumption. While moving to size classes reduces the value of this change, it is also good to compare potential allocator algorithms against just 1G extents. Signed-off-by: Boris Burkov --- fs/btrfs/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 45ebef8d3ea8..fd66586ae2fc 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode, if (trans) own_trans = false; while (num_bytes > 0) { - cur_bytes = min_t(u64, num_bytes, SZ_256M); + cur_bytes = min_t(u64, num_bytes, SZ_1G); cur_bytes = max(cur_bytes, min_size); /* * If we are severely fragmented we could end up with really From patchwork Wed Oct 5 19:49:19 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Burkov X-Patchwork-Id: 12999597 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D514EC4332F for ; Wed, 5 Oct 2022 19:49:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230484AbiJETtb (ORCPT ); Wed, 5 Oct 2022 15:49:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46808 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230009AbiJETt3 (ORCPT ); Wed, 5 Oct 2022 15:49:29 -0400 Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com [66.111.4.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1D7037F271 for ; Wed, 5 Oct 2022 12:49:28 -0700 (PDT) Received: from compute2.internal (compute2.nyi.internal [10.202.2.46]) by mailout.nyi.internal (Postfix) with ESMTP id 87EC55C01A4; Wed, 5 Oct 2022 15:49:27 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute2.internal (MEProxy); Wed, 05 Oct 2022 15:49:27 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc :content-transfer-encoding:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to; s=fm2; t=1664999367; x=1665085767; bh=yK Ak63ynkVxTGplMWXJbjJiT2C7+t5fRaoihgCz+Nsc=; b=VfEJBwRFrpBAVYyT20 3O3QUiwCgAxA6G09Lb8+8LvC9Q6+vNeDsED6sEhrq6aZXJMFUwcFB2Ta6+ZWVU0p YMf1o5ElScQTXwRcYGD3gZP1bYhzHmDanpXcCOq9R0lcsf5FIgEUeoc50D+xCyFE RFkkEybyEeInUhXOvcPTbLBM5H7ZoK9Fv6fL7qfPIGLHJR784wvKz8ufHX3eILmM 4uG34XRrad8pjo2sr+U4wqBGENpVGTZUdYHgWDDfk3L91JB+sEqr7qG52SJoy2pX qS0oYttbXiaXlMiL8SdEyaazC7Iiiw0LZT1P9wLsdw8H2zju60WE67saFKDDnkdP QAGg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:sender:subject :subject:to:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm2; t=1664999367; x=1665085767; bh=yKAk63ynkVxTG plMWXJbjJiT2C7+t5fRaoihgCz+Nsc=; b=JtAVLepJFMBNjr67h+4wOqcPMXY0c 1hjIaf+1EXCJoiuTx/fbHqiXwzeNbc2zynwr08HK/+uL/aFYXBz0iEWvlNp2B765 m5pp0JYvDQAhMoOrdO5Dipi8Tsq/OccHE1Sl1cAsHu4D+OidlKjf8md10ZtlmOme 1Qukas3nalEXDvtWzc5huflfgkBrvjLRwPrKv+Qtte4N+8ZydUi+YJCe6czDaVhq VKgSpVvZdbLlrba24zi8y7Wm5cKA+icQGVSsv5W3hnp2t4llv4vols1fzj1AS1/1 aiGXgYp/Z8DEeWGJ02BsBkhxoeo/RwsMS7k9ncCDnMJGH2zA19Z660J7A== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvfedrfeeifedgudegudcutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkffojghfggfgsedtke ertdertddtnecuhfhrohhmpeeuohhrihhsuceuuhhrkhhovhcuoegsohhrihhssegsuhhr rdhioheqnecuggftrfgrthhtvghrnhepieeuffeuvdeiueejhfehiefgkeevudejjeejff evvdehtddufeeihfekgeeuheelnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghm pehmrghilhhfrhhomhepsghorhhishessghurhdrihho X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 5 Oct 2022 15:49:27 -0400 (EDT) From: Boris Burkov To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 2/5] btrfs: use ffe_ctl in btrfs allocator tracepoints Date: Wed, 5 Oct 2022 12:49:19 -0700 Message-Id: <22c0c12d9fb7750d21fe2e7ad566bcc49856bf5a.1664999303.git.boris@bur.io> X-Mailer: git-send-email 2.37.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org The allocator tracepoints currently have a pile of values from ffe_ctl. In modifying the allocator and adding more tracepoints, I found myself adding to the already long argument list of the tracepoints. It makes it a lot simpler to just send in the ffe_ctl itself. Signed-off-by: Boris Burkov --- fs/btrfs/extent-tree.c | 91 ++---------------------------------- fs/btrfs/extent-tree.h | 80 +++++++++++++++++++++++++++++++ fs/btrfs/super.c | 1 + include/trace/events/btrfs.h | 41 ++++++++-------- 4 files changed, 107 insertions(+), 106 deletions(-) create mode 100644 fs/btrfs/extent-tree.h diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index cd2d36580f1a..0fe1e8eb10cf 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -16,7 +16,7 @@ #include #include #include -#include "misc.h" +#include "extent-tree.h" #include "tree-log.h" #include "disk-io.h" #include "print-tree.h" @@ -31,7 +31,6 @@ #include "space-info.h" #include "block-rsv.h" #include "delalloc-space.h" -#include "block-group.h" #include "discard.h" #include "rcu-string.h" #include "zoned.h" @@ -3442,81 +3441,6 @@ btrfs_release_block_group(struct btrfs_block_group *cache, btrfs_put_block_group(cache); } -enum btrfs_extent_allocation_policy { - BTRFS_EXTENT_ALLOC_CLUSTERED, - BTRFS_EXTENT_ALLOC_ZONED, -}; - -/* - * Structure used internally for find_free_extent() function. Wraps needed - * parameters. - */ -struct find_free_extent_ctl { - /* Basic allocation info */ - u64 ram_bytes; - u64 num_bytes; - u64 min_alloc_size; - u64 empty_size; - u64 flags; - int delalloc; - - /* Where to start the search inside the bg */ - u64 search_start; - - /* For clustered allocation */ - u64 empty_cluster; - struct btrfs_free_cluster *last_ptr; - bool use_cluster; - - bool have_caching_bg; - bool orig_have_caching_bg; - - /* Allocation is called for tree-log */ - bool for_treelog; - - /* Allocation is called for data relocation */ - bool for_data_reloc; - - /* RAID index, converted from flags */ - int index; - - /* - * Current loop number, check find_free_extent_update_loop() for details - */ - int loop; - - /* - * Whether we're refilling a cluster, if true we need to re-search - * current block group but don't try to refill the cluster again. - */ - bool retry_clustered; - - /* - * Whether we're updating free space cache, if true we need to re-search - * current block group but don't try updating free space cache again. - */ - bool retry_unclustered; - - /* If current block group is cached */ - int cached; - - /* Max contiguous hole found */ - u64 max_extent_size; - - /* Total free space from free space cache, not always contiguous */ - u64 total_free_space; - - /* Found result */ - u64 found_offset; - - /* Hint where to start looking for an empty space */ - u64 hint_byte; - - /* Allocation policy */ - enum btrfs_extent_allocation_policy policy; -}; - - /* * Helper function for find_free_extent(). * @@ -3548,8 +3472,7 @@ static int find_free_extent_clustered(struct btrfs_block_group *bg, if (offset) { /* We have a block, we're done */ spin_unlock(&last_ptr->refill_lock); - trace_btrfs_reserve_extent_cluster(cluster_bg, - ffe_ctl->search_start, ffe_ctl->num_bytes); + trace_btrfs_reserve_extent_cluster(cluster_bg, ffe_ctl); *cluster_bg_ret = cluster_bg; ffe_ctl->found_offset = offset; return 0; @@ -3599,10 +3522,8 @@ static int find_free_extent_clustered(struct btrfs_block_group *bg, if (offset) { /* We found one, proceed */ spin_unlock(&last_ptr->refill_lock); - trace_btrfs_reserve_extent_cluster(bg, - ffe_ctl->search_start, - ffe_ctl->num_bytes); ffe_ctl->found_offset = offset; + trace_btrfs_reserve_extent_cluster(bg, ffe_ctl); return 0; } } else if (!ffe_ctl->cached && ffe_ctl->loop > LOOP_CACHING_NOWAIT && @@ -4285,8 +4206,7 @@ static noinline int find_free_extent(struct btrfs_root *root, ins->objectid = 0; ins->offset = 0; - trace_find_free_extent(root, ffe_ctl->num_bytes, ffe_ctl->empty_size, - ffe_ctl->flags); + trace_find_free_extent(root, ffe_ctl); space_info = btrfs_find_space_info(fs_info, ffe_ctl->flags); if (!space_info) { @@ -4457,8 +4377,7 @@ static noinline int find_free_extent(struct btrfs_root *root, ins->objectid = ffe_ctl->search_start; ins->offset = ffe_ctl->num_bytes; - trace_btrfs_reserve_extent(block_group, ffe_ctl->search_start, - ffe_ctl->num_bytes); + trace_btrfs_reserve_extent(block_group, ffe_ctl); btrfs_release_block_group(block_group, ffe_ctl->delalloc); break; loop: diff --git a/fs/btrfs/extent-tree.h b/fs/btrfs/extent-tree.h new file mode 100644 index 000000000000..7d3bb9c60fbe --- /dev/null +++ b/fs/btrfs/extent-tree.h @@ -0,0 +1,80 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef BTRFS_EXTENT_TREE_H +#define BTRFS_EXTENT_TREE_H + +#include "ctree.h" +#include "misc.h" +#include "block-group.h" + +enum btrfs_extent_allocation_policy { + BTRFS_EXTENT_ALLOC_CLUSTERED, + BTRFS_EXTENT_ALLOC_ZONED, +}; + +struct find_free_extent_ctl { + /* Basic allocation info */ + u64 ram_bytes; + u64 num_bytes; + u64 min_alloc_size; + u64 empty_size; + u64 flags; + int delalloc; + + /* Where to start the search inside the bg */ + u64 search_start; + + /* For clustered allocation */ + u64 empty_cluster; + struct btrfs_free_cluster *last_ptr; + bool use_cluster; + + bool have_caching_bg; + bool orig_have_caching_bg; + + /* Allocation is called for tree-log */ + bool for_treelog; + + /* Allocation is called for data relocation */ + bool for_data_reloc; + + /* RAID index, converted from flags */ + int index; + + /* + * Current loop number, check find_free_extent_update_loop() for details + */ + int loop; + + /* + * Whether we're refilling a cluster, if true we need to re-search + * current block group but don't try to refill the cluster again. + */ + bool retry_clustered; + + /* + * Whether we're updating free space cache, if true we need to re-search + * current block group but don't try updating free space cache again. + */ + bool retry_unclustered; + + /* If current block group is cached */ + int cached; + + /* Max contiguous hole found */ + u64 max_extent_size; + + /* Total free space from free space cache, not always contiguous */ + u64 total_free_space; + + /* Found result */ + u64 found_offset; + + /* Hint where to start looking for an empty space */ + u64 hint_byte; + + /* Allocation policy */ + enum btrfs_extent_allocation_policy policy; +}; + +#endif /* BTRFS_EXTENT_TREE_H */ diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 9be4fd2db0f4..2bf1fadcb94e 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -49,6 +49,7 @@ #include "discard.h" #include "qgroup.h" #include "raid56.h" +#include "extent-tree.h" #define CREATE_TRACE_POINTS #include diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index ed50e81174bf..ad50af497e59 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -32,6 +32,7 @@ struct prelim_ref; struct btrfs_space_info; struct btrfs_raid_bio; struct raid56_bio_trace_info; +struct find_free_extent_ctl; #define show_ref_type(type) \ __print_symbolic(type, \ @@ -1241,38 +1242,38 @@ DEFINE_EVENT(btrfs__reserved_extent, btrfs_reserved_extent_free, TRACE_EVENT(find_free_extent, - TP_PROTO(const struct btrfs_root *root, u64 num_bytes, - u64 empty_size, u64 data), + TP_PROTO(const struct btrfs_root *root, + const struct find_free_extent_ctl *ffe_ctl), - TP_ARGS(root, num_bytes, empty_size, data), + TP_ARGS(root, ffe_ctl), TP_STRUCT__entry_btrfs( __field( u64, root_objectid ) __field( u64, num_bytes ) __field( u64, empty_size ) - __field( u64, data ) + __field( u64, flags ) ), TP_fast_assign_btrfs(root->fs_info, __entry->root_objectid = root->root_key.objectid; - __entry->num_bytes = num_bytes; - __entry->empty_size = empty_size; - __entry->data = data; + __entry->num_bytes = ffe_ctl->num_bytes; + __entry->empty_size = ffe_ctl->empty_size; + __entry->flags = ffe_ctl->flags; ), TP_printk_btrfs("root=%llu(%s) len=%llu empty_size=%llu flags=%llu(%s)", show_root_type(__entry->root_objectid), - __entry->num_bytes, __entry->empty_size, __entry->data, - __print_flags((unsigned long)__entry->data, "|", + __entry->num_bytes, __entry->empty_size, __entry->flags, + __print_flags((unsigned long)__entry->flags, "|", BTRFS_GROUP_FLAGS)) ); DECLARE_EVENT_CLASS(btrfs__reserve_extent, - TP_PROTO(const struct btrfs_block_group *block_group, u64 start, - u64 len), + TP_PROTO(const struct btrfs_block_group *block_group, + const struct find_free_extent_ctl *ffe_ctl), - TP_ARGS(block_group, start, len), + TP_ARGS(block_group, ffe_ctl), TP_STRUCT__entry_btrfs( __field( u64, bg_objectid ) @@ -1284,8 +1285,8 @@ DECLARE_EVENT_CLASS(btrfs__reserve_extent, TP_fast_assign_btrfs(block_group->fs_info, __entry->bg_objectid = block_group->start; __entry->flags = block_group->flags; - __entry->start = start; - __entry->len = len; + __entry->start = ffe_ctl->search_start; + __entry->len = ffe_ctl->num_bytes; ), TP_printk_btrfs("root=%llu(%s) block_group=%llu flags=%llu(%s) " @@ -1299,18 +1300,18 @@ DECLARE_EVENT_CLASS(btrfs__reserve_extent, DEFINE_EVENT(btrfs__reserve_extent, btrfs_reserve_extent, - TP_PROTO(const struct btrfs_block_group *block_group, u64 start, - u64 len), + TP_PROTO(const struct btrfs_block_group *block_group, + const struct find_free_extent_ctl *ffe_ctl), - TP_ARGS(block_group, start, len) + TP_ARGS(block_group, ffe_ctl) ); DEFINE_EVENT(btrfs__reserve_extent, btrfs_reserve_extent_cluster, - TP_PROTO(const struct btrfs_block_group *block_group, u64 start, - u64 len), + TP_PROTO(const struct btrfs_block_group *block_group, + const struct find_free_extent_ctl *ffe_ctl), - TP_ARGS(block_group, start, len) + TP_ARGS(block_group, ffe_ctl) ); TRACE_EVENT(btrfs_find_cluster, From patchwork Wed Oct 5 19:49:20 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Burkov X-Patchwork-Id: 12999599 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 43388C433FE for ; Wed, 5 Oct 2022 19:49:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230494AbiJETte (ORCPT ); Wed, 5 Oct 2022 15:49:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46822 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230483AbiJETtb (ORCPT ); Wed, 5 Oct 2022 15:49:31 -0400 Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com [66.111.4.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 12ED07F246 for ; Wed, 5 Oct 2022 12:49:30 -0700 (PDT) Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailout.nyi.internal (Postfix) with ESMTP id 7DE505C00A3; Wed, 5 Oct 2022 15:49:29 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute5.internal (MEProxy); Wed, 05 Oct 2022 15:49:29 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc :content-transfer-encoding:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to; s=fm2; t=1664999369; x=1665085769; bh=wa pGKDGJGFsFZz0rEaiFzOvxhiqBvhhrrrt3A4CXCzA=; b=IVJwrdff/dcbkM2hQG UBr3NDY6xW5hUG3BooDhTCtq0KTBL1gMr31KpxjNMAPa8a7TSVrwIauKLtBIWu0e 82UIhgQ7f2JK8yWS3X/TmE73N9gwvz1PRtNjvvyW/nXAhizb+24FCXFiP74K7NOX NmJ7PExwHK+RrEJ4p/oR9eO4pcUsGDBIy41F/eNNgnCLExSt2boR1yPgFBH8VKlG 6nDBjCxrQG4gs+jkBoFtFKewdmYFm/uUmtL3826GXyZdGpgpqcdZsFvnCPVB2RqL I0P/nUg+GTaTmc0cySFCdiA4rnBFlfDNiJDazmm8QqfwacJ2CprbXEWtkl2JtdxG jlFA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:sender:subject :subject:to:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm2; t=1664999369; x=1665085769; bh=wapGKDGJGFsFZ z0rEaiFzOvxhiqBvhhrrrt3A4CXCzA=; b=wgwHqU0FarP1DCEDumVaUvYCLCjQL AQSUl/UszDEhum9YctJVdIzzlaHy2wvVXNxp6nO0B6mhaX1Zm2isdEAj3egie5j0 jBYHhcPk84wMBuEGuJ8EBgqrvDBc9DV7JBbzscU1J1RuG/vzLXp8EXvH/fBIWyyV uZ4m/xHeX9MRXLfUjQEHC5gq/tzTtY+1RLQqkM2D8C3sZygeVxdtgLs4VdAVeemA ATQ6tdTZOdudb7m4VccZpZdUjzo74F8GhT6VeBgkBCCwqCfd8Rf9wIv1UzXlLgEG HXa8nbh0cKM/Lp/LW7oImT5prpfwlJ+auvnhH5lmzOhU5bBypSw5jro7w== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvfedrfeeifedgudegtdcutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkffojghfggfgsedtke ertdertddtnecuhfhrohhmpeeuohhrihhsuceuuhhrkhhovhcuoegsohhrihhssegsuhhr rdhioheqnecuggftrfgrthhtvghrnhepieeuffeuvdeiueejhfehiefgkeevudejjeejff evvdehtddufeeihfekgeeuheelnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghm pehmrghilhhfrhhomhepsghorhhishessghurhdrihho X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 5 Oct 2022 15:49:28 -0400 (EDT) From: Boris Burkov To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 3/5] btrfs: add more ffe tracepoints Date: Wed, 5 Oct 2022 12:49:20 -0700 Message-Id: <068ceb5b24da5ef6b1b68afb210a2e294bb31e5b.1664999303.git.boris@bur.io> X-Mailer: git-send-email 2.37.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org find_free_extent is a complicated function. It consists (at least) of: - a hint that jumps into the middle of a for loop macro - a middle loop trying every raid level - an outer loop ascending through ffe loop levels - complicated logic for skipping some of those ffe loop levels - multiple underlying in-bg allocators (zoned, cluster, no cluster) Which is all to say that more tracing is helpful for debugging its behavior. Add two new tracepoints: at the entrance to the block_groups loop (hit for every raid level and every ffe_ctl loop) and at the point we seriously consider a block_group for allocation. This way we can see the whole path through the algorithm, including hints, multiple loops, etc. Signed-off-by: Boris Burkov --- fs/btrfs/extent-tree.c | 4 ++ fs/btrfs/extent-tree.h | 3 ++ include/trace/events/btrfs.h | 81 +++++++++++++++++++++++++++++++++++- 3 files changed, 86 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0fe1e8eb10cf..069761529398 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4249,6 +4249,7 @@ static noinline int find_free_extent(struct btrfs_root *root, block_group->flags); btrfs_lock_block_group(block_group, ffe_ctl->delalloc); + ffe_ctl->hinted = true; goto have_block_group; } } else if (block_group) { @@ -4256,6 +4257,7 @@ static noinline int find_free_extent(struct btrfs_root *root, } } search: + trace_find_free_extent_search_loop(root, ffe_ctl); ffe_ctl->have_caching_bg = false; if (ffe_ctl->index == btrfs_bg_flags_to_raid_index(ffe_ctl->flags) || ffe_ctl->index == 0) @@ -4265,6 +4267,7 @@ static noinline int find_free_extent(struct btrfs_root *root, &space_info->block_groups[ffe_ctl->index], list) { struct btrfs_block_group *bg_ret; + ffe_ctl->hinted = false; /* If the block group is read-only, we can skip it entirely. */ if (unlikely(block_group->ro)) { if (ffe_ctl->for_treelog) @@ -4306,6 +4309,7 @@ static noinline int find_free_extent(struct btrfs_root *root, } have_block_group: + trace_find_free_extent_have_block_group(root, ffe_ctl, block_group); ffe_ctl->cached = btrfs_block_group_done(block_group); if (unlikely(!ffe_ctl->cached)) { ffe_ctl->have_caching_bg = true; diff --git a/fs/btrfs/extent-tree.h b/fs/btrfs/extent-tree.h index 7d3bb9c60fbe..38bddb3ed224 100644 --- a/fs/btrfs/extent-tree.h +++ b/fs/btrfs/extent-tree.h @@ -75,6 +75,9 @@ struct find_free_extent_ctl { /* Allocation policy */ enum btrfs_extent_allocation_policy policy; + + /* Whether or not the allocator is currently following a hint */ + bool hinted; }; #endif /* BTRFS_EXTENT_TREE_H */ diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index ad50af497e59..caf0ce6ce718 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1268,6 +1268,79 @@ TRACE_EVENT(find_free_extent, BTRFS_GROUP_FLAGS)) ); +TRACE_EVENT(find_free_extent_search_loop, + + TP_PROTO(const struct btrfs_root *root, + const struct find_free_extent_ctl *ffe_ctl), + + TP_ARGS(root, ffe_ctl), + + TP_STRUCT__entry_btrfs( + __field( u64, root_objectid ) + __field( u64, num_bytes ) + __field( u64, empty_size ) + __field( u64, flags ) + __field( u64, loop ) + ), + + TP_fast_assign_btrfs(root->fs_info, + __entry->root_objectid = root->root_key.objectid; + __entry->num_bytes = ffe_ctl->num_bytes; + __entry->empty_size = ffe_ctl->empty_size; + __entry->flags = ffe_ctl->flags; + __entry->loop = ffe_ctl->loop; + ), + + TP_printk_btrfs("root=%llu(%s) len=%llu empty_size=%llu flags=%llu(%s) loop=%llu", + show_root_type(__entry->root_objectid), + __entry->num_bytes, __entry->empty_size, __entry->flags, + __print_flags((unsigned long)__entry->flags, "|", + BTRFS_GROUP_FLAGS), + __entry->loop) +); + +TRACE_EVENT(find_free_extent_have_block_group, + + TP_PROTO(const struct btrfs_root *root, + const struct find_free_extent_ctl *ffe_ctl, + const struct btrfs_block_group *block_group), + + TP_ARGS(root, ffe_ctl, block_group), + + TP_STRUCT__entry_btrfs( + __field( u64, root_objectid ) + __field( u64, num_bytes ) + __field( u64, empty_size ) + __field( u64, flags ) + __field( u64, loop ) + __field( bool, hinted ) + __field( u64, bg_start ) + __field( u64, bg_flags ) + ), + + TP_fast_assign_btrfs(root->fs_info, + __entry->root_objectid = root->root_key.objectid; + __entry->num_bytes = ffe_ctl->num_bytes; + __entry->empty_size = ffe_ctl->empty_size; + __entry->flags = ffe_ctl->flags; + __entry->loop = ffe_ctl->loop; + __entry->hinted = ffe_ctl->hinted; + __entry->bg_start = block_group->start; + __entry->bg_flags = block_group->flags; + ), + + TP_printk_btrfs("root=%llu(%s) len=%llu empty_size=%llu flags=%llu(%s) " + "loop=%llu hinted=%d block_group=%llu bg_flags=%llu(%s)", + show_root_type(__entry->root_objectid), + __entry->num_bytes, __entry->empty_size, __entry->flags, + __print_flags((unsigned long)__entry->flags, "|", + BTRFS_GROUP_FLAGS), + __entry->loop, __entry->hinted, + __entry->bg_start, __entry->bg_flags, + __print_flags((unsigned long)__entry->bg_flags, "|", + BTRFS_GROUP_FLAGS)) +); + DECLARE_EVENT_CLASS(btrfs__reserve_extent, TP_PROTO(const struct btrfs_block_group *block_group, @@ -1280,6 +1353,8 @@ DECLARE_EVENT_CLASS(btrfs__reserve_extent, __field( u64, flags ) __field( u64, start ) __field( u64, len ) + __field( u64, loop ) + __field( bool, hinted ) ), TP_fast_assign_btrfs(block_group->fs_info, @@ -1287,15 +1362,17 @@ DECLARE_EVENT_CLASS(btrfs__reserve_extent, __entry->flags = block_group->flags; __entry->start = ffe_ctl->search_start; __entry->len = ffe_ctl->num_bytes; + __entry->loop = ffe_ctl->loop; + __entry->hinted = ffe_ctl->hinted; ), TP_printk_btrfs("root=%llu(%s) block_group=%llu flags=%llu(%s) " - "start=%llu len=%llu", + "start=%llu len=%llu loop=%llu hinted=%d", show_root_type(BTRFS_EXTENT_TREE_OBJECTID), __entry->bg_objectid, __entry->flags, __print_flags((unsigned long)__entry->flags, "|", BTRFS_GROUP_FLAGS), - __entry->start, __entry->len) + __entry->start, __entry->len, __entry->loop, __entry->hinted) ); DEFINE_EVENT(btrfs__reserve_extent, btrfs_reserve_extent, From patchwork Wed Oct 5 19:49:21 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Burkov X-Patchwork-Id: 12999600 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 69851C433FE for ; Wed, 5 Oct 2022 19:49:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230495AbiJETtk (ORCPT ); Wed, 5 Oct 2022 15:49:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47142 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230476AbiJETtj (ORCPT ); Wed, 5 Oct 2022 15:49:39 -0400 Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com [66.111.4.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B7AA37F246 for ; Wed, 5 Oct 2022 12:49:31 -0700 (PDT) Received: from compute2.internal (compute2.nyi.internal [10.202.2.46]) by mailout.nyi.internal (Postfix) with ESMTP id 2CB135C00A3; Wed, 5 Oct 2022 15:49:31 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute2.internal (MEProxy); Wed, 05 Oct 2022 15:49:31 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc :content-transfer-encoding:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to; s=fm2; t=1664999371; x=1665085771; bh=LU bpPjaeocUTtwGMjPaXdCryHa2KfojK4IcZ/nNn+WQ=; b=fVzpOR4KfC4HzGe5hh 72oVIlF9zTiCOM1WlKI/doHT2KHL+ecggy3efOQV28ORSrrYa2zjKuTW/qj/cPCf kZGcuDjjIxpesRqOFdfilnmJQ6MvpfPTwLHXHE+rr5MMEoubAmaVpIHydfruECLS 2AuAztEmW9I0tyi4eaqIBx2a5JQ1l9XKdTwulbC8zhRuVvUfs0CdMDwNPX59rPPk UIR8DrHmqnXh+AO78XvFprl90vjL5Nvwt2VJprEIay+tZEzVEudQuvyb2CUb4iN4 LIk/nC8iVnK/u6RYep2VOZKEVBdU51hHyXOP9pYCedBifsun40pLgX10sfnhU38e SULQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:sender:subject :subject:to:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm2; t=1664999371; x=1665085771; bh=LUbpPjaeocUTt wGMjPaXdCryHa2KfojK4IcZ/nNn+WQ=; b=hw7HQozGe2a6NFfmbjkR8mwljhA6y 6e6qkvsrjaUJsOy7y/oOi30C5UaAlnhwuc1LLEIRz9lEfiqzK99YrSo6btr0uODV NjEc6xVGvirbzcMQbG9s2SXsZqYJQZt46KoarJN11nYy0UED6Hbw4g7AEMHJ+zC8 AHurUAbW+5aS12RVvQZN7229wtVmh34qi/Wb0nTodu1W/y4bZcORhVAOKbo7J0y/ fIBV9jPeBnm65Da9WY7hd6GHJlqMY8VdfCLWIEodE+7ZZiOudgsrxpX6v7CUHObx 66yhj0KSPyL8xb3qbKA24+MN3vFGUa9rJSJgh6ih7No4rWBkQjJIGUGDQ== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvfedrfeeifedgudegudcutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkffojghfggfgsedtke ertdertddtnecuhfhrohhmpeeuohhrihhsuceuuhhrkhhovhcuoegsohhrihhssegsuhhr rdhioheqnecuggftrfgrthhtvghrnhepkeekteffteefvedtvedtudethefgtedvjeffue euheekvdeftdelieejjeegudejnecuffhomhgrihhnpehgihhthhhusgdrtghomhdpsghu rhdrihhonecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomh epsghorhhishessghurhdrihho X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 5 Oct 2022 15:49:30 -0400 (EDT) From: Boris Burkov To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 4/5] btrfs: introduce size class to block group allocator Date: Wed, 5 Oct 2022 12:49:21 -0700 Message-Id: X-Mailer: git-send-email 2.37.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org The aim of this patch is to reduce the fragmentation of block groups under certain unhappy workloads. It is particularly effective when the size of extents correlates with their lifetime, which is something we have observed causing fragmentation in the fleet at Meta. This patch categorizes extents into size classes: x < 128KiB: "small" 128KiB < x < 8MiB: "medium" x > 8MiB: "large" and as much as possible reduces allocations of extents into block groups that don't match the size class. This takes advantage of any (possible) correlation between size and lifetime and also leaves behind predictable re-usable gaps when extents are freed; small writes don't gum up bigger holes. Size classes are implemented in the following way: - Mark each new block group with a size class of the first allocation that goes into it. - Add two new passes to ffe: "unset size class" and "wrong size class". First, try only matching block groups, then try unset ones, then allow allocation of new ones, and finally allow mismatched block groups. - Filtering is done just by skipping inappropriate ones, there is no special size class indexing. Other solutions I considered were: - A best fit allocator with an rb-tree. This worked well, as small writes didn't leak big holes from large freed extents, but led to regressions in ffe and write performance due to lock contention on the rb-tree with every allocation possibly updating it in parallel. Perhaps something clever could be done to do the updates in the background while being "right enough". - A fixed size "working set". This prevents freeing an extent drastically changing where writes currently land, and seems like a good option too. Doesn't take advantage of size in any way. - The same size class idea, but implemented with xarray marks. This turned out to be slower than looping the linked list and skipping wrong block groups, and is also less flexible since we must have only 3 size classes (max #marks). With the current approach we can have as many as we like. Performance testing was done via: https://github.com/josefbacik/fsperf Of particular relevance are the new fragmentation specific tests. A brief summary of the testing results: - Neutral results on existing tests. There are some minor regressions and improvements here and there, but nothing that truly stands out as notable. - Improvement on new tests where size class and extent lifetime are correlated. Fragmentation in these cases is completely eliminated and write performance is generally a little better. There is also significant improvement where extent sizes are just a bit larger than the size class boundaries. - Regression on one new tests: where the allocations are sized intentionally a hair under the borders of the size classes. Results are neutral on the test that intentionally attacks this new scheme by mixing extent size and lifetime. The full dump of the performance results can be found here: https://bur.io/fsperf/size-class-2022-10-04.txt (there are ansi escape codes, so best to curl and view in terminal) On the balance, the drastic 1->0 improvement in the happy cases seems worth the regressions we do observe. Some considerations for future work: - Experimenting with more size classes - More hinting/search ordering work to approximate a best-fit allocator Signed-off-by: Boris Burkov --- fs/btrfs/block-group.c | 104 +++++++++++++++++++++++++++++------ fs/btrfs/block-group.h | 15 ++++- fs/btrfs/extent-tree.c | 71 ++++++++++++------------ fs/btrfs/extent-tree.h | 3 + include/trace/events/btrfs.h | 10 +++- 5 files changed, 150 insertions(+), 53 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 32c415cfbdfe..d16a982aa593 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1,5 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 +#include "linux/sizes.h" #include #include "misc.h" #include "ctree.h" @@ -3309,6 +3310,7 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans, cache->space_info->disk_used -= num_bytes * factor; reclaim = should_reclaim_block_group(cache, num_bytes); + spin_unlock(&cache->lock); spin_unlock(&cache->space_info->lock); @@ -3362,32 +3364,42 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans, * reservation and return -EAGAIN, otherwise this function always succeeds. */ int btrfs_add_reserved_bytes(struct btrfs_block_group *cache, - u64 ram_bytes, u64 num_bytes, int delalloc) + u64 ram_bytes, u64 num_bytes, int delalloc, + bool force_wrong_size_class) { struct btrfs_space_info *space_info = cache->space_info; + enum btrfs_block_group_size_class size_class; int ret = 0; spin_lock(&space_info->lock); spin_lock(&cache->lock); if (cache->ro) { ret = -EAGAIN; - } else { - cache->reserved += num_bytes; - space_info->bytes_reserved += num_bytes; - trace_btrfs_space_reservation(cache->fs_info, "space_info", - space_info->flags, num_bytes, 1); - btrfs_space_info_update_bytes_may_use(cache->fs_info, - space_info, -ram_bytes); - if (delalloc) - cache->delalloc_bytes += num_bytes; + goto out; + } - /* - * Compression can use less space than we reserved, so wake - * tickets if that happens - */ - if (num_bytes < ram_bytes) - btrfs_try_granting_tickets(cache->fs_info, space_info); + if (btrfs_is_block_group_data_only(cache)) { + size_class = btrfs_calc_block_group_size_class(num_bytes); + ret = btrfs_use_block_group_size_class(cache, size_class, force_wrong_size_class); + if (ret) + goto out; } + cache->reserved += num_bytes; + space_info->bytes_reserved += num_bytes; + trace_btrfs_space_reservation(cache->fs_info, "space_info", + space_info->flags, num_bytes, 1); + btrfs_space_info_update_bytes_may_use(cache->fs_info, + space_info, -ram_bytes); + if (delalloc) + cache->delalloc_bytes += num_bytes; + + /* + * Compression can use less space than we reserved, so wake + * tickets if that happens + */ + if (num_bytes < ram_bytes) + btrfs_try_granting_tickets(cache->fs_info, space_info); +out: spin_unlock(&cache->lock); spin_unlock(&space_info->lock); return ret; @@ -4146,3 +4158,63 @@ void btrfs_dec_block_group_swap_extents(struct btrfs_block_group *bg, int amount bg->swap_extents -= amount; spin_unlock(&bg->lock); } + +enum btrfs_block_group_size_class btrfs_calc_block_group_size_class(u64 size) +{ + if (size <= SZ_128K) + return BTRFS_BG_SZ_SMALL; + if (size <= SZ_8M) + return BTRFS_BG_SZ_MEDIUM; + return BTRFS_BG_SZ_LARGE; +} + +/* + * Handle a block group allocating an extent in a size class + * + * @bg: The block group we allocated in. + * @size_class: The size class of the allocation. + * @force_wrong_size_class: Whether we are desperate enough to allow + * mismatched size classes. + * + * Returns: 0 if the size class was valid for this block_group, -EAGAIN in the + * case of a race that leads to the wrong size class without + * force_wrong_size_class set. + * + * find_free_extent will skip block groups with a mismatched size class until + * it really needs to avoid ENOSPC. In that case it will set + * force_wrong_size_class. However, if a block group is newly allocated and + * doesn't yet have a size class, then it is possible for two allocations of + * different sizes to race and both try to use it. The loser is caught here and + * has to retry. + */ +int btrfs_use_block_group_size_class(struct btrfs_block_group *bg, + enum btrfs_block_group_size_class size_class, + bool force_wrong_size_class) +{ + ASSERT(size_class != BTRFS_BG_SZ_NONE); + + /* The new allocation is in the right size class, do nothing */ + if (bg->size_class == size_class) + return 0; + /* + * The new allocation is in a mismatched size class. + * This means one of two things: + * 1. Two tasks in find_free_extent for different size_classes raced + * and hit the same empty block_group. Make the loser try again. + * 2. A call to find_free_extent got desperate enough to set + * 'force_wrong_slab'. Don't change the size_class, but allow the + * allocation. + */ + if (bg->size_class != BTRFS_BG_SZ_NONE) { + if (force_wrong_size_class) + return 0; + return -EAGAIN; + } + /* + * The happy new block group case: the new allocation is the first + * one in the block_group so we set size_class. + */ + bg->size_class = size_class; + + return 0; +} diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index 8fb14b99a1d1..650a066c03d5 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -12,6 +12,13 @@ enum btrfs_disk_cache_state { BTRFS_DC_SETUP, }; +enum btrfs_block_group_size_class { + BTRFS_BG_SZ_NONE, /* unset */ + BTRFS_BG_SZ_SMALL, /* 0 < sz <= 128K */ + BTRFS_BG_SZ_MEDIUM, /* 128K < sz <= 8M */ + BTRFS_BG_SZ_LARGE, /* 8M < sz < BG_LEN */ +}; + /* * This describes the state of the block_group for async discard. This is due * to the two pass nature of it where extent discarding is prioritized over @@ -232,6 +239,7 @@ struct btrfs_block_group { struct list_head active_bg_list; struct work_struct zone_finish_work; struct extent_buffer *last_eb; + enum btrfs_block_group_size_class size_class; }; static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group) @@ -310,7 +318,8 @@ int btrfs_setup_space_cache(struct btrfs_trans_handle *trans); int btrfs_update_block_group(struct btrfs_trans_handle *trans, u64 bytenr, u64 num_bytes, bool alloc); int btrfs_add_reserved_bytes(struct btrfs_block_group *cache, - u64 ram_bytes, u64 num_bytes, int delalloc); + u64 ram_bytes, u64 num_bytes, int delalloc, + bool force_wrong_size_class); void btrfs_free_reserved_bytes(struct btrfs_block_group *cache, u64 num_bytes, int delalloc); int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags, @@ -354,4 +363,8 @@ void btrfs_unfreeze_block_group(struct btrfs_block_group *cache); bool btrfs_inc_block_group_swap_extents(struct btrfs_block_group *bg); void btrfs_dec_block_group_swap_extents(struct btrfs_block_group *bg, int amount); +enum btrfs_block_group_size_class btrfs_calc_block_group_size_class(u64 size); +int btrfs_use_block_group_size_class(struct btrfs_block_group *bg, + enum btrfs_block_group_size_class size_class, + bool force_wrong_size_class); #endif /* BTRFS_BLOCK_GROUP_H */ diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 069761529398..63219887f534 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3373,7 +3373,9 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref) enum btrfs_loop_type { LOOP_CACHING_NOWAIT, LOOP_CACHING_WAIT, + LOOP_UNSET_SIZE_CLASS, LOOP_ALLOC_CHUNK, + LOOP_WRONG_SIZE_CLASS, LOOP_NO_EMPTY_SIZE, }; @@ -3938,24 +3940,6 @@ static int can_allocate_chunk(struct btrfs_fs_info *fs_info, } } -static int chunk_allocation_failed(struct find_free_extent_ctl *ffe_ctl) -{ - switch (ffe_ctl->policy) { - case BTRFS_EXTENT_ALLOC_CLUSTERED: - /* - * If we can't allocate a new chunk we've already looped through - * at least once, move on to the NO_EMPTY_SIZE case. - */ - ffe_ctl->loop = LOOP_NO_EMPTY_SIZE; - return 0; - case BTRFS_EXTENT_ALLOC_ZONED: - /* Give up here */ - return -ENOSPC; - default: - BUG(); - } -} - /* * Return >0 means caller needs to re-search for free extent * Return 0 means we have the needed free extent. @@ -3989,31 +3973,28 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info, * LOOP_CACHING_NOWAIT, search partially cached block groups, kicking * caching kthreads as we move along * LOOP_CACHING_WAIT, search everything, and wait if our bg is caching + * LOOP_UNSET_SIZE_CLASS, allow unset size class * LOOP_ALLOC_CHUNK, force a chunk allocation and try again * LOOP_NO_EMPTY_SIZE, set empty_size and empty_cluster to 0 and try * again */ if (ffe_ctl->loop < LOOP_NO_EMPTY_SIZE) { ffe_ctl->index = 0; - if (ffe_ctl->loop == LOOP_CACHING_NOWAIT) { - /* - * We want to skip the LOOP_CACHING_WAIT step if we - * don't have any uncached bgs and we've already done a - * full search through. - */ - if (ffe_ctl->orig_have_caching_bg || !full_search) - ffe_ctl->loop = LOOP_CACHING_WAIT; - else - ffe_ctl->loop = LOOP_ALLOC_CHUNK; - } else { + /* + * We want to skip the LOOP_CACHING_WAIT step if we + * don't have any uncached bgs and we've already done a + * full search through. + */ + if (ffe_ctl->loop == LOOP_CACHING_NOWAIT && + (!ffe_ctl->orig_have_caching_bg && full_search)) ffe_ctl->loop++; - } + ffe_ctl->loop++; if (ffe_ctl->loop == LOOP_ALLOC_CHUNK) { struct btrfs_trans_handle *trans; int exist = 0; - /*Check if allocation policy allows to create a new chunk */ + /* Check if allocation policy allows to create a new chunk */ ret = can_allocate_chunk(fs_info, ffe_ctl); if (ret) return ret; @@ -4033,8 +4014,10 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info, CHUNK_ALLOC_FORCE_FOR_EXTENT); /* Do not bail out on ENOSPC since we can do more. */ - if (ret == -ENOSPC) - ret = chunk_allocation_failed(ffe_ctl); + if (ret == -ENOSPC) { + ret = 0; + ffe_ctl->loop++; + } else if (ret < 0) btrfs_abort_transaction(trans, ret); else @@ -4064,6 +4047,21 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info, return -ENOSPC; } +static bool find_free_extent_check_size_class(struct find_free_extent_ctl *ffe_ctl, + struct btrfs_block_group *bg) +{ + if (ffe_ctl->policy == BTRFS_EXTENT_ALLOC_ZONED) + return true; + if (!btrfs_is_block_group_data_only(bg)) + return true; + if (ffe_ctl->loop >= LOOP_WRONG_SIZE_CLASS) + return true; + if (ffe_ctl->loop >= LOOP_UNSET_SIZE_CLASS && + bg->size_class == BTRFS_BG_SZ_NONE) + return true; + return ffe_ctl->size_class == bg->size_class; +} + static int prepare_allocation_clustered(struct btrfs_fs_info *fs_info, struct find_free_extent_ctl *ffe_ctl, struct btrfs_space_info *space_info, @@ -4198,6 +4196,7 @@ static noinline int find_free_extent(struct btrfs_root *root, ffe_ctl->total_free_space = 0; ffe_ctl->found_offset = 0; ffe_ctl->policy = BTRFS_EXTENT_ALLOC_CLUSTERED; + ffe_ctl->size_class = btrfs_calc_block_group_size_class(ffe_ctl->num_bytes); if (btrfs_is_zoned(fs_info)) ffe_ctl->policy = BTRFS_EXTENT_ALLOC_ZONED; @@ -4334,6 +4333,9 @@ static noinline int find_free_extent(struct btrfs_root *root, if (unlikely(block_group->cached == BTRFS_CACHE_ERROR)) goto loop; + if (!find_free_extent_check_size_class(ffe_ctl, block_group)) + goto loop; + bg_ret = NULL; ret = do_allocation(block_group, ffe_ctl, &bg_ret); if (ret == 0) { @@ -4368,7 +4370,8 @@ static noinline int find_free_extent(struct btrfs_root *root, ret = btrfs_add_reserved_bytes(block_group, ffe_ctl->ram_bytes, ffe_ctl->num_bytes, - ffe_ctl->delalloc); + ffe_ctl->delalloc, + ffe_ctl->loop >= LOOP_WRONG_SIZE_CLASS); if (ret == -EAGAIN) { btrfs_add_free_space_unused(block_group, ffe_ctl->found_offset, diff --git a/fs/btrfs/extent-tree.h b/fs/btrfs/extent-tree.h index 38bddb3ed224..a3e42e54e00b 100644 --- a/fs/btrfs/extent-tree.h +++ b/fs/btrfs/extent-tree.h @@ -78,6 +78,9 @@ struct find_free_extent_ctl { /* Whether or not the allocator is currently following a hint */ bool hinted; + + /* Size class of block groups to prefer in early loops */ + enum btrfs_block_group_size_class size_class; }; #endif /* BTRFS_EXTENT_TREE_H */ diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index caf0ce6ce718..39ead38f7765 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1351,28 +1351,34 @@ DECLARE_EVENT_CLASS(btrfs__reserve_extent, TP_STRUCT__entry_btrfs( __field( u64, bg_objectid ) __field( u64, flags ) + __field( int, bg_size_class ) __field( u64, start ) __field( u64, len ) __field( u64, loop ) __field( bool, hinted ) + __field( int, size_class ) ), TP_fast_assign_btrfs(block_group->fs_info, __entry->bg_objectid = block_group->start; __entry->flags = block_group->flags; + __entry->bg_size_class = block_group->size_class; __entry->start = ffe_ctl->search_start; __entry->len = ffe_ctl->num_bytes; __entry->loop = ffe_ctl->loop; __entry->hinted = ffe_ctl->hinted; + __entry->size_class = ffe_ctl->size_class; ), TP_printk_btrfs("root=%llu(%s) block_group=%llu flags=%llu(%s) " - "start=%llu len=%llu loop=%llu hinted=%d", + "bg_size_class=%d start=%llu len=%llu loop=%llu " + "hinted=%d size_class=%d", show_root_type(BTRFS_EXTENT_TREE_OBJECTID), __entry->bg_objectid, __entry->flags, __print_flags((unsigned long)__entry->flags, "|", BTRFS_GROUP_FLAGS), - __entry->start, __entry->len, __entry->loop, __entry->hinted) + __entry->bg_size_class, __entry->start, __entry->len, + __entry->loop, __entry->hinted, __entry->size_class) ); DEFINE_EVENT(btrfs__reserve_extent, btrfs_reserve_extent, From patchwork Wed Oct 5 19:49:22 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Burkov X-Patchwork-Id: 12999601 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45A05C433F5 for ; Wed, 5 Oct 2022 19:49:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230497AbiJETtm (ORCPT ); Wed, 5 Oct 2022 15:49:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47154 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230483AbiJETtj (ORCPT ); Wed, 5 Oct 2022 15:49:39 -0400 Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com [66.111.4.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6CF68CC for ; Wed, 5 Oct 2022 12:49:33 -0700 (PDT) Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailout.nyi.internal (Postfix) with ESMTP id D7DFA5C0121; Wed, 5 Oct 2022 15:49:32 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute5.internal (MEProxy); Wed, 05 Oct 2022 15:49:32 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc :content-transfer-encoding:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to; s=fm2; t=1664999372; x=1665085772; bh=5u 8fAVxRoI8o5bWXQiIYZVMC3/nQuZLdnjSVvLYjGQc=; b=sd4G7T8reXFEkn3MIT ddoYcvnifvNBru01718L+JkXPH5fzfsTbUnSFNrnueKi9sq+9K/oPO5EwWZftWEx RtXHMPfNW3/1uBijGrWtVbWy/CkNtB7WzIjvHKRVn+a8Z/55/DO97cGrIL/q2kwv LDX/gSpKXLznGz58LGEAz9IVahTjaVzCw8bpax05PP0VawS+ZfH0bKlp+IHwnOId pinKKT9Ymo22oLoSzHP1LtnbjQEYSfjLPAGfaANEtpM3Z0kP7oN58weBB7zIGZHF ESNh1cqcdfC16k/3GOaLVu0ojj/cLH1nLY8vYMWMakpT+V/emM3yiE4icUTmrrnf NPMw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:sender:subject :subject:to:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm2; t=1664999372; x=1665085772; bh=5u8fAVxRoI8o5 bWXQiIYZVMC3/nQuZLdnjSVvLYjGQc=; b=b8D85MoLylqsfpNgSkmOfzzqA//by pFRHiBOS+Dn/N6ofG20jlElZQn5/USJMLNdbFIjUD4K1o1cti57k454065QfSD11 EZ7QjCI87lUxA0tbWVJQcjPUNBMn8VQeg/15jYgXM4ffHDC2Me7IJ0g2XLsfH2wE UAuqrgioW1tNLxZfamqpitSN99hGEk8wHeNnkZ6kQrzneBdB0Knj4z6qWQKJboVZ 4F/MEKPjNBzioZ1v5SQH0TczY/d+dCUxoJZHAHsEDNI1KkRuf0AZq9/+KDitRaXA Ho/eYT7BBvOGLuI2UIN6Hf44a9akTpAn2z2TKZ05v1DuY8LujVx/Tcd9g== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvfedrfeeifedgudegtdcutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkffojghfggfgsedtke ertdertddtnecuhfhrohhmpeeuohhrihhsuceuuhhrkhhovhcuoegsohhrihhssegsuhhr rdhioheqnecuggftrfgrthhtvghrnhepieeuffeuvdeiueejhfehiefgkeevudejjeejff evvdehtddufeeihfekgeeuheelnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghm pehmrghilhhfrhhomhepsghorhhishessghurhdrihho X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 5 Oct 2022 15:49:32 -0400 (EDT) From: Boris Burkov To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 5/5] btrfs: load block group size class when caching Date: Wed, 5 Oct 2022 12:49:22 -0700 Message-Id: <9ba155f3d9ed5c1025dc0f497d621171df06d95d.1664999303.git.boris@bur.io> X-Mailer: git-send-email 2.37.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Since the size class is an artifact of an arbitrary anti fragmentation strategy, it doesn't really make sense to persist it. Furthermore, most of the size class logic assumes fresh block groups. That is of course not a reasonable assumption -- we will be upgrading kernels with existing filesystems whose block groups are not classified. To work around those issues, implement logic to compute the size class of the block groups as we cache them in. To perfectly assess the state of a block group, we would have to read the entire extent tree (since the free space cache mashes together contiguous extent items) which would be prohibitively expensive for larger file systems with more extents. We can do it relatively cheaply by implementing a simple heuristic of sampling a handful of extents and picking the smallest one we see. In the happy case where the block group was classified, we will only see extents of the correct size. In the unhappy case, we will hopefully find one of the smaller extents, but there is no perfect answer anyway. Autorelocation will eventually churn up the block group if there is significant free-ing anyway. The work is done in the caching thread but after marking the block group cached, as we tradeoff classification accuracy vs. slowing down allocations. There was no regression in mount performance at end state of the fsperf test suite. Signed-off-by: Boris Burkov --- fs/btrfs/block-group.c | 130 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 130 insertions(+) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index d16a982aa593..26cae88d3659 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -527,6 +527,134 @@ u64 add_new_free_space(struct btrfs_block_group *block_group, u64 start, u64 end return total_added; } +/* + * Get an arbitrary extent item index / max_index through the block group + * + * @block_group: the block group to sample from + * @index: the integral step through the block group to grab from + * @max_index: the granularity of the sampling + * @key: return value parameter for the item we find + * + * pre-conditions on indices: + * 0 <= index <= max_index + * 0 < max_index + * + * Returns: 0 on success, 1 if the search didn't yield a useful item, negative + * error code on error. + */ +static int sample_block_group_extent_item(struct btrfs_block_group *block_group, + int index, int max_index, + struct btrfs_key *key) +{ + struct btrfs_fs_info *fs_info = block_group->fs_info; + struct btrfs_root *extent_root; + int ret = 0; + u64 search_offset; + struct btrfs_path *path; + + ASSERT(index >= 0); + ASSERT(index <= max_index); + ASSERT(max_index > 0); + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + down_read(&fs_info->commit_root_sem); + extent_root = btrfs_extent_root(fs_info, max_t(u64, block_group->start, + BTRFS_SUPER_INFO_OFFSET)); + + path->skip_locking = 1; + path->search_commit_root = 1; + path->reada = READA_FORWARD; + + search_offset = index * (block_group->length / max_index); + key->objectid = block_group->start + search_offset; + key->offset = 0; + key->type = BTRFS_EXTENT_ITEM_KEY; + + ret = btrfs_search_slot(NULL, extent_root, key, path, 0, 0); + if (ret != 0) + goto out; + if (key->objectid < block_group->start || + key->objectid > block_group->start + block_group->length) { + ret = 1; + goto out; + } + if (key->type != BTRFS_EXTENT_ITEM_KEY) { + ret = 1; + goto out; + } +out: + btrfs_free_path(path); + up_read(&fs_info->commit_root_sem); + return ret; +} + +/* + * Best effort attempt to compute a block group's size class while caching it. + * + * @block_group: the block group we are caching + * + * We cannot infer the size class while adding free space extents, because that + * logic doesn't care about contiguous file extents (it doesn't differentiate + * between a 100M extent and 100 contiguous 1M extents). So we need to read the + * file extent items. Reading all of them is quite wasteful, because usually + * only a handful are enough to give a good answer. Therefore, we just grab 5 of + * them at even steps through the block group and pick the smallest size class + * we see. Since size class is best effort, and not guaranteed in general, + * inaccuracy is acceptable. + * + * To be more explicit about why this algorithm makes sense: + * + * If we are caching in a block group from disk, then there are three major cases + * to consider: + * 1. the block group is well behaved and all extents in it are the same size + * class. + * 2. the block group is mostly one size class with rare exceptions for last + * ditch allocations + * 3. the block group was populated before size classes and can have a totally + * arbitrary mix of size classes. + * + * In case 1, looking at any extent in the block group will yield the correct + * result. For the mixed cases, taking the minimum size class seems like a good + * approximation, since gaps from frees will be usable to the size class. For + * 2., a small handful of file extents is likely to yield the right answer. For + * 3, we can either read every file extent, or admit that this is best effort + * anyway and try to stay fast. + * + * Returns: 0 on success, negative error code on error. + */ +static int load_block_group_size_class(struct btrfs_block_group *block_group) +{ + struct btrfs_key key; + int i; + u64 min_size = block_group->length; + enum btrfs_block_group_size_class size_class = BTRFS_BG_SZ_NONE; + int ret; + + if (!btrfs_is_block_group_data_only(block_group)) + return 0; + + for (i = 0; i < 5; ++i) { + ret = sample_block_group_extent_item(block_group, i, 5, &key); + if (ret < 0) + goto out; + if (ret > 0) + continue; + min_size = min_t(u64, min_size, key.offset); + size_class = btrfs_calc_block_group_size_class(min_size); + } + if (size_class != BTRFS_BG_SZ_NONE) { + spin_lock(&block_group->lock); + block_group->size_class = size_class; + spin_unlock(&block_group->lock); + } + +out: + return ret; +} + static int load_extent_tree_free(struct btrfs_caching_control *caching_ctl) { struct btrfs_block_group *block_group = caching_ctl->block_group; @@ -726,6 +854,8 @@ static noinline void caching_thread(struct btrfs_work *work) wake_up(&caching_ctl->wait); + load_block_group_size_class(block_group); + btrfs_put_caching_control(caching_ctl); btrfs_put_block_group(block_group); }