From patchwork Thu Jul 12 01:25:41 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lu Fengqi X-Patchwork-Id: 10520783 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 6BAD760626 for ; Thu, 12 Jul 2018 01:26:12 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5373C29427 for ; Thu, 12 Jul 2018 01:26:12 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 4808F2942A; Thu, 12 Jul 2018 01:26:12 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00, MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 459B429427 for ; Thu, 12 Jul 2018 01:26:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2390257AbeGLBdL (ORCPT ); Wed, 11 Jul 2018 21:33:11 -0400 Received: from mail.cn.fujitsu.com ([183.91.158.132]:19696 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S2389528AbeGLBdL (ORCPT ); Wed, 11 Jul 2018 21:33:11 -0400 X-IronPort-AV: E=Sophos;i="5.43,368,1503331200"; d="scan'208";a="42141340" Received: from unknown (HELO cn.fujitsu.com) ([10.167.33.5]) by heian.cn.fujitsu.com with ESMTP; 12 Jul 2018 09:26:04 +0800 Received: from G08CNEXCHPEKD02.g08.fujitsu.local (unknown [10.167.33.83]) by cn.fujitsu.com (Postfix) with ESMTP id 1CD034B473FE; Thu, 12 Jul 2018 09:26:03 +0800 (CST) Received: from fnst.localdomain (10.167.226.155) by G08CNEXCHPEKD02.g08.fujitsu.local (10.167.33.89) with Microsoft SMTP Server (TLS) id 14.3.399.0; Thu, 12 Jul 2018 09:26:09 +0800 From: Lu Fengqi To: CC: Wang Xiaoguang , Qu Wenruo Subject: [PATCH v14.8 02/14] btrfs: Introduce COMPRESS reserve type to fix false enospc for compression Date: Thu, 12 Jul 2018 09:25:41 +0800 Message-ID: <20180712012553.29431-3-lufq.fnst@cn.fujitsu.com> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20180712012553.29431-1-lufq.fnst@cn.fujitsu.com> References: <20180712012553.29431-1-lufq.fnst@cn.fujitsu.com> MIME-Version: 1.0 X-Originating-IP: [10.167.226.155] X-yoursite-MailScanner-ID: 1CD034B473FE.AF013 X-yoursite-MailScanner: Found to be clean X-yoursite-MailScanner-From: lufq.fnst@cn.fujitsu.com Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Wang Xiaoguang When testing btrfs compression, sometimes we got ENOSPC error, though fs still has much free space, xfstests generic/171, generic/172, generic/173, generic/174, generic/175 can reveal this bug in my test environment when compression is enabled. After some debugging work, we found that it's btrfs_delalloc_reserve_metadata() which sometimes tries to reserve too much metadata space, even for very small data range. In btrfs_delalloc_reserve_metadata(), the number of metadata bytes to reserve is calculated by the difference between outstanding extents and reserved extents. But due to bad designed drop_outstanding_extent() function, it can make the difference too big, and cause problem. The problem happens in the following flow with compression enabled. 1) Buffered write 128M data with 128K blocksize outstanding_extents = 1 reserved_extents = 1024 (128M / 128K, one blocksize will get one reserved_extent) Note: it's btrfs_merge_extent_hook() to merge outstanding extents. But reserved extents are still 1024. 2) Allocate extents for dirty range cow_file_range_async() split above large extent into small 128K extents. Let's assume 2 compressed extents have been split. So we have: outstanding_extents = 3 reserved_extents = 1024 range [0, 256K) has extents allocated 3) One ordered extent get finished btrfs_finish_ordered_io() |- btrfs_delalloc_release_metadata() |- drop_outstanding_extent() drop_outstanding_extent() will free *ALL* redundant reserved extents. So we have: outstanding_extents = 2 (One has finished) reserved_extents = 2 4) Continue allocating extents for dirty range cow_file_range_async() continue handling the remaining range. When the whole 128M range is done and assume no more ordered extents have finished. outstanding_extents = 1023 (One has finished in Step 3) reserved_extents = 2 (*ALL* freed in Step 3) 5) Another buffered write happens to the file btrfs_delalloc_reserve_metadata() will calculate metadata space. The calculation is: meta_to_reserve = (outstanding_extents - reserved_extents) * \ nodesize * max_tree_level(8) * 2 If nodesize is 16K, it's 1021 * 16K * 8 * 2, near 256M. If nodesize is 64K, it's about 1G. That's totally insane. The fix is to introduce new reserve type, COMPRESSION, to info outstanding extents calculation algorithm, to get correct outstanding_extents based extent size. So in Step 1), outstanding_extents = 1024 reserved_extents = 1024 Step 2): outstanding_extents = 1024 reserved_extents = 1024 Step 3): outstanding_extents = 1023 reserved_extents = 1023 And in Step 5) we reserve correct amount of metadata space. Signed-off-by: Wang Xiaoguang Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- fs/btrfs/ctree.h | 2 ++ fs/btrfs/extent-tree.c | 2 ++ fs/btrfs/extent_io.c | 7 ++-- fs/btrfs/extent_io.h | 1 + fs/btrfs/file.c | 3 ++ fs/btrfs/inode.c | 81 +++++++++++++++++++++++++++++++++++------- fs/btrfs/ioctl.c | 2 ++ fs/btrfs/relocation.c | 3 ++ 8 files changed, 86 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index f906aab71116..8743fdcfe139 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -106,9 +106,11 @@ static inline u32 count_max_extents(u64 size, u64 max_extent_size) */ enum btrfs_metadata_reserve_type { BTRFS_RESERVE_NORMAL, + BTRFS_RESERVE_COMPRESS, }; u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type); +int inode_need_compress(struct inode *inode, u64 start, u64 end); struct btrfs_mapping_tree { struct extent_map_tree map_tree; diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 8e7ad123aa95..225ebcb1fd09 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -6021,6 +6021,8 @@ u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type) { if (reserve_type == BTRFS_RESERVE_NORMAL) return BTRFS_MAX_EXTENT_SIZE; + else if (reserve_type == BTRFS_RESERVE_COMPRESS) + return SZ_128K; ASSERT(0); return BTRFS_MAX_EXTENT_SIZE; diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index e55843f536bc..25d1c302dd47 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -596,7 +596,7 @@ int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, btrfs_debug_check_extent_io_range(tree, start, end); if (bits & EXTENT_DELALLOC) - bits |= EXTENT_NORESERVE; + bits |= EXTENT_NORESERVE | EXTENT_COMPRESS; if (delete) bits |= ~EXTENT_CTLBITS; @@ -1489,6 +1489,7 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree, u64 cur_start = *start; u64 found = 0; u64 total_bytes = 0; + unsigned int pre_state; spin_lock(&tree->lock); @@ -1506,7 +1507,8 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree, while (1) { state = rb_entry(node, struct extent_state, rb_node); if (found && (state->start != cur_start || - (state->state & EXTENT_BOUNDARY))) { + (state->state & EXTENT_BOUNDARY) || + (state->state ^ pre_state) & EXTENT_COMPRESS)) { goto out; } if (!(state->state & EXTENT_DELALLOC)) { @@ -1522,6 +1524,7 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree, found++; *end = state->end; cur_start = state->end + 1; + pre_state = state->state; node = rb_next(node); total_bytes += state->end - state->start + 1; if (total_bytes >= max_bytes) diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 0bfd4aeb822d..4eabbbaa17e9 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -25,6 +25,7 @@ #define EXTENT_QGROUP_RESERVED (1U << 16) #define EXTENT_CLEAR_DATA_RESV (1U << 17) #define EXTENT_DELALLOC_NEW (1U << 18) +#define EXTENT_COMPRESS (1U << 19) #define EXTENT_IOBITS (EXTENT_LOCKED | EXTENT_WRITEBACK) #define EXTENT_DO_ACCOUNTING (EXTENT_CLEAR_META_RESV | \ EXTENT_CLEAR_DATA_RESV) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 65aa8662b03b..b503b255b65b 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1598,6 +1598,9 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file, if (!pages) return -ENOMEM; + if (inode_need_compress(inode, -1, 0)) + reserve_type = BTRFS_RESERVE_COMPRESS; + while (iov_iter_count(i) > 0) { size_t offset = pos & (PAGE_SIZE - 1); size_t sector_offset; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 9c5c64bd4bb9..d58d984fc3af 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -392,7 +392,7 @@ static noinline int add_async_extent(struct async_cow *cow, return 0; } -static inline int inode_need_compress(struct inode *inode, u64 start, u64 end) +int inode_need_compress(struct inode *inode, u64 start, u64 end) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); @@ -1179,7 +1179,8 @@ static noinline void async_cow_free(struct btrfs_work *work) static int cow_file_range_async(struct inode *inode, struct page *locked_page, u64 start, u64 end, int *page_started, unsigned long *nr_written, - unsigned int write_flags) + unsigned int write_flags, + enum btrfs_metadata_reserve_type reserve_type) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct async_cow *async_cow; @@ -1198,10 +1199,8 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page, async_cow->start = start; async_cow->write_flags = write_flags; - if (BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS && - !btrfs_test_opt(fs_info, FORCE_COMPRESS)) - cur_end = end; - else + cur_end = end; + if (reserve_type == BTRFS_RESERVE_COMPRESS) cur_end = min(end, start + SZ_512K - 1); async_cow->end = cur_end; @@ -1610,6 +1609,14 @@ static int run_delalloc_range(void *private_data, struct page *locked_page, int ret; int force_cow = need_force_cow(inode, start, end); unsigned int write_flags = wbc_to_write_flags(wbc); + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; + int need_compress; + enum btrfs_metadata_reserve_type reserve_type = BTRFS_RESERVE_NORMAL; + + need_compress = test_range_bit(io_tree, start, end, + EXTENT_COMPRESS, 1, NULL); + if (need_compress) + reserve_type = BTRFS_RESERVE_COMPRESS; if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW && !force_cow) { ret = run_delalloc_nocow(inode, locked_page, start, end, @@ -1617,7 +1624,7 @@ static int run_delalloc_range(void *private_data, struct page *locked_page, } else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC && !force_cow) { ret = run_delalloc_nocow(inode, locked_page, start, end, page_started, 0, nr_written); - } else if (!inode_need_compress(inode, start, end)) { + } else if (!need_compress) { ret = cow_file_range(inode, locked_page, start, end, end, page_started, nr_written, 1, NULL); } else { @@ -1625,7 +1632,7 @@ static int run_delalloc_range(void *private_data, struct page *locked_page, &BTRFS_I(inode)->runtime_flags); ret = cow_file_range_async(inode, locked_page, start, end, page_started, nr_written, - write_flags); + write_flags, reserve_type); } if (ret) btrfs_cleanup_ordered_extents(inode, start, end - start + 1); @@ -1644,6 +1651,9 @@ static void btrfs_split_extent_hook(void *private_data, if (!(orig->state & EXTENT_DELALLOC)) return; + if (orig->state & EXTENT_COMPRESS) + reserve_type = BTRFS_RESERVE_COMPRESS; + max_extent_size = btrfs_max_extent_size(reserve_type); size = orig->end - orig->start + 1; @@ -1688,6 +1698,9 @@ static void btrfs_merge_extent_hook(void *private_data, if (!(other->state & EXTENT_DELALLOC)) return; + if (other->state & EXTENT_COMPRESS) + reserve_type = BTRFS_RESERVE_COMPRESS; + max_extent_size = btrfs_max_extent_size(reserve_type); if (new->start > other->start) @@ -1813,6 +1826,8 @@ static void btrfs_set_bit_hook(void *private_data, BTRFS_RESERVE_NORMAL; bool do_list = !btrfs_is_free_space_inode(BTRFS_I(inode)); + if (*bits & EXTENT_COMPRESS) + reserve_type = BTRFS_RESERVE_COMPRESS; max_extent_size = btrfs_max_extent_size(reserve_type); num_extents = count_max_extents(len, max_extent_size); @@ -1874,6 +1889,8 @@ static void btrfs_clear_bit_hook(void *private_data, struct btrfs_root *root = inode->root; bool do_list = !btrfs_is_free_space_inode(inode); + if (state->state & EXTENT_COMPRESS) + reserve_type = BTRFS_RESERVE_COMPRESS; max_extent_size = btrfs_max_extent_size(reserve_type); num_extents = count_max_extents(len, max_extent_size); @@ -2096,14 +2113,31 @@ static noinline int add_pending_csums(struct btrfs_trans_handle *trans, return 0; } +/* + * Normally flag should be 0, but if a data range will go through compress path, + * set flag to 1. Note: here we should ensure enum btrfs_metadata_reserve_type + * and flag's values are consistent. + */ int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end, unsigned int extra_bits, struct extent_state **cached_state, enum btrfs_metadata_reserve_type reserve_type) { + int ret; + unsigned int bits; + + /* compression path */ + if (reserve_type == BTRFS_RESERVE_COMPRESS) + bits = EXTENT_DELALLOC | EXTENT_COMPRESS | EXTENT_UPTODATE | + extra_bits; + else + bits = EXTENT_DELALLOC | EXTENT_UPTODATE | extra_bits; + WARN_ON((end & (PAGE_SIZE - 1)) == 0); - return set_extent_delalloc(&BTRFS_I(inode)->io_tree, start, end, - extra_bits, cached_state); + ret = set_extent_bit(&BTRFS_I(inode)->io_tree, start, end, + bits, NULL, cached_state, GFP_NOFS); + + return ret; } @@ -2111,9 +2145,20 @@ int btrfs_set_extent_defrag(struct inode *inode, u64 start, u64 end, struct extent_state **cached_state, enum btrfs_metadata_reserve_type reserve_type) { + int ret; + unsigned int bits; + WARN_ON((end & (PAGE_SIZE - 1)) == 0); - return set_extent_defrag(&BTRFS_I(inode)->io_tree, start, end, - cached_state); + if (reserve_type == BTRFS_RESERVE_COMPRESS) + bits = EXTENT_DELALLOC | EXTENT_UPTODATE | EXTENT_DEFRAG | + EXTENT_COMPRESS; + else + bits = EXTENT_DELALLOC | EXTENT_UPTODATE | EXTENT_DEFRAG; + + ret = set_extent_bit(&BTRFS_I(inode)->io_tree, start, end, + bits, NULL, cached_state, GFP_NOFS); + + return ret; } /* see btrfs_writepage_start_hook for details on why this is required */ @@ -2166,6 +2211,8 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work) goto again; } + if (inode_need_compress(inode, page_start, page_end)) + reserve_type = BTRFS_RESERVE_COMPRESS; ret = btrfs_delalloc_reserve_space(inode, &data_reserved, page_start, PAGE_SIZE, reserve_type); if (ret) { @@ -3085,8 +3132,11 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) trans->block_rsv = &BTRFS_I(inode)->block_rsv; - if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered_extent->flags)) + if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered_extent->flags)) { compress_type = ordered_extent->compress_type; + reserve_type = BTRFS_RESERVE_COMPRESS; + } + if (test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags)) { BUG_ON(compress_type); btrfs_qgroup_free_data(inode, NULL, ordered_extent->file_offset, @@ -4917,6 +4967,9 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len, u64 block_end; enum btrfs_metadata_reserve_type reserve_type = BTRFS_RESERVE_NORMAL; + if (inode_need_compress(inode, -1, 0)) + reserve_type = BTRFS_RESERVE_COMPRESS; + if (IS_ALIGNED(offset, blocksize) && (!len || IS_ALIGNED(len, blocksize))) goto out; @@ -8943,6 +8996,8 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) page_end = page_start + PAGE_SIZE - 1; end = page_end; + if (inode_need_compress(inode, page_start, page_end)) + reserve_type = BTRFS_RESERVE_COMPRESS; /* * Reserving delalloc space after obtaining the page lock can lead to * deadlock. For example, if a dirty page is locked by this function diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 20a7aa36fe6a..fd0329065c4b 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1234,6 +1234,8 @@ static int cluster_pages_for_defrag(struct inode *inode, page_cnt = min_t(u64, (u64)num_pages, (u64)file_end - start_index + 1); + if (inode_need_compress(inode, -1, 0)) + reserve_type = BTRFS_RESERVE_COMPRESS; ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start_index << PAGE_SHIFT, page_cnt << PAGE_SHIFT, reserve_type); diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 95a9b80c0110..85b872278a71 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -3195,6 +3195,9 @@ static int relocate_file_extent_cluster(struct inode *inode, if (!cluster->nr) return 0; + if (inode_need_compress(inode, -1, 0)) + reserve_type = BTRFS_RESERVE_COMPRESS; + ra = kzalloc(sizeof(*ra), GFP_NOFS); if (!ra) return -ENOMEM;