From patchwork Thu Feb 12 16:16:51 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Josef Bacik X-Patchwork-Id: 5820551 Return-Path: X-Original-To: patchwork-linux-btrfs@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id F1DD2BF440 for ; Thu, 12 Feb 2015 16:17:13 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 280D420219 for ; Thu, 12 Feb 2015 16:17:11 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id C41A32015E for ; Thu, 12 Feb 2015 16:17:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755753AbbBLQQz (ORCPT ); Thu, 12 Feb 2015 11:16:55 -0500 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:52670 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755292AbbBLQQz (ORCPT ); Thu, 12 Feb 2015 11:16:55 -0500 Received: from pps.filterd (m0044012 [127.0.0.1]) by mx0a-00082601.pphosted.com (8.14.5/8.14.5) with SMTP id t1CG6vPd028239 for ; Thu, 12 Feb 2015 08:16:54 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=wit.ai; h=from : to : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=mx2; bh=GlL5/uxCfYVRqQk4vDm6BLQCNXP8HSuzg0s7BX8uIQw=; b=h5dQ9hYAd7pNSQ25EQiHD5WQv0JOzAZnut4lfw9WVWGYVCFqY7Gjlm2fEEMMheIEniP+ jQGHjfy3+shv93nxR/5uhNZaZC7r3jPhy9aDW/bVAYZNqhULFhPYcVdxMf8qSx9qQtTz LmtIcuKThCF4BIMHG8yeXa8O6SGOWf1VJEA7dpBkl/nllWqaTUJ0KGn5I8XJHzGMjJVA zsqTVhSeBJqIRxtWdX1mN28dPbBotVu4K+rPTOJw18RSxkcDSRmmd4itARzERmOndnD1 fmkuJvenEzQSFiD2f+3QdWagvoT1qCXGPsGAXoFG47Zv4UAORvQKccJONC7NzYSqqcTV Bw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=fb.com; h=from : to : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=facebook; bh=GlL5/uxCfYVRqQk4vDm6BLQCNXP8HSuzg0s7BX8uIQw=; b=CnlH9kcRXTTShwELlD3aPyIGKVyalIhzE0rReY0FPpV2VljVS2bSNxCn68qkCadp9PHn X7vQe3xyzrKqaM41fz+yaQ2l96edR5kzLGXOyuV/03LI7V45zFlqHGAnUwQOtTE1Qrpt VJQ65MI3B74D7xVgZ6Ik3JbJGzt+uq3lvsk= Received: from mail.thefacebook.com ([199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 1sgv85rbug-3 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT) for ; Thu, 12 Feb 2015 08:16:54 -0800 Received: from localhost (192.168.57.29) by mail.thefacebook.com (192.168.16.19) with Microsoft SMTP Server (TLS) id 14.3.195.1; Thu, 12 Feb 2015 08:16:53 -0800 From: Josef Bacik To: Subject: [PATCH 3/3 v2] Btrfs: account for large extents with enospc Date: Thu, 12 Feb 2015 11:16:51 -0500 Message-ID: <1423757811-2511-1-git-send-email-jbacik@fb.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <20150212043653.GB2416@localhost.localdomain> References: <20150212043653.GB2416@localhost.localdomain> MIME-Version: 1.0 X-Originating-IP: [192.168.57.29] X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.13.68, 1.0.33, 0.0.0000 definitions=2015-02-12_05:2015-02-12, 2015-02-12, 1970-01-01 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 spamscore=0 suspectscore=3 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=7.0.1-1402240000 definitions=main-1502120159 X-FB-Internal: deliver Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Spam-Status: No, score=-6.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_HI,T_DKIM_INVALID,T_RP_MATCHES_RCVD,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On our gluster boxes we stream large tar balls of backups onto our fses. With 160gb of ram this means we get really large contiguous ranges of dirty data, but the way our ENOSPC stuff works is that as long as it's contiguous we only hold metadata reservation for one extent. The problem is we limit our extents to 128mb, so we'll end up with at least 800 extents so our enospc accounting is quite a bit lower than what we need. To keep track of this make sure we increase outstanding_extents for every multiple of the max extent size so we can be sure to have enough reserved metadata space. Thanks, Signed-off-by: Josef Bacik Reviewed-by: Liu Bo --- V1->V2: -use bit shift instead of div64_u64. -skip math if num_bytes < max_size in drop_outstanding_extents fs/btrfs/ctree.h | 8 ++++++++ fs/btrfs/extent-tree.c | 16 ++++++++++----- fs/btrfs/extent_io.c | 2 +- fs/btrfs/inode.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++- 4 files changed, 73 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 0b4683f..afdae42 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -198,6 +198,14 @@ static int btrfs_csum_sizes[] = { 4, 0 }; #define BTRFS_DIRTY_METADATA_THRESH (32 * 1024 * 1024) +#define BTRFS_MAX_EXTENT_SIZE (128 * 1024 * 1024) +#define BTRFS_MAX_EXTENT_BIT 27 + +static inline u64 btrfs_num_extents(u64 num_bytes) +{ + return (num_bytes + BTRFS_MAX_EXTENT_SIZE - 1) >> BTRFS_MAX_EXTENT_BIT; +} + /* * The key defines the order in the tree, and so it also defines (optimal) * block layout. diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index a346487..18e8a5a 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4966,19 +4966,25 @@ void btrfs_subvolume_release_metadata(struct btrfs_root *root, /** * drop_outstanding_extent - drop an outstanding extent * @inode: the inode we're dropping the extent for + * @num_bytes: the number of bytes we're relaseing. * * This is called when we are freeing up an outstanding extent, either called * after an error or after an extent is written. This will return the number of * reserved extents that need to be freed. This must be called with * BTRFS_I(inode)->lock held. */ -static unsigned drop_outstanding_extent(struct inode *inode) +static unsigned drop_outstanding_extent(struct inode *inode, u64 num_bytes) { unsigned drop_inode_space = 0; unsigned dropped_extents = 0; + unsigned num_extents = 0; - BUG_ON(!BTRFS_I(inode)->outstanding_extents); - BTRFS_I(inode)->outstanding_extents--; + num_extents = (num_bytes > BTRFS_MAX_EXTENT_SIZE) ? + (unsigned)btrfs_num_extents(num_bytes) : 1; + + ASSERT(num_extents); + ASSERT(BTRFS_I(inode)->outstanding_extents >= num_extents); + BTRFS_I(inode)->outstanding_extents -= num_extents; if (BTRFS_I(inode)->outstanding_extents == 0 && test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED, @@ -5149,7 +5155,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes) out_fail: spin_lock(&BTRFS_I(inode)->lock); - dropped = drop_outstanding_extent(inode); + dropped = drop_outstanding_extent(inode, num_bytes); /* * If the inodes csum_bytes is the same as the original * csum_bytes then we know we haven't raced with any free()ers @@ -5228,7 +5234,7 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes) num_bytes = ALIGN(num_bytes, root->sectorsize); spin_lock(&BTRFS_I(inode)->lock); - dropped = drop_outstanding_extent(inode); + dropped = drop_outstanding_extent(inode, num_bytes); if (num_bytes) to_free = calc_csum_metadata_size(inode, num_bytes, 0); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 4ebabd2..3fbc177 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3239,7 +3239,7 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode, page, &delalloc_start, &delalloc_end, - 128 * 1024 * 1024); + BTRFS_MAX_EXTENT_SIZE); if (nr_delalloc == 0) { delalloc_start = delalloc_end + 1; continue; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index fb16fd3..324cb6d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1530,10 +1530,41 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page, static void btrfs_split_extent_hook(struct inode *inode, struct extent_state *orig, u64 split) { + u64 size; + /* not delalloc, ignore it */ if (!(orig->state & EXTENT_DELALLOC)) return; + size = orig->end - orig->start + 1; + if (size > BTRFS_MAX_EXTENT_SIZE) { + u64 new_size; + + /* + * We need the largest size of the remaining extent to see if we + * need to add a new outstanding extent. Think of the following + * case + * + * [MEAX_EXTENT_SIZEx2 - 4k][4k] + * + * The new_size would just be 4k and we'd think we had enough + * outstanding extents for this if we only took one side of the + * split, same goes for the other direction. We need to see if + * the larger size still is the same amount of extents as the + * original size, because if it is we need to add a new + * outstanding extent. But if we split up and the larger size + * is less than the original then we are good to go since we've + * already accounted for the extra extent in our original + * accounting. + */ + new_size = orig->end - split + 1; + if ((split - orig->start) > new_size) + new_size = split - orig->start; + + if (btrfs_num_extents(size) > btrfs_num_extents(new_size)) + return; + } + spin_lock(&BTRFS_I(inode)->lock); BTRFS_I(inode)->outstanding_extents++; spin_unlock(&BTRFS_I(inode)->lock); @@ -1549,10 +1580,30 @@ static void btrfs_merge_extent_hook(struct inode *inode, struct extent_state *new, struct extent_state *other) { + u64 new_size, old_size; + /* not delalloc, ignore it */ if (!(other->state & EXTENT_DELALLOC)) return; + old_size = other->end - other->start + 1; + new_size = old_size + (new->end - new->start + 1); + + /* we're not bigger than the max, unreserve the space and go */ + if (new_size <= BTRFS_MAX_EXTENT_SIZE) { + spin_lock(&BTRFS_I(inode)->lock); + BTRFS_I(inode)->outstanding_extents--; + spin_unlock(&BTRFS_I(inode)->lock); + return; + } + + /* + * If we grew by another max_extent, just return, we want to keep that + * reserved amount. + */ + if (btrfs_num_extents(new_size) > btrfs_num_extents(old_size)) + return; + spin_lock(&BTRFS_I(inode)->lock); BTRFS_I(inode)->outstanding_extents--; spin_unlock(&BTRFS_I(inode)->lock); @@ -1648,6 +1699,7 @@ static void btrfs_clear_bit_hook(struct inode *inode, unsigned long *bits) { u64 len = state->end + 1 - state->start; + u64 num_extents = btrfs_num_extents(len); spin_lock(&BTRFS_I(inode)->lock); if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG)) @@ -1667,7 +1719,7 @@ static void btrfs_clear_bit_hook(struct inode *inode, *bits &= ~EXTENT_FIRST_DELALLOC; } else if (!(*bits & EXTENT_DO_ACCOUNTING)) { spin_lock(&BTRFS_I(inode)->lock); - BTRFS_I(inode)->outstanding_extents--; + BTRFS_I(inode)->outstanding_extents -= num_extents; spin_unlock(&BTRFS_I(inode)->lock); }