From patchwork Thu Jun 29 22:09:25 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Omar Sandoval <osandov@osandov.com>
X-Patchwork-Id: 9818059
Return-Path: <linux-btrfs-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	EFBDF602B1 for <patchwork-linux-btrfs@patchwork.kernel.org>;
	Thu, 29 Jun 2017 22:09:31 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E90CD2847A
	for <patchwork-linux-btrfs@patchwork.kernel.org>;
	Thu, 29 Jun 2017 22:09:31 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id DB903284EA; Thu, 29 Jun 2017 22:09:31 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.3 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	RCVD_IN_DNSWL_HI, RCVD_IN_SORBS_SPAM,
	T_DKIM_INVALID autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 283DA2847A
	for <patchwork-linux-btrfs@patchwork.kernel.org>;
	Thu, 29 Jun 2017 22:09:31 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753458AbdF2WJ3 (ORCPT
	<rfc822;patchwork-linux-btrfs@patchwork.kernel.org>);
	Thu, 29 Jun 2017 18:09:29 -0400
Received: from mail-pf0-f178.google.com ([209.85.192.178]:35802 "EHLO
	mail-pf0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752545AbdF2WJ2 (ORCPT
	<rfc822;linux-btrfs@vger.kernel.org>);
	Thu, 29 Jun 2017 18:09:28 -0400
Received: by mail-pf0-f178.google.com with SMTP id c73so56964785pfk.2
	for <linux-btrfs@vger.kernel.org>;
	Thu, 29 Jun 2017 15:09:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=osandov-com.20150623.gappssmtp.com; s=20150623;
	h=date:from:to:cc:subject:message-id:references:mime-version
	:content-disposition:in-reply-to:user-agent;
	bh=kbBCzNy7inPl7KOFi33NZsIOnJu1LCvq3ddYQjDXxMM=;
	b=jotLCTyXW1hZZ+1I4E0BOvR8m1C7hDTeQM7ln/ikKoXJPkiNLJxLhnQV8hV3RMB+HH
	6yY7DAlOPlwzUHK+VEiUGb0VsFwov/CQF9/I9O7Jk0Y4nJMVL2KCf1goHK0aRFELczGS
	0bozBAZOFZjSEBQGjGjccOkzyWay4UAj4lYs8ndeKatRWm6B0vEHqY9mIL6o0bkAt2uQ
	hlrYZ5dT+FYDK0rHAMQdlkIh6f0wmS852KTQDYA5EDZ3qMQxpiq7ZMmfLF/SJ/UGVE3B
	Z1orkZkZntHC1IYPZPHLXqNKulbGbKK+zsSt+JPuztHbF7yBShnAySBWnspXvyzDhwzd
	9SmQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:date:from:to:cc:subject:message-id:references
	:mime-version:content-disposition:in-reply-to:user-agent;
	bh=kbBCzNy7inPl7KOFi33NZsIOnJu1LCvq3ddYQjDXxMM=;
	b=A9+CGutf1heVBTv+6V2EHjdzC0VWmt0tV1CPPhesvvhlbzbquZEnnmFS4Kx/LZwNLK
	/9dUfwVRbFgEAbj1zjRsPF3hlOphkQ3Guwl2nOMm0fk8MpxAY5Dvs3X1edZThrsiLJI/
	65GaHBi2r9iVEqYNHf+GSk4cPfpCioSb/LuhiW0Inw3VRZ5kDBWS7igJ80G+LnVIXz3j
	32eb1LOoROg4yVrxlQm2ah8XnZA4AwpZt5HsYbzJJBS8BojzYK3YM7rdeTm+qfWdJVhS
	9BdVNkyD88vr6xdyvadNeseoJqwbBD9JYNrTFHQx3iwQGzKTxJ+ddJyfgv133519uerG
	jgew==
X-Gm-Message-State: AKS2vOwN/EGMkYHXOkMJPyIvU4uNIXetfFH2Mhq0yycckh9DQEVn7hOF
	W38yccdNq5iSGHi6GIbvQg==
X-Received: by 10.99.154.18 with SMTP id o18mr17609321pge.251.1498774167286;
	Thu, 29 Jun 2017 15:09:27 -0700 (PDT)
Received: from vader.DHCP.thefacebook.com ([2620:10d:c090:200::5:25c6])
	by smtp.gmail.com with ESMTPSA id
	d2sm14604214pfb.49.2017.06.29.15.09.26
	(version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
	Thu, 29 Jun 2017 15:09:26 -0700 (PDT)
Date: Thu, 29 Jun 2017 15:09:25 -0700
From: Omar Sandoval <osandov@osandov.com>
To: Jeff Mahoney <jeffm@suse.com>
Cc: linux-btrfs@vger.kernel.org, Nikolay Borisov <nborisov@suse.com>
Subject: Re: [PATCH v2 1/2] btrfs: account for pinned bytes in
	should_alloc_chunk
Message-ID: <20170629220925.GC20229@vader.DHCP.thefacebook.com>
References: <20170622135148.16730-1-jeffm@suse.com>
	<20170629192141.GA20229@vader.DHCP.thefacebook.com>
	<b431385d-049f-5972-b5f2-367b7d3e4c7c@suse.com>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <b431385d-049f-5972-b5f2-367b7d3e4c7c@suse.com>
User-Agent: Mutt/1.8.3 (2017-05-23)
Sender: linux-btrfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On Thu, Jun 29, 2017 at 03:49:05PM -0400, Jeff Mahoney wrote:
> On 6/29/17 3:21 PM, Omar Sandoval wrote:
> > On Thu, Jun 22, 2017 at 09:51:47AM -0400, jeffm@suse.com wrote:
> >> From: Jeff Mahoney <jeffm@suse.com>
> >>
> >> In a heavy write scenario, we can end up with a large number of pinned bytes.
> >> This can translate into (very) premature ENOSPC because pinned bytes
> >> must be accounted for when allowing a reservation but aren't accounted for
> >> when deciding whether to create a new chunk.
> >>
> >> This patch adds the accounting to should_alloc_chunk so that we can
> >> create the chunk.
> > 
> > Hey, Jeff,
> 
> Hi Omar -
> 
> > Does this fix your ENOSPC problem on a fresh filesystem? I just tracked
> 
> No, it didn't.  It helped somewhat, but we were still hitting it
> frequently.  What did help was reverting "Btrfs: skip commit transaction
> if we don't have enough pinned bytes" (not upstream yet, on the list).
> 
> > down an ENOSPC issue someone here reported when doing a btrfs send to a
> > fresh filesystem and it sounds a lot like your issue: metadata
> > bytes_may_use shoots up but we don't allocate any chunks for it. I'm not
> > seeing how including bytes_pinned will help for this case. We won't have
> > any pinned bytes when populating a new fs, right?
> 
> Our test environment is just installing the OS.  That means lots of
> creates, writes, and then renames, so there's a fair amount of metadata
> churn that results in elevated pinned_bytes.  Rsync can cause the same
> workload pretty easily too.  Nikolay was going to look into coming up
> with a configuration for fsstress that would emulate it.
> 
> > I don't have a good solution. Allocating chunks based on bytes_may_use
> > is going to way over-allocate because of our worst-case estimations. I'm
> > double-checking now that the flusher is doing the right thing and not
> > missing anything. I'll keep digging, just wanted to know if you had any
> > thoughts.
> 
> My suspicion is that it all just happens to work and that there are
> several bugs working together that approximate a correct result.  My
> reasoning is that the patch I referenced above is correct.  The logic in
> may_commit_transaction is inverted and causing a ton of additional
> transaction commits.  I think that having the additional transaction
> commits is serving to free pinned bytes more quickly so things just work
> for the most part and pinned bytes doesn't play as much of a role.  But
> once the transaction count comes down, that pinned bytes count gets
> elevated and becomes more important.  I think it should be taken into
> account to determine whether committing a transaction early will result
> in releasing enough space to honor the reservation without allocating a
> new chunk.  If the answer is yes, flush it.  If no, there's no point in
> flushing it now, so just allocate the chunk and move on.
> 
> The big question is where this 80% number comes into play.
> 
> There is a caveat here: almost all of our testing has been on 4.4 with a
> bunch of these patches backported.  I believe the same issue will still
> be there on mainline, but we're in release crunch mode and I haven't had
> a chance to test more fully.
> 
> -Jeff

Jeff, can you try this and see if it helps?


In my test case, it looks like what's happening is that most of the
metadata reservation we have comes from delalloc extents. When someone
comes along that isn't allowed to overcommit anymore, they queue up
their ticket and kick the flusher. Then the flusher comes along and
flushes a little bit of delalloc and sees "oh, we can overcommit now,
we're good", but it still didn't free enough to actually fulfill the
ticket, so the guy waiting still gets an ENOSPC.

This fixes it for my reproducer, but I need to put together a smaller
test case.
---
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 33d979e9ea2a..83eecd33ad96 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4776,10 +4776,6 @@ static void shrink_delalloc(struct btrfs_root *root, u64 to_reclaim, u64 orig,
 		else
 			flush = BTRFS_RESERVE_NO_FLUSH;
 		spin_lock(&space_info->lock);
-		if (can_overcommit(root, space_info, orig, flush)) {
-			spin_unlock(&space_info->lock);
-			break;
-		}
 		if (list_empty(&space_info->tickets) &&
 		    list_empty(&space_info->priority_tickets)) {
 			spin_unlock(&space_info->lock);