From patchwork Tue Jun  5 00:34:22 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kent Overstreet <kent.overstreet@gmail.com>
X-Patchwork-Id: 10447513
Return-Path: <linux-block-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	2B7E260467 for <patchwork-linux-block@patchwork.kernel.org>;
	Tue,  5 Jun 2018 00:34:29 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 16FB82916C
	for <patchwork-linux-block@patchwork.kernel.org>;
	Tue,  5 Jun 2018 00:34:29 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 093B629174; Tue,  5 Jun 2018 00:34:29 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID, DKIM_VALID_AU, FREEMAIL_FROM, MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4E87B2916C
	for <patchwork-linux-block@patchwork.kernel.org>;
	Tue,  5 Jun 2018 00:34:28 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751495AbeFEAe1 (ORCPT
	<rfc822;patchwork-linux-block@patchwork.kernel.org>);
	Mon, 4 Jun 2018 20:34:27 -0400
Received: from mail-qt0-f193.google.com ([209.85.216.193]:42856 "EHLO
	mail-qt0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751046AbeFEAe0 (ORCPT
	<rfc822; linux-block@vger.kernel.org>); Mon, 4 Jun 2018 20:34:26 -0400
Received: by mail-qt0-f193.google.com with SMTP id y31-v6so641181qty.9
	for <linux-block@vger.kernel.org>;
	Mon, 04 Jun 2018 17:34:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gmail.com; s=20161025;
	h=date:from:to:cc:subject:message-id:references:mime-version
	:content-disposition:in-reply-to:user-agent;
	bh=IsxyevOLIlN+3dKH5tLgXIQh3Un92LpLffApZl/Z42g=;
	b=h4KSv4tLyFBx/vrPZcf0VBXESN0I0yVJVxl12lNzcitdvykiT/cIfPwYc6NBVTe03q
	yHN6K6qpwvPpTAZQ5kGXlu9jSwQPwBEpeBkkcq9lij8QOCpb6/fmXelzenyWlLamBcKR
	F9deE9aE4FqvQLVefGqQuk4MzRbldJkkEFHTCzvPyD91IeIYNPcha8J7MlQX49ZVHaeD
	BlZInfqk12a59aomiQx4DuCBpkPwArQFGJ+E3Xj7zuI81/ovCShIIllLzSb1Mu14LnE9
	MnOJRBM8mw4jHZaF7hFfrLGodxgAlGvDSamRre8VYcVQs0k4plCVPwFfU50/6a3x+gPY
	F9Ug==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:date:from:to:cc:subject:message-id:references
	:mime-version:content-disposition:in-reply-to:user-agent;
	bh=IsxyevOLIlN+3dKH5tLgXIQh3Un92LpLffApZl/Z42g=;
	b=cpXEXyuBwmxYDOEz0CEJsQk8yWELUbH1iBTERkRKPi6izjjDIKuUuM2nf508H0V/I7
	vU8iwPT9+NgvGTzLC1e36MRtuKhXQZrIzQIQoRHMKMqM7d8HY9IHBULMa3hFUp02BsVr
	9xowmbvgafCMdPsvrSd1+/rht6WpoMR4QLw6HWGOJ4QZyLkrFzuEZk+bY9eJJyudTNE0
	ddU55+mpLUrwEwYHczW4rDe6foJVvARiync2hGLJ0F+nv6wfRJb51S34bsLbXZzxTsFE
	c+GlKedvUaNwX+A351HqG6ERxW8ErHfTCFQRcIbRA9stTkvAo+FEC2XZj4U0RPMmypcR
	z2IA==
X-Gm-Message-State: APt69E3LEEvhVlmxGFXDJjNTqEHIGZ/X44VSOYnXfs00wQGDPLo7Yzvi
	rj24KnQXZmi5wJYpQNZG3A==
X-Google-Smtp-Source: 
 ADUXVKJ577D0klphFldNyKssa9U45qQBkdDVuGGTUQVSy3/ea7Pd4g7S6WY976sdH1EWa8qiJIvdeA==
X-Received: by 2002:ac8:78c:: with SMTP id
	l12-v6mr19084694qth.101.1528158866031;
	Mon, 04 Jun 2018 17:34:26 -0700 (PDT)
Received: from kmo-pixel (c-71-234-172-214.hsd1.vt.comcast.net.
	[71.234.172.214]) by smtp.gmail.com with ESMTPSA id
	n11-v6sm25128936qki.76.2018.06.04.17.34.24
	(version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
	Mon, 04 Jun 2018 17:34:24 -0700 (PDT)
Date: Mon, 4 Jun 2018 20:34:22 -0400
From: Kent Overstreet <kent.overstreet@gmail.com>
To: NeilBrown <neilb@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-block <linux-block@vger.kernel.org>,
	Arnd Bergmann <arnd@arndb.de>, Shaohua Li <shli@kernel.org>
Subject: Re: [GIT PULL] Block changes for 4.18-rc
Message-ID: <20180605003422.GD30325@kmo-pixel>
References: <11b01169-8e11-34ed-8137-aa5cd50a39c2@kernel.dk>
	<CA+55aFy6_+90WKyDAbhh0NcN22N11AyaCaJu0nrafmf49+h6tQ@mail.gmail.com>
	<52aaf207-ef5d-a886-66fe-566de9d9bbee@kernel.dk>
	<87efhmp7zw.fsf@notabene.neil.brown.name>
	<20180604212100.GB30325@kmo-pixel>
	<8736y2p3kf.fsf@notabene.neil.brown.name>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <8736y2p3kf.fsf@notabene.neil.brown.name>
User-Agent: Mutt/1.10.0 (2018-05-17)
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On Tue, Jun 05, 2018 at 08:52:32AM +1000, NeilBrown wrote:
> On Mon, Jun 04 2018, Kent Overstreet wrote:
> 
> > On Tue, Jun 05, 2018 at 07:16:51AM +1000, NeilBrown wrote:
> >> I really should get back to BIOSET_NEED_RESCUER and see if I can discard
> >> it completely.
> >> Kent - the only remaining user is bcache, and the main reason I haven't
> >> removed it is that I cannot follow the code (or at least, I couldn't
> >> last time I tried).  Are you able to see if it is still needed?
> >
> > Oh man, rescueurs make my head hurt. I remember you changed something so that
> > they weren't needed most of the time, but I can't remember what you did - can
> > you remind me what that was and whatever else you can remember offhand?
> 
> (and closures make my head hurt :-)

Fair enough :)

> There were two closely related changes.
> One is make sure that generic_make_request processed bios strictly
> in a depth-first order.  The other is to ensure that all (stacking)
> block device drivers never block in their make_request_fn waiting
> for something that have themselves submitted.
> 
> Together these mean that a thread in generic_make_request will never
> block waiting for a bio that might be queued in current->bio_list.
> The bio chosen by generic_make_request will be a leaf and won't depend
> on anything else in the list, and the handler for that request won't
> block after submitting a lower-level request.
> 
> The  second property is the important one for drivers and the one we
> need to ensure that bcache follows.
> A common bad pattern is something like:
> 
>  while (too_big(bio)) {
>     split = bio_spilt(bio,...))
>     process(split)
>  }
>  process(bio)
> 
> That needs to be more like
>   if (too_big(bio)) {
>      split = bio_split(bio,...);
>      generic_make_request(bio);
>      bio = split;
>   }
>   process(bio)
> 
> so that generic_make_request() gets to do that while-loop
> and can control the order in which bios are handled.
> 
> I cannot convince myself that bcache doesn't have something similar to
> that while loop (structured with closures).

Oh yeah, it's all coming back to me...

so, the problem isn't really closures, it's that we don't know where we have to
split until after we've already allocated all our state for the request and done
almost all our work for the request - also, the state we allocate is per
_incoming_ bio, not per split we submit.

So for the read side, it looks like
 - allocate a struct search (man, so much of the naming in bcache is crap, it's
   painful going back from bcachefs)
 - traverse the btree to the start of this request
 - iterate over extents, splitting and submitting for each extent the original
   bio overlaps with

so, converting it to your 2nd pattern would mean changing that code so that when
we walk the btree and find an extent we overlap with, if the extent doesn't
cover the whole bio, we split, submit the split, resubmit the original bio - and
then return and _restart the entire btree traversal_, for each split. ouch.

the write path is different, but the problem is roughly the same in that we
don't know where we're going to need to split until we're deep inside the write
path, allocating space for where the write will go - we split if the bucket
we're currently writing to doesn't have enough space. But again, we don't find
that out until well after allocating data structures for that write, taking
locks for space allocation, etc. etc...

So the approach of just resubmitting the rest of the bio just flat out won't
work, we've already allocated our data structures and they're tied to that
bio... it would require massive changes and deep surgery on how the entire
control flow works...

Also unconditionally doing this for every bio split (bailing out and restarting
the btree traversal/write path) would be pretty crappy performance wise, e.g.
when doing big sequential reads of data that was written all with random 4k
writes...

but...

the problem we're trying to avoid is just blocking while allocating memory,
while we've got bios blocked on current->bio_list, right?

So another approach would be to checking when we're allocating a bio split if
that allocation might deadlock - if so, do it with GFP_NOWAIT, and if it fails -
just punt the read or write request to workqueue.

this is something the code can actually do the way it's structured now... making
use of closures :)

cache_lookup() calls bch_btree_map_keys() to walk the extents the request
covers, and restarts itself if cache_lookup_fn() returns -EAGAIN - this is
because the btree lookup might require reading in btree nodes, so we can't block
on a btree node read while running under generic_make_request().

So this (untested!) patch should be sufficient for the read side:


It's a bit more complicated for the write path - right now, we split after
allocating space (bch_data_insert_start() -> bch_alloc_sectors()). To be able to
punt to workqueue if the split fails, bch_alloc_sectors() needs to be broken up
into bch_alloc_sectors_start() to pick and lock a write point and allocate a
bucket if necessary, so that bch_data_insert_start() can find out how much space
there is available and attempt the split before actually consuming that space.

That's how the sector allocator already works in bcachefs though, so it's a
straightforward change...

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index ae67f5fa80..9aed34d050 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -538,6 +538,14 @@ static int cache_lookup_fn(struct btree_op *op, struct btree *b, struct bkey *k)
 	if (!KEY_SIZE(k))
 		return MAP_CONTINUE;
 
+	n = bio_next_split(bio, min_t(uint64_t, INT_MAX,
+				      KEY_OFFSET(k) - bio->bi_iter.bi_sector),
+		!current->bio_list || bio_list_empty(current->bio_list)
+		? GFP_NOIO : GFP_NOWAIT,
+		&s->d->bio_split);
+	if (!n)
+		return -EAGAIN;
+
 	/* XXX: figure out best pointer - for multiple cache devices */
 	ptr = 0;
 
@@ -546,10 +554,6 @@ static int cache_lookup_fn(struct btree_op *op, struct btree *b, struct bkey *k)
 	if (KEY_DIRTY(k))
 		s->read_dirty_data = true;
 
-	n = bio_next_split(bio, min_t(uint64_t, INT_MAX,
-				      KEY_OFFSET(k) - bio->bi_iter.bi_sector),
-			   GFP_NOIO, &s->d->bio_split);
-
 	bio_key = &container_of(n, struct bbio, bio)->key;
 	bch_bkey_copy_single_ptr(bio_key, k, ptr);