From patchwork Tue Jun 5 00:34:22 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kent Overstreet X-Patchwork-Id: 10447513 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 2B7E260467 for ; Tue, 5 Jun 2018 00:34:29 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 16FB82916C for ; Tue, 5 Jun 2018 00:34:29 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 093B629174; Tue, 5 Jun 2018 00:34:29 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, FREEMAIL_FROM, MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4E87B2916C for ; Tue, 5 Jun 2018 00:34:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751495AbeFEAe1 (ORCPT ); Mon, 4 Jun 2018 20:34:27 -0400 Received: from mail-qt0-f193.google.com ([209.85.216.193]:42856 "EHLO mail-qt0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751046AbeFEAe0 (ORCPT ); Mon, 4 Jun 2018 20:34:26 -0400 Received: by mail-qt0-f193.google.com with SMTP id y31-v6so641181qty.9 for ; Mon, 04 Jun 2018 17:34:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=IsxyevOLIlN+3dKH5tLgXIQh3Un92LpLffApZl/Z42g=; b=h4KSv4tLyFBx/vrPZcf0VBXESN0I0yVJVxl12lNzcitdvykiT/cIfPwYc6NBVTe03q yHN6K6qpwvPpTAZQ5kGXlu9jSwQPwBEpeBkkcq9lij8QOCpb6/fmXelzenyWlLamBcKR F9deE9aE4FqvQLVefGqQuk4MzRbldJkkEFHTCzvPyD91IeIYNPcha8J7MlQX49ZVHaeD BlZInfqk12a59aomiQx4DuCBpkPwArQFGJ+E3Xj7zuI81/ovCShIIllLzSb1Mu14LnE9 MnOJRBM8mw4jHZaF7hFfrLGodxgAlGvDSamRre8VYcVQs0k4plCVPwFfU50/6a3x+gPY F9Ug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=IsxyevOLIlN+3dKH5tLgXIQh3Un92LpLffApZl/Z42g=; b=cpXEXyuBwmxYDOEz0CEJsQk8yWELUbH1iBTERkRKPi6izjjDIKuUuM2nf508H0V/I7 vU8iwPT9+NgvGTzLC1e36MRtuKhXQZrIzQIQoRHMKMqM7d8HY9IHBULMa3hFUp02BsVr 9xowmbvgafCMdPsvrSd1+/rht6WpoMR4QLw6HWGOJ4QZyLkrFzuEZk+bY9eJJyudTNE0 ddU55+mpLUrwEwYHczW4rDe6foJVvARiync2hGLJ0F+nv6wfRJb51S34bsLbXZzxTsFE c+GlKedvUaNwX+A351HqG6ERxW8ErHfTCFQRcIbRA9stTkvAo+FEC2XZj4U0RPMmypcR z2IA== X-Gm-Message-State: APt69E3LEEvhVlmxGFXDJjNTqEHIGZ/X44VSOYnXfs00wQGDPLo7Yzvi rj24KnQXZmi5wJYpQNZG3A== X-Google-Smtp-Source: ADUXVKJ577D0klphFldNyKssa9U45qQBkdDVuGGTUQVSy3/ea7Pd4g7S6WY976sdH1EWa8qiJIvdeA== X-Received: by 2002:ac8:78c:: with SMTP id l12-v6mr19084694qth.101.1528158866031; Mon, 04 Jun 2018 17:34:26 -0700 (PDT) Received: from kmo-pixel (c-71-234-172-214.hsd1.vt.comcast.net. [71.234.172.214]) by smtp.gmail.com with ESMTPSA id n11-v6sm25128936qki.76.2018.06.04.17.34.24 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Mon, 04 Jun 2018 17:34:24 -0700 (PDT) Date: Mon, 4 Jun 2018 20:34:22 -0400 From: Kent Overstreet To: NeilBrown Cc: Jens Axboe , Linus Torvalds , linux-block , Arnd Bergmann , Shaohua Li Subject: Re: [GIT PULL] Block changes for 4.18-rc Message-ID: <20180605003422.GD30325@kmo-pixel> References: <11b01169-8e11-34ed-8137-aa5cd50a39c2@kernel.dk> <52aaf207-ef5d-a886-66fe-566de9d9bbee@kernel.dk> <87efhmp7zw.fsf@notabene.neil.brown.name> <20180604212100.GB30325@kmo-pixel> <8736y2p3kf.fsf@notabene.neil.brown.name> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <8736y2p3kf.fsf@notabene.neil.brown.name> User-Agent: Mutt/1.10.0 (2018-05-17) Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Tue, Jun 05, 2018 at 08:52:32AM +1000, NeilBrown wrote: > On Mon, Jun 04 2018, Kent Overstreet wrote: > > > On Tue, Jun 05, 2018 at 07:16:51AM +1000, NeilBrown wrote: > >> I really should get back to BIOSET_NEED_RESCUER and see if I can discard > >> it completely. > >> Kent - the only remaining user is bcache, and the main reason I haven't > >> removed it is that I cannot follow the code (or at least, I couldn't > >> last time I tried). Are you able to see if it is still needed? > > > > Oh man, rescueurs make my head hurt. I remember you changed something so that > > they weren't needed most of the time, but I can't remember what you did - can > > you remind me what that was and whatever else you can remember offhand? > > (and closures make my head hurt :-) Fair enough :) > There were two closely related changes. > One is make sure that generic_make_request processed bios strictly > in a depth-first order. The other is to ensure that all (stacking) > block device drivers never block in their make_request_fn waiting > for something that have themselves submitted. > > Together these mean that a thread in generic_make_request will never > block waiting for a bio that might be queued in current->bio_list. > The bio chosen by generic_make_request will be a leaf and won't depend > on anything else in the list, and the handler for that request won't > block after submitting a lower-level request. > > The second property is the important one for drivers and the one we > need to ensure that bcache follows. > A common bad pattern is something like: > > while (too_big(bio)) { > split = bio_spilt(bio,...)) > process(split) > } > process(bio) > > That needs to be more like > if (too_big(bio)) { > split = bio_split(bio,...); > generic_make_request(bio); > bio = split; > } > process(bio) > > so that generic_make_request() gets to do that while-loop > and can control the order in which bios are handled. > > I cannot convince myself that bcache doesn't have something similar to > that while loop (structured with closures). Oh yeah, it's all coming back to me... so, the problem isn't really closures, it's that we don't know where we have to split until after we've already allocated all our state for the request and done almost all our work for the request - also, the state we allocate is per _incoming_ bio, not per split we submit. So for the read side, it looks like - allocate a struct search (man, so much of the naming in bcache is crap, it's painful going back from bcachefs) - traverse the btree to the start of this request - iterate over extents, splitting and submitting for each extent the original bio overlaps with so, converting it to your 2nd pattern would mean changing that code so that when we walk the btree and find an extent we overlap with, if the extent doesn't cover the whole bio, we split, submit the split, resubmit the original bio - and then return and _restart the entire btree traversal_, for each split. ouch. the write path is different, but the problem is roughly the same in that we don't know where we're going to need to split until we're deep inside the write path, allocating space for where the write will go - we split if the bucket we're currently writing to doesn't have enough space. But again, we don't find that out until well after allocating data structures for that write, taking locks for space allocation, etc. etc... So the approach of just resubmitting the rest of the bio just flat out won't work, we've already allocated our data structures and they're tied to that bio... it would require massive changes and deep surgery on how the entire control flow works... Also unconditionally doing this for every bio split (bailing out and restarting the btree traversal/write path) would be pretty crappy performance wise, e.g. when doing big sequential reads of data that was written all with random 4k writes... but... the problem we're trying to avoid is just blocking while allocating memory, while we've got bios blocked on current->bio_list, right? So another approach would be to checking when we're allocating a bio split if that allocation might deadlock - if so, do it with GFP_NOWAIT, and if it fails - just punt the read or write request to workqueue. this is something the code can actually do the way it's structured now... making use of closures :) cache_lookup() calls bch_btree_map_keys() to walk the extents the request covers, and restarts itself if cache_lookup_fn() returns -EAGAIN - this is because the btree lookup might require reading in btree nodes, so we can't block on a btree node read while running under generic_make_request(). So this (untested!) patch should be sufficient for the read side: It's a bit more complicated for the write path - right now, we split after allocating space (bch_data_insert_start() -> bch_alloc_sectors()). To be able to punt to workqueue if the split fails, bch_alloc_sectors() needs to be broken up into bch_alloc_sectors_start() to pick and lock a write point and allocate a bucket if necessary, so that bch_data_insert_start() can find out how much space there is available and attempt the split before actually consuming that space. That's how the sector allocator already works in bcachefs though, so it's a straightforward change... diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index ae67f5fa80..9aed34d050 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -538,6 +538,14 @@ static int cache_lookup_fn(struct btree_op *op, struct btree *b, struct bkey *k) if (!KEY_SIZE(k)) return MAP_CONTINUE; + n = bio_next_split(bio, min_t(uint64_t, INT_MAX, + KEY_OFFSET(k) - bio->bi_iter.bi_sector), + !current->bio_list || bio_list_empty(current->bio_list) + ? GFP_NOIO : GFP_NOWAIT, + &s->d->bio_split); + if (!n) + return -EAGAIN; + /* XXX: figure out best pointer - for multiple cache devices */ ptr = 0; @@ -546,10 +554,6 @@ static int cache_lookup_fn(struct btree_op *op, struct btree *b, struct bkey *k) if (KEY_DIRTY(k)) s->read_dirty_data = true; - n = bio_next_split(bio, min_t(uint64_t, INT_MAX, - KEY_OFFSET(k) - bio->bi_iter.bi_sector), - GFP_NOIO, &s->d->bio_split); - bio_key = &container_of(n, struct bbio, bio)->key; bch_bkey_copy_single_ptr(bio_key, k, ptr);