From patchwork Fri Jan 6 23:01:07 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: NeilBrown X-Patchwork-Id: 9502457 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 0DECE60413 for ; Fri, 6 Jan 2017 23:02:10 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E779028537 for ; Fri, 6 Jan 2017 23:02:09 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id DAAB328539; Fri, 6 Jan 2017 23:02:09 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, T_TVD_MIME_EPI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6E92228537 for ; Fri, 6 Jan 2017 23:02:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S969072AbdAFXBe (ORCPT ); Fri, 6 Jan 2017 18:01:34 -0500 Received: from mx2.suse.de ([195.135.220.15]:47376 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S967771AbdAFXB2 (ORCPT ); Fri, 6 Jan 2017 18:01:28 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 7AB09AC0D; Fri, 6 Jan 2017 23:01:24 +0000 (UTC) From: NeilBrown To: Mike Snitzer , Mikulas Patocka Date: Sat, 07 Jan 2017 10:01:07 +1100 Cc: Jack Wang , Lars Ellenberg , Jens Axboe , linux-raid , Michael Wang , Peter Zijlstra , Jiri Kosina , Ming Lei , linux-kernel@vger.kernel.org, Zheng Liu , linux-block@vger.kernel.org, Takashi Iwai , "linux-bcache\@vger.kernel.org" , Ingo Molnar , Alasdair Kergon , "Martin K. Petersen" , Keith Busch , device-mapper development , Shaohua Li , Kent Overstreet , "Kirill A. Shutemov" , Roland Kammerer Subject: Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion In-Reply-To: <20170106195216.GA15583@redhat.com> References: <1467990243-3531-1-git-send-email-lars.ellenberg@linbit.com> <1467990243-3531-2-git-send-email-lars.ellenberg@linbit.com> <20160711141042.GY13335@soda.linbit> <76d9bf14-d848-4405-8358-3771c0a93d39@profitbricks.com> <20161223114553.GP4138@soda.linbit> <878tqrmmqx.fsf@notabene.neil.brown.name> <20170104185046.GA982@redhat.com> <20170106195216.GA15583@redhat.com> User-Agent: Notmuch/0.22.1 (http://notmuchmail.org) Emacs/24.5.1 (x86_64-suse-linux-gnu) Message-ID: <877f67lrnw.fsf@notabene.neil.brown.name> MIME-Version: 1.0 Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Sat, Jan 07 2017, Mike Snitzer wrote: > On Fri, Jan 06 2017 at 12:34pm -0500, > Mikulas Patocka wrote: > >> >> >> On Fri, 6 Jan 2017, Mikulas Patocka wrote: >> >> > >> > >> > On Wed, 4 Jan 2017, Mike Snitzer wrote: >> > >> > > On Wed, Jan 04 2017 at 12:12am -0500, >> > > NeilBrown wrote: >> > > >> > > > > Suggested-by: NeilBrown >> > > > > Signed-off-by: Jack Wang >> > > > > --- >> > > > > block/blk-core.c | 20 ++++++++++++++++++++ >> > > > > 1 file changed, 20 insertions(+) >> > > > > >> > > > > diff --git a/block/blk-core.c b/block/blk-core.c >> > > > > index 9e3ac56..47ef373 100644 >> > > > > --- a/block/blk-core.c >> > > > > +++ b/block/blk-core.c >> > > > > @@ -2138,10 +2138,30 @@ blk_qc_t generic_make_request(struct bio *bio) >> > > > > struct request_queue *q = bdev_get_queue(bio->bi_bdev); >> > > > > >> > > > > if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) { >> > > > > + struct bio_list lower, same, hold; >> > > > > + >> > > > > + /* Create a fresh bio_list for all subordinate requests */ >> > > > > + bio_list_init(&hold); >> > > > > + bio_list_merge(&hold, &bio_list_on_stack); >> > > > > + bio_list_init(&bio_list_on_stack); >> > > > > >> > > > > ret = q->make_request_fn(q, bio); >> > > > > >> > > > > blk_queue_exit(q); >> > > > > + /* sort new bios into those for a lower level >> > > > > + * and those for the same level >> > > > > + */ >> > > > > + bio_list_init(&lower); >> > > > > + bio_list_init(&same); >> > > > > + while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL) >> > > > > + if (q == bdev_get_queue(bio->bi_bdev)) >> > > > > + bio_list_add(&same, bio); >> > > > > + else >> > > > > + bio_list_add(&lower, bio); >> > > > > + /* now assemble so we handle the lowest level first */ >> > > > > + bio_list_merge(&bio_list_on_stack, &lower); >> > > > > + bio_list_merge(&bio_list_on_stack, &same); >> > > > > + bio_list_merge(&bio_list_on_stack, &hold); >> > > > > >> > > > > bio = bio_list_pop(current->bio_list); >> > > > > } else { >> > > > > -- >> > > > > 2.7.4 >> > > >> > > Mikulas, would you be willing to try the below patch with the >> > > dm-snapshot deadlock scenario and report back on whether it fixes that? >> > > >> > > Patch below looks to be the same as here: >> > > https://marc.info/?l=linux-raid&m=148232453107685&q=p3 >> > > >> > > Neil and/or others if that isn't the patch that should be tested please >> > > provide a pointer to the latest. >> > > >> > > Thanks, >> > > Mike >> > >> > The bad news is that this doesn't fix the snapshot deadlock. >> > >> > I created a test program for the snapshot deadlock bug (it was originally >> > created years ago to test for a different bug, so it contains some cruft). >> > You also need to insert "if (ci->sector_count) msleep(100);" to the end of >> > __split_and_process_non_flush to make the kernel sleep when splitting the >> > bio. >> > >> > And with the above above patch, the snapshot deadlock bug still happens. > > That is really unfortunate. Would be useful to dig in and understand > why. Because ordering of the IO in generic_make_request() really should > take care of it. I *think* you might be able to resolve this by changing __split_and_process_bio() to only ever perform a single split. No looping. i.e. if the bio is too big to handle directly, then split off the front to a new bio, which you bio_chain to the original. The original then has bio_advance() called to stop over the front, then generic_make_request() so it is queued. Then the code proceeds to __clone_and_map_data_bio() on the front that got split off. When that completes it *doesn't* loop round, but returns into generic_make_request() which does the looping and makes sure to handle the lowest-level bio next. something vaguely like this: though I haven't tested, and the change (if it works) should probably be more fully integrated into surrounding code. You probably don't want to use "fs_bio_set" either - a target-local pool would be best. NeilBrown diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 3086da5664f3..06ee0960e415 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct clone_info *ci) len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count); + if (len < ci->sector_count) { + struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set); + bio_chain(split, bio); + generic_make_request(bio); + bio = split; + ci->sector_count = len; + } + r = __clone_and_map_data_bio(ci, ti, ci->sector, &len); if (r < 0) return r;