From patchwork Mon Apr 11 04:50:22 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: NeilBrown X-Patchwork-Id: 697521 Received: from mx3-phx2.redhat.com (mx3-phx2.redhat.com [209.132.183.24]) by demeter1.kernel.org (8.14.4/8.14.3) with ESMTP id p3B4rRBi029090 for ; Mon, 11 Apr 2011 04:53:48 GMT Received: from lists01.pubmisc.prod.ext.phx2.redhat.com (lists01.pubmisc.prod.ext.phx2.redhat.com [10.5.19.33]) by mx3-phx2.redhat.com (8.13.8/8.13.8) with ESMTP id p3B4olUn026647; Mon, 11 Apr 2011 00:50:49 -0400 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id p3B4oheF012782 for ; Mon, 11 Apr 2011 00:50:43 -0400 Received: from mx1.redhat.com (ext-mx12.extmail.prod.ext.phx2.redhat.com [10.5.110.17]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id p3B4ocA1021649; Mon, 11 Apr 2011 00:50:38 -0400 Received: from mx2.suse.de (cantor2.suse.de [195.135.220.15]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id p3B4oXkF008914; Mon, 11 Apr 2011 00:50:34 -0400 Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.221.2]) by mx2.suse.de (Postfix) with ESMTP id EB3005FC9F; Mon, 11 Apr 2011 06:50:32 +0200 (CEST) Date: Mon, 11 Apr 2011 14:50:22 +1000 From: NeilBrown To: Mike Snitzer , Jens Axboe Message-ID: <20110411145022.710c30e9@notabene.brown> In-Reply-To: <20110405130541.6c2b5f86@notabene.brown> References: <1295659049-2688-1-git-send-email-jaxboe@fusionio.com> <1295659049-2688-6-git-send-email-jaxboe@fusionio.com> <20110303221353.GA10366@redhat.com> <4D761E0D.8050200@fusionio.com> <20110308202100.GA31744@redhat.com> <4D76912C.9040705@fusionio.com> <20110308220526.GA393@redhat.com> <20110310005810.GA17911@redhat.com> <20110405130541.6c2b5f86@notabene.brown> Mime-Version: 1.0 X-RedHat-Spam-Score: -2.31 (RCVD_IN_DNSWL_MED,T_RP_MATCHES_RCVD) X-Scanned-By: MIMEDefang 2.68 on 10.5.11.22 X-Scanned-By: MIMEDefang 2.68 on 10.5.110.17 X-MIME-Autoconverted: from quoted-printable to 8bit by lists01.pubmisc.prod.ext.phx2.redhat.com id p3B4oheF012782 X-loop: dm-devel@redhat.com Cc: "hch@infradead.org" , linux-raid@vger.kernel.org, dm-devel@redhat.com, "linux-kernel@vger.kernel.org" Subject: Re: [dm-devel] [PATCH 05/10] block: remove per-queue plugging X-BeenThere: dm-devel@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk Reply-To: device-mapper development List-Id: device-mapper development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com X-Greylist: IP, sender and recipient auto-whitelisted, not delayed by milter-greylist-4.2.6 (demeter1.kernel.org [140.211.167.41]); Mon, 11 Apr 2011 04:53:48 +0000 (UTC) On Tue, 5 Apr 2011 13:05:41 +1000 NeilBrown wrote: > On Wed, 9 Mar 2011 19:58:10 -0500 Mike Snitzer wrote: > > > Also, in your MD changes, you removed all calls to md_unplug() but > > didn't remove md_unplug(). Seems it should be removed along with the > > 'plug' member of 'struct mddev_t'? Neil? > > I've been distracted by other things and only just managed to have a look at > this. > > The new plugging code seems to completely ignore the needs of stacked devices > - or at least my needs in md. > > For RAID1 with a write-intent-bitmap, I queue all write requests and then on > an unplug I update the write-intent-bitmap to mark all the relevant blocks > and then release the writes. > > With the new code there is no way for an unplug event to wake up the raid1d > thread to start the writeout - I haven't tested it but I suspect it will just > hang. > > Similarly for RAID5 I gather write bios (long before they become 'struct > request' which is what the plugging code understands) and on an unplug event > I release the writes - hopefully with enough bios per stripe so that we don't > need to pre-read. > > Possibly the simplest fix would be to have a second list_head in 'struct > blk_plug' which contained callbacks (a function pointer a list_head in a > struct which is passed as an arg to the function!). > blk_finish_plug could then walk the list and call the call-backs. > It would be quite easy to hook into that. I've implemented this and it seems to work. Jens: could you please review and hopefully ack the patch below, and let me know if you will submit it or should I? My testing of this combined with some other patches which cause various md personalities to use it shows up a bug somewhere. The symptoms are crashes in various places in blk-core and sometimes elevator.c list_sort occurs fairly often included in the stack but not always. This patch Signed-off-by: NeilBrown diff --git a/block/blk-core.c b/block/blk-core.c index 273d60b..903ce8d 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2674,19 +2674,23 @@ static void flush_plug_list(struct blk_plug *plug) struct request_queue *q; unsigned long flags; struct request *rq; + struct list_head head; BUG_ON(plug->magic != PLUG_MAGIC); if (list_empty(&plug->list)) return; + list_add(&head, &plug->list); + list_del_init(&plug->list); if (plug->should_sort) - list_sort(NULL, &plug->list, plug_rq_cmp); + list_sort(NULL, &head, plug_rq_cmp); + plug->should_sort = 0; q = NULL; local_irq_save(flags); - while (!list_empty(&plug->list)) { - rq = list_entry_rq(plug->list.next); + while (!list_empty(&head)) { + rq = list_entry_rq(head.next); list_del_init(&rq->queuelist); BUG_ON(!(rq->cmd_flags & REQ_ON_PLUG)); BUG_ON(!rq->q); makes the symptom go away. It simply moves the plug list onto a separate list head before sorting and processing it. My test was simply writing to a RAID1 with dd: while true; do dd if=/dev/zero of=/dev/md0 size=4k; done Obviously all writes go to two devices so the plug list will always need sorting. The only explanation I can come up with is that very occasionally schedule on 2 separate cpus calls blk_flush_plug for the same task. I don't understand the scheduler nearly well enough to know if or how that can happen. However with this patch in place I can write to a RAID1 constantly for half an hour, and without it, the write rarely lasts for 3 minutes. If you want to reproduce my experiment, you can pull from git://neil.brown.name/md plug-test to get my patches for plugging in md (which are quite ready for submission but seem to work), create a RAID1 using e.g. mdadm -C /dev/md0 --level=1 --raid-disks=2 /dev/device1 /dev/device2 while true; do dd if=/dev/zero of=/dev/md0 bs=4K ; done Thanks, NeilBrown >From 687b189c02276887dd7d5b87a817da9f67ed3c2c Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Thu, 7 Apr 2011 13:16:59 +1000 Subject: [PATCH] Enhance new plugging support to support general callbacks. md/raid requires an unplug callback, but as it does not uses requests the current code cannot provide one. So allow arbitrary callbacks to be attached to the blk_plug. Cc: Jens Axboe Signed-off-by: NeilBrown --- block/blk-core.c | 13 +++++++++++++ include/linux/blkdev.h | 7 ++++++- 2 files changed, 19 insertions(+), 1 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 725091d..273d60b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2644,6 +2644,7 @@ void blk_start_plug(struct blk_plug *plug) plug->magic = PLUG_MAGIC; INIT_LIST_HEAD(&plug->list); + INIT_LIST_HEAD(&plug->cb_list); plug->should_sort = 0; /* @@ -2717,9 +2718,21 @@ static void flush_plug_list(struct blk_plug *plug) local_irq_restore(flags); } +static void flush_plug_callbacks(struct blk_plug *plug) +{ + while (!list_empty(&plug->cb_list)) { + struct blk_plug_cb *cb = list_first_entry(&plug->cb_list, + struct blk_plug_cb, + list); + list_del(&cb->list); + cb->callback(cb); + } +} + static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug) { flush_plug_list(plug); + flush_plug_callbacks(plug); if (plug == tsk->plug) tsk->plug = NULL; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 32176cc..3e5e604 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -857,8 +857,13 @@ extern void blk_put_queue(struct request_queue *); struct blk_plug { unsigned long magic; struct list_head list; + struct list_head cb_list; unsigned int should_sort; }; +struct blk_plug_cb { + struct list_head list; + void (*callback)(struct blk_plug_cb *); +}; extern void blk_start_plug(struct blk_plug *); extern void blk_finish_plug(struct blk_plug *); @@ -876,7 +881,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk) { struct blk_plug *plug = tsk->plug; - return plug && !list_empty(&plug->list); + return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list)); } /*