From patchwork Wed May 3 17:07:43 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 9710203 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 0BB8460362 for ; Wed, 3 May 2017 17:08:01 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E8D8128622 for ; Wed, 3 May 2017 17:08:00 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id DD29428631; Wed, 3 May 2017 17:08:00 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.4 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, RCVD_IN_DNSWL_HI, RCVD_IN_SORBS_SPAM autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EDF3A28622 for ; Wed, 3 May 2017 17:07:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753779AbdECRH5 (ORCPT ); Wed, 3 May 2017 13:07:57 -0400 Received: from mail-it0-f41.google.com ([209.85.214.41]:37953 "EHLO mail-it0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753689AbdECRHw (ORCPT ); Wed, 3 May 2017 13:07:52 -0400 Received: by mail-it0-f41.google.com with SMTP id e65so41665023ita.1 for ; Wed, 03 May 2017 10:07:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:cc:from:message-id:date:user-agent :mime-version:in-reply-to:content-transfer-encoding; bh=7ML3Yocj5MNy1Hr5UzG2EZjzByy7gr5JnBtn6ER8BKU=; b=rov+fAN+PwNguib1iOeonUp6aWPWzKm4Hn8NjO70kM3T1l+hYs5Ij/5awCAP8D7KWt h+5R6YsnEKwW5bf2xIgrlWlGZqEjiAjaFC9z+TolZFF3SLzD8Xt1j4IxJuaG73iKG/gQ NJ3xfpA46anV+R7mLHlxsukG+SDQMkm2jl7NqsSahds0rSVAlJW5f3noX0ZGzzoP4ZG3 u8dAJikflSd1Bn302dcQbGfHQEtfj78MvIX6e1FvcrVd1pkwI+Glg9fj/5BPvL6NIyCY CVWybm4O9HI8+PVC50NjFH5HSKkHb95ZdOMVTndS48+y7En2Rr2fsLKV5UD6zBVHRCy/ slzw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding; bh=7ML3Yocj5MNy1Hr5UzG2EZjzByy7gr5JnBtn6ER8BKU=; b=VNg+ScpoYHcmcClA7viLyEdiJEM3y99xbc9opqoHXN03YyT4R+SoUBoL7SFY37a3/q dkIIRvkTslsOPO52dwYXKHu/tJzaEQfTELhi1KcNUA158i5Dt0OzmXC2HKBC7xXjZMiS aRVevwOT+E/e0CGxr6mxRethNU3Qmr82Rr/LU60JEYGQKixEmAVanUsU2NxDBA1GffuH 8tcV1LPsuFti3hoIZI38SExcZei1/D2YLzvkl8uRgjX1di5lZT8mv5jqsCE4n4dz6cZu ISZ1avUUkD8lVF8FiPTeIaU4u6qMwVxObcgoCUJIgtFBo6J4Y9raK8kR7pFnUFV6f9Ng EoMg== X-Gm-Message-State: AN3rC/4A4M1nOgC5jKfabdmGjsMkgp8DLSV7CktW3xWRc8qnN8sBIVvI eC/WJS3sr1e2gQ== X-Received: by 10.36.22.206 with SMTP id a197mr9416269ita.60.1493831270702; Wed, 03 May 2017 10:07:50 -0700 (PDT) Received: from [192.168.1.154] ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id h142sm496456ith.31.2017.05.03.10.07.47 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 03 May 2017 10:07:49 -0700 (PDT) Subject: Re: [PATCH 0/4] blk-mq: support to use hw tag for scheduling To: Ming Lei References: <20170428151539.25514-1-ming.lei@redhat.com> <839682da-f375-8eab-d6f5-fcf1457150f1@fb.com> <20170503040303.GA20187@ming.t460p> <370fbeb6-d832-968a-2759-47f16b866551@kernel.dk> <20170503150351.GA7927@ming.t460p> <31bb973e-d9cf-9454-58fd-4893701088c5@kernel.dk> <20170503153808.GB7927@ming.t460p> <20170503165201.GB9706@ming.t460p> <20170503170315.GD9706@ming.t460p> Cc: linux-block@vger.kernel.org, Christoph Hellwig , Omar Sandoval , Bart Van Assche From: Jens Axboe Message-ID: <24ff7ca6-73d6-f8a5-b7d9-e92d0bfdb4b0@kernel.dk> Date: Wed, 3 May 2017 11:07:43 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <20170503170315.GD9706@ming.t460p> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On 05/03/2017 11:03 AM, Ming Lei wrote: > On Thu, May 04, 2017 at 12:52:06AM +0800, Ming Lei wrote: >> On Wed, May 03, 2017 at 11:38:09PM +0800, Ming Lei wrote: >>> On Wed, May 03, 2017 at 09:08:34AM -0600, Jens Axboe wrote: >>>> On 05/03/2017 09:03 AM, Ming Lei wrote: >>>>> On Wed, May 03, 2017 at 08:10:58AM -0600, Jens Axboe wrote: >>>>>> On 05/03/2017 08:08 AM, Jens Axboe wrote: >>>>>>> On 05/02/2017 10:03 PM, Ming Lei wrote: >>>>>>>> On Fri, Apr 28, 2017 at 02:29:18PM -0600, Jens Axboe wrote: >>>>>>>>> On 04/28/2017 09:15 AM, Ming Lei wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> This patchset introduces flag of BLK_MQ_F_SCHED_USE_HW_TAG and >>>>>>>>>> allows to use hardware tag directly for IO scheduling if the queue's >>>>>>>>>> depth is big enough. In this way, we can avoid to allocate extra tags >>>>>>>>>> and request pool for IO schedule, and the schedule tag allocation/release >>>>>>>>>> can be saved in I/O submit path. >>>>>>>>> >>>>>>>>> Ming, I like this approach, it's pretty clean. It'd be nice to have a >>>>>>>>> bit of performance data to back up that it's useful to add this code, >>>>>>>>> though. Have you run anything on eg kyber on nvme that shows a >>>>>>>>> reduction in overhead when getting rid of separate scheduler tags? >>>>>>>> >>>>>>>> I can observe small improvement in the following tests: >>>>>>>> >>>>>>>> 1) fio script >>>>>>>> # io scheduler: kyber >>>>>>>> >>>>>>>> RWS="randread read randwrite write" >>>>>>>> for RW in $RWS; do >>>>>>>> echo "Running test $RW" >>>>>>>> sudo echo 3 > /proc/sys/vm/drop_caches >>>>>>>> sudo fio --direct=1 --size=128G --bsrange=4k-4k --runtime=20 --numjobs=1 --ioengine=libaio --iodepth=10240 --group_reporting=1 --filename=$DISK --name=$DISK-test-$RW --rw=$RW --output-format=json >>>>>>>> done >>>>>>>> >>>>>>>> 2) results >>>>>>>> >>>>>>>> --------------------------------------------------------- >>>>>>>> |sched tag(iops/lat) | use hw tag to sched(iops/lat) >>>>>>>> ---------------------------------------------------------- >>>>>>>> randread |188940/54107 | 193865/52734 >>>>>>>> ---------------------------------------------------------- >>>>>>>> read |192646/53069 | 199738/51188 >>>>>>>> ---------------------------------------------------------- >>>>>>>> randwrite |171048/59777 | 179038/57112 >>>>>>>> ---------------------------------------------------------- >>>>>>>> write |171886/59492 | 181029/56491 >>>>>>>> ---------------------------------------------------------- >>>>>>>> >>>>>>>> I guess it may be a bit more obvious when running the test on one slow >>>>>>>> NVMe device, and will try to find one and run the test again. >>>>>>> >>>>>>> Thanks for running that. As I said in my original reply, I think this >>>>>>> is a good optimization, and the implementation is clean. I'm fine with >>>>>>> the current limitations of when to enable it, and it's not like we >>>>>>> can't extend this later, if we want. >>>>>>> >>>>>>> I do agree with Bart that patch 1+4 should be combined. I'll do that. >>>>>> >>>>>> Actually, can you do that when reposting? Looks like you needed to >>>>>> do that anyway. >>>>> >>>>> Yeah, I will do that in V1. >>>> >>>> V2? :-) >>>> >>>> Sounds good. I just wanted to check the numbers here, with the series >>>> applied on top of for-linus crashes when switching to kyber. A few hunks >>> >>> Yeah, I saw that too, it has been fixed in my local tree, :-) >>> >>>> threw fuzz, but it looked fine to me. But I bet I fat fingered >>>> something. So it'd be great if you could respin against my for-linus >>>> branch. >>> >>> Actually, that is exactly what I am doing, :-) >> >> Looks v4.11 plus your for-linus often triggers the following hang during >> boot, and it seems caused by the change in (blk-mq: unify hctx delayed_run_work >> and run_work) >> >> >> UG: scheduling while atomic: kworker/0:1H/704/0x00000002 >> Modules linked in: >> Preemption disabled at: >> [] virtio_queue_rq+0xdb/0x350 >> CPU: 0 PID: 704 Comm: kworker/0:1H Not tainted 4.11.0-04508-ga1f35f46164b #132 >> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.9.3-1.fc25 04/01/2014 >> Workqueue: kblockd blk_mq_run_work_fn >> Call Trace: >> dump_stack+0x65/0x8f >> ? virtio_queue_rq+0xdb/0x350 >> __schedule_bug+0x76/0xc0 >> __schedule+0x610/0x820 >> ? new_slab+0x2c9/0x590 >> schedule+0x40/0x90 >> schedule_timeout+0x273/0x320 >> ? ___slab_alloc+0x3cb/0x4f0 >> wait_for_completion+0x97/0x100 >> ? wait_for_completion+0x97/0x100 >> ? wake_up_q+0x80/0x80 >> flush_work+0x104/0x1a0 >> ? flush_workqueue_prep_pwqs+0x130/0x130 >> __cancel_work_timer+0xeb/0x160 >> ? vp_notify+0x16/0x20 >> ? virtqueue_add_sgs+0x23c/0x4a0 >> cancel_delayed_work_sync+0x13/0x20 >> blk_mq_stop_hw_queue+0x16/0x20 >> virtio_queue_rq+0x316/0x350 >> blk_mq_dispatch_rq_list+0x194/0x350 >> blk_mq_sched_dispatch_requests+0x118/0x170 >> ? finish_task_switch+0x80/0x1e0 >> __blk_mq_run_hw_queue+0xa3/0xc0 >> blk_mq_run_work_fn+0x2c/0x30 >> process_one_work+0x1e0/0x400 >> worker_thread+0x48/0x3f0 >> kthread+0x109/0x140 >> ? process_one_work+0x400/0x400 >> ? kthread_create_on_node+0x40/0x40 >> ret_from_fork+0x2c/0x40 > > Looks we can't call cancel_delayed_work_sync() here. > > Last time, I asked why the _sync is required, looks not > get anwser, or I might forget that. > > Bart, could you explain it a bit why the _sync version of > cancel work is required? Yeah, that patch was buggy, we definitely can't use the sync variants from a drivers ->queue_rq(). How about something like the below? For the important internal callers, we should be able to pass _sync just fine. diff --git a/block/blk-mq.c b/block/blk-mq.c index bf90684a007a..1e230a864129 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -41,6 +41,7 @@ static LIST_HEAD(all_q_list); static void blk_mq_poll_stats_start(struct request_queue *q); static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb); +static void __blk_mq_stop_hw_queues(struct request_queue *q, bool sync); static int blk_mq_poll_stats_bkt(const struct request *rq) { @@ -166,7 +167,7 @@ void blk_mq_quiesce_queue(struct request_queue *q) unsigned int i; bool rcu = false; - blk_mq_stop_hw_queues(q); + __blk_mq_stop_hw_queues(q, true); queue_for_each_hw_ctx(q, hctx, i) { if (hctx->flags & BLK_MQ_F_BLOCKING) @@ -1218,20 +1219,34 @@ bool blk_mq_queue_stopped(struct request_queue *q) } EXPORT_SYMBOL(blk_mq_queue_stopped); -void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx) +static void __blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx, bool sync) { - cancel_delayed_work_sync(&hctx->run_work); + if (sync) + cancel_delayed_work_sync(&hctx->run_work); + else + cancel_delayed_work(&hctx->run_work); + set_bit(BLK_MQ_S_STOPPED, &hctx->state); } + +void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx) +{ + __blk_mq_stop_hw_queue(hctx, false); +} EXPORT_SYMBOL(blk_mq_stop_hw_queue); -void blk_mq_stop_hw_queues(struct request_queue *q) +void __blk_mq_stop_hw_queues(struct request_queue *q, bool sync) { struct blk_mq_hw_ctx *hctx; int i; queue_for_each_hw_ctx(q, hctx, i) - blk_mq_stop_hw_queue(hctx); + __blk_mq_stop_hw_queue(hctx, sync); +} + +void blk_mq_stop_hw_queues(struct request_queue *q) +{ + __blk_mq_stop_hw_queues(q, false); } EXPORT_SYMBOL(blk_mq_stop_hw_queues);