From patchwork Mon Aug 8 11:39:08 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Pen X-Patchwork-Id: 9268199 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 24BCB607D6 for ; Mon, 8 Aug 2016 11:39:55 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1185224DA1 for ; Mon, 8 Aug 2016 11:39:55 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 0424E25EF7; Mon, 8 Aug 2016 11:39:55 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_HI,T_DKIM_INVALID autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7D10524DA1 for ; Mon, 8 Aug 2016 11:39:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752007AbcHHLjp (ORCPT ); Mon, 8 Aug 2016 07:39:45 -0400 Received: from mail-wm0-f43.google.com ([74.125.82.43]:36919 "EHLO mail-wm0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752014AbcHHLjn (ORCPT ); Mon, 8 Aug 2016 07:39:43 -0400 Received: by mail-wm0-f43.google.com with SMTP id i5so132908097wmg.0 for ; Mon, 08 Aug 2016 04:39:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=profitbricks-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id; bh=ifDFW3LxDz5QFxDyJVnCdtnsjMvGnG3vp6pmAUImtb4=; b=baD/4Wx7R46bFZKg7O+VzDX1nu9AZdGPH4XoW7B7skq49pywOXBi2Fu3YQs5TeDCFf sbhacCT9ITLJp2AZ6cXFyRzUmnPMrndLpKIvoYKKQK4d4DgG0mKIpcYRU9FHVM6RMZ9+ bzQghssA1TsH+O03vzud9Gc/GQyNZPscQbaiJE/r3alo1M4PEGtdFLU+3YXagCiKulSA GTTelfj0Rk969eWuuJ4trC6ijP2RdPaGem28kIJVTLIEesrBTfvXTpBrcNtitQ3GAUMt hq4N2fY0am8V6epMzaE2ltTBJJByi40b9V777iDQpdEGmuEWnRogzA1X+7dMxbmjo/iR aDeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=ifDFW3LxDz5QFxDyJVnCdtnsjMvGnG3vp6pmAUImtb4=; b=LTDrQ+b6QdiQsPJxEbLrRZRg65Dgb2Jko7gWOj0LJHRkL2Wna1glotLYj2e2HOlMua 5aOb6XVAxj/+8CKajQcIP/Eig01wIRIzTj/a8oUspDyPmVZUYOUQhtaGaKqeICwpeOfw syMKiiholn7KnQLgnGP5ACaC7a60323iznAGliffujMIh9MrPcfV1kbHX3xAubzDg6cJ JdQAzr+bWXHmkYg9107T2jQABbR37b9B9yYVCM8iLBUO/a2ifCfInSLUTodNwiBj+2XK 0xSAvO3WL50wipi79ILaf5WJJohQKiPsaRYsrhFyObTFrcqh4Yxhv3s+yCHH49vMlFlz FsUQ== X-Gm-Message-State: AEkoouuCzN5FVLWnD5Pro5AtDFiudEYRRlb9LPqN2f6PG7ynPh5LXfKXxISeairSHcJiEmHh X-Received: by 10.28.157.148 with SMTP id g142mr16555631wme.2.1470656376421; Mon, 08 Aug 2016 04:39:36 -0700 (PDT) Received: from pb.pb.local ([62.217.45.26]) by smtp.gmail.com with ESMTPSA id za2sm32467234wjb.34.2016.08.08.04.39.35 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 08 Aug 2016 04:39:35 -0700 (PDT) From: Roman Pen Cc: Roman Pen , Akinobu Mita , Tejun Heo , Jens Axboe , Christoph Hellwig , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v2 1/1] blk-mq: fix hang caused by freeze/unfreeze sequence Date: Mon, 8 Aug 2016 13:39:08 +0200 Message-Id: <20160808113908.5445-1-roman.penyaev@profitbricks.com> X-Mailer: git-send-email 2.9.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Long time ago there was a similar fix proposed by Akinobu Mita[1], but it seems that time everyone decided to fix this subtle race in percpu-refcount and Tejun Heo[2] did an attempt (as I can see that patchset was not applied). The following is a description of a hang in blk_mq_freeze_queue_wait() - same fix but a bug from another angle. The hang happens on attempt to freeze a queue while another task does queue unfreeze. The root cause is an incorrect sequence of percpu_ref_reinit() and percpu_ref_kill() and as a result those two can be swapped: CPU#0 CPU#1 ---------------- ----------------- percpu_ref_kill() percpu_ref_kill() << atomic reference does percpu_ref_reinit() << not guarantee the order blk_mq_freeze_queue_wait() << HANG HERE percpu_ref_reinit() Firstly this wrong sequence raises two kernel warnings: 1st. WARNING at lib/percpu-recount.c:309 percpu_ref_kill_and_confirm called more than once 2nd. WARNING at lib/percpu-refcount.c:331 But the most unpleasant effect is a hang of a blk_mq_freeze_queue_wait(), which waits for a zero of a q_usage_counter, which never happens because percpu-ref was reinited (instead of being killed) and stays in PERCPU state forever. The simplified sequence above can be reproduced on shared tags, when queue A is going to die meanwhile another queue B is in init state and is trying to freeze the queue A, which shares the same tags set: CPU#0 CPU#1 ------------------------------- ------------------------------------ q1 = blk_mq_init_queue(shared_tags) q2 = blk_mq_init_queue(shared_tags): blk_mq_add_queue_tag_set(shared_tags): blk_mq_update_tag_set_depth(shared_tags): blk_mq_freeze_queue(q1) blk_cleanup_queue(q1) ... blk_mq_freeze_queue(q1) <<<->>> blk_mq_unfreeze_queue(q1) [1] Message id: 1443287365-4244-7-git-send-email-akinobu.mita@gmail.com [2] Message id: 1443563240-29306-6-git-send-email-tj@kernel.org Signed-off-by: Roman Pen Cc: Akinobu Mita Cc: Tejun Heo Cc: Jens Axboe Cc: Christoph Hellwig Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Acked-by: Tejun Heo --- v2: - forgotten hunk from local repo - minor tweaks in the commit message block/blk-core.c | 3 ++- block/blk-mq.c | 22 +++++++++++----------- include/linux/blkdev.h | 7 ++++++- 3 files changed, 19 insertions(+), 13 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index ef78848..4fd27e9 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -658,7 +658,7 @@ int blk_queue_enter(struct request_queue *q, gfp_t gfp) return -EBUSY; ret = wait_event_interruptible(q->mq_freeze_wq, - !atomic_read(&q->mq_freeze_depth) || + !q->mq_freeze_depth || blk_queue_dying(q)); if (blk_queue_dying(q)) return -ENODEV; @@ -740,6 +740,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) __set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags); init_waitqueue_head(&q->mq_freeze_wq); + mutex_init(&q->mq_freeze_lock); /* * Init percpu_ref in atomic mode so that it's faster to shutdown. diff --git a/block/blk-mq.c b/block/blk-mq.c index 6d6f8fe..1f3e81b 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -80,13 +80,13 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx, void blk_mq_freeze_queue_start(struct request_queue *q) { - int freeze_depth; - - freeze_depth = atomic_inc_return(&q->mq_freeze_depth); - if (freeze_depth == 1) { + mutex_lock(&q->mq_freeze_lock); + if (++q->mq_freeze_depth == 1) { percpu_ref_kill(&q->q_usage_counter); + mutex_unlock(&q->mq_freeze_lock); blk_mq_run_hw_queues(q, false); - } + } else + mutex_unlock(&q->mq_freeze_lock); } EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start); @@ -124,14 +124,14 @@ EXPORT_SYMBOL_GPL(blk_mq_freeze_queue); void blk_mq_unfreeze_queue(struct request_queue *q) { - int freeze_depth; - - freeze_depth = atomic_dec_return(&q->mq_freeze_depth); - WARN_ON_ONCE(freeze_depth < 0); - if (!freeze_depth) { + mutex_lock(&q->mq_freeze_lock); + q->mq_freeze_depth--; + WARN_ON_ONCE(q->mq_freeze_depth < 0); + if (!q->mq_freeze_depth) { percpu_ref_reinit(&q->q_usage_counter); wake_up_all(&q->mq_freeze_wq); } + mutex_unlock(&q->mq_freeze_lock); } EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue); @@ -2105,7 +2105,7 @@ void blk_mq_free_queue(struct request_queue *q) static void blk_mq_queue_reinit(struct request_queue *q, const struct cpumask *online_mask) { - WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth)); + WARN_ON_ONCE(!q->mq_freeze_depth); blk_mq_sysfs_unregister(q); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index f6ff9d1..d692c16 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -445,7 +445,7 @@ struct request_queue { struct mutex sysfs_lock; int bypass_depth; - atomic_t mq_freeze_depth; + int mq_freeze_depth; #if defined(CONFIG_BLK_DEV_BSG) bsg_job_fn *bsg_job_fn; @@ -459,6 +459,11 @@ struct request_queue { #endif struct rcu_head rcu_head; wait_queue_head_t mq_freeze_wq; + /* + * Protect concurrent access to q_usage_counter by + * percpu_ref_kill() and percpu_ref_reinit(). + */ + struct mutex mq_freeze_lock; struct percpu_ref q_usage_counter; struct list_head all_q_node;