From patchwork Sat Feb 18 01:05:06 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 9580925 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id D83D5600F6 for ; Sat, 18 Feb 2017 01:06:03 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CB6562864A for ; Sat, 18 Feb 2017 01:06:03 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id BF15E287A4; Sat, 18 Feb 2017 01:06:03 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D00E52864A for ; Sat, 18 Feb 2017 01:06:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752127AbdBRBF7 (ORCPT ); Fri, 17 Feb 2017 20:05:59 -0500 Received: from mail-pg0-f43.google.com ([74.125.83.43]:33556 "EHLO mail-pg0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751977AbdBRBF7 (ORCPT ); Fri, 17 Feb 2017 20:05:59 -0500 Received: by mail-pg0-f43.google.com with SMTP id y6so16009208pgy.0 for ; Fri, 17 Feb 2017 17:05:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id; bh=g0G/IOB9zBwCd6YSeLrCysRI4SbFzF0oylciHmNA5ZU=; b=fo9pXpXGzbWJ+fhEt5Rq1EP0hPGFQidpE2ntVPe9GzFqOAZWY7f0Cu2Kfeq16lTgGq phuxMH/N2F/cHwi2a22nrdeqfwP2v+s1qQthZdHndfQQyDRTssE0LUlMhubLnzkjbylB IMkqNInTLzGH5N0MekAz2/dqiEFe5V4YluVqse3vM3aNpCGBWU686PTwCGin4/U6sqiw JmHd2Ten9whj9ppXQQmCBMplvYAEK4JMddlhLQ/rd2O+dQErfHeWkuJtn7Crvmkp/3oe kMQSlN6jQFdnTqBN9KPZBgvNRDeWazf8YIUPzQBv8XUs220kNCpvJ62CASpktcwOdn2P WNTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=g0G/IOB9zBwCd6YSeLrCysRI4SbFzF0oylciHmNA5ZU=; b=qK+U//x0UuG97ViUfVQOuwDgGtAxBloro0pKe2SdmNvDm3y63/MMiV6jIXLE5hN/Xd /LZ5KCCScjzG1P5zKHOmwyDe5wrkKLTnNACUDR1u7RJBF/5Dj0tDOb0FjqwUbUDiqUvW AzeYHT+++5g9hqPkXSv2stR9qHIU9Ov3nueQy/8emhKfFeciks5l72I1py6cGOfI3qhx TtHWl+Qhb2fZXfiessn86W89Wsqbl6z5tWgo8zajfCNgMCYoMYO17P4AIfsTGNGtVOSm pDbleXR1rBSi+Y7SldqahmIJSaHwvwG3zNr442kSCMMyeLlWBqnkoYJ3Uavuk09AtWsE okww== X-Gm-Message-State: AMke39kA82QNBCEBjMEEnag/J2ovEbMm/GjGVy9iajAG5mmqzCH3CovZduoueI1UIHFCb9s1 X-Received: by 10.98.196.202 with SMTP id h71mr12532506pfk.66.1487379957982; Fri, 17 Feb 2017 17:05:57 -0800 (PST) Received: from vader.thefacebook.com ([2620:10d:c090:200::2a40]) by smtp.gmail.com with ESMTPSA id n123sm21716084pga.9.2017.02.17.17.05.57 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 17 Feb 2017 17:05:57 -0800 (PST) From: Omar Sandoval To: Jens Axboe , linux-block@vger.kernel.org Cc: kernel-team@fb.com Subject: [PATCH 1/2] blk-mq: use sbq wait queues instead of restart for driver tags Date: Fri, 17 Feb 2017 17:05:06 -0800 Message-Id: X-Mailer: git-send-email 2.11.1 Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Omar Sandoval Commit 50e1dab86aa2 ("blk-mq-sched: fix starvation for multiple hardware queues and shared tags") fixed one starvation issue for shared tags. However, we can still get into a situation where we fail to allocate a tag because all tags are allocated but we don't have any pending requests on any hardware queue. One solution for this would be to restart all queues that share a tag map, but that really sucks. Ideally, we could just block and wait for a tag, but that isn't always possible from blk_mq_dispatch_rq_list(). However, we can still use the struct sbitmap_queue wait queues with a custom callback instead of blocking. This has a few benefits: 1. It avoids iterating over all hardware queues when completing an I/O, which the current restart code has to do. 2. It benefits from the existing rolling wakeup code. 3. It avoids punting to another thread just to have it block. Signed-off-by: Omar Sandoval --- block/blk-mq.c | 60 ++++++++++++++++++++++++++++++++++++++++++++------ include/linux/blk-mq.h | 2 ++ 2 files changed, 55 insertions(+), 7 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 5564a9d103ca..0dacb743d4d7 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -904,6 +904,44 @@ static bool reorder_tags_to_front(struct list_head *list) return first != NULL; } +static int blk_mq_dispatch_wake(wait_queue_t *wait, unsigned mode, int flags, + void *key) +{ + struct blk_mq_hw_ctx *hctx; + + hctx = container_of(wait, struct blk_mq_hw_ctx, dispatch_wait); + + list_del(&wait->task_list); + clear_bit_unlock(BLK_MQ_S_TAG_WAITING, &hctx->state); + blk_mq_run_hw_queue(hctx, true); + return 1; +} + +static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx) +{ + struct sbq_wait_state *ws; + + /* + * The TAG_WAITING bit serves as a lock protecting hctx->dispatch_wait. + * The thread which wins the race to grab this bit adds the hardware + * queue to the wait queue. + */ + if (test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state) || + test_and_set_bit_lock(BLK_MQ_S_TAG_WAITING, &hctx->state)) + return false; + + init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake); + ws = bt_wait_ptr(&hctx->tags->bitmap_tags, hctx); + + /* + * As soon as this returns, it's no longer safe to fiddle with + * hctx->dispatch_wait, since a completion can wake up the wait queue + * and unlock the bit. + */ + add_wait_queue(&ws->wait, &hctx->dispatch_wait); + return true; +} + bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list) { struct request_queue *q = hctx->queue; @@ -926,20 +964,27 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list) struct blk_mq_queue_data bd; rq = list_first_entry(list, struct request, queuelist); - if (!blk_mq_get_driver_tag(rq, &hctx, false)) { + if (!blk_mq_get_driver_tag(rq, NULL, false)) { if (!queued && reorder_tags_to_front(list)) continue; /* - * We failed getting a driver tag. Mark the queue(s) - * as needing a restart. Retry getting a tag again, - * in case the needed IO completed right before we - * marked the queue as needing a restart. + * The initial allocation attempt failed, so we need to + * rerun the hardware queue when a tag is freed. */ - blk_mq_sched_mark_restart(hctx); - if (!blk_mq_get_driver_tag(rq, &hctx, false)) + if (blk_mq_dispatch_wait_add(hctx)) { + /* + * It's possible that a tag was freed in the + * window between the allocation failure and + * adding the hardware queue to the wait queue. + */ + if (!blk_mq_get_driver_tag(rq, NULL, false)) + break; + } else { break; + } } + list_del_init(&rq->queuelist); bd.rq = rq; @@ -1051,6 +1096,7 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) { if (unlikely(blk_mq_hctx_stopped(hctx) || + test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state) || !blk_mq_hw_queue_mapped(hctx))) return; diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 8e4df3d6c8cd..001d30d727c5 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -33,6 +33,7 @@ struct blk_mq_hw_ctx { struct blk_mq_ctx **ctxs; unsigned int nr_ctx; + wait_queue_t dispatch_wait; atomic_t wait_index; struct blk_mq_tags *tags; @@ -160,6 +161,7 @@ enum { BLK_MQ_S_STOPPED = 0, BLK_MQ_S_TAG_ACTIVE = 1, BLK_MQ_S_SCHED_RESTART = 2, + BLK_MQ_S_TAG_WAITING = 3, BLK_MQ_MAX_DEPTH = 10240,