From patchwork Tue Feb 21 21:29:56 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 9585755 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 5D736600C1 for ; Tue, 21 Feb 2017 21:30:47 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 437E5285DD for ; Tue, 21 Feb 2017 21:30:47 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 3827328622; Tue, 21 Feb 2017 21:30:47 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7EB46285DD for ; Tue, 21 Feb 2017 21:30:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751452AbdBUVaq (ORCPT ); Tue, 21 Feb 2017 16:30:46 -0500 Received: from mail-pg0-f42.google.com ([74.125.83.42]:35590 "EHLO mail-pg0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751397AbdBUVap (ORCPT ); Tue, 21 Feb 2017 16:30:45 -0500 Received: by mail-pg0-f42.google.com with SMTP id b129so41352104pgc.2 for ; Tue, 21 Feb 2017 13:30:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id; bh=qFjywDzryo29cq3bA+vAS7O2vdxhhLIwd3UtoSXkCOo=; b=F1Je1G2dT1bkSqtVgT35OfwmqDONrvIOKR1r6j8YMamd/rmgIWRLoRkuidrjORErK0 aM8h1VgadOzToxgYOPHtN70wOkdkGpxaE73ir6cGMm9E2WfWNNXebjfo2CCs0bxUkvyY kbzPHf/VMnJUxBp5f6jAgwu52Fugi+sslT8iFt/dV4OpH60/X8MaZL4LCoVMr02Zxcww fu11sl88S0KirRJLLHrZuqrYfkY9IuKOFBPsycF4WXa3LppoisBi6r6HvSbx5sEQmJRQ 3ZrTOSUv9JDxtw3nvH7fyR0WIV6rtu3tSi3JTUu7DSK7UU+kMUzVZaLWz5YQF3PO1rq9 ukcg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=qFjywDzryo29cq3bA+vAS7O2vdxhhLIwd3UtoSXkCOo=; b=P20QmAZUAlVxXCylGMk7eTydfBovfbmX9DwFw+MG3ojRRrCVgGGFO6A30KZ05B6FMY ZrjHnOcE+mEJuHP6nbV1TeHZ6HiKVgfNxqwm0tHawOB11itbRMaZCHZa6JnrzEaHfVwc 2nNuMYV5L0nWmx7T+6Ju4M4WxlHuFSFYOzwhS72ICvqdXJ6YvIqloNxo82p/+EMEVRke 40Xg4NHi1ukFD/SyhskOhKeiN8Dh+CNvxjd3uCeTGnrBmsqkF42UvHY7XilzGKSE2sze tyI9P5sllk5m004f1BLr1ZzTmuWPc6ltATSsPJlL47I6yB/AEDZnDqDmFrY8gV2zpzfE FZkw== X-Gm-Message-State: AMke39nOEIYpf0rBUF4mDNNwpNYWL8xCmOasxetpYl05/l4ekgiVB2j5Ze23EfVSD5GexSfp X-Received: by 10.99.39.71 with SMTP id n68mr36991663pgn.85.1487712644697; Tue, 21 Feb 2017 13:30:44 -0800 (PST) Received: from vader.thefacebook.com ([2620:10d:c090:200::2:7fc8]) by smtp.gmail.com with ESMTPSA id a76sm42617030pfe.131.2017.02.21.13.30.42 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 21 Feb 2017 13:30:42 -0800 (PST) From: Omar Sandoval To: Jens Axboe , linux-block@vger.kernel.org Cc: kernel-team@fb.com Subject: [PATCH v2 1/2] blk-mq: use sbq wait queues instead of restart for driver tags Date: Tue, 21 Feb 2017 13:29:56 -0800 Message-Id: <19bf336f1d329df8d12fbdc5ab8842f81880c71d.1487712413.git.osandov@fb.com> X-Mailer: git-send-email 2.11.1 Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Omar Sandoval Commit 50e1dab86aa2 ("blk-mq-sched: fix starvation for multiple hardware queues and shared tags") fixed one starvation issue for shared tags. However, we can still get into a situation where we fail to allocate a tag because all tags are allocated but we don't have any pending requests on any hardware queue. One solution for this would be to restart all queues that share a tag map, but that really sucks. Ideally, we could just block and wait for a tag, but that isn't always possible from blk_mq_dispatch_rq_list(). However, we can still use the struct sbitmap_queue wait queues with a custom callback instead of blocking. This has a few benefits: 1. It avoids iterating over all hardware queues when completing an I/O, which the current restart code has to do. 2. It benefits from the existing rolling wakeup code. 3. It avoids punting to another thread just to have it block. Signed-off-by: Omar Sandoval Signed-off-by: Jens Axboe --- Changed from v1: - Call blk_mq_get_driver_tag() with &hctx, not NULL, to make sure we handle the case of a request from a different hardware queue than the one we're running (e.g., with nvme+mq-deadline) block/blk-mq.c | 58 ++++++++++++++++++++++++++++++++++++++++++++------ include/linux/blk-mq.h | 2 ++ 2 files changed, 54 insertions(+), 6 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index b29e7dc7b309..bcb9f8742f72 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -904,6 +904,44 @@ static bool reorder_tags_to_front(struct list_head *list) return first != NULL; } +static int blk_mq_dispatch_wake(wait_queue_t *wait, unsigned mode, int flags, + void *key) +{ + struct blk_mq_hw_ctx *hctx; + + hctx = container_of(wait, struct blk_mq_hw_ctx, dispatch_wait); + + list_del(&wait->task_list); + clear_bit_unlock(BLK_MQ_S_TAG_WAITING, &hctx->state); + blk_mq_run_hw_queue(hctx, true); + return 1; +} + +static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx) +{ + struct sbq_wait_state *ws; + + /* + * The TAG_WAITING bit serves as a lock protecting hctx->dispatch_wait. + * The thread which wins the race to grab this bit adds the hardware + * queue to the wait queue. + */ + if (test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state) || + test_and_set_bit_lock(BLK_MQ_S_TAG_WAITING, &hctx->state)) + return false; + + init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake); + ws = bt_wait_ptr(&hctx->tags->bitmap_tags, hctx); + + /* + * As soon as this returns, it's no longer safe to fiddle with + * hctx->dispatch_wait, since a completion can wake up the wait queue + * and unlock the bit. + */ + add_wait_queue(&ws->wait, &hctx->dispatch_wait); + return true; +} + bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list) { struct request_queue *q = hctx->queue; @@ -931,15 +969,22 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list) continue; /* - * We failed getting a driver tag. Mark the queue(s) - * as needing a restart. Retry getting a tag again, - * in case the needed IO completed right before we - * marked the queue as needing a restart. + * The initial allocation attempt failed, so we need to + * rerun the hardware queue when a tag is freed. */ - blk_mq_sched_mark_restart(hctx); - if (!blk_mq_get_driver_tag(rq, &hctx, false)) + if (blk_mq_dispatch_wait_add(hctx)) { + /* + * It's possible that a tag was freed in the + * window between the allocation failure and + * adding the hardware queue to the wait queue. + */ + if (!blk_mq_get_driver_tag(rq, &hctx, false)) + break; + } else { break; + } } + list_del_init(&rq->queuelist); bd.rq = rq; @@ -1051,6 +1096,7 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) { if (unlikely(blk_mq_hctx_stopped(hctx) || + test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state) || !blk_mq_hw_queue_mapped(hctx))) return; diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 8e4df3d6c8cd..001d30d727c5 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -33,6 +33,7 @@ struct blk_mq_hw_ctx { struct blk_mq_ctx **ctxs; unsigned int nr_ctx; + wait_queue_t dispatch_wait; atomic_t wait_index; struct blk_mq_tags *tags; @@ -160,6 +161,7 @@ enum { BLK_MQ_S_STOPPED = 0, BLK_MQ_S_TAG_ACTIVE = 1, BLK_MQ_S_SCHED_RESTART = 2, + BLK_MQ_S_TAG_WAITING = 3, BLK_MQ_MAX_DEPTH = 10240,