blk-mq: Avoid that a request queue stalls when restarting a shared hctx

From: Roman Pen <roman.penyaev@profitbricks.com>

From: Roman Pen <roman.penyaev@profitbricks.com>

The patch below fixes queue stalling when shared hctx marked for restart
(BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.  The
root cause is that hctxs are shared between queues, but 'shared_hctx_restart'
belongs to the particular queue, which in fact may not need to be restarted,
thus we return from blk_mq_sched_restart() and leave shared hctx of another
queue never restarted.

The fix is to make shared_hctx_restart counter belong not to the queue, but
to tags, thereby counter will reflect real number of shared hctx needed to
be restarted.

During tests 1 hctx (set->nr_hw_queues) was used and all stalled requests
were noticed in dd->fifo_list of mq-deadline scheduler.

Seeming possible sequence of events:

1. Request A of queue A is inserted into dd->fifo_list of the scheduler.

2. Request B of queue A bypasses scheduler and goes directly to
   hctx->dispatch.

3. Request C of queue B is inserted.

4. blk_mq_sched_dispatch_requests() is invoked, since hctx->dispatch is not
   empty (request B is in the list) hctx is only marked for for next restart
   and request A is left in a list (see comment "So it's best to leave them
   there for as long as we can. Mark the hw queue as needing a restart in
   that case." in blk-mq-sched.c)

5. Eventually request B is completed/freed and blk_mq_sched_restart() is
   called, but by chance hctx from queue B is chosen for restart and request C
   gets a chance to be dispatched.

6. Eventually request C is completed/freed and blk_mq_sched_restart() is
   called, but shared_hctx_restart for queue B is zero and we return without
   attempt to restart hctx from queue A, thus request A is stuck forever.

But stalling queue is not the only one problem with blk_mq_sched_restart().
My tests show that those loops thru all queues and hctxs can be very costly,
even with shared_hctx_restart counter, which aims to fix performance issue.
For my tests I create 128 devices with 64 hctx each, which share same tags
set.

The following is the fio and ftrace output for v4.14-rc4 kernel:

 READ: io=5630.3MB, aggrb=573208KB/s, minb=573208KB/s, maxb=573208KB/s, mint=10058msec, maxt=10058msec
WRITE: io=5650.9MB, aggrb=575312KB/s, minb=575312KB/s, maxb=575312KB/s, mint=10058msec, maxt=10058msec

root@pserver16:~/roman# cat /sys/kernel/debug/tracing/trace_stat/* | grep blk_mq
  Function                  Hit     Time            Avg             s^2
  --------                  ---     ----            ---             ---
  blk_mq_sched_restart     16347    9540759 us      583.639 us      8804801 us
  blk_mq_sched_restart      7884    6073471 us      770.354 us      8780054 us
  blk_mq_sched_restart     14176    7586794 us      535.185 us      2822731 us
  blk_mq_sched_restart      7843    6205435 us      791.206 us      12424960 us
  blk_mq_sched_restart      1490    4786107 us      3212.153 us     1949753 us
  blk_mq_sched_restart      7892    6039311 us      765.244 us      2994627 us
  blk_mq_sched_restart     15382    7511126 us      488.306 us      3090912 us
  [cut]

And here are results with two patches reverted:
   8e8320c9315c ("blk-mq: fix performance regression with shared tags")
   6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")

 READ: io=12884MB, aggrb=1284.3MB/s, minb=1284.3MB/s, maxb=1284.3MB/s, mint=10032msec, maxt=10032msec
WRITE: io=12987MB, aggrb=1294.6MB/s, minb=1294.6MB/s, maxb=1294.6MB/s, mint=10032msec, maxt=10032msec

root@pserver16:~/roman# cat /sys/kernel/debug/tracing/trace_stat/* | grep blk_mq
  Function                  Hit      Time            Avg             s^2
  --------                  ---      ----            ---             ---
  blk_mq_sched_restart      50699    8802.349 us     0.173 us        121.771 us
  blk_mq_sched_restart      50362    8740.470 us     0.173 us        161.494 us
  blk_mq_sched_restart      50402    9066.337 us     0.179 us        113.009 us
  blk_mq_sched_restart      50104    9366.197 us     0.186 us        188.645 us
  blk_mq_sched_restart      50375    9317.727 us     0.184 us        54.218 us
  blk_mq_sched_restart      50136    9311.657 us     0.185 us        446.790 us
  blk_mq_sched_restart      50103    9179.625 us     0.183 us        114.472 us
  [cut]

Timings and stdevs are terrible, which leads to significant difference:
570MB/s vs 1280MB/s.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Jack Wang <jack.wang.usish@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
[ bvanassche: modified patch title, description and Cc-list ]
---
 block/blk-mq-sched.c   | 10 ++++------
 block/blk-mq-tag.c     |  1 +
 block/blk-mq-tag.h     |  1 +
 block/blk-mq.c         |  4 ++--
 include/linux/blkdev.h |  2 --
 5 files changed, 8 insertions(+), 10 deletions(-)

Message ID	20180723155038.22062-1-bart.vanassche@wdc.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1265B112B for <patchwork-linux-block@patchwork.kernel.org>; Mon, 23 Jul 2018 15:50:41 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F39A228D4B for <patchwork-linux-block@patchwork.kernel.org>; Mon, 23 Jul 2018 15:50:40 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E793528D58; Mon, 23 Jul 2018 15:50:40 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.8 required=2.0 tests=BAYES_00,DKIM_SIGNED, MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI,T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1A9E028D4B for <patchwork-linux-block@patchwork.kernel.org>; Mon, 23 Jul 2018 15:50:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388722AbeGWQw3 (ORCPT <rfc822;patchwork-linux-block@patchwork.kernel.org>); Mon, 23 Jul 2018 12:52:29 -0400 Received: from esa4.hgst.iphmx.com ([216.71.154.42]:64027 "EHLO esa4.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2388718AbeGWQw3 (ORCPT <rfc822;linux-block@vger.kernel.org>); Mon, 23 Jul 2018 12:52:29 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1532361039; x=1563897039; h=from:to:cc:subject:date:message-id; bh=HNRWtSjAVNTNxks7hD0PpG7H6+xPtr/b6q5I9xJzGho=; b=M5vdtffljEA/5S8zb/ZPgM14vTQBZfC6aP4sFxjBrJFp8VLwsAFObPPM 9iyrm29B2JH6LDeAyIoKmT2WC9pMZtQbDSkx8EKaoT/GRE4T4rKeF8yTR L7EsKVTZpqLFSJ++TlPOJDE8dtNPLU0PFZW/wuw85ZM2zEBui+aWaqO5P 8gwzquo25xMYLg5xWHcEgkQegF8vAvhZgaH+tsF7Su6wCs9mEO+/hISp3 apcIvO5uKNhZzslkKmYCw1J2n2HdiyXQGZf3Lt56L9ZpCq83Orcw1wJxU szCCPoq8bqKo+wgAmQNIpP3hZaHDelitzYLMDZHHf2Z+Qfpf/dxHXQyZY A==; X-IronPort-AV: E=Sophos;i="5.51,393,1526313600"; d="scan'208";a="84725145" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 23 Jul 2018 23:50:39 +0800 Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP; 23 Jul 2018 08:38:48 -0700 Received: from thinkpad-bart.sdcorp.global.sandisk.com ([10.111.67.248]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Jul 2018 08:50:39 -0700 From: Bart Van Assche <bart.vanassche@wdc.com> To: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org, Christoph Hellwig <hch@lst.de>, Roman Pen <roman.penyaev@profitbricks.com>, Ming Lei <ming.lei@redhat.com>, Jianchao Wang <jianchao.w.wang@oracle.com>, Johannes Thumshirn <jthumshirn@suse.de>, Jack Wang <jack.wang.usish@gmail.com>, stable@vger.kernel.org, Bart Van Assche <bart.vanassche@wdc.com> Subject: [PATCH] blk-mq: Avoid that a request queue stalls when restarting a shared hctx Date: Mon, 23 Jul 2018 08:50:38 -0700 Message-Id: <20180723155038.22062-1-bart.vanassche@wdc.com> X-Mailer: git-send-email 2.18.0 Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: <linux-block.vger.kernel.org> X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP
Series	blk-mq: Avoid that a request queue stalls when restarting a shared hctx \| expand blk-mq: Avoid that a request queue stalls when restarting a shared hctx

blk-mq: Avoid that a request queue stalls when restarting a shared hctx

Commit Message

Comments

Patch