From patchwork Thu Apr 11 03:44:16 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dongsheng Yang X-Patchwork-Id: 10895051 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B90501669 for ; Thu, 11 Apr 2019 03:46:26 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 96ECC28BAD for ; Thu, 11 Apr 2019 03:46:26 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 88D7228BC1; Thu, 11 Apr 2019 03:46:26 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D9F8628BAD for ; Thu, 11 Apr 2019 03:46:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726629AbfDKDqY (ORCPT ); Wed, 10 Apr 2019 23:46:24 -0400 Received: from m97134.mail.qiye.163.com ([220.181.97.134]:54947 "EHLO m97134.mail.qiye.163.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726230AbfDKDqY (ORCPT ); Wed, 10 Apr 2019 23:46:24 -0400 Received: from atest-guest.localdomain (unknown [218.94.118.90]) by smtp5 (Coremail) with SMTP id huCowACXCck5uK5csOaVAw--.1042S2; Thu, 11 Apr 2019 11:44:58 +0800 (CST) From: Dongsheng Yang To: idryomov@gmail.com, sage@redhat.com, elder@kernel.org, jdillama@redhat.com Cc: ceph-devel@vger.kernel.org, Dongsheng Yang Subject: [PATCH] rbd: support .get_budget and .put_budget in rbd_mq_ops Date: Wed, 10 Apr 2019 23:44:16 -0400 Message-Id: <1554954256-15605-1-git-send-email-dongsheng.yang@easystack.cn> X-Mailer: git-send-email 1.8.3.1 X-CM-TRANSID: huCowACXCck5uK5csOaVAw--.1042S2 X-Coremail-Antispam: 1Uf129KBjvJXoW3XF4rCryUZF1xur43urWrAFb_yoWxXFWfpF W3tan0yr4UXF12ga93JF4UZr15tw18KF9rWFs2yw1fuFZ3Kr15JFyIkFy5XFW7AFZ5Crs7 GFs5XrZ5uF1jqFDanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x0Jb6pnPUUUUU= X-Originating-IP: [218.94.118.90] X-CM-SenderInfo: 5grqw2pkhqwhp1dqwq5hdv52pwdfyhdfq/1tbigBqTelrpOGvpxgAAs8 Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP To improve sequential IO of rbd device, we wish there are as more chance to merge IO as possible. But when the blck_mq_make_request() try to merge with IOs in Software queue by blk_mq_sched_bio_merge(), it would find software queue is empty at most cases. The reason is when blk_mq call blk_mq_run_hw_queue() to dispatch requests from SQ to driver, it will call .queue_rq() implemented in rbd driver. And we only queue a work in rbd_mq_ops->queue_rq() without possible to return BLK_STS_DEV_RESOURCE or BLK_STS_RESOURCE. Then the requests in SQ will be dequeued in anyway. That means in most case, Software is empty when the next request want to do a IO merge. To improve it, blk_mq provides .get_budget and .put_budget in blk_mq_ops. Then request need to get budget at first when dequeue from software queue, if it return false, request will be remained in software queue and wait for next dispatch. So this commit introduce an option of busy_threshold in rbd driver, and implement .get_budget and .put_budget in rbd_mq_ops. We will increase budget_reserved in .get_budget and decrese budget_reserved in .put_budget. When the budget_reserved is larger than busy_threshold, return false in .get_budget to indecate rbd device is busy now. Signed-off-by: Dongsheng Yang --- In my testing, we can get about 60% improvement in seq reading and similar level in rand reading, with -o busy_threshold=64. (bs=4K iodepth=128 numjobs=1) Note: this option would decrease the concurrency of inflight IO. although there is no limit (SOFT_LIMIT) in osd_client about the inflight IO currently, there is a maximum of concurrency IO can be handled in osd servers (PHY_LIMIT) actually. So when the busy_threshold is not smaller than PHY_LIMIT, we can get a similar performance in random IO with busy_threshold set. drivers/block/rbd.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 52 insertions(+) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 63f73e8..f008b8e 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -376,6 +376,8 @@ struct rbd_device { atomic_t parent_ref; struct rbd_device *parent; + atomic_t budget_reserved; + /* Block layer tags. */ struct blk_mq_tag_set tag_set; @@ -734,6 +736,7 @@ static struct rbd_client *rbd_client_find(struct ceph_options *ceph_opts) enum { Opt_queue_depth, Opt_lock_timeout, + Opt_busy_threshold, Opt_last_int, /* int args above */ Opt_pool_ns, @@ -750,6 +753,7 @@ enum { static match_table_t rbd_opts_tokens = { {Opt_queue_depth, "queue_depth=%d"}, {Opt_lock_timeout, "lock_timeout=%d"}, + {Opt_busy_threshold, "busy_threshold=%d"}, /* int args above */ {Opt_pool_ns, "_pool_ns=%s"}, /* string args above */ @@ -765,6 +769,7 @@ enum { struct rbd_options { int queue_depth; + unsigned long busy_threshold; unsigned long lock_timeout; bool read_only; bool lock_on_read; @@ -773,6 +778,7 @@ struct rbd_options { }; #define RBD_QUEUE_DEPTH_DEFAULT BLKDEV_MAX_RQ +#define RBD_BUSY_THRESHOLD_DEFAULT 0 /* never be busy */ #define RBD_LOCK_TIMEOUT_DEFAULT 0 /* no timeout */ #define RBD_READ_ONLY_DEFAULT false #define RBD_LOCK_ON_READ_DEFAULT false @@ -820,6 +826,16 @@ static int parse_rbd_opts_token(char *c, void *private) } pctx->opts->lock_timeout = msecs_to_jiffies(intval * 1000); break; + case Opt_busy_threshold: + /* 0 means .get_budget in blk_mq_ops will always return true. + * Then request will dequeue from SQ immediately. + */ + if (intval < 0) { + pr_err("busy_threashold out of range\n"); + return -EINVAL; + } + pctx->opts->busy_threshold = intval; + break; case Opt_pool_ns: kfree(pctx->spec->pool_ns); pctx->spec->pool_ns = match_strdup(argstr); @@ -2590,6 +2606,7 @@ static void rbd_img_end_request(struct rbd_img_request *img_req) img_req->xferred == blk_rq_bytes(img_req->rq)) || (img_req->result < 0 && !img_req->xferred)); + atomic_dec(&img_req->rbd_dev->budget_reserved); blk_mq_end_request(img_req->rq, errno_to_blk_status(img_req->result)); rbd_img_request_put(img_req); @@ -3747,6 +3764,7 @@ static void rbd_queue_workfn(struct work_struct *work) obj_op_name(op_type), length, offset, result); ceph_put_snap_context(snapc); err: + atomic_dec(&rbd_dev->budget_reserved); blk_mq_end_request(rq, errno_to_blk_status(result)); } @@ -3955,9 +3973,41 @@ static int rbd_init_request(struct blk_mq_tag_set *set, struct request *rq, return 0; } +static bool rbd_get_budget(struct blk_mq_hw_ctx *hctx) +{ + struct request_queue *q = hctx->queue; + struct rbd_device *rbd_dev = q->queuedata; + unsigned long busy_threshold = rbd_dev->opts->busy_threshold; + + if (!busy_threshold) + return true; + + if (atomic_inc_return(&rbd_dev->budget_reserved) <= busy_threshold) + return true; + + atomic_dec(&rbd_dev->budget_reserved); + /* use 3ms as same with BLK_MQ_RESOURCE_DELAY */ + blk_mq_delay_run_hw_queue(hctx, 3); + return false; +} + +static void rbd_put_budget(struct blk_mq_hw_ctx *hctx) +{ + struct request_queue *q = hctx->queue; + struct rbd_device *rbd_dev = q->queuedata; + unsigned long busy_threshold = rbd_dev->opts->busy_threshold; + + if (!busy_threshold) + return; + + atomic_dec(&rbd_dev->budget_reserved); +} + static const struct blk_mq_ops rbd_mq_ops = { .queue_rq = rbd_queue_rq, .init_request = rbd_init_request, + .get_budget = rbd_get_budget, + .put_budget = rbd_put_budget, }; static int rbd_init_disk(struct rbd_device *rbd_dev) @@ -5390,6 +5440,7 @@ static int rbd_add_parse_args(const char *buf, pctx.opts->read_only = RBD_READ_ONLY_DEFAULT; pctx.opts->queue_depth = RBD_QUEUE_DEPTH_DEFAULT; pctx.opts->lock_timeout = RBD_LOCK_TIMEOUT_DEFAULT; + pctx.opts->busy_threshold = RBD_BUSY_THRESHOLD_DEFAULT; pctx.opts->lock_on_read = RBD_LOCK_ON_READ_DEFAULT; pctx.opts->exclusive = RBD_EXCLUSIVE_DEFAULT; pctx.opts->trim = RBD_TRIM_DEFAULT; @@ -5630,6 +5681,7 @@ static int rbd_dev_probe_parent(struct rbd_device *rbd_dev, int depth) rbd_dev->parent = parent; atomic_set(&rbd_dev->parent_ref, 1); + atomic_set(&rbd_dev->budget_reserved, 0); return 0; out_err: