diff mbox series

rbd: support .get_budget and .put_budget in rbd_mq_ops

Message ID 1554954256-15605-1-git-send-email-dongsheng.yang@easystack.cn (mailing list archive)
State New, archived
Headers show
Series rbd: support .get_budget and .put_budget in rbd_mq_ops | expand

Commit Message

Dongsheng Yang April 11, 2019, 3:44 a.m. UTC
To improve sequential IO of rbd device, we wish there are
as more chance to merge IO as possible.

But when the blck_mq_make_request() try to merge with IOs in Software
queue by blk_mq_sched_bio_merge(), it would find software queue
is empty at most cases.

The reason is when blk_mq call blk_mq_run_hw_queue() to dispatch requests
from SQ to driver, it will call .queue_rq() implemented in rbd driver.
And we only queue a work in rbd_mq_ops->queue_rq() without possible to
return BLK_STS_DEV_RESOURCE or BLK_STS_RESOURCE. Then the requests in
SQ will be dequeued in anyway. That means in most case, Software is
empty when the next request want to do a IO merge.

To improve it, blk_mq provides .get_budget and .put_budget in blk_mq_ops.
Then request need to get budget at first when dequeue from software queue,
if it return false, request will be remained in software queue and wait
for next dispatch.

So this commit introduce an option of busy_threshold in rbd driver, and
implement .get_budget and .put_budget in rbd_mq_ops. We will increase
budget_reserved in .get_budget and decrese budget_reserved in .put_budget.
When the budget_reserved is larger than busy_threshold, return false in
.get_budget to indecate rbd device is busy now.

Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
---
	In my testing, we can get about 60% improvement in seq reading and
similar level in rand reading, with -o busy_threshold=64. (bs=4K iodepth=128 numjobs=1)

Note:
	 this option would decrease the concurrency of inflight IO. although there
is no limit (SOFT_LIMIT) in osd_client about the inflight IO currently, there is
a maximum of concurrency IO can be handled in osd servers (PHY_LIMIT) actually.
So when the busy_threshold is not smaller than PHY_LIMIT, we can get a similar
performance in random IO with busy_threshold set.


 drivers/block/rbd.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

Comments

Ilya Dryomov April 11, 2019, 9:37 a.m. UTC | #1
On Thu, Apr 11, 2019 at 5:46 AM Dongsheng Yang
<dongsheng.yang@easystack.cn> wrote:
>
> To improve sequential IO of rbd device, we wish there are
> as more chance to merge IO as possible.
>
> But when the blck_mq_make_request() try to merge with IOs in Software
> queue by blk_mq_sched_bio_merge(), it would find software queue
> is empty at most cases.
>
> The reason is when blk_mq call blk_mq_run_hw_queue() to dispatch requests
> from SQ to driver, it will call .queue_rq() implemented in rbd driver.
> And we only queue a work in rbd_mq_ops->queue_rq() without possible to
> return BLK_STS_DEV_RESOURCE or BLK_STS_RESOURCE. Then the requests in
> SQ will be dequeued in anyway. That means in most case, Software is
> empty when the next request want to do a IO merge.
>
> To improve it, blk_mq provides .get_budget and .put_budget in blk_mq_ops.
> Then request need to get budget at first when dequeue from software queue,
> if it return false, request will be remained in software queue and wait
> for next dispatch.
>
> So this commit introduce an option of busy_threshold in rbd driver, and
> implement .get_budget and .put_budget in rbd_mq_ops. We will increase
> budget_reserved in .get_budget and decrese budget_reserved in .put_budget.
> When the budget_reserved is larger than busy_threshold, return false in
> .get_budget to indecate rbd device is busy now.
>
> Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
> ---
>         In my testing, we can get about 60% improvement in seq reading and
> similar level in rand reading, with -o busy_threshold=64. (bs=4K iodepth=128 numjobs=1)

Hi Dongsheng,

Have you tried running the same test with -o queue_depth=64?  I thought
get_budget was supposed to be a SCSI-specific workaround for a very big
tagset shared between many individual devices capable of handling only
a couple of requests at the same time.  For rbd, -o queue_depth controls
the size of the tagset for each individual device, so an extra layer of
budgeting shouldn't be needed?

Also, small block sequential I/O like that is a definite use case for
fancy striping.  Have you tried experimenting with different striping
patterns here?

Thanks,

                Ilya
Dongsheng Yang April 12, 2019, 4:59 a.m. UTC | #2
On 04/11/2019 05:37 PM, Ilya Dryomov wrote:
> On Thu, Apr 11, 2019 at 5:46 AM Dongsheng Yang
> <dongsheng.yang@easystack.cn> wrote:
>> To improve sequential IO of rbd device, we wish there are
>> as more chance to merge IO as possible.
>>
>> But when the blck_mq_make_request() try to merge with IOs in Software
>> queue by blk_mq_sched_bio_merge(), it would find software queue
>> is empty at most cases.
>>
>> The reason is when blk_mq call blk_mq_run_hw_queue() to dispatch requests
>> from SQ to driver, it will call .queue_rq() implemented in rbd driver.
>> And we only queue a work in rbd_mq_ops->queue_rq() without possible to
>> return BLK_STS_DEV_RESOURCE or BLK_STS_RESOURCE. Then the requests in
>> SQ will be dequeued in anyway. That means in most case, Software is
>> empty when the next request want to do a IO merge.
>>
>> To improve it, blk_mq provides .get_budget and .put_budget in blk_mq_ops.
>> Then request need to get budget at first when dequeue from software queue,
>> if it return false, request will be remained in software queue and wait
>> for next dispatch.
>>
>> So this commit introduce an option of busy_threshold in rbd driver, and
>> implement .get_budget and .put_budget in rbd_mq_ops. We will increase
>> budget_reserved in .get_budget and decrese budget_reserved in .put_budget.
>> When the budget_reserved is larger than busy_threshold, return false in
>> .get_budget to indecate rbd device is busy now.
>>
>> Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
>> ---
>>          In my testing, we can get about 60% improvement in seq reading and
>> similar level in rand reading, with -o busy_threshold=64. (bs=4K iodepth=128 numjobs=1)
> Hi Dongsheng,
>
> Have you tried running the same test with -o queue_depth=64?  I thought
> get_budget was supposed to be a SCSI-specific workaround for a very big
> tagset shared between many individual devices capable of handling only
> a couple of requests at the same time.  For rbd, -o queue_depth controls
> the size of the tagset for each individual device, so an extra layer of
> budgeting shouldn't be needed?
Hi Ilya,
     The reason I introduced busy_threshold is that I found the nr_requests
is same with queue_depth in my case. So even we use queue_depth=64,
there is no chance to do merge, because the newer request can't be 
submitted into blk_mq.

But I think I was wrong, I found that's true only in none scheduler what 
is used in my usecase. Any other scheduler
will overwrite the nr_requests as double of queue_depth:

q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
                                    BLKDEV_MAX_RQ);

So when we are using queue_depth=64, the nr_requests would be 128. Then
when we do test with iodepth=128 numjobs=1, there are 64 IO inflight in rbd
driver and the later 64 get a chance to be merged.

Test result:
(1) rbd map IMAGE -o queue_depth=64 (mq-deadline fio bs=4K iodepth=128 
numjobs=1)
iops=17k mergedIO (85555)
(2) rbd map IMAGE -o busy_threshold=64 (mq-daedline fio bs=4K 
iodepth=128 numjobs=1)
iops=16.9K mergedIO (103374)

The test result match what we analyzed: IO get merged with 
queue_depth=64. But
the number of merged IO is smaller than busy_threshold=64. That's 
reasonable, as
the nr_requests in case2 is 256, so we have (256 - 64) = 198 IOs in 
pending can get
chance to be merged. But in case1, the nr_requests is 128, so, there is 
64 IOs can
get chance to merge. But busy_threshold case introduce .get_budget and 
.put_budget
which will introduce some overhead. that result the iops is smaller than 
queue_depth=64
case. So queue_depth=64 is better than busy_threshold=64 with 
mq-deadline scheduler.

There is only None scheduler scenario the busy_threshold could be 
consider usable:
      a) rbd map IMAGE -o queue_depth=64 (none fio bs=4K iodepth=128 
numjobs=1)
          iops=9464, mergedIO(0)
      b) rbd map IMAGE -o queue_depth=64 (none fio bs=4K iodepth=128 
numjobs=1)
          iops=18314, mergedIO(100099)

Hmmm, none merge in none scheduler sounds not a problem. So looks really 
not worthy to
  introduce busy_threshold in upstream at all.
>
> Also, small block sequential I/O like that is a definite use case for
> fancy striping.  Have you tried experimenting with different striping
> patterns here?

That helps. In theory, striping makes the sequential IO handled by
different object parallel, but IO size is still 4K. So even with striping,
the performance is not good as merged IO which is much higher
than random IO.

Thanx
>
> Thanks,
>
>                  Ilya
>
Ilya Dryomov April 12, 2019, 2:45 p.m. UTC | #3
On Fri, Apr 12, 2019 at 7:01 AM Dongsheng Yang
<dongsheng.yang@easystack.cn> wrote:
>
>
>
> On 04/11/2019 05:37 PM, Ilya Dryomov wrote:
> > On Thu, Apr 11, 2019 at 5:46 AM Dongsheng Yang
> > <dongsheng.yang@easystack.cn> wrote:
> >> To improve sequential IO of rbd device, we wish there are
> >> as more chance to merge IO as possible.
> >>
> >> But when the blck_mq_make_request() try to merge with IOs in Software
> >> queue by blk_mq_sched_bio_merge(), it would find software queue
> >> is empty at most cases.
> >>
> >> The reason is when blk_mq call blk_mq_run_hw_queue() to dispatch requests
> >> from SQ to driver, it will call .queue_rq() implemented in rbd driver.
> >> And we only queue a work in rbd_mq_ops->queue_rq() without possible to
> >> return BLK_STS_DEV_RESOURCE or BLK_STS_RESOURCE. Then the requests in
> >> SQ will be dequeued in anyway. That means in most case, Software is
> >> empty when the next request want to do a IO merge.
> >>
> >> To improve it, blk_mq provides .get_budget and .put_budget in blk_mq_ops.
> >> Then request need to get budget at first when dequeue from software queue,
> >> if it return false, request will be remained in software queue and wait
> >> for next dispatch.
> >>
> >> So this commit introduce an option of busy_threshold in rbd driver, and
> >> implement .get_budget and .put_budget in rbd_mq_ops. We will increase
> >> budget_reserved in .get_budget and decrese budget_reserved in .put_budget.
> >> When the budget_reserved is larger than busy_threshold, return false in
> >> .get_budget to indecate rbd device is busy now.
> >>
> >> Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
> >> ---
> >>          In my testing, we can get about 60% improvement in seq reading and
> >> similar level in rand reading, with -o busy_threshold=64. (bs=4K iodepth=128 numjobs=1)
> > Hi Dongsheng,
> >
> > Have you tried running the same test with -o queue_depth=64?  I thought
> > get_budget was supposed to be a SCSI-specific workaround for a very big
> > tagset shared between many individual devices capable of handling only
> > a couple of requests at the same time.  For rbd, -o queue_depth controls
> > the size of the tagset for each individual device, so an extra layer of
> > budgeting shouldn't be needed?
> Hi Ilya,
>      The reason I introduced busy_threshold is that I found the nr_requests
> is same with queue_depth in my case. So even we use queue_depth=64,
> there is no chance to do merge, because the newer request can't be
> submitted into blk_mq.
>
> But I think I was wrong, I found that's true only in none scheduler what
> is used in my usecase. Any other scheduler
> will overwrite the nr_requests as double of queue_depth:
>
> q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
>                                     BLKDEV_MAX_RQ);
>
> So when we are using queue_depth=64, the nr_requests would be 128. Then
> when we do test with iodepth=128 numjobs=1, there are 64 IO inflight in rbd
> driver and the later 64 get a chance to be merged.
>
> Test result:
> (1) rbd map IMAGE -o queue_depth=64 (mq-deadline fio bs=4K iodepth=128
> numjobs=1)
> iops=17k mergedIO (85555)
> (2) rbd map IMAGE -o busy_threshold=64 (mq-daedline fio bs=4K
> iodepth=128 numjobs=1)
> iops=16.9K mergedIO (103374)
>
> The test result match what we analyzed: IO get merged with
> queue_depth=64. But
> the number of merged IO is smaller than busy_threshold=64. That's
> reasonable, as
> the nr_requests in case2 is 256, so we have (256 - 64) = 198 IOs in
> pending can get
> chance to be merged. But in case1, the nr_requests is 128, so, there is
> 64 IOs can
> get chance to merge. But busy_threshold case introduce .get_budget and
> .put_budget
> which will introduce some overhead. that result the iops is smaller than
> queue_depth=64
> case. So queue_depth=64 is better than busy_threshold=64 with
> mq-deadline scheduler.
>
> There is only None scheduler scenario the busy_threshold could be
> consider usable:
>       a) rbd map IMAGE -o queue_depth=64 (none fio bs=4K iodepth=128
> numjobs=1)
>           iops=9464, mergedIO(0)
>       b) rbd map IMAGE -o queue_depth=64 (none fio bs=4K iodepth=128
> numjobs=1)
>           iops=18314, mergedIO(100099)
>
> Hmmm, none merge in none scheduler sounds not a problem. So looks really
> not worthy to
>   introduce busy_threshold in upstream at all.

Right, get_budget is really just a workaround for SCSI.

Note that scheduler = none doesn't mean no merges.  With explicit
plugging merges will still happen even without a scheduler.

By default fio will submit/complete one I/O at a time, which runs
foul of per-task plugging.  If you tell it to submit/complete I/O in
big enough batches, you should see object size I/Os:

$ sudo rbd map foo
$ echo none | sudo tee /sys/block/rbd0/queue/scheduler
none

$ fio --name=seq4k --filename=/dev/rbd0 --ioengine=libaio
--iodepth=1024 --rw=read --bs=4k --direct=1 --numjobs=1
...
Run status group 0 (all jobs):
   READ: bw=25.5MiB/s (26.8MB/s), 25.5MiB/s-25.5MiB/s
(26.8MB/s-26.8MB/s), io=1024MiB (1074MB), run=40137-40137msec

Disk stats (read/write):
  rbd0: ios=262118/0, merge=0/0, ticks=5023012/0, in_queue=4892398, util=44.23%

$ fio --name=seq4k --filename=/dev/rbd0 --ioengine=libaio
--iodepth=1024 --iodepth_batch_submit=1024
--iodepth_batch_complete=1024 --rw=read --bs=4k --direct=1 --numjobs=1
...
Run status group 0 (all jobs):
   READ: bw=83.1MiB/s (87.2MB/s), 83.1MiB/s-83.1MiB/s
(87.2MB/s-87.2MB/s), io=1024MiB (1074MB), run=12317-12317msec

Disk stats (read/write):
  rbd0: ios=252/0, merge=257796/0, ticks=11002/0, in_queue=10873, util=18.48%

No merges in the first run and literally everything is merged (4M I/Os)
in the second run.

Thanks,

                Ilya
diff mbox series

Patch

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 63f73e8..f008b8e 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -376,6 +376,8 @@  struct rbd_device {
 	atomic_t		parent_ref;
 	struct rbd_device	*parent;
 
+	atomic_t		budget_reserved;
+
 	/* Block layer tags. */
 	struct blk_mq_tag_set	tag_set;
 
@@ -734,6 +736,7 @@  static struct rbd_client *rbd_client_find(struct ceph_options *ceph_opts)
 enum {
 	Opt_queue_depth,
 	Opt_lock_timeout,
+	Opt_busy_threshold,
 	Opt_last_int,
 	/* int args above */
 	Opt_pool_ns,
@@ -750,6 +753,7 @@  enum {
 static match_table_t rbd_opts_tokens = {
 	{Opt_queue_depth, "queue_depth=%d"},
 	{Opt_lock_timeout, "lock_timeout=%d"},
+	{Opt_busy_threshold, "busy_threshold=%d"},
 	/* int args above */
 	{Opt_pool_ns, "_pool_ns=%s"},
 	/* string args above */
@@ -765,6 +769,7 @@  enum {
 
 struct rbd_options {
 	int	queue_depth;
+	unsigned long	busy_threshold;
 	unsigned long	lock_timeout;
 	bool	read_only;
 	bool	lock_on_read;
@@ -773,6 +778,7 @@  struct rbd_options {
 };
 
 #define RBD_QUEUE_DEPTH_DEFAULT	BLKDEV_MAX_RQ
+#define RBD_BUSY_THRESHOLD_DEFAULT	0 /* never be busy */
 #define RBD_LOCK_TIMEOUT_DEFAULT 0  /* no timeout */
 #define RBD_READ_ONLY_DEFAULT	false
 #define RBD_LOCK_ON_READ_DEFAULT false
@@ -820,6 +826,16 @@  static int parse_rbd_opts_token(char *c, void *private)
 		}
 		pctx->opts->lock_timeout = msecs_to_jiffies(intval * 1000);
 		break;
+	case Opt_busy_threshold:
+		/* 0 means .get_budget in blk_mq_ops will always return true.
+		 * Then request will dequeue from SQ immediately.
+		 */
+		if (intval < 0) {
+			pr_err("busy_threashold out of range\n");
+			return -EINVAL;
+		}
+		pctx->opts->busy_threshold = intval;
+		break;
 	case Opt_pool_ns:
 		kfree(pctx->spec->pool_ns);
 		pctx->spec->pool_ns = match_strdup(argstr);
@@ -2590,6 +2606,7 @@  static void rbd_img_end_request(struct rbd_img_request *img_req)
 		    img_req->xferred == blk_rq_bytes(img_req->rq)) ||
 		   (img_req->result < 0 && !img_req->xferred));
 
+	atomic_dec(&img_req->rbd_dev->budget_reserved);
 	blk_mq_end_request(img_req->rq,
 			   errno_to_blk_status(img_req->result));
 	rbd_img_request_put(img_req);
@@ -3747,6 +3764,7 @@  static void rbd_queue_workfn(struct work_struct *work)
 			 obj_op_name(op_type), length, offset, result);
 	ceph_put_snap_context(snapc);
 err:
+	atomic_dec(&rbd_dev->budget_reserved);
 	blk_mq_end_request(rq, errno_to_blk_status(result));
 }
 
@@ -3955,9 +3973,41 @@  static int rbd_init_request(struct blk_mq_tag_set *set, struct request *rq,
 	return 0;
 }
 
+static bool rbd_get_budget(struct blk_mq_hw_ctx *hctx)
+{
+	struct request_queue *q = hctx->queue;
+	struct rbd_device *rbd_dev = q->queuedata;
+	unsigned long busy_threshold = rbd_dev->opts->busy_threshold;
+
+	if (!busy_threshold)
+		return true;
+
+	if (atomic_inc_return(&rbd_dev->budget_reserved) <= busy_threshold)
+		return true;
+
+	atomic_dec(&rbd_dev->budget_reserved);
+	/* use 3ms as same with BLK_MQ_RESOURCE_DELAY */
+	blk_mq_delay_run_hw_queue(hctx, 3);
+	return false;
+}
+
+static void rbd_put_budget(struct blk_mq_hw_ctx *hctx)
+{
+	struct request_queue *q = hctx->queue;
+	struct rbd_device *rbd_dev = q->queuedata;
+	unsigned long busy_threshold = rbd_dev->opts->busy_threshold;
+
+	if (!busy_threshold)
+		return;
+
+	atomic_dec(&rbd_dev->budget_reserved);
+}
+
 static const struct blk_mq_ops rbd_mq_ops = {
 	.queue_rq	= rbd_queue_rq,
 	.init_request	= rbd_init_request,
+	.get_budget	= rbd_get_budget,
+	.put_budget	= rbd_put_budget,
 };
 
 static int rbd_init_disk(struct rbd_device *rbd_dev)
@@ -5390,6 +5440,7 @@  static int rbd_add_parse_args(const char *buf,
 	pctx.opts->read_only = RBD_READ_ONLY_DEFAULT;
 	pctx.opts->queue_depth = RBD_QUEUE_DEPTH_DEFAULT;
 	pctx.opts->lock_timeout = RBD_LOCK_TIMEOUT_DEFAULT;
+	pctx.opts->busy_threshold = RBD_BUSY_THRESHOLD_DEFAULT;
 	pctx.opts->lock_on_read = RBD_LOCK_ON_READ_DEFAULT;
 	pctx.opts->exclusive = RBD_EXCLUSIVE_DEFAULT;
 	pctx.opts->trim = RBD_TRIM_DEFAULT;
@@ -5630,6 +5681,7 @@  static int rbd_dev_probe_parent(struct rbd_device *rbd_dev, int depth)
 
 	rbd_dev->parent = parent;
 	atomic_set(&rbd_dev->parent_ref, 1);
+	atomic_set(&rbd_dev->budget_reserved, 0);
 	return 0;
 
 out_err: