diff mbox

fio rbd completions (Was: fio rbd hang for block sizes > 1M)

Message ID 544FCDBD.7070106@kernel.dk (mailing list archive)
State New, archived
Headers show

Commit Message

Jens Axboe Oct. 28, 2014, 5:09 p.m. UTC
On 2014-10-28 09:49, Ketor D wrote:
> Cannot get the new commited code from github now.
> When I get the newest code, I will test.

So here's another idea, applies on top of current -git. Basically it 
makes rbd wait for the oldest event, not just the first one in the array 
of all ios. This is the saner thing to do, as hopefully the oldest event 
will be the one to complete first. At least it has a much higher chance 
of being the right thing to do, than just waiting on a random event.

Completely untested, so you might have to fiddle a bit with it to ensure 
that it actually works...

Comments

Ketor D Oct. 28, 2014, 6:43 p.m. UTC | #1
Yeah, the new wait strategy looks better.

I will test the patch soon.

2014-10-29 1:09 GMT+08:00 Jens Axboe <axboe@kernel.dk>:
> On 2014-10-28 09:49, Ketor D wrote:
>>
>> Cannot get the new commited code from github now.
>> When I get the newest code, I will test.
>
>
> So here's another idea, applies on top of current -git. Basically it makes
> rbd wait for the oldest event, not just the first one in the array of all
> ios. This is the saner thing to do, as hopefully the oldest event will be
> the one to complete first. At least it has a much higher chance of being the
> right thing to do, than just waiting on a random event.
>
> Completely untested, so you might have to fiddle a bit with it to ensure
> that it actually works...
>
> --
> Jens Axboe
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ketor D Oct. 29, 2014, 7:15 a.m. UTC | #2
Hi, Jens,

There is cmdline parse bug in the fio rbd test.

I have fixed this and create a pull request on the github.

Please review.

After fix the bugs, the fio test can run.


2014-10-29 2:43 GMT+08:00 Ketor D <d.ketor@gmail.com>:
> Yeah, the new wait strategy looks better.
>
> I will test the patch soon.
>
> 2014-10-29 1:09 GMT+08:00 Jens Axboe <axboe@kernel.dk>:
>> On 2014-10-28 09:49, Ketor D wrote:
>>>
>>> Cannot get the new commited code from github now.
>>> When I get the newest code, I will test.
>>
>>
>> So here's another idea, applies on top of current -git. Basically it makes
>> rbd wait for the oldest event, not just the first one in the array of all
>> ios. This is the saner thing to do, as hopefully the oldest event will be
>> the one to complete first. At least it has a much higher chance of being the
>> right thing to do, than just waiting on a random event.
>>
>> Completely untested, so you might have to fiddle a bit with it to ensure
>> that it actually works...
>>
>> --
>> Jens Axboe
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Oct. 29, 2014, 2:31 p.m. UTC | #3
On 2014-10-29 01:15, Ketor D wrote:
> Hi, Jens,
>
> There is cmdline parse bug in the fio rbd test.
>
> I have fixed this and create a pull request on the github.
>
> Please review.
>
> After fix the bugs, the fio test can run.

I merged your two pull requests (thanks!) and committed a polished 
variant of the sort patch. Ketor and Mark, would you mind both running a 
quick benchmark on the current -git head?
Ketor D Oct. 30, 2014, 2:50 a.m. UTC | #4
Hi Jens,

The current code runs good!

Test Conf: jobs=1 iodepth=1 bs=4k

if busy_poll = 1, IOPS is 38989.
if busy_poll = 0, IOPS is 33031.

And busy_poll=0 test result looks no difference from the old code than
do not have sorted events wait.



2014-10-29 22:31 GMT+08:00 Jens Axboe <axboe@kernel.dk>:
> On 2014-10-29 01:15, Ketor D wrote:
>>
>> Hi, Jens,
>>
>> There is cmdline parse bug in the fio rbd test.
>>
>> I have fixed this and create a pull request on the github.
>>
>> Please review.
>>
>> After fix the bugs, the fio test can run.
>
>
> I merged your two pull requests (thanks!) and committed a polished variant
> of the sort patch. Ketor and Mark, would you mind both running a quick
> benchmark on the current -git head?
>
> --
> Jens Axboe
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Oct. 30, 2014, 2:55 a.m. UTC | #5
On 2014-10-29 20:50, Ketor D wrote:
> Hi Jens,
>
> The current code runs good!
>
> Test Conf: jobs=1 iodepth=1 bs=4k
>
> if busy_poll = 1, IOPS is 38989.
> if busy_poll = 0, IOPS is 33031.
>
> And busy_poll=0 test result looks no difference from the old code than
> do not have sorted events wait.

Good to hear! I think we can safely say that we've pushed rbd as far as 
we can. At least on the fio side. Still appears to be some suboptimal 
parts of the librbd API. And the kernel rbd driver could definitely be 
improved a lot as well. Using busy_poll on the user side gets rid of a 
sleep/wakeup cycle at that end, but the kernel driver still punts and 
offloads any work item to a work queue...

Thanks for all your testing through this, really appreciated!
Ketor D Oct. 30, 2014, 5:29 a.m. UTC | #6
Thanks ,I feel a great honour to finished these test and happy to help
improved fio.
And I am trying to decrease the librbd latency. I have make some small progess.

2014-10-30 10:55 GMT+08:00 Jens Axboe <axboe@kernel.dk>:
> On 2014-10-29 20:50, Ketor D wrote:
>>
>> Hi Jens,
>>
>> The current code runs good!
>>
>> Test Conf: jobs=1 iodepth=1 bs=4k
>>
>> if busy_poll = 1, IOPS is 38989.
>> if busy_poll = 0, IOPS is 33031.
>>
>> And busy_poll=0 test result looks no difference from the old code than
>> do not have sorted events wait.
>
>
> Good to hear! I think we can safely say that we've pushed rbd as far as we
> can. At least on the fio side. Still appears to be some suboptimal parts of
> the librbd API. And the kernel rbd driver could definitely be improved a lot
> as well. Using busy_poll on the user side gets rid of a sleep/wakeup cycle
> at that end, but the kernel driver still punts and offloads any work item to
> a work queue...
>
> Thanks for all your testing through this, really appreciated!
>
> --
> Jens Axboe
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Kirkwood Oct. 30, 2014, 7:44 a.m. UTC | #7
On 30/10/14 03:31, Jens Axboe wrote:
> On 2014-10-29 01:15, Ketor D wrote:
>> Hi, Jens,
>>
>> There is cmdline parse bug in the fio rbd test.
>>
>> I have fixed this and create a pull request on the github.
>>
>> Please review.
>>
>> After fix the bugs, the fio test can run.
>
> I merged your two pull requests (thanks!) and committed a polished
> variant of the sort patch. Ketor and Mark, would you mind both running a
> quick benchmark on the current -git head?
>

Better late than never (sorry), comparing with the 'original' fio code 
containing the usleep(100):

blocksize k | head iops     | orig iops
------------+---------------+--------------
4           |  11114        |  11516
128         |   4551        |   6550
1024        |   1195        |   1248

So we do pretty much the same except in the middle blocksize range (I 
checked again with the old binary just to rule out any other changes on 
the ceph end...).

Regards

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ketor D Oct. 30, 2014, 8:04 a.m. UTC | #8
Hi Mark,
      Could you do a fio test in your env with the busy_poll=1 ?
I am very interested in the busy_poll result. Thanks!

2014-10-30 15:44 GMT+08:00 Mark Kirkwood <mark.kirkwood@catalyst.net.nz>:
> On 30/10/14 03:31, Jens Axboe wrote:
>>
>> On 2014-10-29 01:15, Ketor D wrote:
>>>
>>> Hi, Jens,
>>>
>>> There is cmdline parse bug in the fio rbd test.
>>>
>>> I have fixed this and create a pull request on the github.
>>>
>>> Please review.
>>>
>>> After fix the bugs, the fio test can run.
>>
>>
>> I merged your two pull requests (thanks!) and committed a polished
>> variant of the sort patch. Ketor and Mark, would you mind both running a
>> quick benchmark on the current -git head?
>>
>
> Better late than never (sorry), comparing with the 'original' fio code
> containing the usleep(100):
>
> blocksize k | head iops     | orig iops
> ------------+---------------+--------------
> 4           |  11114        |  11516
> 128         |   4551        |   6550
> 1024        |   1195        |   1248
>
> So we do pretty much the same except in the middle blocksize range (I
> checked again with the old binary just to rule out any other changes on the
> ceph end...).
>
> Regards
>
> Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Kirkwood Oct. 31, 2014, 8:54 a.m. UTC | #9
On 30/10/14 21:04, Ketor D wrote:
> Hi Mark,
>        Could you do a fio test in your env with the busy_poll=1 ?
> I am very interested in the busy_poll result. Thanks!
>

Sure:

blocksize k | head iops     |  head iops (busy_pool=1)
------------+---------------+--------------------------
4           |  11114        |  12608
128         |   4551        |   6422
1024        |   1195        |   1175
4096        |    320        |    316

So looks like the busy_pool=1 improves performance for small and mid 
range blocksizes but is a little slower at the larger end.

However there are a lot of variables here - I'm using iodepth=32 for 
instance, and altering that may change the pattern I'm seeing, also a 
system with more osd's may bring out different behaviours as it runs the 
fio client out of available cpu power in the smaller block sizes.

Regards

Mark


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/engines/rbd.c b/engines/rbd.c
index cf7be0acd1e3..f3129044c430 100644
--- a/engines/rbd.c
+++ b/engines/rbd.c
@@ -20,6 +20,7 @@  struct rbd_data {
 	rados_ioctx_t io_ctx;
 	rbd_image_t image;
 	struct io_u **aio_events;
+	struct io_u **sort_events;
 };
 
 struct rbd_options {
@@ -80,20 +81,19 @@  static int _fio_setup_rbd_data(struct thread_data *td,
 	if (td->io_ops->data)
 		return 0;
 
-	rbd_data = malloc(sizeof(struct rbd_data));
+	rbd_data = calloc(1, sizeof(struct rbd_data));
 	if (!rbd_data)
 		goto failed;
 
-	memset(rbd_data, 0, sizeof(struct rbd_data));
-
-	rbd_data->aio_events = malloc(td->o.iodepth * sizeof(struct io_u *));
+	rbd_data->aio_events = calloc(td->o.iodepth, sizeof(struct io_u *));
 	if (!rbd_data->aio_events)
 		goto failed;
 
-	memset(rbd_data->aio_events, 0, td->o.iodepth * sizeof(struct io_u *));
+	rbd_data->sort_events = calloc(td->o.iodepth, sizeof(struct io_u *));
+	if (!rbd_data->sort_events)
+		goto failed;
 
 	*rbd_data_ptr = rbd_data;
-
 	return 0;
 
 failed:
@@ -218,14 +218,32 @@  static inline int fri_check_complete(struct rbd_data *rbd_data,
 	return 0;
 }
 
+static int rbd_io_u_cmp(const void *p1, const void *p2)
+{
+	const struct io_u **a = (const struct io_u **) p1;
+	const struct io_u **b = (const struct io_u **) p2;
+	uint64_t at, bt;
+
+	at = utime_since_now(&(*a)->start_time);
+	bt = utime_since_now(&(*b)->start_time);
+
+	if (at < bt)
+		return -1;
+	else if (at == bt)
+		return 0;
+	else
+		return 1;
+}
+
 static int rbd_iter_events(struct thread_data *td, unsigned int *events,
 			   unsigned int min_evts, int wait)
 {
 	struct rbd_data *rbd_data = td->io_ops->data;
 	unsigned int this_events = 0;
 	struct io_u *io_u;
-	int i;
+	int i, sort_idx;
 
+	sort_idx = 0;
 	io_u_qiter(&td->io_u_all, io_u, i) {
 		struct fio_rbd_iou *fri = io_u->engine_data;
 
@@ -236,16 +254,39 @@  static int rbd_iter_events(struct thread_data *td, unsigned int *events,
 
 		if (fri_check_complete(rbd_data, io_u, events))
 			this_events++;
-		else if (wait) {
-			rbd_aio_wait_for_complete(fri->completion);
+		else if (wait)
+			rbd_data->sort_events[sort_idx++] = io_u;
 
-			if (fri_check_complete(rbd_data, io_u, events))
-				this_events++;
-		}
 		if (*events >= min_evts)
 			break;
 	}
 
+	if (!wait || !sort_idx)
+		return this_events;
+
+	qsort(rbd_data->sort_events, sort_idx, sizeof(struct io_u *), rbd_io_u_cmp);
+	for (i = 0; i < sort_idx; i++) {
+		struct fio_rbd_iou *fri;
+
+		io_u = rbd_data->sort_events[i];
+		fri = io_u->engine_data;
+
+		if (fri_check_complete(rbd_data, io_u, events)) {
+			this_events++;
+			continue;
+		}
+		if (!wait)
+			continue;
+
+		rbd_aio_wait_for_complete(fri->completion);
+
+		if (fri_check_complete(rbd_data, io_u, events))
+			this_events++;
+
+		if (wait && *events >= min_evts)
+			wait = 0;
+	}
+
 	return this_events;
 }
 
@@ -359,6 +400,7 @@  static void fio_rbd_cleanup(struct thread_data *td)
 	if (rbd_data) {
 		_fio_rbd_disconnect(rbd_data);
 		free(rbd_data->aio_events);
+		free(rbd_data->sort_events);
 		free(rbd_data);
 	}