diff mbox series

[8/9] io_uring: utilize the io_batch infrastructure for more efficient polled IO

Message ID 20211013165416.985696-9-axboe@kernel.dk (mailing list archive)
State New, archived
Headers show
Series Batched completions | expand

Commit Message

Jens Axboe Oct. 13, 2021, 4:54 p.m. UTC
Wire up using an io_batch for f_op->iopoll(). If the lower stack supports
it, we can handle high rates of polled IO more efficiently.

This raises the single core efficiency on my system from ~6.1M IOPS to
~6.6M IOPS running a random read workload at depth 128 on two gen2
Optane drives.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

Comments

Christoph Hellwig Oct. 14, 2021, 8:03 a.m. UTC | #1
On Wed, Oct 13, 2021 at 10:54:15AM -0600, Jens Axboe wrote:
> @@ -2404,6 +2406,11 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
>  		struct kiocb *kiocb = &req->rw.kiocb;
>  		int ret;
>  
> +		if (!file)
> +			file = kiocb->ki_filp;
> +		else if (file != kiocb->ki_filp)
> +			break;
> +

Can you explain why we now can only poll for a single file (independent
of the fact that batching is used)?

> +	if (!pos && !iob.req_list)
>  		return 0;
> +	if (iob.req_list)
> +		iob.complete(&iob);

Why not:

	if (iob.req_list)
		iob.complete(&iob);
	else if (!pos)
		return 0;

?
Jens Axboe Oct. 14, 2021, 3:45 p.m. UTC | #2
On 10/14/21 2:03 AM, Christoph Hellwig wrote:
> On Wed, Oct 13, 2021 at 10:54:15AM -0600, Jens Axboe wrote:
>> @@ -2404,6 +2406,11 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
>>  		struct kiocb *kiocb = &req->rw.kiocb;
>>  		int ret;
>>  
>> +		if (!file)
>> +			file = kiocb->ki_filp;
>> +		else if (file != kiocb->ki_filp)
>> +			break;
>> +
> 
> Can you explain why we now can only poll for a single file (independent
> of the fact that batching is used)?

Different file may be on a different backend, it's just playing it
safe and splitting it up. In practice it should not matter.

>> +	if (!pos && !iob.req_list)
>>  		return 0;
>> +	if (iob.req_list)
>> +		iob.complete(&iob);
> 
> Why not:
> 
> 	if (iob.req_list)
> 		iob.complete(&iob);
> 	else if (!pos)
> 		return 0;

That is cleaner, will adapt. Thanks!
Christoph Hellwig Oct. 14, 2021, 4:08 p.m. UTC | #3
On Thu, Oct 14, 2021 at 09:45:38AM -0600, Jens Axboe wrote:
> On 10/14/21 2:03 AM, Christoph Hellwig wrote:
> > On Wed, Oct 13, 2021 at 10:54:15AM -0600, Jens Axboe wrote:
> >> @@ -2404,6 +2406,11 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
> >>  		struct kiocb *kiocb = &req->rw.kiocb;
> >>  		int ret;
> >>  
> >> +		if (!file)
> >> +			file = kiocb->ki_filp;
> >> +		else if (file != kiocb->ki_filp)
> >> +			break;
> >> +
> > 
> > Can you explain why we now can only poll for a single file (independent
> > of the fact that batching is used)?
> 
> Different file may be on a different backend, it's just playing it
> safe and splitting it up. In practice it should not matter.

Well, with file systems even the same file can land on different
devices.  Maybe we need a cookie?

Either way this should be commented as right now it looks pretty
arbitrary.
Jens Axboe Oct. 14, 2021, 6:14 p.m. UTC | #4
On 10/14/21 10:08 AM, Christoph Hellwig wrote:
> On Thu, Oct 14, 2021 at 09:45:38AM -0600, Jens Axboe wrote:
>> On 10/14/21 2:03 AM, Christoph Hellwig wrote:
>>> On Wed, Oct 13, 2021 at 10:54:15AM -0600, Jens Axboe wrote:
>>>> @@ -2404,6 +2406,11 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
>>>>  		struct kiocb *kiocb = &req->rw.kiocb;
>>>>  		int ret;
>>>>  
>>>> +		if (!file)
>>>> +			file = kiocb->ki_filp;
>>>> +		else if (file != kiocb->ki_filp)
>>>> +			break;
>>>> +
>>>
>>> Can you explain why we now can only poll for a single file (independent
>>> of the fact that batching is used)?
>>
>> Different file may be on a different backend, it's just playing it
>> safe and splitting it up. In practice it should not matter.
> 
> Well, with file systems even the same file can land on different
> devices.  Maybe we need a cookie?
> 
> Either way this should be commented as right now it looks pretty
> arbitrary.

I got rid of this and added a dev_id kind of cookie that gets matched
on batched addition.
Christoph Hellwig Oct. 16, 2021, 4:29 a.m. UTC | #5
On Thu, Oct 14, 2021 at 12:14:19PM -0600, Jens Axboe wrote:
> > Either way this should be commented as right now it looks pretty
> > arbitrary.
> 
> I got rid of this and added a dev_id kind of cookie that gets matched
> on batched addition.

Thinking about this a bit more.  I don't think we need to differenciate
devices, just complete handlers.  That is everything using the same
->complete_batch handler could go onto the same list.  Which means
different nvme devices can be polled together.  So I think we can
just check the complete/complete_batch handler and be done with it.
The big benefit besides saving a field is that this is pretty much
self-explaining.
Jens Axboe Oct. 16, 2021, 2:33 p.m. UTC | #6
On 10/15/21 10:29 PM, Christoph Hellwig wrote:
> On Thu, Oct 14, 2021 at 12:14:19PM -0600, Jens Axboe wrote:
>>> Either way this should be commented as right now it looks pretty
>>> arbitrary.
>>
>> I got rid of this and added a dev_id kind of cookie that gets matched
>> on batched addition.
> 
> Thinking about this a bit more.  I don't think we need to differenciate
> devices, just complete handlers.  That is everything using the same
> ->complete_batch handler could go onto the same list.  Which means
> different nvme devices can be polled together.  So I think we can
> just check the complete/complete_batch handler and be done with it.
> The big benefit besides saving a field is that this is pretty much
> self-explaining.

Good idea, cleaner than a dev_id cookie too. I'll be sending out the
revised series soonish, but I incorporated that change.
diff mbox series

Patch

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 082ff64c1bcb..5c031ab8f77f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2390,6 +2390,8 @@  static int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 {
 	struct io_wq_work_node *pos, *start, *prev;
 	unsigned int poll_flags = BLK_POLL_NOSLEEP;
+	struct file *file = NULL;
+	DEFINE_IO_BATCH(iob);
 	int nr_events = 0;
 
 	/*
@@ -2404,6 +2406,11 @@  static int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 		struct kiocb *kiocb = &req->rw.kiocb;
 		int ret;
 
+		if (!file)
+			file = kiocb->ki_filp;
+		else if (file != kiocb->ki_filp)
+			break;
+
 		/*
 		 * Move completed and retryable entries to our local lists.
 		 * If we find a request that requires polling, break out
@@ -2412,19 +2419,21 @@  static int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 		if (READ_ONCE(req->iopoll_completed))
 			break;
 
-		ret = kiocb->ki_filp->f_op->iopoll(kiocb, NULL, poll_flags);
+		ret = kiocb->ki_filp->f_op->iopoll(kiocb, &iob, poll_flags);
 		if (unlikely(ret < 0))
 			return ret;
 		else if (ret)
 			poll_flags |= BLK_POLL_ONESHOT;
 
 		/* iopoll may have completed current req */
-		if (READ_ONCE(req->iopoll_completed))
+		if (iob.req_list || READ_ONCE(req->iopoll_completed))
 			break;
 	}
 
-	if (!pos)
+	if (!pos && !iob.req_list)
 		return 0;
+	if (iob.req_list)
+		iob.complete(&iob);
 
 	prev = start;
 	wq_list_for_each_resume(pos, prev) {