diff mbox series

[for-next,v3,4/7] io_uring: add IORING_SETUP_DEFER_TASKRUN

Message ID 20220819121946.676065-5-dylany@fb.com (mailing list archive)
State New
Headers show
Series io_uring: defer task work to when it is needed | expand

Commit Message

Dylan Yudaken Aug. 19, 2022, 12:19 p.m. UTC
Allow deferring async tasks until the user calls io_uring_enter(2) with
the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at
io_uring_setup time. This functionality requires that the later
io_uring_enter will be called from the same submission task, and therefore
restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is also
set.

Being able to hand pick when tasks are run prevents the problem where
there is current work to be done, however task work runs anyway.

For example, a common workload would obtain a batch of CQEs, and process
each one. Interrupting this to additional taskwork would add latency but
not gain anything. If instead task work is deferred to just before more
CQEs are obtained then no additional latency is added.

The way this is implemented is by trying to keep task work local to a
io_ring_ctx, rather than to the submission task. This is required, as the
application will want to wake up only a single io_ring_ctx at a time to
process work, and so the lists of work have to be kept separate.

This has some other benefits like not having to check the task continually
in handle_tw_list (and potentially unlocking/locking those), and reducing
locks in the submit & process completions path.

There are networking cases where using this option can reduce request
latency by 50%. For example a contrived example using [1] where the client
sends 2k data and receives the same data back while doing some system
calls (to trigger task work) shows this reduction. The reason ends up
being that if sending responses is delayed by processing task work, then
the client side sits idle. Whereas reordering the sends first means that
the client runs it's workload in parallel with the local task work.

[1]:
Using https://github.com/DylanZA/netbench/tree/defer_run
Client:
./netbench  --client_only 1 --control_port 10000 --host <host> --tx "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --workload 1000"
Server:
./netbench  --server_only 1 --control_port 10000  --rx "io_uring --defer_taskrun 0 --workload 100"   --rx "io_uring  --defer_taskrun 1 --workload 100"

Signed-off-by: Dylan Yudaken <dylany@fb.com>
---
 include/linux/io_uring_types.h |   2 +
 include/uapi/linux/io_uring.h  |   7 ++
 io_uring/cancel.c              |   2 +-
 io_uring/io_uring.c            | 158 ++++++++++++++++++++++++++++++---
 io_uring/io_uring.h            |  29 +++++-
 io_uring/rsrc.c                |   2 +-
 6 files changed, 184 insertions(+), 16 deletions(-)

Comments

Pavel Begunkov Aug. 22, 2022, 11:34 a.m. UTC | #1
On 8/19/22 13:19, Dylan Yudaken wrote:
> Allow deferring async tasks until the user calls io_uring_enter(2) with
> the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at
> io_uring_setup time. This functionality requires that the later
> io_uring_enter will be called from the same submission task, and therefore
> restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is also
> set.

Looks ok, a couple of small comments below, but I don't see anything
blocking it.

> Being able to hand pick when tasks are run prevents the problem where
> there is current work to be done, however task work runs anyway.
> 
> For example, a common workload would obtain a batch of CQEs, and process
> each one. Interrupting this to additional taskwork would add latency but
> not gain anything. If instead task work is deferred to just before more
> CQEs are obtained then no additional latency is added.
> 
> The way this is implemented is by trying to keep task work local to a
> io_ring_ctx, rather than to the submission task. This is required, as the
> application will want to wake up only a single io_ring_ctx at a time to
> process work, and so the lists of work have to be kept separate.
> 
> This has some other benefits like not having to check the task continually
> in handle_tw_list (and potentially unlocking/locking those), and reducing
> locks in the submit & process completions path.
> 
> There are networking cases where using this option can reduce request
> latency by 50%. For example a contrived example using [1] where the client
> sends 2k data and receives the same data back while doing some system
> calls (to trigger task work) shows this reduction. The reason ends up
> being that if sending responses is delayed by processing task work, then
> the client side sits idle. Whereas reordering the sends first means that
> the client runs it's workload in parallel with the local task work.

Quite contrived, for some it may cut latency in half but for others
as easily increate it twofold. In any case, it's not a critique of the
feature as it's optional, but rather raises a question whether we
need to add some fairness / scheduling here.

> [1]:
> Using https://github.com/DylanZA/netbench/tree/defer_run
> Client:
> ./netbench  --client_only 1 --control_port 10000 --host <host> --tx "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --workload 1000"
> Server:
> ./netbench  --server_only 1 --control_port 10000  --rx "io_uring --defer_taskrun 0 --workload 100"   --rx "io_uring  --defer_taskrun 1 --workload 100"
> 
> Signed-off-by: Dylan Yudaken <dylany@fb.com>
> ---

> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 53696dd90626..6572d2276750 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
[...]

> +int io_run_local_work(struct io_ring_ctx *ctx, bool locked)
> +{
> +	struct llist_node *node;
> +	struct llist_node fake;
> +	struct llist_node *current_final = NULL;
> +	int ret;
> +
> +	if (unlikely(ctx->submitter_task != current)) {
> +		if (locked)
> +			mutex_unlock(&ctx->uring_lock);
> +
> +		/* maybe this is before any submissions */
> +		if (!ctx->submitter_task)
> +			return 0;
> +
> +		return -EEXIST;
> +	}
> +
> +	if (!locked)
> +		locked = mutex_trylock(&ctx->uring_lock);
> +
> +	node = io_llist_xchg(&ctx->work_llist, &fake);
> +	ret = 0;
> +again:
> +	while (node != current_final) {
> +		struct llist_node *next = node->next;
> +		struct io_kiocb *req = container_of(node, struct io_kiocb,
> +						    io_task_work.node);
> +		prefetch(container_of(next, struct io_kiocb, io_task_work.node));
> +		req->io_task_work.func(req, &locked);
> +		ret++;
> +		node = next;
> +	}
> +
> +	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
> +		atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
> +
> +	node = io_llist_cmpxchg(&ctx->work_llist, &fake, NULL);
> +	if (node != &fake) {
> +		current_final = &fake;
> +		node = io_llist_xchg(&ctx->work_llist, &fake);
> +		goto again;
> +	}
> +
> +	if (locked) {
> +		io_submit_flush_completions(ctx);
> +		mutex_unlock(&ctx->uring_lock);
> +	}
> +	return ret;
> +}

I was thinking about:

int io_run_local_work(struct io_ring_ctx *ctx, bool *locked)
{
	locked = try_lock();
}

bool locked = false;
io_run_local_work(ctx, *locked);
if (locked)
	unlock();

// or just as below when already holding it
bool locked = true;
io_run_local_work(ctx, *locked);

Which would replace

if (DEFER) {
	// we're assuming that it'll unlock
	io_run_local_work(true);
} else {
	unlock();
}

with

if (DEFER) {
	bool locked = true;
	io_run_local_work(&locked);
}
unlock();

But anyway, it can be mulled later.


> -int io_run_task_work_sig(void)
> +int io_run_task_work_sig(struct io_ring_ctx *ctx)
>   {
> -	if (io_run_task_work())
> +	if (io_run_task_work_ctx(ctx))
>   		return 1;
>   	if (task_sigpending(current))
>   		return -EINTR;
> @@ -2196,7 +2294,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
>   	unsigned long check_cq;
>   
>   	/* make sure we run task_work before checking for signals */
> -	ret = io_run_task_work_sig();
> +	ret = io_run_task_work_sig(ctx);
>   	if (ret || io_should_wake(iowq))
>   		return ret;
>   
> @@ -2230,7 +2328,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>   		io_cqring_overflow_flush(ctx);
>   		if (io_cqring_events(ctx) >= min_events)
>   			return 0;
> -		if (!io_run_task_work())
> +		if (!io_run_task_work_ctx(ctx))
>   			break;
>   	} while (1);
>   
> @@ -2573,6 +2671,9 @@ static __cold void io_ring_exit_work(struct work_struct *work)
>   	 * as nobody else will be looking for them.
>   	 */
>   	do {
> +		if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
> +			io_move_task_work_from_local(ctx);
> +
>   		while (io_uring_try_cancel_requests(ctx, NULL, true))
>   			cond_resched();
>   
> @@ -2768,6 +2869,8 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
>   		}
>   	}
>   
> +	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
> +		ret |= io_run_local_work(ctx, false) > 0;
>   	ret |= io_cancel_defer_files(ctx, task, cancel_all);
>   	mutex_lock(&ctx->uring_lock);
>   	ret |= io_poll_remove_all(ctx, task, cancel_all);
> @@ -3057,10 +3160,20 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
>   		}
>   		if ((flags & IORING_ENTER_GETEVENTS) && ctx->syscall_iopoll)
>   			goto iopoll_locked;
> +		if ((flags & IORING_ENTER_GETEVENTS) &&
> +			(ctx->flags & IORING_SETUP_DEFER_TASKRUN)) {
> +			int ret2 = io_run_local_work(ctx, true);
> +
> +			if (unlikely(ret2 < 0))
> +				goto out;

It's an optimisation and we don't have to handle errors here,
let's ignore them and make it looking a bit better.

> +			goto getevents_ran_local;
> +		}
>   		mutex_unlock(&ctx->uring_lock);
>   	}
> +
>   	if (flags & IORING_ENTER_GETEVENTS) {
>   		int ret2;
> +
>   		if (ctx->syscall_iopoll) {
>   			/*
>   			 * We disallow the app entering submit/complete with
> @@ -3081,6 +3194,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
>   			const sigset_t __user *sig;
>   			struct __kernel_timespec __user *ts;
>   
> +			if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {

I think it should be in io_cqring_wait(), which calls it anyway
in the beginning. Instead of

	do {
		io_cqring_overflow_flush(ctx);
		if (io_cqring_events(ctx) >= min_events)
			return 0;
		if (!io_run_task_work())
			break;
	} while (1);

Let's have

	do {
		ret = io_run_task_work_ctx();
		// handle ret
		io_cqring_overflow_flush(ctx);
		if (io_cqring_events(ctx) >= min_events)
			return 0;
	} while (1);

> +				ret2 = io_run_local_work(ctx, false);
> +				if (unlikely(ret2 < 0))
> +					goto getevents_out;
> +			}
> +getevents_ran_local:
>   			ret2 = io_get_ext_arg(flags, argp, &argsz, &ts, &sig);
>   			if (likely(!ret2)) {
>   				min_complete = min(min_complete,
> @@ -3090,6 +3209,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
>   			}
>   		}
>   
> +getevents_out:
>   		if (!ret) {
>   			ret = ret2;
>   
> @@ -3289,17 +3409,29 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
>   	if (ctx->flags & IORING_SETUP_SQPOLL) {
>   		/* IPI related flags don't make sense with SQPOLL */
>   		if (ctx->flags & (IORING_SETUP_COOP_TASKRUN |
> -				  IORING_SETUP_TASKRUN_FLAG))
> +				  IORING_SETUP_TASKRUN_FLAG |
> +				  IORING_SETUP_DEFER_TASKRUN))

Sounds like we should also fail if SQPOLL is set, especially with
the task check on the waiting side.

>   			goto err;
>   		ctx->notify_method = TWA_SIGNAL_NO_IPI;
>   	} else if (ctx->flags & IORING_SETUP_COOP_TASKRUN) {
>   		ctx->notify_method = TWA_SIGNAL_NO_IPI;
[...]
>   	mutex_lock(&ctx->uring_lock);
>   	ret = __io_uring_register(ctx, opcode, arg, nr_args);
> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
> index 2f73f83af960..a9fb115234af 100644
> --- a/io_uring/io_uring.h
> +++ b/io_uring/io_uring.h
> @@ -26,7 +26,8 @@ enum {
[...]
> +static inline int io_run_task_work_unlock_ctx(struct io_ring_ctx *ctx)
> +{
> +	int ret;
> +
> +	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
> +		ret = io_run_local_work(ctx, true);
> +	} else {
> +		mutex_unlock(&ctx->uring_lock);
> +		ret = (int)io_run_task_work();

Why do we need a cast? let's keep the return type same
Hao Xu Aug. 29, 2022, 6:32 a.m. UTC | #2
On 8/22/22 19:34, Pavel Begunkov wrote:
> On 8/19/22 13:19, Dylan Yudaken wrote:
>> Allow deferring async tasks until the user calls io_uring_enter(2) with
>> the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at
>> io_uring_setup time. This functionality requires that the later
>> io_uring_enter will be called from the same submission task, and 
>> therefore
>> restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is also
>> set.
> 
> Looks ok, a couple of small comments below, but I don't see anything
> blocking it.
> 
>> Being able to hand pick when tasks are run prevents the problem where
>> there is current work to be done, however task work runs anyway.
>>
>> For example, a common workload would obtain a batch of CQEs, and process
>> each one. Interrupting this to additional taskwork would add latency but
>> not gain anything. If instead task work is deferred to just before more
>> CQEs are obtained then no additional latency is added.
>>
>> The way this is implemented is by trying to keep task work local to a
>> io_ring_ctx, rather than to the submission task. This is required, as the
>> application will want to wake up only a single io_ring_ctx at a time to
>> process work, and so the lists of work have to be kept separate.
>>
>> This has some other benefits like not having to check the task 
>> continually
>> in handle_tw_list (and potentially unlocking/locking those), and reducing
>> locks in the submit & process completions path.
>>
>> There are networking cases where using this option can reduce request
>> latency by 50%. For example a contrived example using [1] where the 
>> client
>> sends 2k data and receives the same data back while doing some system
>> calls (to trigger task work) shows this reduction. The reason ends up
>> being that if sending responses is delayed by processing task work, then
>> the client side sits idle. Whereas reordering the sends first means that
>> the client runs it's workload in parallel with the local task work.
> 
> Quite contrived, for some it may cut latency in half but for others
> as easily increate it twofold. In any case, it's not a critique of the
> feature as it's optional, but rather raises a question whether we
> need to add some fairness / scheduling here.
> 
>> [1]:
>> Using https://github.com/DylanZA/netbench/tree/defer_run
>> Client:
>> ./netbench  --client_only 1 --control_port 10000 --host <host> --tx 
>> "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --workload 
>> 1000"
>> Server:
>> ./netbench  --server_only 1 --control_port 10000  --rx "io_uring 
>> --defer_taskrun 0 --workload 100"   --rx "io_uring  --defer_taskrun 1 
>> --workload 100"
>>
>> Signed-off-by: Dylan Yudaken <dylany@fb.com>
>> ---
> 
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 53696dd90626..6572d2276750 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
> [...]
> 
>> +int io_run_local_work(struct io_ring_ctx *ctx, bool locked)
>> +{
>> +    struct llist_node *node;
>> +    struct llist_node fake;
>> +    struct llist_node *current_final = NULL;
>> +    int ret;
>> +
>> +    if (unlikely(ctx->submitter_task != current)) {
>> +        if (locked)
>> +            mutex_unlock(&ctx->uring_lock);
>> +
>> +        /* maybe this is before any submissions */
>> +        if (!ctx->submitter_task)
>> +            return 0;
>> +
>> +        return -EEXIST;
>> +    }
>> +
>> +    if (!locked)
>> +        locked = mutex_trylock(&ctx->uring_lock);
>> +
>> +    node = io_llist_xchg(&ctx->work_llist, &fake);
>> +    ret = 0;
>> +again:
>> +    while (node != current_final) {
>> +        struct llist_node *next = node->next;
>> +        struct io_kiocb *req = container_of(node, struct io_kiocb,
>> +                            io_task_work.node);
>> +        prefetch(container_of(next, struct io_kiocb, 
>> io_task_work.node));
>> +        req->io_task_work.func(req, &locked);
>> +        ret++;
>> +        node = next;
>> +    }
>> +
>> +    if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
>> +        atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
>> +
>> +    node = io_llist_cmpxchg(&ctx->work_llist, &fake, NULL);
>> +    if (node != &fake) {
>> +        current_final = &fake;
>> +        node = io_llist_xchg(&ctx->work_llist, &fake);
>> +        goto again;
>> +    }
>> +
>> +    if (locked) {
>> +        io_submit_flush_completions(ctx);
>> +        mutex_unlock(&ctx->uring_lock);
>> +    }
>> +    return ret;
>> +}
> 
> I was thinking about:
> 
> int io_run_local_work(struct io_ring_ctx *ctx, bool *locked)
> {
>      locked = try_lock();
> }
> 
> bool locked = false;
> io_run_local_work(ctx, *locked);
> if (locked)
>      unlock();
> 
> // or just as below when already holding it
> bool locked = true;
> io_run_local_work(ctx, *locked);
> 
> Which would replace
> 
> if (DEFER) {
>      // we're assuming that it'll unlock
>      io_run_local_work(true);
> } else {
>      unlock();
> }
> 
> with
> 
> if (DEFER) {
>      bool locked = true;
>      io_run_local_work(&locked);
> }
> unlock();
> 
> But anyway, it can be mulled later.
> 
> 
>> -int io_run_task_work_sig(void)
>> +int io_run_task_work_sig(struct io_ring_ctx *ctx)
>>   {
>> -    if (io_run_task_work())
>> +    if (io_run_task_work_ctx(ctx))
>>           return 1;
>>       if (task_sigpending(current))
>>           return -EINTR;
>> @@ -2196,7 +2294,7 @@ static inline int io_cqring_wait_schedule(struct 
>> io_ring_ctx *ctx,
>>       unsigned long check_cq;
>>       /* make sure we run task_work before checking for signals */
>> -    ret = io_run_task_work_sig();
>> +    ret = io_run_task_work_sig(ctx);
>>       if (ret || io_should_wake(iowq))
>>           return ret;
>> @@ -2230,7 +2328,7 @@ static int io_cqring_wait(struct io_ring_ctx 
>> *ctx, int min_events,
>>           io_cqring_overflow_flush(ctx);
>>           if (io_cqring_events(ctx) >= min_events)
>>               return 0;
>> -        if (!io_run_task_work())
>> +        if (!io_run_task_work_ctx(ctx))
>>               break;
>>       } while (1);
>> @@ -2573,6 +2671,9 @@ static __cold void io_ring_exit_work(struct 
>> work_struct *work)
>>        * as nobody else will be looking for them.
>>        */
>>       do {
>> +        if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
>> +            io_move_task_work_from_local(ctx);
>> +
>>           while (io_uring_try_cancel_requests(ctx, NULL, true))
>>               cond_resched();
>> @@ -2768,6 +2869,8 @@ static __cold bool 
>> io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
>>           }
>>       }
>> +    if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
>> +        ret |= io_run_local_work(ctx, false) > 0;
>>       ret |= io_cancel_defer_files(ctx, task, cancel_all);
>>       mutex_lock(&ctx->uring_lock);
>>       ret |= io_poll_remove_all(ctx, task, cancel_all);
>> @@ -3057,10 +3160,20 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, 
>> fd, u32, to_submit,
>>           }
>>           if ((flags & IORING_ENTER_GETEVENTS) && ctx->syscall_iopoll)
>>               goto iopoll_locked;
>> +        if ((flags & IORING_ENTER_GETEVENTS) &&
>> +            (ctx->flags & IORING_SETUP_DEFER_TASKRUN)) {
>> +            int ret2 = io_run_local_work(ctx, true);
>> +
>> +            if (unlikely(ret2 < 0))
>> +                goto out;
> 
> It's an optimisation and we don't have to handle errors here,
> let's ignore them and make it looking a bit better.
> 
>> +            goto getevents_ran_local;
>> +        }
>>           mutex_unlock(&ctx->uring_lock);
>>       }
>> +
>>       if (flags & IORING_ENTER_GETEVENTS) {
>>           int ret2;
>> +
>>           if (ctx->syscall_iopoll) {
>>               /*
>>                * We disallow the app entering submit/complete with
>> @@ -3081,6 +3194,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, 
>> fd, u32, to_submit,
>>               const sigset_t __user *sig;
>>               struct __kernel_timespec __user *ts;
>> +            if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
> 
> I think it should be in io_cqring_wait(), which calls it anyway
> in the beginning. Instead of
> 
>      do {
>          io_cqring_overflow_flush(ctx);
>          if (io_cqring_events(ctx) >= min_events)
>              return 0;
>          if (!io_run_task_work())
>              break;
>      } while (1);
> 
> Let's have
> 
>      do {
>          ret = io_run_task_work_ctx();
>          // handle ret
>          io_cqring_overflow_flush(ctx);
>          if (io_cqring_events(ctx) >= min_events)
>              return 0;
>      } while (1);
> 
>> +                ret2 = io_run_local_work(ctx, false);
>> +                if (unlikely(ret2 < 0))
>> +                    goto getevents_out;
>> +            }
>> +getevents_ran_local:
>>               ret2 = io_get_ext_arg(flags, argp, &argsz, &ts, &sig);
>>               if (likely(!ret2)) {
>>                   min_complete = min(min_complete,
>> @@ -3090,6 +3209,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, 
>> fd, u32, to_submit,
>>               }
>>           }
>> +getevents_out:
>>           if (!ret) {
>>               ret = ret2;
>> @@ -3289,17 +3409,29 @@ static __cold int io_uring_create(unsigned 
>> entries, struct io_uring_params *p,
>>       if (ctx->flags & IORING_SETUP_SQPOLL) {
>>           /* IPI related flags don't make sense with SQPOLL */
>>           if (ctx->flags & (IORING_SETUP_COOP_TASKRUN |
>> -                  IORING_SETUP_TASKRUN_FLAG))
>> +                  IORING_SETUP_TASKRUN_FLAG |
>> +                  IORING_SETUP_DEFER_TASKRUN))
> 
> Sounds like we should also fail if SQPOLL is set, especially with
> the task check on the waiting side.

sqpoll as a natural single issuer case, shouldn't we support this
feature for it? And surely, in that case, don't do local task work check
in cqring wait time and be careful in other places like
io_uring_register

> 
>>               goto err;
>>           ctx->notify_method = TWA_SIGNAL_NO_IPI;
>>       } else if (ctx->flags & IORING_SETUP_COOP_TASKRUN) {
>>           ctx->notify_method = TWA_SIGNAL_NO_IPI;
> [...]
>>       mutex_lock(&ctx->uring_lock);
>>       ret = __io_uring_register(ctx, opcode, arg, nr_args);
>> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
>> index 2f73f83af960..a9fb115234af 100644
>> --- a/io_uring/io_uring.h
>> +++ b/io_uring/io_uring.h
>> @@ -26,7 +26,8 @@ enum {
> [...]
>> +static inline int io_run_task_work_unlock_ctx(struct io_ring_ctx *ctx)
>> +{
>> +    int ret;
>> +
>> +    if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
>> +        ret = io_run_local_work(ctx, true);
>> +    } else {
>> +        mutex_unlock(&ctx->uring_lock);
>> +        ret = (int)io_run_task_work();
> 
> Why do we need a cast? let's keep the return type same
> 
>
Dylan Yudaken Aug. 30, 2022, 7:23 a.m. UTC | #3
On Mon, 2022-08-29 at 14:32 +0800, Hao Xu wrote:
> > > @@ -3289,17 +3409,29 @@ static __cold int
> > > io_uring_create(unsigned 
> > > entries, struct io_uring_params *p,
> > >       if (ctx->flags & IORING_SETUP_SQPOLL) {
> > >           /* IPI related flags don't make sense with SQPOLL */
> > >           if (ctx->flags & (IORING_SETUP_COOP_TASKRUN |
> > > -                  IORING_SETUP_TASKRUN_FLAG))
> > > +                  IORING_SETUP_TASKRUN_FLAG |
> > > +                  IORING_SETUP_DEFER_TASKRUN))
> > 
> > Sounds like we should also fail if SQPOLL is set, especially with
> > the task check on the waiting side.
> 
> sqpoll as a natural single issuer case, shouldn't we support this
> feature for it? And surely, in that case, don't do local task work
> check
> in cqring wait time and be careful in other places like
> io_uring_register

I think there is definitely scope for that - but it's less obvious how
to do it.
i.e. in it's current form DEFER_TASKRUN requires the GETEVENTS to be
submitted on the same task as the initial submission, but with SQPOLL
the submission task is a kernel thread so would have to have some
difference in the API.

As an idea for a later patch set - perhaps the semantics should be to
keep task work local but only run it once submissions have been
processed for a ctx. I suspect that will require some care to ensure
the wakeup flag is set correctly and that it cleans up properly.

Dylan
Hao Xu Aug. 30, 2022, 7:54 a.m. UTC | #4
On 8/30/22 15:23, Dylan Yudaken wrote:
> On Mon, 2022-08-29 at 14:32 +0800, Hao Xu wrote:
>>>> @@ -3289,17 +3409,29 @@ static __cold int
>>>> io_uring_create(unsigned
>>>> entries, struct io_uring_params *p,
>>>>        if (ctx->flags & IORING_SETUP_SQPOLL) {
>>>>            /* IPI related flags don't make sense with SQPOLL */
>>>>            if (ctx->flags & (IORING_SETUP_COOP_TASKRUN |
>>>> -                  IORING_SETUP_TASKRUN_FLAG))
>>>> +                  IORING_SETUP_TASKRUN_FLAG |
>>>> +                  IORING_SETUP_DEFER_TASKRUN))
>>>
>>> Sounds like we should also fail if SQPOLL is set, especially with
>>> the task check on the waiting side.
>>
>> sqpoll as a natural single issuer case, shouldn't we support this
>> feature for it? And surely, in that case, don't do local task work
>> check
>> in cqring wait time and be careful in other places like
>> io_uring_register
> 
> I think there is definitely scope for that - but it's less obvious how
> to do it.
> i.e. in it's current form DEFER_TASKRUN requires the GETEVENTS to be
> submitted on the same task as the initial submission, but with SQPOLL
> the submission task is a kernel thread so would have to have some
> difference in the API.

Yea, just like what I said, in sqpoll mode, we shouldn't do the tw
handle in the io_uring_enter.

> 
> As an idea for a later patch set - perhaps the semantics should be to
> keep task work local but only run it once submissions have been
> processed for a ctx. I suspect that will require some care to ensure
> the wakeup flag is set correctly and that it cleans up properly.
> 

Yea, it should be a separate follow-up patchset, currently there is a
io_run_task_work() after submitted sqes for all ctxes and before going
to sleep, that may be a good place for it. I haven't think about it in
detail, but there should be a viable way.

> Dylan
>
Dylan Yudaken Aug. 30, 2022, 9:54 a.m. UTC | #5
On Mon, 2022-08-22 at 12:34 +0100, Pavel Begunkov wrote:
> On 8/19/22 13:19, Dylan Yudaken wrote:
> > Allow deferring async tasks until the user calls io_uring_enter(2)
> > with
> > the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at
> > io_uring_setup time. This functionality requires that the later
> > io_uring_enter will be called from the same submission task, and
> > therefore
> > restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is
> > also
> > set.
> 
> Looks ok, a couple of small comments below, but I don't see anything
> blocking it.
> 
> > Being able to hand pick when tasks are run prevents the problem
> > where
> > there is current work to be done, however task work runs anyway.
> > 
> > For example, a common workload would obtain a batch of CQEs, and
> > process
> > each one. Interrupting this to additional taskwork would add
> > latency but
> > not gain anything. If instead task work is deferred to just before
> > more
> > CQEs are obtained then no additional latency is added.
> > 
> > The way this is implemented is by trying to keep task work local to
> > a
> > io_ring_ctx, rather than to the submission task. This is required,
> > as the
> > application will want to wake up only a single io_ring_ctx at a
> > time to
> > process work, and so the lists of work have to be kept separate.
> > 
> > This has some other benefits like not having to check the task
> > continually
> > in handle_tw_list (and potentially unlocking/locking those), and
> > reducing
> > locks in the submit & process completions path.
> > 
> > There are networking cases where using this option can reduce
> > request
> > latency by 50%. For example a contrived example using [1] where the
> > client
> > sends 2k data and receives the same data back while doing some
> > system
> > calls (to trigger task work) shows this reduction. The reason ends
> > up
> > being that if sending responses is delayed by processing task work,
> > then
> > the client side sits idle. Whereas reordering the sends first means
> > that
> > the client runs it's workload in parallel with the local task work.
> 
> Quite contrived, for some it may cut latency in half but for others
> as easily increate it twofold. In any case, it's not a critique of
> the
> feature as it's optional, but rather raises a question whether we
> need to add some fairness / scheduling here.
> 
> > [1]:
> > Using https://github.com/DylanZA/netbench/tree/defer_run
> > Client:
> > ./netbench  --client_only 1 --control_port 10000 --host <host> --tx
> > "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --
> > workload 1000"
> > Server:
> > ./netbench  --server_only 1 --control_port 10000  --rx "io_uring --
> > defer_taskrun 0 --workload 100"   --rx "io_uring  --defer_taskrun 1
> > --workload 100"
> > 
> > Signed-off-by: Dylan Yudaken <dylany@fb.com>
> > ---
> 
> > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> > index 53696dd90626..6572d2276750 100644
> > --- a/io_uring/io_uring.c
> > +++ b/io_uring/io_uring.c
> [...]
> 
> > +int io_run_local_work(struct io_ring_ctx *ctx, bool locked)
> > +{
> > +       struct llist_node *node;
> > +       struct llist_node fake;
> > +       struct llist_node *current_final = NULL;
> > +       int ret;
> > +
> > +       if (unlikely(ctx->submitter_task != current)) {
> > +               if (locked)
> > +                       mutex_unlock(&ctx->uring_lock);
> > +
> > +               /* maybe this is before any submissions */
> > +               if (!ctx->submitter_task)
> > +                       return 0;
> > +
> > +               return -EEXIST;
> > +       }
> > +
> > +       if (!locked)
> > +               locked = mutex_trylock(&ctx->uring_lock);
> > +
> > +       node = io_llist_xchg(&ctx->work_llist, &fake);
> > +       ret = 0;
> > +again:
> > +       while (node != current_final) {
> > +               struct llist_node *next = node->next;
> > +               struct io_kiocb *req = container_of(node, struct
> > io_kiocb,
> > +                                                  
> > io_task_work.node);
> > +               prefetch(container_of(next, struct io_kiocb,
> > io_task_work.node));
> > +               req->io_task_work.func(req, &locked);
> > +               ret++;
> > +               node = next;
> > +       }
> > +
> > +       if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
> > +               atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings-
> > >sq_flags);
> > +
> > +       node = io_llist_cmpxchg(&ctx->work_llist, &fake, NULL);
> > +       if (node != &fake) {
> > +               current_final = &fake;
> > +               node = io_llist_xchg(&ctx->work_llist, &fake);
> > +               goto again;
> > +       }
> > +
> > +       if (locked) {
> > +               io_submit_flush_completions(ctx);
> > +               mutex_unlock(&ctx->uring_lock);
> > +       }
> > +       return ret;
> > +}
> 
> I was thinking about:
> 
> int io_run_local_work(struct io_ring_ctx *ctx, bool *locked)
> {
>         locked = try_lock();
> }
> 
> bool locked = false;
> io_run_local_work(ctx, *locked);
> if (locked)
>         unlock();
> 
> // or just as below when already holding it
> bool locked = true;
> io_run_local_work(ctx, *locked);
> 
> Which would replace
> 
> if (DEFER) {
>         // we're assuming that it'll unlock
>         io_run_local_work(true);
> } else {
>         unlock();
> }
> 
> with
> 
> if (DEFER) {
>         bool locked = true;
>         io_run_local_work(&locked);
> }
> unlock();
> 
> But anyway, it can be mulled later.

I think there is an easier way to clean it up if we allow an extra
unlock/lock in io_uring_enter (see below). Will do that in v4

> 
> 
> > -int io_run_task_work_sig(void)
> > +int io_run_task_work_sig(struct io_ring_ctx *ctx)
> >   {
> > -       if (io_run_task_work())
> > +       if (io_run_task_work_ctx(ctx))
> >                 return 1;
> >         if (task_sigpending(current))
> >                 return -EINTR;
> > @@ -2196,7 +2294,7 @@ static inline int
> > io_cqring_wait_schedule(struct io_ring_ctx *ctx,
> >         unsigned long check_cq;
> >   
> >         /* make sure we run task_work before checking for signals
> > */
> > -       ret = io_run_task_work_sig();
> > +       ret = io_run_task_work_sig(ctx);
> >         if (ret || io_should_wake(iowq))
> >                 return ret;
> >   
> > @@ -2230,7 +2328,7 @@ static int io_cqring_wait(struct io_ring_ctx
> > *ctx, int min_events,
> >                 io_cqring_overflow_flush(ctx);
> >                 if (io_cqring_events(ctx) >= min_events)
> >                         return 0;
> > -               if (!io_run_task_work())
> > +               if (!io_run_task_work_ctx(ctx))
> >                         break;
> >         } while (1);
> >   
> > @@ -2573,6 +2671,9 @@ static __cold void io_ring_exit_work(struct
> > work_struct *work)
> >          * as nobody else will be looking for them.
> >          */
> >         do {
> > +               if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
> > +                       io_move_task_work_from_local(ctx);
> > +
> >                 while (io_uring_try_cancel_requests(ctx, NULL,
> > true))
> >                         cond_resched();
> >   
> > @@ -2768,6 +2869,8 @@ static __cold bool
> > io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
> >                 }
> >         }
> >   
> > +       if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
> > +               ret |= io_run_local_work(ctx, false) > 0;
> >         ret |= io_cancel_defer_files(ctx, task, cancel_all);
> >         mutex_lock(&ctx->uring_lock);
> >         ret |= io_poll_remove_all(ctx, task, cancel_all);
> > @@ -3057,10 +3160,20 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
> > int, fd, u32, to_submit,
> >                 }
> >                 if ((flags & IORING_ENTER_GETEVENTS) && ctx-
> > >syscall_iopoll)
> >                         goto iopoll_locked;
> > +               if ((flags & IORING_ENTER_GETEVENTS) &&
> > +                       (ctx->flags & IORING_SETUP_DEFER_TASKRUN))
> > {
> > +                       int ret2 = io_run_local_work(ctx, true);
> > +
> > +                       if (unlikely(ret2 < 0))
> > +                               goto out;
> 
> It's an optimisation and we don't have to handle errors here,
> let's ignore them and make it looking a bit better.

I'm not convinced about that - as then there is no way the application
will know it is trying to complete events on the wrong thread. Work
will just silently pile up instead.
That being said - with the changes below I can just get rid of this
code I think.

> 
> > +                       goto getevents_ran_local;
> > +               }
> >                 mutex_unlock(&ctx->uring_lock);
> >         }
> > +
> >         if (flags & IORING_ENTER_GETEVENTS) {
> >                 int ret2;
> > +
> >                 if (ctx->syscall_iopoll) {
> >                         /*
> >                          * We disallow the app entering
> > submit/complete with
> > @@ -3081,6 +3194,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
> > int, fd, u32, to_submit,
> >                         const sigset_t __user *sig;
> >                         struct __kernel_timespec __user *ts;
> >   
> > +                       if (ctx->flags &
> > IORING_SETUP_DEFER_TASKRUN) {
> 
> I think it should be in io_cqring_wait(), which calls it anyway
> in the beginning. Instead of
> 
>         do {
>                 io_cqring_overflow_flush(ctx);
>                 if (io_cqring_events(ctx) >= min_events)
>                         return 0;
>                 if (!io_run_task_work())
>                         break;
>         } while (1);
> 
> Let's have
> 
>         do {
>                 ret = io_run_task_work_ctx();
>                 // handle ret
>                 io_cqring_overflow_flush(ctx);
>                 if (io_cqring_events(ctx) >= min_events)
>                         return 0;
>         } while (1);

I think that is ok.
The downside is that it adds an extra lock/unlock of the ctx in some
cases. I assume that will be neglegible?

> 
> > +                               ret2 = io_run_local_work(ctx,
> > false);
> > +                               if (unlikely(ret2 < 0))
> > +                                       goto getevents_out;
> > +                       }
> > +getevents_ran_local:
> >                         ret2 = io_get_ext_arg(flags, argp, &argsz,
> > &ts, &sig);
> >                         if (likely(!ret2)) {
> >                                 min_complete = min(min_complete,
> > @@ -3090,6 +3209,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int,
> > fd, u32, to_submit,
> >                         }
> >                 }
> >   
> > +getevents_out:
> >                 if (!ret) {
> >                         ret = ret2;
> >   
> > @@ -3289,17 +3409,29 @@ static __cold int io_uring_create(unsigned
> > entries, struct io_uring_params *p,
> >         if (ctx->flags & IORING_SETUP_SQPOLL) {
> >                 /* IPI related flags don't make sense with SQPOLL
> > */
> >                 if (ctx->flags & (IORING_SETUP_COOP_TASKRUN |
> > -                                 IORING_SETUP_TASKRUN_FLAG))
> > +                                 IORING_SETUP_TASKRUN_FLAG |
> > +                                 IORING_SETUP_DEFER_TASKRUN))
> 
> Sounds like we should also fail if SQPOLL is set, especially with
> the task check on the waiting side.
> 

That is what this code is doing I think? Did I miss something?


> >                         goto err;
> >                 ctx->notify_method = TWA_SIGNAL_NO_IPI;
> >         } else if (ctx->flags & IORING_SETUP_COOP_TASKRUN) {
> >                 ctx->notify_method = TWA_SIGNAL_NO_IPI;
> [...]
> >         mutex_lock(&ctx->uring_lock);
> >         ret = __io_uring_register(ctx, opcode, arg, nr_args);
> > diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
> > index 2f73f83af960..a9fb115234af 100644
> > --- a/io_uring/io_uring.h
> > +++ b/io_uring/io_uring.h
> > @@ -26,7 +26,8 @@ enum {
> [...]
> > +static inline int io_run_task_work_unlock_ctx(struct io_ring_ctx
> > *ctx)
> > +{
> > +       int ret;
> > +
> > +       if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
> > +               ret = io_run_local_work(ctx, true);
> > +       } else {
> > +               mutex_unlock(&ctx->uring_lock);
> > +               ret = (int)io_run_task_work();
> 
> Why do we need a cast? let's keep the return type same

Ok I'll update the return types here


Dylan
Pavel Begunkov Aug. 30, 2022, 10:29 a.m. UTC | #6
On 8/30/22 10:54, Dylan Yudaken wrote:
> On Mon, 2022-08-22 at 12:34 +0100, Pavel Begunkov wrote:
[...]
>>> +
>>> +       node = io_llist_cmpxchg(&ctx->work_llist, &fake, NULL);
>>> +       if (node != &fake) {
>>> +               current_final = &fake;
>>> +               node = io_llist_xchg(&ctx->work_llist, &fake);
>>> +               goto again;
>>> +       }
>>> +
>>> +       if (locked) {
>>> +               io_submit_flush_completions(ctx);
>>> +               mutex_unlock(&ctx->uring_lock);
>>> +       }
>>> +       return ret;
>>> +}
>>
>> I was thinking about:
>>
>> int io_run_local_work(struct io_ring_ctx *ctx, bool *locked)
>> {
>>          locked = try_lock();
>> }
>>
>> bool locked = false;
>> io_run_local_work(ctx, *locked);
>> if (locked)
>>          unlock();
>>
>> // or just as below when already holding it
>> bool locked = true;
>> io_run_local_work(ctx, *locked);
>>
>> Which would replace
>>
>> if (DEFER) {
>>          // we're assuming that it'll unlock
>>          io_run_local_work(true);
>> } else {
>>          unlock();
>> }
>>
>> with
>>
>> if (DEFER) {
>>          bool locked = true;
>>          io_run_local_work(&locked);
>> }
>> unlock();
>>
>> But anyway, it can be mulled later.
> 
> I think there is an easier way to clean it up if we allow an extra
> unlock/lock in io_uring_enter (see below). Will do that in v4

fwiw, I'm fine with the current code, the rest can
be cleaned up later if you'd prefer so.

[...]
>>> @@ -3057,10 +3160,20 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
>>> int, fd, u32, to_submit,
>>>                  }
>>>                  if ((flags & IORING_ENTER_GETEVENTS) && ctx-
>>>> syscall_iopoll)
>>>                          goto iopoll_locked;
>>> +               if ((flags & IORING_ENTER_GETEVENTS) &&
>>> +                       (ctx->flags & IORING_SETUP_DEFER_TASKRUN))
>>> {
>>> +                       int ret2 = io_run_local_work(ctx, true);
>>> +
>>> +                       if (unlikely(ret2 < 0))
>>> +                               goto out;
>>
>> It's an optimisation and we don't have to handle errors here,
>> let's ignore them and make it looking a bit better.
> 
> I'm not convinced about that - as then there is no way the application
> will know it is trying to complete events on the wrong thread. Work
> will just silently pile up instead.

by optimisation I mean exactly this chunk right after submsission.
If it's a wrong thread this will be ignored, then control flow will
fall into cq_wait and then fail there returning an error. So, the
userspace should get an error in the end but the handling would be
consolidated in cq_wait.

> That being said - with the changes below I can just get rid of this
> code I think.
> 
>>
>>> +                       goto getevents_ran_local;
>>> +               }
>>>                  mutex_unlock(&ctx->uring_lock);
>>>          }
>>> +
>>>          if (flags & IORING_ENTER_GETEVENTS) {
>>>                  int ret2;
>>> +
>>>                  if (ctx->syscall_iopoll) {
>>>                          /*
>>>                           * We disallow the app entering
>>> submit/complete with
>>> @@ -3081,6 +3194,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
>>> int, fd, u32, to_submit,
>>>                          const sigset_t __user *sig;
>>>                          struct __kernel_timespec __user *ts;
>>>    
>>> +                       if (ctx->flags &
>>> IORING_SETUP_DEFER_TASKRUN) {
>>
>> I think it should be in io_cqring_wait(), which calls it anyway
>> in the beginning. Instead of
>>
>>          do {
>>                  io_cqring_overflow_flush(ctx);
>>                  if (io_cqring_events(ctx) >= min_events)
>>                          return 0;
>>                  if (!io_run_task_work())
>>                          break;
>>          } while (1);
>>
>> Let's have
>>
>>          do {
>>                  ret = io_run_task_work_ctx();
>>                  // handle ret
>>                  io_cqring_overflow_flush(ctx);
>>                  if (io_cqring_events(ctx) >= min_events)
>>                          return 0;
>>          } while (1);
> 
> I think that is ok.
> The downside is that it adds an extra lock/unlock of the ctx in some
> cases. I assume that will be neglegible?

Not sure there will be any extra locking. IIRC, it was about replacing

// io_uring_enter() -> GETEVENTS path
run_tw();
// io_cqring_wait()
while (cqes_ready() < needed)
	run_tw();

With:

// io_uring_enter()
do {
	run_tw();
} while(cqes_ready() < needed);


>>> +                               ret2 = io_run_local_work(ctx,
>>> false);
>>> +                               if (unlikely(ret2 < 0))
>>> +                                       goto getevents_out;
>>> +                       }
>>> +getevents_ran_local:
>>>                          ret2 = io_get_ext_arg(flags, argp, &argsz,
>>> &ts, &sig);
>>>                          if (likely(!ret2)) {
>>>                                  min_complete = min(min_complete,
>>> @@ -3090,6 +3209,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int,
>>> fd, u32, to_submit,
>>>                          }
>>>                  }
>>>    
>>> +getevents_out:
>>>                  if (!ret) {
>>>                          ret = ret2;
>>>    
>>> @@ -3289,17 +3409,29 @@ static __cold int io_uring_create(unsigned
>>> entries, struct io_uring_params *p,
>>>          if (ctx->flags & IORING_SETUP_SQPOLL) {
>>>                  /* IPI related flags don't make sense with SQPOLL
>>> */
>>>                  if (ctx->flags & (IORING_SETUP_COOP_TASKRUN |
>>> -                                 IORING_SETUP_TASKRUN_FLAG))
>>> +                                 IORING_SETUP_TASKRUN_FLAG |
>>> +                                 IORING_SETUP_DEFER_TASKRUN))
>>
>> Sounds like we should also fail if SQPOLL is set, especially with
>> the task check on the waiting side.
>>
> 
> That is what this code is doing I think? Did I miss something?

Ok, great then
Hao Xu Aug. 30, 2022, 1:19 p.m. UTC | #7
On 8/19/22 20:19, Dylan Yudaken wrote:
> Allow deferring async tasks until the user calls io_uring_enter(2) with
> the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at
> io_uring_setup time. This functionality requires that the later
> io_uring_enter will be called from the same submission task, and therefore
> restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is also
> set.
> 
> Being able to hand pick when tasks are run prevents the problem where
> there is current work to be done, however task work runs anyway.
> 
> For example, a common workload would obtain a batch of CQEs, and process
> each one. Interrupting this to additional taskwork would add latency but
> not gain anything. If instead task work is deferred to just before more
> CQEs are obtained then no additional latency is added.
> 
> The way this is implemented is by trying to keep task work local to a
> io_ring_ctx, rather than to the submission task. This is required, as the
> application will want to wake up only a single io_ring_ctx at a time to
> process work, and so the lists of work have to be kept separate.
> 
> This has some other benefits like not having to check the task continually
> in handle_tw_list (and potentially unlocking/locking those), and reducing
> locks in the submit & process completions path.
> 
> There are networking cases where using this option can reduce request
> latency by 50%. For example a contrived example using [1] where the client
> sends 2k data and receives the same data back while doing some system
> calls (to trigger task work) shows this reduction. The reason ends up
> being that if sending responses is delayed by processing task work, then
> the client side sits idle. Whereas reordering the sends first means that
> the client runs it's workload in parallel with the local task work.
> 

Sorry, seems I misunderstood the purpose of this patchset. Allow me to
ask a question: "we always first submit sqes then handle task work
(in IORING_SETUP_COOP_TASKRUN mode), how could the sending be
interrupted by task works?"
Dylan Yudaken Aug. 30, 2022, 1:34 p.m. UTC | #8
On Tue, 2022-08-30 at 21:19 +0800, Hao Xu wrote:
> On 8/19/22 20:19, Dylan Yudaken wrote:
> > Allow deferring async tasks until the user calls io_uring_enter(2)
> > with
> > the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at
> > io_uring_setup time. This functionality requires that the later
> > io_uring_enter will be called from the same submission task, and
> > therefore
> > restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is
> > also
> > set.
> > 
> > Being able to hand pick when tasks are run prevents the problem
> > where
> > there is current work to be done, however task work runs anyway.
> > 
> > For example, a common workload would obtain a batch of CQEs, and
> > process
> > each one. Interrupting this to additional taskwork would add
> > latency but
> > not gain anything. If instead task work is deferred to just before
> > more
> > CQEs are obtained then no additional latency is added.
> > 
> > The way this is implemented is by trying to keep task work local to
> > a
> > io_ring_ctx, rather than to the submission task. This is required,
> > as the
> > application will want to wake up only a single io_ring_ctx at a
> > time to
> > process work, and so the lists of work have to be kept separate.
> > 
> > This has some other benefits like not having to check the task
> > continually
> > in handle_tw_list (and potentially unlocking/locking those), and
> > reducing
> > locks in the submit & process completions path.
> > 
> > There are networking cases where using this option can reduce
> > request
> > latency by 50%. For example a contrived example using [1] where the
> > client
> > sends 2k data and receives the same data back while doing some
> > system
> > calls (to trigger task work) shows this reduction. The reason ends
> > up
> > being that if sending responses is delayed by processing task work,
> > then
> > the client side sits idle. Whereas reordering the sends first means
> > that
> > the client runs it's workload in parallel with the local task work.
> > 
> 
> Sorry, seems I misunderstood the purpose of this patchset. Allow me
> to
> ask a question: "we always first submit sqes then handle task work
> (in IORING_SETUP_COOP_TASKRUN mode), how could the sending be
> interrupted by task works?"

IORING_SETUP_COOP_TASKRUN causes the task to not be interrupted simply
for task work, however it will still be run on every system call even
if completions are not about to be processed. 

IoUring task work (unlike say epoll wakeups) can take a non-trivial
amount of time, and so running them closer to when they are used can
reduce latency of other unrelated operations by not unnecessarily
stalling them.
Hao Xu Aug. 30, 2022, 2:04 p.m. UTC | #9
On 8/30/22 21:34, Dylan Yudaken wrote:
> On Tue, 2022-08-30 at 21:19 +0800, Hao Xu wrote:
>> On 8/19/22 20:19, Dylan Yudaken wrote:
>>> Allow deferring async tasks until the user calls io_uring_enter(2)
>>> with
>>> the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at
>>> io_uring_setup time. This functionality requires that the later
>>> io_uring_enter will be called from the same submission task, and
>>> therefore
>>> restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is
>>> also
>>> set.
>>>
>>> Being able to hand pick when tasks are run prevents the problem
>>> where
>>> there is current work to be done, however task work runs anyway.
>>>
>>> For example, a common workload would obtain a batch of CQEs, and
>>> process
>>> each one. Interrupting this to additional taskwork would add
>>> latency but
>>> not gain anything. If instead task work is deferred to just before
>>> more
>>> CQEs are obtained then no additional latency is added.
>>>
>>> The way this is implemented is by trying to keep task work local to
>>> a
>>> io_ring_ctx, rather than to the submission task. This is required,
>>> as the
>>> application will want to wake up only a single io_ring_ctx at a
>>> time to
>>> process work, and so the lists of work have to be kept separate.
>>>
>>> This has some other benefits like not having to check the task
>>> continually
>>> in handle_tw_list (and potentially unlocking/locking those), and
>>> reducing
>>> locks in the submit & process completions path.
>>>
>>> There are networking cases where using this option can reduce
>>> request
>>> latency by 50%. For example a contrived example using [1] where the
>>> client
>>> sends 2k data and receives the same data back while doing some
>>> system
>>> calls (to trigger task work) shows this reduction. The reason ends
>>> up
>>> being that if sending responses is delayed by processing task work,
>>> then
>>> the client side sits idle. Whereas reordering the sends first means
>>> that
>>> the client runs it's workload in parallel with the local task work.
>>>
>>
>> Sorry, seems I misunderstood the purpose of this patchset. Allow me
>> to
>> ask a question: "we always first submit sqes then handle task work
>> (in IORING_SETUP_COOP_TASKRUN mode), how could the sending be
>> interrupted by task works?"
> 
> IORING_SETUP_COOP_TASKRUN causes the task to not be interrupted simply
> for task work, however it will still be run on every system call even
> if completions are not about to be processed.
> 

gotcha, then sqpoll may not be a tenant of this feature..


> IoUring task work (unlike say epoll wakeups) can take a non-trivial
> amount of time, and so running them closer to when they are used can
> reduce latency of other unrelated operations by not unnecessarily
> stalling them.
>
diff mbox series

Patch

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 677a25d44d7f..d56ff2185168 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -301,6 +301,8 @@  struct io_ring_ctx {
 		struct io_hash_table	cancel_table;
 		bool			poll_multi_queue;
 
+		struct llist_head	work_llist;
+
 		struct list_head	io_buffers_comp;
 	} ____cacheline_aligned_in_smp;
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 1463cfecb56b..be8d1801bf4a 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -153,6 +153,13 @@  enum {
  */
 #define IORING_SETUP_SINGLE_ISSUER	(1U << 12)
 
+/*
+ * Defer running task work to get events.
+ * Rather than running bits of task work whenever the task transitions
+ * try to do it just before it is needed.
+ */
+#define IORING_SETUP_DEFER_TASKRUN	(1U << 13)
+
 enum io_uring_op {
 	IORING_OP_NOP,
 	IORING_OP_READV,
diff --git a/io_uring/cancel.c b/io_uring/cancel.c
index e4e1dc0325f0..db6180b62e41 100644
--- a/io_uring/cancel.c
+++ b/io_uring/cancel.c
@@ -292,7 +292,7 @@  int io_sync_cancel(struct io_ring_ctx *ctx, void __user *arg)
 			break;
 
 		mutex_unlock(&ctx->uring_lock);
-		ret = io_run_task_work_sig();
+		ret = io_run_task_work_sig(ctx);
 		if (ret < 0) {
 			mutex_lock(&ctx->uring_lock);
 			break;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 53696dd90626..6572d2276750 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -142,7 +142,7 @@  static bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 static void io_dismantle_req(struct io_kiocb *req);
 static void io_clean_op(struct io_kiocb *req);
 static void io_queue_sqe(struct io_kiocb *req);
-
+static void io_move_task_work_from_local(struct io_ring_ctx *ctx);
 static void __io_submit_flush_completions(struct io_ring_ctx *ctx);
 
 static struct kmem_cache *req_cachep;
@@ -316,6 +316,7 @@  static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_LIST_HEAD(&ctx->rsrc_ref_list);
 	INIT_DELAYED_WORK(&ctx->rsrc_put_work, io_rsrc_put_work);
 	init_llist_head(&ctx->rsrc_put_llist);
+	init_llist_head(&ctx->work_llist);
 	INIT_LIST_HEAD(&ctx->tctx_list);
 	ctx->submit_state.free_list.next = NULL;
 	INIT_WQ_LIST(&ctx->locked_free_list);
@@ -1047,12 +1048,36 @@  void tctx_task_work(struct callback_head *cb)
 	trace_io_uring_task_work_run(tctx, count, loops);
 }
 
-void io_req_task_work_add(struct io_kiocb *req)
+static void io_req_local_work_add(struct io_kiocb *req)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	if (!llist_add(&req->io_task_work.node, &ctx->work_llist))
+		return;
+
+	if (unlikely(atomic_read(&req->task->io_uring->in_idle))) {
+		io_move_task_work_from_local(ctx);
+		return;
+	}
+
+	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
+		atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
+
+	io_cqring_wake(ctx);
+
+}
+
+static inline void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
 {
 	struct io_uring_task *tctx = req->task->io_uring;
 	struct io_ring_ctx *ctx = req->ctx;
 	struct llist_node *node;
 
+	if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
+		io_req_local_work_add(req);
+		return;
+	}
+
 	/* task_work already pending, we're done */
 	if (!llist_add(&req->io_task_work.node, &tctx->task_list))
 		return;
@@ -1074,6 +1099,76 @@  void io_req_task_work_add(struct io_kiocb *req)
 	}
 }
 
+void io_req_task_work_add(struct io_kiocb *req)
+{
+	__io_req_task_work_add(req, true);
+}
+
+static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
+{
+	struct llist_node *node;
+
+	node = llist_del_all(&ctx->work_llist);
+	while (node) {
+		struct io_kiocb *req = container_of(node, struct io_kiocb,
+						    io_task_work.node);
+
+		node = node->next;
+		__io_req_task_work_add(req, false);
+	}
+}
+
+int io_run_local_work(struct io_ring_ctx *ctx, bool locked)
+{
+	struct llist_node *node;
+	struct llist_node fake;
+	struct llist_node *current_final = NULL;
+	int ret;
+
+	if (unlikely(ctx->submitter_task != current)) {
+		if (locked)
+			mutex_unlock(&ctx->uring_lock);
+
+		/* maybe this is before any submissions */
+		if (!ctx->submitter_task)
+			return 0;
+
+		return -EEXIST;
+	}
+
+	if (!locked)
+		locked = mutex_trylock(&ctx->uring_lock);
+
+	node = io_llist_xchg(&ctx->work_llist, &fake);
+	ret = 0;
+again:
+	while (node != current_final) {
+		struct llist_node *next = node->next;
+		struct io_kiocb *req = container_of(node, struct io_kiocb,
+						    io_task_work.node);
+		prefetch(container_of(next, struct io_kiocb, io_task_work.node));
+		req->io_task_work.func(req, &locked);
+		ret++;
+		node = next;
+	}
+
+	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
+		atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
+
+	node = io_llist_cmpxchg(&ctx->work_llist, &fake, NULL);
+	if (node != &fake) {
+		current_final = &fake;
+		node = io_llist_xchg(&ctx->work_llist, &fake);
+		goto again;
+	}
+
+	if (locked) {
+		io_submit_flush_completions(ctx);
+		mutex_unlock(&ctx->uring_lock);
+	}
+	return ret;
+}
+
 static void io_req_tw_post(struct io_kiocb *req, bool *locked)
 {
 	io_req_complete_post(req);
@@ -1284,9 +1379,10 @@  static int io_iopoll_check(struct io_ring_ctx *ctx, long min)
 		if (wq_list_empty(&ctx->iopoll_list)) {
 			u32 tail = ctx->cached_cq_tail;
 
-			mutex_unlock(&ctx->uring_lock);
-			io_run_task_work();
+			ret = io_run_task_work_unlock_ctx(ctx);
 			mutex_lock(&ctx->uring_lock);
+			if (ret < 0)
+				break;
 
 			/* some requests don't go through iopoll_list */
 			if (tail != ctx->cached_cq_tail ||
@@ -2146,7 +2242,9 @@  struct io_wait_queue {
 
 static inline bool io_has_work(struct io_ring_ctx *ctx)
 {
-	return test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq);
+	return test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq) ||
+	       ((ctx->flags & IORING_SETUP_DEFER_TASKRUN) &&
+		!llist_empty(&ctx->work_llist));
 }
 
 static inline bool io_should_wake(struct io_wait_queue *iowq)
@@ -2178,9 +2276,9 @@  static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
 	return -1;
 }
 
-int io_run_task_work_sig(void)
+int io_run_task_work_sig(struct io_ring_ctx *ctx)
 {
-	if (io_run_task_work())
+	if (io_run_task_work_ctx(ctx))
 		return 1;
 	if (task_sigpending(current))
 		return -EINTR;
@@ -2196,7 +2294,7 @@  static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
 	unsigned long check_cq;
 
 	/* make sure we run task_work before checking for signals */
-	ret = io_run_task_work_sig();
+	ret = io_run_task_work_sig(ctx);
 	if (ret || io_should_wake(iowq))
 		return ret;
 
@@ -2230,7 +2328,7 @@  static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 		io_cqring_overflow_flush(ctx);
 		if (io_cqring_events(ctx) >= min_events)
 			return 0;
-		if (!io_run_task_work())
+		if (!io_run_task_work_ctx(ctx))
 			break;
 	} while (1);
 
@@ -2573,6 +2671,9 @@  static __cold void io_ring_exit_work(struct work_struct *work)
 	 * as nobody else will be looking for them.
 	 */
 	do {
+		if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
+			io_move_task_work_from_local(ctx);
+
 		while (io_uring_try_cancel_requests(ctx, NULL, true))
 			cond_resched();
 
@@ -2768,6 +2869,8 @@  static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 		}
 	}
 
+	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
+		ret |= io_run_local_work(ctx, false) > 0;
 	ret |= io_cancel_defer_files(ctx, task, cancel_all);
 	mutex_lock(&ctx->uring_lock);
 	ret |= io_poll_remove_all(ctx, task, cancel_all);
@@ -3057,10 +3160,20 @@  SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 		}
 		if ((flags & IORING_ENTER_GETEVENTS) && ctx->syscall_iopoll)
 			goto iopoll_locked;
+		if ((flags & IORING_ENTER_GETEVENTS) &&
+			(ctx->flags & IORING_SETUP_DEFER_TASKRUN)) {
+			int ret2 = io_run_local_work(ctx, true);
+
+			if (unlikely(ret2 < 0))
+				goto out;
+			goto getevents_ran_local;
+		}
 		mutex_unlock(&ctx->uring_lock);
 	}
+
 	if (flags & IORING_ENTER_GETEVENTS) {
 		int ret2;
+
 		if (ctx->syscall_iopoll) {
 			/*
 			 * We disallow the app entering submit/complete with
@@ -3081,6 +3194,12 @@  SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 			const sigset_t __user *sig;
 			struct __kernel_timespec __user *ts;
 
+			if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
+				ret2 = io_run_local_work(ctx, false);
+				if (unlikely(ret2 < 0))
+					goto getevents_out;
+			}
+getevents_ran_local:
 			ret2 = io_get_ext_arg(flags, argp, &argsz, &ts, &sig);
 			if (likely(!ret2)) {
 				min_complete = min(min_complete,
@@ -3090,6 +3209,7 @@  SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 			}
 		}
 
+getevents_out:
 		if (!ret) {
 			ret = ret2;
 
@@ -3289,17 +3409,29 @@  static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 	if (ctx->flags & IORING_SETUP_SQPOLL) {
 		/* IPI related flags don't make sense with SQPOLL */
 		if (ctx->flags & (IORING_SETUP_COOP_TASKRUN |
-				  IORING_SETUP_TASKRUN_FLAG))
+				  IORING_SETUP_TASKRUN_FLAG |
+				  IORING_SETUP_DEFER_TASKRUN))
 			goto err;
 		ctx->notify_method = TWA_SIGNAL_NO_IPI;
 	} else if (ctx->flags & IORING_SETUP_COOP_TASKRUN) {
 		ctx->notify_method = TWA_SIGNAL_NO_IPI;
 	} else {
-		if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
+		if (ctx->flags & IORING_SETUP_TASKRUN_FLAG &&
+		    !(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
 			goto err;
 		ctx->notify_method = TWA_SIGNAL;
 	}
 
+	/*
+	 * For DEFER_TASKRUN we require the completion task to be the same as the
+	 * submission task. This implies that there is only one submitter, so enforce
+	 * that.
+	 */
+	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN &&
+	    !(ctx->flags & IORING_SETUP_SINGLE_ISSUER)) {
+		goto err;
+	}
+
 	/*
 	 * This is just grabbed for accounting purposes. When a process exits,
 	 * the mm is exited and dropped before the files, hence we need to hang
@@ -3400,7 +3532,7 @@  static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 			IORING_SETUP_R_DISABLED | IORING_SETUP_SUBMIT_ALL |
 			IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG |
 			IORING_SETUP_SQE128 | IORING_SETUP_CQE32 |
-			IORING_SETUP_SINGLE_ISSUER))
+			IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN))
 		return -EINVAL;
 
 	return io_uring_create(entries, &p, params);
@@ -3872,7 +4004,7 @@  SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
 
 	ctx = f.file->private_data;
 
-	io_run_task_work();
+	io_run_task_work_ctx(ctx);
 
 	mutex_lock(&ctx->uring_lock);
 	ret = __io_uring_register(ctx, opcode, arg, nr_args);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 2f73f83af960..a9fb115234af 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -26,7 +26,8 @@  enum {
 
 struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx);
 bool io_req_cqe_overflow(struct io_kiocb *req);
-int io_run_task_work_sig(void);
+int io_run_task_work_sig(struct io_ring_ctx *ctx);
+int io_run_local_work(struct io_ring_ctx *ctx, bool locked);
 void io_req_complete_failed(struct io_kiocb *req, s32 res);
 void __io_req_complete(struct io_kiocb *req, unsigned issue_flags);
 void io_req_complete_post(struct io_kiocb *req);
@@ -234,6 +235,32 @@  static inline bool io_run_task_work(void)
 	return false;
 }
 
+static inline bool io_run_task_work_ctx(struct io_ring_ctx *ctx)
+{
+	bool ret = false;
+
+	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
+		ret = io_run_local_work(ctx, false) > 0;
+
+	/* want to run this after in case more is added */
+	ret  |= io_run_task_work();
+	return ret;
+}
+
+static inline int io_run_task_work_unlock_ctx(struct io_ring_ctx *ctx)
+{
+	int ret;
+
+	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
+		ret = io_run_local_work(ctx, true);
+	} else {
+		mutex_unlock(&ctx->uring_lock);
+		ret = (int)io_run_task_work();
+	}
+
+	return ret;
+}
+
 static inline void io_tw_lock(struct io_ring_ctx *ctx, bool *locked)
 {
 	if (!*locked) {
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 71359a4d0bd4..80cda6e2067f 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -343,7 +343,7 @@  __cold static int io_rsrc_ref_quiesce(struct io_rsrc_data *data,
 		flush_delayed_work(&ctx->rsrc_put_work);
 		reinit_completion(&data->done);
 
-		ret = io_run_task_work_sig();
+		ret = io_run_task_work_sig(ctx);
 		mutex_lock(&ctx->uring_lock);
 	} while (ret >= 0);
 	data->quiesce = false;