diff mbox series

[v1] io_uring: reserve word at cqring tail+4 for the user

Message ID 20190917091358.3652-1-avi@scylladb.com (mailing list archive)
State New, archived
Headers show
Series [v1] io_uring: reserve word at cqring tail+4 for the user | expand

Commit Message

Avi Kivity Sept. 17, 2019, 9:13 a.m. UTC
In some applications, a thread waits for I/O events generated by
the kernel, and also events generated by other threads in the same
application. Typically events from other threads are passed using
in-memory queues that are not known to the kernel. As long as the
threads is active, it polls for both kernel completions and
inter-thread completions; when it is idle, it tells the other threads
to use an I/O event to wait it up (e.g. an eventfd or a pipe) and
then enters the kernel, waiting for such an event or an ordinary
I/O completion.

When such a thread goes idle, it typically spins for a while to
avoid the kernel entry/exit cost in case an event is forthcoming
shortly. While it spins it polls both I/O completions and
inter-thread queues.

The x86 instruction pair UMONITOR/UMWAIT allows waiting for a cache
line to be written to. This can be used with io_uring to wait for a
wakeup without spinning (and wasting power and slowing down the other
hyperthread). Other threads can also wake up the waiter by doing a
safe write to the tail word (which triggers the wakeup), but safe
writes are slow as they require an atomic instruction. To speed up
those wakeups, reserve a word after the tail for user writes.

A thread consuming an io_uring completion queue can then use the
following sequences:

  - while busy:
    - pick up work from the completion queue and from other threads,
      and process it

  - while idle:
    - use UMONITOR/UMWAIT to wait on completions and notifications
      from other threads for a short period
    - if no work is picked up, let other threads know you will need
      a kernel wakeup, and use io_uring_enter to wait indefinitely

Signed-off-by: Avi Kivity <avi@scylladb.com>
---
 fs/io_uring.c                 | 5 +++--
 include/uapi/linux/io_uring.h | 4 ++++
 2 files changed, 7 insertions(+), 2 deletions(-)

Comments

Jens Axboe Sept. 17, 2019, 2:54 p.m. UTC | #1
On 9/17/19 3:13 AM, Avi Kivity wrote:
> In some applications, a thread waits for I/O events generated by
> the kernel, and also events generated by other threads in the same
> application. Typically events from other threads are passed using
> in-memory queues that are not known to the kernel. As long as the
> threads is active, it polls for both kernel completions and
> inter-thread completions; when it is idle, it tells the other threads
> to use an I/O event to wait it up (e.g. an eventfd or a pipe) and
> then enters the kernel, waiting for such an event or an ordinary
> I/O completion.
> 
> When such a thread goes idle, it typically spins for a while to
> avoid the kernel entry/exit cost in case an event is forthcoming
> shortly. While it spins it polls both I/O completions and
> inter-thread queues.
> 
> The x86 instruction pair UMONITOR/UMWAIT allows waiting for a cache
> line to be written to. This can be used with io_uring to wait for a
> wakeup without spinning (and wasting power and slowing down the other
> hyperthread). Other threads can also wake up the waiter by doing a
> safe write to the tail word (which triggers the wakeup), but safe
> writes are slow as they require an atomic instruction. To speed up
> those wakeups, reserve a word after the tail for user writes.
> 
> A thread consuming an io_uring completion queue can then use the
> following sequences:
> 
>    - while busy:
>      - pick up work from the completion queue and from other threads,
>        and process it
> 
>    - while idle:
>      - use UMONITOR/UMWAIT to wait on completions and notifications
>        from other threads for a short period
>      - if no work is picked up, let other threads know you will need
>        a kernel wakeup, and use io_uring_enter to wait indefinitely

This is cool, I like it. A few comments:

> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index cfb48bd088e1..4bd7905cee1d 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -77,12 +77,13 @@
>   
>   #define IORING_MAX_ENTRIES	4096
>   #define IORING_MAX_FIXED_FILES	1024
>   
>   struct io_uring {
> -	u32 head ____cacheline_aligned_in_smp;
> -	u32 tail ____cacheline_aligned_in_smp;
> +	u32 head ____cacheline_aligned;
> +	u32 tail ____cacheline_aligned;
> +	u32 reserved_for_user; // for cq ring and UMONITOR/UMWAIT (or similar) wakeups
>   };

Since we have that full cacheline, maybe name this one a bit more
appropriately as we can add others if we need it. Not a big deal.
But definitely use /* */ style comments :-)

> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 1e1652f25cc1..1a6a826a66f3 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -103,10 +103,14 @@ struct io_sqring_offsets {
>    */
>   #define IORING_SQ_NEED_WAKEUP	(1U << 0) /* needs io_uring_enter wakeup */
>   
>   struct io_cqring_offsets {
>   	__u32 head;
> +	// tail is guaranteed to be aligned on a cache line, and to have the
> +	// following __u32 free for user use. This allows using e.g.
> +	// UMONITOR/UMWAIT to wait on both writes to head and writes from
> +	// other threads to the following word.
>   	__u32 tail;
>   	__u32 ring_mask;
>   	__u32 ring_entries;
>   	__u32 overflow;
>   	__u32 cqes;

Ditto on the comments here.

Would be ideal if we could pair this with an example for liburing, a basic
test case would be fine. Something that shows how to use it, and verifies
that it works.

Also, this patch is against master, it should be against for-5.4/io_iuring as
it won't apply there right now.
Avi Kivity Sept. 17, 2019, 3:12 p.m. UTC | #2
On 17/09/2019 17.54, Jens Axboe wrote:
> On 9/17/19 3:13 AM, Avi Kivity wrote:
>> In some applications, a thread waits for I/O events generated by
>> the kernel, and also events generated by other threads in the same
>> application. Typically events from other threads are passed using
>> in-memory queues that are not known to the kernel. As long as the
>> threads is active, it polls for both kernel completions and
>> inter-thread completions; when it is idle, it tells the other threads
>> to use an I/O event to wait it up (e.g. an eventfd or a pipe) and
>> then enters the kernel, waiting for such an event or an ordinary
>> I/O completion.
>>
>> When such a thread goes idle, it typically spins for a while to
>> avoid the kernel entry/exit cost in case an event is forthcoming
>> shortly. While it spins it polls both I/O completions and
>> inter-thread queues.
>>
>> The x86 instruction pair UMONITOR/UMWAIT allows waiting for a cache
>> line to be written to. This can be used with io_uring to wait for a
>> wakeup without spinning (and wasting power and slowing down the other
>> hyperthread). Other threads can also wake up the waiter by doing a
>> safe write to the tail word (which triggers the wakeup), but safe
>> writes are slow as they require an atomic instruction. To speed up
>> those wakeups, reserve a word after the tail for user writes.
>>
>> A thread consuming an io_uring completion queue can then use the
>> following sequences:
>>
>>     - while busy:
>>       - pick up work from the completion queue and from other threads,
>>         and process it
>>
>>     - while idle:
>>       - use UMONITOR/UMWAIT to wait on completions and notifications
>>         from other threads for a short period
>>       - if no work is picked up, let other threads know you will need
>>         a kernel wakeup, and use io_uring_enter to wait indefinitely
> This is cool, I like it. A few comments:
>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index cfb48bd088e1..4bd7905cee1d 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -77,12 +77,13 @@
>>    
>>    #define IORING_MAX_ENTRIES	4096
>>    #define IORING_MAX_FIXED_FILES	1024
>>    
>>    struct io_uring {
>> -	u32 head ____cacheline_aligned_in_smp;
>> -	u32 tail ____cacheline_aligned_in_smp;
>> +	u32 head ____cacheline_aligned;
>> +	u32 tail ____cacheline_aligned;
>> +	u32 reserved_for_user; // for cq ring and UMONITOR/UMWAIT (or similar) wakeups
>>    };
> Since we have that full cacheline, maybe name this one a bit more
> appropriately as we can add others if we need it. Not a big deal.


You mean, name it for its intended purpose of serving as a write target 
for umonitor/umwait wakes?


Note that the user won't see the name, and that it's only accurate for 
an io_uring that's used for completions.


> But definitely use /* */ style comments :-)


Sorry, in C++-land for a while. You're lucky I didn't turn the whole 
thing into a virtual template something.


>
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index 1e1652f25cc1..1a6a826a66f3 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -103,10 +103,14 @@ struct io_sqring_offsets {
>>     */
>>    #define IORING_SQ_NEED_WAKEUP	(1U << 0) /* needs io_uring_enter wakeup */
>>    
>>    struct io_cqring_offsets {
>>    	__u32 head;
>> +	// tail is guaranteed to be aligned on a cache line, and to have the
>> +	// following __u32 free for user use. This allows using e.g.
>> +	// UMONITOR/UMWAIT to wait on both writes to head and writes from
>> +	// other threads to the following word.
>>    	__u32 tail;
>>    	__u32 ring_mask;
>>    	__u32 ring_entries;
>>    	__u32 overflow;
>>    	__u32 cqes;
> Ditto on the comments here.


Sure.


> Would be ideal if we could pair this with an example for liburing, a basic
> test case would be fine. Something that shows how to use it, and verifies
> that it works.


I'll have to look for a machine with waitpkg for that.


> Also, this patch is against master, it should be against for-5.4/io_iuring as
> it won't apply there right now.


Sure, will rebase.
diff mbox series

Patch

diff --git a/fs/io_uring.c b/fs/io_uring.c
index cfb48bd088e1..4bd7905cee1d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -77,12 +77,13 @@ 
 
 #define IORING_MAX_ENTRIES	4096
 #define IORING_MAX_FIXED_FILES	1024
 
 struct io_uring {
-	u32 head ____cacheline_aligned_in_smp;
-	u32 tail ____cacheline_aligned_in_smp;
+	u32 head ____cacheline_aligned;
+	u32 tail ____cacheline_aligned;
+	u32 reserved_for_user; // for cq ring and UMONITOR/UMWAIT (or similar) wakeups
 };
 
 /*
  * This data is shared with the application through the mmap at offset
  * IORING_OFF_SQ_RING.
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 1e1652f25cc1..1a6a826a66f3 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -103,10 +103,14 @@  struct io_sqring_offsets {
  */
 #define IORING_SQ_NEED_WAKEUP	(1U << 0) /* needs io_uring_enter wakeup */
 
 struct io_cqring_offsets {
 	__u32 head;
+	// tail is guaranteed to be aligned on a cache line, and to have the
+	// following __u32 free for user use. This allows using e.g.
+	// UMONITOR/UMWAIT to wait on both writes to head and writes from
+	// other threads to the following word.
 	__u32 tail;
 	__u32 ring_mask;
 	__u32 ring_entries;
 	__u32 overflow;
 	__u32 cqes;