Message ID | 20190917091358.3652-1-avi@scylladb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v1] io_uring: reserve word at cqring tail+4 for the user | expand |
On 9/17/19 3:13 AM, Avi Kivity wrote: > In some applications, a thread waits for I/O events generated by > the kernel, and also events generated by other threads in the same > application. Typically events from other threads are passed using > in-memory queues that are not known to the kernel. As long as the > threads is active, it polls for both kernel completions and > inter-thread completions; when it is idle, it tells the other threads > to use an I/O event to wait it up (e.g. an eventfd or a pipe) and > then enters the kernel, waiting for such an event or an ordinary > I/O completion. > > When such a thread goes idle, it typically spins for a while to > avoid the kernel entry/exit cost in case an event is forthcoming > shortly. While it spins it polls both I/O completions and > inter-thread queues. > > The x86 instruction pair UMONITOR/UMWAIT allows waiting for a cache > line to be written to. This can be used with io_uring to wait for a > wakeup without spinning (and wasting power and slowing down the other > hyperthread). Other threads can also wake up the waiter by doing a > safe write to the tail word (which triggers the wakeup), but safe > writes are slow as they require an atomic instruction. To speed up > those wakeups, reserve a word after the tail for user writes. > > A thread consuming an io_uring completion queue can then use the > following sequences: > > - while busy: > - pick up work from the completion queue and from other threads, > and process it > > - while idle: > - use UMONITOR/UMWAIT to wait on completions and notifications > from other threads for a short period > - if no work is picked up, let other threads know you will need > a kernel wakeup, and use io_uring_enter to wait indefinitely This is cool, I like it. A few comments: > diff --git a/fs/io_uring.c b/fs/io_uring.c > index cfb48bd088e1..4bd7905cee1d 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -77,12 +77,13 @@ > > #define IORING_MAX_ENTRIES 4096 > #define IORING_MAX_FIXED_FILES 1024 > > struct io_uring { > - u32 head ____cacheline_aligned_in_smp; > - u32 tail ____cacheline_aligned_in_smp; > + u32 head ____cacheline_aligned; > + u32 tail ____cacheline_aligned; > + u32 reserved_for_user; // for cq ring and UMONITOR/UMWAIT (or similar) wakeups > }; Since we have that full cacheline, maybe name this one a bit more appropriately as we can add others if we need it. Not a big deal. But definitely use /* */ style comments :-) > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h > index 1e1652f25cc1..1a6a826a66f3 100644 > --- a/include/uapi/linux/io_uring.h > +++ b/include/uapi/linux/io_uring.h > @@ -103,10 +103,14 @@ struct io_sqring_offsets { > */ > #define IORING_SQ_NEED_WAKEUP (1U << 0) /* needs io_uring_enter wakeup */ > > struct io_cqring_offsets { > __u32 head; > + // tail is guaranteed to be aligned on a cache line, and to have the > + // following __u32 free for user use. This allows using e.g. > + // UMONITOR/UMWAIT to wait on both writes to head and writes from > + // other threads to the following word. > __u32 tail; > __u32 ring_mask; > __u32 ring_entries; > __u32 overflow; > __u32 cqes; Ditto on the comments here. Would be ideal if we could pair this with an example for liburing, a basic test case would be fine. Something that shows how to use it, and verifies that it works. Also, this patch is against master, it should be against for-5.4/io_iuring as it won't apply there right now.
On 17/09/2019 17.54, Jens Axboe wrote: > On 9/17/19 3:13 AM, Avi Kivity wrote: >> In some applications, a thread waits for I/O events generated by >> the kernel, and also events generated by other threads in the same >> application. Typically events from other threads are passed using >> in-memory queues that are not known to the kernel. As long as the >> threads is active, it polls for both kernel completions and >> inter-thread completions; when it is idle, it tells the other threads >> to use an I/O event to wait it up (e.g. an eventfd or a pipe) and >> then enters the kernel, waiting for such an event or an ordinary >> I/O completion. >> >> When such a thread goes idle, it typically spins for a while to >> avoid the kernel entry/exit cost in case an event is forthcoming >> shortly. While it spins it polls both I/O completions and >> inter-thread queues. >> >> The x86 instruction pair UMONITOR/UMWAIT allows waiting for a cache >> line to be written to. This can be used with io_uring to wait for a >> wakeup without spinning (and wasting power and slowing down the other >> hyperthread). Other threads can also wake up the waiter by doing a >> safe write to the tail word (which triggers the wakeup), but safe >> writes are slow as they require an atomic instruction. To speed up >> those wakeups, reserve a word after the tail for user writes. >> >> A thread consuming an io_uring completion queue can then use the >> following sequences: >> >> - while busy: >> - pick up work from the completion queue and from other threads, >> and process it >> >> - while idle: >> - use UMONITOR/UMWAIT to wait on completions and notifications >> from other threads for a short period >> - if no work is picked up, let other threads know you will need >> a kernel wakeup, and use io_uring_enter to wait indefinitely > This is cool, I like it. A few comments: > >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index cfb48bd088e1..4bd7905cee1d 100644 >> --- a/fs/io_uring.c >> +++ b/fs/io_uring.c >> @@ -77,12 +77,13 @@ >> >> #define IORING_MAX_ENTRIES 4096 >> #define IORING_MAX_FIXED_FILES 1024 >> >> struct io_uring { >> - u32 head ____cacheline_aligned_in_smp; >> - u32 tail ____cacheline_aligned_in_smp; >> + u32 head ____cacheline_aligned; >> + u32 tail ____cacheline_aligned; >> + u32 reserved_for_user; // for cq ring and UMONITOR/UMWAIT (or similar) wakeups >> }; > Since we have that full cacheline, maybe name this one a bit more > appropriately as we can add others if we need it. Not a big deal. You mean, name it for its intended purpose of serving as a write target for umonitor/umwait wakes? Note that the user won't see the name, and that it's only accurate for an io_uring that's used for completions. > But definitely use /* */ style comments :-) Sorry, in C++-land for a while. You're lucky I didn't turn the whole thing into a virtual template something. > >> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h >> index 1e1652f25cc1..1a6a826a66f3 100644 >> --- a/include/uapi/linux/io_uring.h >> +++ b/include/uapi/linux/io_uring.h >> @@ -103,10 +103,14 @@ struct io_sqring_offsets { >> */ >> #define IORING_SQ_NEED_WAKEUP (1U << 0) /* needs io_uring_enter wakeup */ >> >> struct io_cqring_offsets { >> __u32 head; >> + // tail is guaranteed to be aligned on a cache line, and to have the >> + // following __u32 free for user use. This allows using e.g. >> + // UMONITOR/UMWAIT to wait on both writes to head and writes from >> + // other threads to the following word. >> __u32 tail; >> __u32 ring_mask; >> __u32 ring_entries; >> __u32 overflow; >> __u32 cqes; > Ditto on the comments here. Sure. > Would be ideal if we could pair this with an example for liburing, a basic > test case would be fine. Something that shows how to use it, and verifies > that it works. I'll have to look for a machine with waitpkg for that. > Also, this patch is against master, it should be against for-5.4/io_iuring as > it won't apply there right now. Sure, will rebase.
diff --git a/fs/io_uring.c b/fs/io_uring.c index cfb48bd088e1..4bd7905cee1d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -77,12 +77,13 @@ #define IORING_MAX_ENTRIES 4096 #define IORING_MAX_FIXED_FILES 1024 struct io_uring { - u32 head ____cacheline_aligned_in_smp; - u32 tail ____cacheline_aligned_in_smp; + u32 head ____cacheline_aligned; + u32 tail ____cacheline_aligned; + u32 reserved_for_user; // for cq ring and UMONITOR/UMWAIT (or similar) wakeups }; /* * This data is shared with the application through the mmap at offset * IORING_OFF_SQ_RING. diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 1e1652f25cc1..1a6a826a66f3 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -103,10 +103,14 @@ struct io_sqring_offsets { */ #define IORING_SQ_NEED_WAKEUP (1U << 0) /* needs io_uring_enter wakeup */ struct io_cqring_offsets { __u32 head; + // tail is guaranteed to be aligned on a cache line, and to have the + // following __u32 free for user use. This allows using e.g. + // UMONITOR/UMWAIT to wait on both writes to head and writes from + // other threads to the following word. __u32 tail; __u32 ring_mask; __u32 ring_entries; __u32 overflow; __u32 cqes;
In some applications, a thread waits for I/O events generated by the kernel, and also events generated by other threads in the same application. Typically events from other threads are passed using in-memory queues that are not known to the kernel. As long as the threads is active, it polls for both kernel completions and inter-thread completions; when it is idle, it tells the other threads to use an I/O event to wait it up (e.g. an eventfd or a pipe) and then enters the kernel, waiting for such an event or an ordinary I/O completion. When such a thread goes idle, it typically spins for a while to avoid the kernel entry/exit cost in case an event is forthcoming shortly. While it spins it polls both I/O completions and inter-thread queues. The x86 instruction pair UMONITOR/UMWAIT allows waiting for a cache line to be written to. This can be used with io_uring to wait for a wakeup without spinning (and wasting power and slowing down the other hyperthread). Other threads can also wake up the waiter by doing a safe write to the tail word (which triggers the wakeup), but safe writes are slow as they require an atomic instruction. To speed up those wakeups, reserve a word after the tail for user writes. A thread consuming an io_uring completion queue can then use the following sequences: - while busy: - pick up work from the completion queue and from other threads, and process it - while idle: - use UMONITOR/UMWAIT to wait on completions and notifications from other threads for a short period - if no work is picked up, let other threads know you will need a kernel wakeup, and use io_uring_enter to wait indefinitely Signed-off-by: Avi Kivity <avi@scylladb.com> --- fs/io_uring.c | 5 +++-- include/uapi/linux/io_uring.h | 4 ++++ 2 files changed, 7 insertions(+), 2 deletions(-)