Message ID | 20250128133927.3989681-1-max.kellermann@ionos.com (mailing list archive) |
---|---|
Headers | show |
Series | Various io_uring micro-optimizations (reducing lock contention) | expand |
On 1/28/25 6:39 AM, Max Kellermann wrote: > While optimizing my io_uring-based web server, I found that the kernel > spends 35% of the CPU time waiting for `io_wq_acct.lock`. This patch > set reduces contention of this lock, though I believe much more should > be done in order to allow more worker concurrency. > > I measured these patches with my HTTP server (serving static files and > running a tiny PHP script) and with a micro-benchmark that submits > millions of `IORING_OP_NOP` entries (with `IOSQE_ASYNC` to force > offloading the operation to a worker, so this offload overhead can be > measured). > > Some of the optimizations eliminate memory accesses, e.g. by passing > values that are already known to (inlined) functions and by caching > values in local variables. These are useful optimizations, but they > are too small to measure them in a benchmark (too much noise). > > Some of the patches have a measurable effect and they contain > benchmark numbers that I could reproduce in repeated runs, despite the > noise. > > I'm not confident about the correctness of the last patch ("io_uring: > skip redundant poll wakeups"). This seemed like low-hanging fruit, so > low that it seemed suspicious to me. If this is a useful > optimization, the idea could probably be ported to other wait_queue > users, or even into the wait_queue library. What I'm not confident > about is whether the optimization is valid or whether it may miss > wakeups, leading to stalls. Please advise! That last patch is the only one that needs a bit more checking, so I'd suggest we just ignore that one for now. We're in the middle of the merge window anyway, so all of this would have to wait for the 6.15 merge window anyway - iow, plenty of time. The other patches look pretty straight forward to me. Only thing that has me puzzled a bit is why you have so much io-wq activity with your application, in general I'd expect 0 activity there. But Then I saw the forced ASYNC flag, and it makes sense. In general, forcing that isn't a great idea, but for a benchmark for io-wq it certainly makes sense. I'll apply 1-7 once 6.14-rc1 is out and I can kick off a for-6.15/io_uring branch. Thanks!
On Wed, Jan 29, 2025 at 6:19 PM Jens Axboe <axboe@kernel.dk> wrote: > The other patches look pretty straight forward to me. Only thing that > has me puzzled a bit is why you have so much io-wq activity with your > application, in general I'd expect 0 activity there. But Then I saw the > forced ASYNC flag, and it makes sense. In general, forcing that isn't a > great idea, but for a benchmark for io-wq it certainly makes sense. I was experimenting with io_uring and wanted to see how much performance I can squeeze out of my web server running single-threaded. The overhead of io_uring_submit() grew very large, because the "send" operation would do a lot of synchronous work in the kernel. I tried SQPOLL but it was actually a big performance regression; this just shifted my CPU usage to epoll_wait(). Forcing ASYNC gave me large throughput improvements (moving the submission overhead to iowq), but then the iowq lock contention was the next limit, thus this patch series. I'm still experimenting, and I will certainly revisit SQPOLL to learn more about why it didn't help and how to fix it. > I'll apply 1-7 once 6.14-rc1 is out and I can kick off a > for-6.15/io_uring branch. Thanks! Thanks Jens, and please let me know when you're ready to discuss the last patch. It's a big improvement for those who combine io_uring with epoll, it's worth it. Max
On 1/29/25 10:39 AM, Max Kellermann wrote: > On Wed, Jan 29, 2025 at 6:19?PM Jens Axboe <axboe@kernel.dk> wrote: >> The other patches look pretty straight forward to me. Only thing that >> has me puzzled a bit is why you have so much io-wq activity with your >> application, in general I'd expect 0 activity there. But Then I saw the >> forced ASYNC flag, and it makes sense. In general, forcing that isn't a >> great idea, but for a benchmark for io-wq it certainly makes sense. > > I was experimenting with io_uring and wanted to see how much > performance I can squeeze out of my web server running > single-threaded. The overhead of io_uring_submit() grew very large, > because the "send" operation would do a lot of synchronous work in the > kernel. I tried SQPOLL but it was actually a big performance > regression; this just shifted my CPU usage to epoll_wait(). Forcing > ASYNC gave me large throughput improvements (moving the submission > overhead to iowq), but then the iowq lock contention was the next > limit, thus this patch series. > > I'm still experimenting, and I will certainly revisit SQPOLL to learn > more about why it didn't help and how to fix it. Why are you combining it with epoll in the first place? It's a lot more efficient to wait on a/multiple events in io_uring_enter() rather than go back to a serialize one-event-per-notification by using epoll to wait on completions on the io_uring side.
On Wed, Jan 29, 2025 at 6:45 PM Jens Axboe <axboe@kernel.dk> wrote: > Why are you combining it with epoll in the first place? It's a lot more > efficient to wait on a/multiple events in io_uring_enter() rather than > go back to a serialize one-event-per-notification by using epoll to wait > on completions on the io_uring side. Yes, I wish I could do that, but that works only if everything is io_uring - all or nothing. Most of the code is built around an epoll-based loop and will not be ported to io_uring so quickly. Maybe what's missing is epoll_wait as io_uring opcode. Then I could wrap it the other way. Or am I supposed to use io_uring poll_add_multishot for that? Max
On 1/29/25 17:39, Max Kellermann wrote: > On Wed, Jan 29, 2025 at 6:19 PM Jens Axboe <axboe@kernel.dk> wrote: >> The other patches look pretty straight forward to me. Only thing that >> has me puzzled a bit is why you have so much io-wq activity with your >> application, in general I'd expect 0 activity there. But Then I saw the >> forced ASYNC flag, and it makes sense. In general, forcing that isn't a >> great idea, but for a benchmark for io-wq it certainly makes sense. > > I was experimenting with io_uring and wanted to see how much > performance I can squeeze out of my web server running > single-threaded. The overhead of io_uring_submit() grew very large, > because the "send" operation would do a lot of synchronous work in the > kernel. I tried SQPOLL but it was actually a big performance > regression; this just shifted my CPU usage to epoll_wait(). Forcing > ASYNC gave me large throughput improvements (moving the submission > overhead to iowq), but then the iowq lock contention was the next > limit, thus this patch series. It's great to see iowq getting some optimisations, but note that it wouldn't be fair comparing it to single threaded peers when you have a lot of iowq activity as it might be occupying multiple CPUs. A curious open question is whether it'd be more performant to have several user threads with their private rings. > I'm still experimenting, and I will certainly revisit SQPOLL to learn > more about why it didn't help and how to fix it. It's wasteful unless you saturate it close to 100%, and then you usually have SQPOLL on a separate CPU than the user task submitting requests, and so it'd take some cache bouncing. It's not a silver bullet.
On Wed, Jan 29, 2025 at 8:30 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > It's great to see iowq getting some optimisations, but note that > it wouldn't be fair comparing it to single threaded peers when > you have a lot of iowq activity as it might be occupying multiple > CPUs. True. Fully loaded with the benchmark, I see 400%-600% CPU usage on my process (30-40% of that being spinlock contention). I wanted to explore how far I can get with a single (userspace) thread, and leave the dirty thread-sync work to the kernel. > It's wasteful unless you saturate it close to 100%, and then you > usually have SQPOLL on a separate CPU than the user task submitting > requests, and so it'd take some cache bouncing. It's not a silver > bullet. Of course, memory latency always bites us in the end. But this isn't the endgame just yet, we still have a lot of potential for optimizations.