[0/8] Various io_uring micro-optimizations (reducing lock contention)

Message ID	20250128133927.3989681-1-max.kellermann@ionos.com (mailing list archive)
Headers	show Received: from mail-wr1-f53.google.com (mail-wr1-f53.google.com [209.85.221.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BA8A61A725A for <io-uring@vger.kernel.org>; Tue, 28 Jan 2025 13:39:37 +0000 (UTC) From: Max Kellermann <max.kellermann@ionos.com> To: axboe@kernel.dk, asml.silence@gmail.com, io-uring@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Max Kellermann <max.kellermann@ionos.com> Subject: [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention) Date: Tue, 28 Jan 2025 14:39:19 +0100 Message-ID: <20250128133927.3989681-1-max.kellermann@ionos.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Various io_uring micro-optimizations (reducing lock contention) \| expand [0/8] Various io_uring micro-optimizations (reducing lock contention) [1/8] io_uring/io-wq: eliminate redundant io_work_get_acct() calls [2/8] io_uring/io-wq: add io_worker.acct pointer [3/8] io_uring/io-wq: move worker lists to struct io_wq_acct [4/8] io_uring/io-wq: cache work->flags in variable [5/8] io_uring/io-wq: do not use bogus hash value [6/8] io_uring/io-wq: pass io_wq to io_get_next_work() [7/8] io_uring: cache io_kiocb->flags in variable [8/8] io_uring: skip redundant poll wakeups

Max Kellermann Jan. 28, 2025, 1:39 p.m. UTC

While optimizing my io_uring-based web server, I found that the kernel
spends 35% of the CPU time waiting for `io_wq_acct.lock`.  This patch
set reduces contention of this lock, though I believe much more should
be done in order to allow more worker concurrency.

I measured these patches with my HTTP server (serving static files and
running a tiny PHP script) and with a micro-benchmark that submits
millions of `IORING_OP_NOP` entries (with `IOSQE_ASYNC` to force
offloading the operation to a worker, so this offload overhead can be
measured).

Some of the optimizations eliminate memory accesses, e.g. by passing
values that are already known to (inlined) functions and by caching
values in local variables.  These are useful optimizations, but they
are too small to measure them in a benchmark (too much noise).

Some of the patches have a measurable effect and they contain
benchmark numbers that I could reproduce in repeated runs, despite the
noise.

I'm not confident about the correctness of the last patch ("io_uring:
skip redundant poll wakeups").  This seemed like low-hanging fruit, so
low that it seemed suspicious to me.  If this is a useful
optimization, the idea could probably be ported to other wait_queue
users, or even into the wait_queue library.  What I'm not confident
about is whether the optimization is valid or whether it may miss
wakeups, leading to stalls.  Please advise!

Total "perf diff" for `IORING_OP_NOP`:

    42.25%     -9.24%  [kernel.kallsyms]     [k] queued_spin_lock_slowpath
     4.79%     +2.83%  [kernel.kallsyms]     [k] io_worker_handle_work
     7.23%     -1.41%  [kernel.kallsyms]     [k] io_wq_submit_work
     6.80%     +1.23%  [kernel.kallsyms]     [k] io_wq_free_work
     3.19%     +1.10%  [kernel.kallsyms]     [k] io_req_task_complete
     2.45%     +0.94%  [kernel.kallsyms]     [k] try_to_wake_up
               +0.81%  [kernel.kallsyms]     [k] io_acct_activate_free_worker
     0.79%     +0.64%  [kernel.kallsyms]     [k] __schedule

Serving static files with HTTP (send+receive on local+TCP,splice
file->pipe->TCP):

    42.92%     -7.84%  [kernel.kallsyms]     [k] queued_spin_lock_slowpath
     1.53%     -1.51%  [kernel.kallsyms]     [k] ep_poll_callback
     1.18%     +1.49%  [kernel.kallsyms]     [k] io_wq_free_work
     0.61%     +0.60%  [kernel.kallsyms]     [k] try_to_wake_up
     0.76%     -0.43%  [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
     2.22%     -0.33%  [kernel.kallsyms]     [k] io_wq_submit_work

Running PHP script (send+receive on local+TCP, splice pipe->TCP):

    33.01%     -4.13%  [kernel.kallsyms]     [k] queued_spin_lock_slowpath
     1.57%     -1.56%  [kernel.kallsyms]     [k] ep_poll_callback
     1.36%     +1.19%  [kernel.kallsyms]     [k] io_wq_free_work
     0.94%     -0.61%  [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
     2.56%     -0.36%  [kernel.kallsyms]     [k] io_wq_submit_work
     2.06%     +0.36%  [kernel.kallsyms]     [k] io_worker_handle_work
     1.00%     +0.35%  [kernel.kallsyms]     [k] try_to_wake_up

(The `IORING_OP_NOP` benchmark finishes after a hardcoded number of
operations; the two HTTP benchmarks finish after a certain wallclock
duration, and therefore more HTTP requests were handled.)

Max Kellermann (8):
  io_uring/io-wq: eliminate redundant io_work_get_acct() calls
  io_uring/io-wq: add io_worker.acct pointer
  io_uring/io-wq: move worker lists to struct io_wq_acct
  io_uring/io-wq: cache work->flags in variable
  io_uring/io-wq: do not use bogus hash value
  io_uring/io-wq: pass io_wq to io_get_next_work()
  io_uring: cache io_kiocb->flags in variable
  io_uring: skip redundant poll wakeups

 include/linux/io_uring_types.h |  10 ++
 io_uring/io-wq.c               | 230 +++++++++++++++++++--------------
 io_uring/io-wq.h               |   7 +-
 io_uring/io_uring.c            |  63 +++++----
 io_uring/io_uring.h            |   2 +-
 5 files changed, 187 insertions(+), 125 deletions(-)

Jens Axboe Jan. 29, 2025, 5:18 p.m. UTC | #1

On 1/28/25 6:39 AM, Max Kellermann wrote:
> While optimizing my io_uring-based web server, I found that the kernel
> spends 35% of the CPU time waiting for `io_wq_acct.lock`.  This patch
> set reduces contention of this lock, though I believe much more should
> be done in order to allow more worker concurrency.
> 
> I measured these patches with my HTTP server (serving static files and
> running a tiny PHP script) and with a micro-benchmark that submits
> millions of `IORING_OP_NOP` entries (with `IOSQE_ASYNC` to force
> offloading the operation to a worker, so this offload overhead can be
> measured).
> 
> Some of the optimizations eliminate memory accesses, e.g. by passing
> values that are already known to (inlined) functions and by caching
> values in local variables.  These are useful optimizations, but they
> are too small to measure them in a benchmark (too much noise).
> 
> Some of the patches have a measurable effect and they contain
> benchmark numbers that I could reproduce in repeated runs, despite the
> noise.
> 
> I'm not confident about the correctness of the last patch ("io_uring:
> skip redundant poll wakeups").  This seemed like low-hanging fruit, so
> low that it seemed suspicious to me.  If this is a useful
> optimization, the idea could probably be ported to other wait_queue
> users, or even into the wait_queue library.  What I'm not confident
> about is whether the optimization is valid or whether it may miss
> wakeups, leading to stalls.  Please advise!

That last patch is the only one that needs a bit more checking, so I'd
suggest we just ignore that one for now. We're in the middle of the
merge window anyway, so all of this would have to wait for the 6.15
merge window anyway - iow, plenty of time.

The other patches look pretty straight forward to me. Only thing that
has me puzzled a bit is why you have so much io-wq activity with your
application, in general I'd expect 0 activity there. But Then I saw the
forced ASYNC flag, and it makes sense. In general, forcing that isn't a
great idea, but for a benchmark for io-wq it certainly makes sense.

I'll apply 1-7 once 6.14-rc1 is out and I can kick off a
for-6.15/io_uring branch. Thanks!

Max Kellermann Jan. 29, 2025, 5:39 p.m. UTC | #2

On Wed, Jan 29, 2025 at 6:19 PM Jens Axboe <axboe@kernel.dk> wrote:
> The other patches look pretty straight forward to me. Only thing that
> has me puzzled a bit is why you have so much io-wq activity with your
> application, in general I'd expect 0 activity there. But Then I saw the
> forced ASYNC flag, and it makes sense. In general, forcing that isn't a
> great idea, but for a benchmark for io-wq it certainly makes sense.

I was experimenting with io_uring and wanted to see how much
performance I can squeeze out of my web server running
single-threaded. The overhead of io_uring_submit() grew very large,
because the "send" operation would do a lot of synchronous work in the
kernel. I tried SQPOLL but it was actually a big performance
regression; this just shifted my CPU usage to epoll_wait(). Forcing
ASYNC gave me large throughput improvements (moving the submission
overhead to iowq), but then the iowq lock contention was the next
limit, thus this patch series.

I'm still experimenting, and I will certainly revisit SQPOLL to learn
more about why it didn't help and how to fix it.

> I'll apply 1-7 once 6.14-rc1 is out and I can kick off a
> for-6.15/io_uring branch. Thanks!

Thanks Jens, and please let me know when you're ready to discuss the
last patch. It's a big improvement for those who combine io_uring with
epoll, it's worth it.

Max

Jens Axboe Jan. 29, 2025, 5:45 p.m. UTC | #3

On 1/29/25 10:39 AM, Max Kellermann wrote:
> On Wed, Jan 29, 2025 at 6:19?PM Jens Axboe <axboe@kernel.dk> wrote:
>> The other patches look pretty straight forward to me. Only thing that
>> has me puzzled a bit is why you have so much io-wq activity with your
>> application, in general I'd expect 0 activity there. But Then I saw the
>> forced ASYNC flag, and it makes sense. In general, forcing that isn't a
>> great idea, but for a benchmark for io-wq it certainly makes sense.
> 
> I was experimenting with io_uring and wanted to see how much
> performance I can squeeze out of my web server running
> single-threaded. The overhead of io_uring_submit() grew very large,
> because the "send" operation would do a lot of synchronous work in the
> kernel. I tried SQPOLL but it was actually a big performance
> regression; this just shifted my CPU usage to epoll_wait(). Forcing
> ASYNC gave me large throughput improvements (moving the submission
> overhead to iowq), but then the iowq lock contention was the next
> limit, thus this patch series.
> 
> I'm still experimenting, and I will certainly revisit SQPOLL to learn
> more about why it didn't help and how to fix it.

Why are you combining it with epoll in the first place? It's a lot more
efficient to wait on a/multiple events in io_uring_enter() rather than
go back to a serialize one-event-per-notification by using epoll to wait
on completions on the io_uring side.

Max Kellermann Jan. 29, 2025, 6:01 p.m. UTC | #4

On Wed, Jan 29, 2025 at 6:45 PM Jens Axboe <axboe@kernel.dk> wrote:
> Why are you combining it with epoll in the first place? It's a lot more
> efficient to wait on a/multiple events in io_uring_enter() rather than
> go back to a serialize one-event-per-notification by using epoll to wait
> on completions on the io_uring side.

Yes, I wish I could do that, but that works only if everything is
io_uring - all or nothing. Most of the code is built around an
epoll-based loop and will not be ported to io_uring so quickly.

Maybe what's missing is epoll_wait as io_uring opcode. Then I could
wrap it the other way. Or am I supposed to use io_uring
poll_add_multishot for that?

Max

Pavel Begunkov Jan. 29, 2025, 7:30 p.m. UTC | #5

On 1/29/25 17:39, Max Kellermann wrote:
> On Wed, Jan 29, 2025 at 6:19 PM Jens Axboe <axboe@kernel.dk> wrote:
>> The other patches look pretty straight forward to me. Only thing that
>> has me puzzled a bit is why you have so much io-wq activity with your
>> application, in general I'd expect 0 activity there. But Then I saw the
>> forced ASYNC flag, and it makes sense. In general, forcing that isn't a
>> great idea, but for a benchmark for io-wq it certainly makes sense.
> 
> I was experimenting with io_uring and wanted to see how much
> performance I can squeeze out of my web server running
> single-threaded. The overhead of io_uring_submit() grew very large,
> because the "send" operation would do a lot of synchronous work in the
> kernel. I tried SQPOLL but it was actually a big performance
> regression; this just shifted my CPU usage to epoll_wait(). Forcing
> ASYNC gave me large throughput improvements (moving the submission
> overhead to iowq), but then the iowq lock contention was the next
> limit, thus this patch series.

It's great to see iowq getting some optimisations, but note that
it wouldn't be fair comparing it to single threaded peers when
you have a lot of iowq activity as it might be occupying multiple
CPUs. A curious open question is whether it'd be more performant
to have several user threads with their private rings.

> I'm still experimenting, and I will certainly revisit SQPOLL to learn
> more about why it didn't help and how to fix it.

It's wasteful unless you saturate it close to 100%, and then you
usually have SQPOLL on a separate CPU than the user task submitting
requests, and so it'd take some cache bouncing. It's not a silver
bullet.

Max Kellermann Jan. 29, 2025, 7:43 p.m. UTC | #6

On Wed, Jan 29, 2025 at 8:30 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> It's great to see iowq getting some optimisations, but note that
> it wouldn't be fair comparing it to single threaded peers when
> you have a lot of iowq activity as it might be occupying multiple
> CPUs.

True. Fully loaded with the benchmark, I see 400%-600% CPU usage on my
process (30-40% of that being spinlock contention).
I wanted to explore how far I can get with a single (userspace)
thread, and leave the dirty thread-sync work to the kernel.

> It's wasteful unless you saturate it close to 100%, and then you
> usually have SQPOLL on a separate CPU than the user task submitting
> requests, and so it'd take some cache bouncing. It's not a silver
> bullet.

Of course, memory latency always bites us in the end. But this isn't
the endgame just yet, we still have a lot of potential for
optimizations.

[0/8] Various io_uring micro-optimizations (reducing lock contention)

Message

Comments