mbox series

[PATCHSET,v3,0/9] Improve MSG_RING DEFER_TASKRUN performance

Message ID 20240605141933.11975-1-axboe@kernel.dk (mailing list archive)
Headers show
Series Improve MSG_RING DEFER_TASKRUN performance | expand

Message

Jens Axboe June 5, 2024, 1:51 p.m. UTC
Hi,

For v1 and replies to that and tons of perf measurements, go here:

https://lore.kernel.org/io-uring/3d553205-0fe2-482e-8d4c-a4a1ad278893@kernel.dk/T/#m12f44c0a9ee40a59b0dcc226e22a0d031903aa73

and find v2 here:

https://lore.kernel.org/io-uring/20240530152822.535791-2-axboe@kernel.dk/

and you can find the git tree here:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-msg_ring

Patches are based on top of current Linus -git, with the 6.10 and 6.11
pending io_uring changes pulled in.

tldr is that this series greatly improves both latency, overhead, and
throughput of sending messages to other rings. It's done by using the
CQE overflow framework rather than attempting to local remote rings,
which can potentially cause spurious -EAGAIN and io-wq usage. Outside
of that, it also unifies how message posting is done, ending up with a
single method across target ring types.

Some select performance results:

Sender using 10 usec delay, sending ~100K messages per second:

Pre-patches:

Latencies for: Sender (msg=131950)
    percentiles (nsec):
     |  1.0000th=[ 1896],  5.0000th=[ 2064], 10.0000th=[ 2096],
     | 20.0000th=[ 2192], 30.0000th=[ 2352], 40.0000th=[ 2480],
     | 50.0000th=[ 2544], 60.0000th=[ 2608], 70.0000th=[ 2896],
     | 80.0000th=[ 2992], 90.0000th=[ 3376], 95.0000th=[ 3472],
     | 99.0000th=[ 3568], 99.5000th=[ 3728], 99.9000th=[ 6880],
     | 99.9500th=[14656], 99.9900th=[42752]
Latencies for: Receiver (msg=131950)
    percentiles (nsec):
     |  1.0000th=[ 1160],  5.0000th=[ 1288], 10.0000th=[ 1336],
     | 20.0000th=[ 1384], 30.0000th=[ 1448], 40.0000th=[ 1624],
     | 50.0000th=[ 1688], 60.0000th=[ 1736], 70.0000th=[ 1768],
     | 80.0000th=[ 1848], 90.0000th=[ 2256], 95.0000th=[ 2320],
     | 99.0000th=[ 2416], 99.5000th=[ 2480], 99.9000th=[ 3184],
     | 99.9500th=[14400], 99.9900th=[18304]
Expected messages: 299882

and with the patches:

Latencies for: Sender (msg=247931)
    percentiles (nsec):
     |  1.0000th=[  181],  5.0000th=[  191], 10.0000th=[  201],
     | 20.0000th=[  211], 30.0000th=[  231], 40.0000th=[  262],
     | 50.0000th=[  290], 60.0000th=[  322], 70.0000th=[  390],
     | 80.0000th=[  482], 90.0000th=[  748], 95.0000th=[  892],
     | 99.0000th=[ 1032], 99.5000th=[ 1096], 99.9000th=[ 1336],
     | 99.9500th=[ 1512], 99.9900th=[ 1992]
Latencies for: Receiver (msg=247931)
    percentiles (nsec):
     |  1.0000th=[  350],  5.0000th=[  382], 10.0000th=[  410],
     | 20.0000th=[  482], 30.0000th=[  572], 40.0000th=[  652],
     | 50.0000th=[  764], 60.0000th=[  860], 70.0000th=[ 1080],
     | 80.0000th=[ 1480], 90.0000th=[ 1768], 95.0000th=[ 1896],
     | 99.0000th=[ 2448], 99.5000th=[ 2576], 99.9000th=[ 3184],
     | 99.9500th=[ 3792], 99.9900th=[17280]
Expected messages: 299926

which is a ~8.7x improvement for 50th latency percentile for the sender,
and ~3.5x for the 99th percentile, and a ~2.2x receiver side improvement
for the 50th percentile. Higher percentiels for the receiver are pretty
similar, but note that this is accomplished with the throughput being
almost twice that of before (~248K messages over 3 seconds vs ~132K
before).

Using a 20 usec message delay, targeting 50K messages per second,
the latency picture is close to the same as above. However, pre patches
we get ~110K messages and after we get ~142K messages. Pre patches is
~37% off the target rate, with the patches we're within 5% of the
target.

One interesting use case for message passing is sending work items
between rings. For example, you can have a ring that accepts connections
and then passes them to worker threads that have their own ring. Or you
can have threads that receive data and needs to pass a work item for
processing to another thread. Normally that would be done with some kind
of queue with serialization, and then a remote wakeup with eg epoll on
the other end and using eventfd. That isn't very efficient. With message
passing, you can simply hand over the work item rather than need to
manage both a queue and a wakeup mechanism in userspace.

 include/linux/io_uring_types.h |   8 ++
 io_uring/io_uring.c            |  33 ++----
 io_uring/io_uring.h            |  44 +++++++
 io_uring/msg_ring.c            | 211 +++++++++++++++++----------------
 io_uring/msg_ring.h            |   3 +
 5 files changed, 176 insertions(+), 123 deletions(-)

Changes since v2:
- Add wakeup batching for MSG_RING with DEFER_TASKRUN by refactoring
  the helpers that we use for local task_work.
- Drop patch splitting fd installing into a separate helper, as we just
  remove it at the end anyway when the old MSG_RING posting code is
  removed.
- Little cleanups