Message ID | 20250128133927.3989681-5-max.kellermann@ionos.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Various io_uring micro-optimizations (reducing lock contention) | expand |
On 1/28/25 13:39, Max Kellermann wrote: > This eliminates several redundant atomic reads and therefore reduces > the duration the surrounding spinlocks are held. What architecture are you running? I don't get why the reads are expensive while it's relaxed and there shouldn't even be any contention. It doesn't even need to be atomics, we still should be able to convert int back to plain ints. > In several io_uring benchmarks, this reduced the CPU time spent in > queued_spin_lock_slowpath() considerably: > > io_uring benchmark with a flood of `IORING_OP_NOP` and `IOSQE_ASYNC`: > > 38.86% -1.49% [kernel.kallsyms] [k] queued_spin_lock_slowpath > 6.75% +0.36% [kernel.kallsyms] [k] io_worker_handle_work > 2.60% +0.19% [kernel.kallsyms] [k] io_nop > 3.92% +0.18% [kernel.kallsyms] [k] io_req_task_complete > 6.34% -0.18% [kernel.kallsyms] [k] io_wq_submit_work > > HTTP server, static file: > > 42.79% -2.77% [kernel.kallsyms] [k] queued_spin_lock_slowpath > 2.08% +0.23% [kernel.kallsyms] [k] io_wq_submit_work > 1.19% +0.20% [kernel.kallsyms] [k] amd_iommu_iotlb_sync_map > 1.46% +0.15% [kernel.kallsyms] [k] ep_poll_callback > 1.80% +0.15% [kernel.kallsyms] [k] io_worker_handle_work > > HTTP server, PHP: > > 35.03% -1.80% [kernel.kallsyms] [k] queued_spin_lock_slowpath > 0.84% +0.21% [kernel.kallsyms] [k] amd_iommu_iotlb_sync_map > 1.39% +0.12% [kernel.kallsyms] [k] _copy_to_iter > 0.21% +0.10% [kernel.kallsyms] [k] update_sd_lb_stats > > Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
On Wed, Jan 29, 2025 at 7:56 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > What architecture are you running? I don't get why the reads > are expensive while it's relaxed and there shouldn't even be > any contention. It doesn't even need to be atomics, we still > should be able to convert int back to plain ints. I measured on an AMD Epyc 9654P. As you see in my numbers, around 40% of the CPU time was wasted on spinlock contention. Dozens of io-wq threads are trampling on each other's feet all the time. I don't think this is about memory accesses being exceptionally expensive; it's just about wringing every cycle from the code section that's under the heavy-contention spinlock.
On 1/29/25 19:11, Max Kellermann wrote: > On Wed, Jan 29, 2025 at 7:56 PM Pavel Begunkov <asml.silence@gmail.com> wrote: >> What architecture are you running? I don't get why the reads >> are expensive while it's relaxed and there shouldn't even be >> any contention. It doesn't even need to be atomics, we still >> should be able to convert int back to plain ints. > > I measured on an AMD Epyc 9654P. > As you see in my numbers, around 40% of the CPU time was wasted on > spinlock contention. Dozens of io-wq threads are trampling on each > other's feet all the time. > I don't think this is about memory accesses being exceptionally > expensive; it's just about wringing every cycle from the code section > that's under the heavy-contention spinlock. Ok, then it's an architectural problem and needs more serious reengineering, e.g. of how work items are stored and grabbed, and it might even get some more use cases for io_uring. FWIW, I'm not saying smaller optimisations shouldn't have place especially when they're clean.
On Thu, Jan 30, 2025 at 12:41 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > Ok, then it's an architectural problem and needs more serious > reengineering, e.g. of how work items are stored and grabbed Rough unpolished idea: I was thinking about having multiple work lists, each with its own spinlock (separate cache line), and each io-wq thread only uses one of them, while the submitter round-robins through the lists.
On 1/29/25 4:41 PM, Pavel Begunkov wrote: > On 1/29/25 19:11, Max Kellermann wrote: >> On Wed, Jan 29, 2025 at 7:56?PM Pavel Begunkov <asml.silence@gmail.com> wrote: >>> What architecture are you running? I don't get why the reads >>> are expensive while it's relaxed and there shouldn't even be >>> any contention. It doesn't even need to be atomics, we still >>> should be able to convert int back to plain ints. >> >> I measured on an AMD Epyc 9654P. >> As you see in my numbers, around 40% of the CPU time was wasted on >> spinlock contention. Dozens of io-wq threads are trampling on each >> other's feet all the time. >> I don't think this is about memory accesses being exceptionally >> expensive; it's just about wringing every cycle from the code section >> that's under the heavy-contention spinlock. > > Ok, then it's an architectural problem and needs more serious > reengineering, e.g. of how work items are stored and grabbed, and it > might even get some more use cases for io_uring. FWIW, I'm not saying > smaller optimisations shouldn't have place especially when they're > clean. Totally agree - io-wq would need some improvements on the where to queue and pull work to make it scale better, which may indeed be a good idea to do and would open it up to more use cases that currently don't make much sense. That said, also agree that the minor optimizations still have a place, it's not like they will stand in the way of general improvements as well.
On 1/29/25 10:36 PM, Max Kellermann wrote: > On Thu, Jan 30, 2025 at 12:41?AM Pavel Begunkov <asml.silence@gmail.com> wrote: >> Ok, then it's an architectural problem and needs more serious >> reengineering, e.g. of how work items are stored and grabbed > > Rough unpolished idea: I was thinking about having multiple work > lists, each with its own spinlock (separate cache line), and each > io-wq thread only uses one of them, while the submitter round-robins > through the lists. Pending work would certainly need better spreading than just the two classes we have now. One thing to keep in mind is that the design of io-wq is such that it's quite possible to have N work items pending and just a single thread serving all of them. If the io-wq thread doesn't go to sleep, it will keep processing work units. This is done for efficiency reasons, and to avoid a proliferation of io-wq threads when it's not going to be beneficial. This means than when you queue a work item, it's not easy to pick an appropriate io-wq thread upfront, and generally the io-wq thread itself will pick its next work item at the perfect time - when it doesn't have anything else to do, or finished the existing work. This should be kept in mind for making io-wq scale better.
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c index dfdd45ebe4bb..ba9974e6f521 100644 --- a/io_uring/io-wq.c +++ b/io_uring/io-wq.c @@ -170,9 +170,9 @@ static inline struct io_wq_acct *io_get_acct(struct io_wq *wq, bool bound) } static inline struct io_wq_acct *io_work_get_acct(struct io_wq *wq, - struct io_wq_work *work) + unsigned int work_flags) { - return io_get_acct(wq, !(atomic_read(&work->flags) & IO_WQ_WORK_UNBOUND)); + return io_get_acct(wq, !(work_flags & IO_WQ_WORK_UNBOUND)); } static inline struct io_wq_acct *io_wq_get_acct(struct io_worker *worker) @@ -457,9 +457,14 @@ static void __io_worker_idle(struct io_wq_acct *acct, struct io_worker *worker) } } +static inline unsigned int __io_get_work_hash(unsigned int work_flags) +{ + return work_flags >> IO_WQ_HASH_SHIFT; +} + static inline unsigned int io_get_work_hash(struct io_wq_work *work) { - return atomic_read(&work->flags) >> IO_WQ_HASH_SHIFT; + return __io_get_work_hash(atomic_read(&work->flags)); } static bool io_wait_on_hash(struct io_wq *wq, unsigned int hash) @@ -489,17 +494,19 @@ static struct io_wq_work *io_get_next_work(struct io_wq_acct *acct, struct io_wq *wq = worker->wq; wq_list_for_each(node, prev, &acct->work_list) { + unsigned int work_flags; unsigned int hash; work = container_of(node, struct io_wq_work, list); /* not hashed, can run anytime */ - if (!io_wq_is_hashed(work)) { + work_flags = atomic_read(&work->flags); + if (!__io_wq_is_hashed(work_flags)) { wq_list_del(&acct->work_list, node, prev); return work; } - hash = io_get_work_hash(work); + hash = __io_get_work_hash(work_flags); /* all items with this hash lie in [work, tail] */ tail = wq->hash_tail[hash]; @@ -596,12 +603,13 @@ static void io_worker_handle_work(struct io_wq_acct *acct, /* handle a whole dependent link */ do { struct io_wq_work *next_hashed, *linked; - unsigned int hash = io_get_work_hash(work); + unsigned int work_flags = atomic_read(&work->flags); + unsigned int hash = __io_get_work_hash(work_flags); next_hashed = wq_next_work(work); if (do_kill && - (atomic_read(&work->flags) & IO_WQ_WORK_UNBOUND)) + (work_flags & IO_WQ_WORK_UNBOUND)) atomic_or(IO_WQ_WORK_CANCEL, &work->flags); wq->do_work(work); io_assign_current_work(worker, NULL); @@ -917,18 +925,19 @@ static void io_run_cancel(struct io_wq_work *work, struct io_wq *wq) } while (work); } -static void io_wq_insert_work(struct io_wq *wq, struct io_wq_acct *acct, struct io_wq_work *work) +static void io_wq_insert_work(struct io_wq *wq, struct io_wq_acct *acct, + struct io_wq_work *work, unsigned int work_flags) { unsigned int hash; struct io_wq_work *tail; - if (!io_wq_is_hashed(work)) { + if (!__io_wq_is_hashed(work_flags)) { append: wq_list_add_tail(&work->list, &acct->work_list); return; } - hash = io_get_work_hash(work); + hash = __io_get_work_hash(work_flags); tail = wq->hash_tail[hash]; wq->hash_tail[hash] = work; if (!tail) @@ -944,8 +953,8 @@ static bool io_wq_work_match_item(struct io_wq_work *work, void *data) void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work) { - struct io_wq_acct *acct = io_work_get_acct(wq, work); unsigned int work_flags = atomic_read(&work->flags); + struct io_wq_acct *acct = io_work_get_acct(wq, work_flags); struct io_cb_cancel_data match = { .fn = io_wq_work_match_item, .data = work, @@ -964,7 +973,7 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work) } raw_spin_lock(&acct->lock); - io_wq_insert_work(wq, acct, work); + io_wq_insert_work(wq, acct, work, work_flags); clear_bit(IO_ACCT_STALLED_BIT, &acct->flags); raw_spin_unlock(&acct->lock); diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h index b3b004a7b625..d4fb2940e435 100644 --- a/io_uring/io-wq.h +++ b/io_uring/io-wq.h @@ -54,9 +54,14 @@ int io_wq_cpu_affinity(struct io_uring_task *tctx, cpumask_var_t mask); int io_wq_max_workers(struct io_wq *wq, int *new_count); bool io_wq_worker_stopped(void); +static inline bool __io_wq_is_hashed(unsigned int work_flags) +{ + return work_flags & IO_WQ_WORK_HASHED; +} + static inline bool io_wq_is_hashed(struct io_wq_work *work) { - return atomic_read(&work->flags) & IO_WQ_WORK_HASHED; + return __io_wq_is_hashed(atomic_read(&work->flags)); } typedef bool (work_cancel_fn)(struct io_wq_work *, void *);
This eliminates several redundant atomic reads and therefore reduces the duration the surrounding spinlocks are held. In several io_uring benchmarks, this reduced the CPU time spent in queued_spin_lock_slowpath() considerably: io_uring benchmark with a flood of `IORING_OP_NOP` and `IOSQE_ASYNC`: 38.86% -1.49% [kernel.kallsyms] [k] queued_spin_lock_slowpath 6.75% +0.36% [kernel.kallsyms] [k] io_worker_handle_work 2.60% +0.19% [kernel.kallsyms] [k] io_nop 3.92% +0.18% [kernel.kallsyms] [k] io_req_task_complete 6.34% -0.18% [kernel.kallsyms] [k] io_wq_submit_work HTTP server, static file: 42.79% -2.77% [kernel.kallsyms] [k] queued_spin_lock_slowpath 2.08% +0.23% [kernel.kallsyms] [k] io_wq_submit_work 1.19% +0.20% [kernel.kallsyms] [k] amd_iommu_iotlb_sync_map 1.46% +0.15% [kernel.kallsyms] [k] ep_poll_callback 1.80% +0.15% [kernel.kallsyms] [k] io_worker_handle_work HTTP server, PHP: 35.03% -1.80% [kernel.kallsyms] [k] queued_spin_lock_slowpath 0.84% +0.21% [kernel.kallsyms] [k] amd_iommu_iotlb_sync_map 1.39% +0.12% [kernel.kallsyms] [k] _copy_to_iter 0.21% +0.10% [kernel.kallsyms] [k] update_sd_lb_stats Signed-off-by: Max Kellermann <max.kellermann@ionos.com> --- io_uring/io-wq.c | 33 +++++++++++++++++++++------------ io_uring/io-wq.h | 7 ++++++- 2 files changed, 27 insertions(+), 13 deletions(-)