diff mbox series

[4/8] io_uring/io-wq: cache work->flags in variable

Message ID 20250128133927.3989681-5-max.kellermann@ionos.com (mailing list archive)
State New
Headers show
Series Various io_uring micro-optimizations (reducing lock contention) | expand

Commit Message

Max Kellermann Jan. 28, 2025, 1:39 p.m. UTC
This eliminates several redundant atomic reads and therefore reduces
the duration the surrounding spinlocks are held.

In several io_uring benchmarks, this reduced the CPU time spent in
queued_spin_lock_slowpath() considerably:

io_uring benchmark with a flood of `IORING_OP_NOP` and `IOSQE_ASYNC`:

    38.86%     -1.49%  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
     6.75%     +0.36%  [kernel.kallsyms]  [k] io_worker_handle_work
     2.60%     +0.19%  [kernel.kallsyms]  [k] io_nop
     3.92%     +0.18%  [kernel.kallsyms]  [k] io_req_task_complete
     6.34%     -0.18%  [kernel.kallsyms]  [k] io_wq_submit_work

HTTP server, static file:

    42.79%     -2.77%  [kernel.kallsyms]     [k] queued_spin_lock_slowpath
     2.08%     +0.23%  [kernel.kallsyms]     [k] io_wq_submit_work
     1.19%     +0.20%  [kernel.kallsyms]     [k] amd_iommu_iotlb_sync_map
     1.46%     +0.15%  [kernel.kallsyms]     [k] ep_poll_callback
     1.80%     +0.15%  [kernel.kallsyms]     [k] io_worker_handle_work

HTTP server, PHP:

    35.03%     -1.80%  [kernel.kallsyms]     [k] queued_spin_lock_slowpath
     0.84%     +0.21%  [kernel.kallsyms]     [k] amd_iommu_iotlb_sync_map
     1.39%     +0.12%  [kernel.kallsyms]     [k] _copy_to_iter
     0.21%     +0.10%  [kernel.kallsyms]     [k] update_sd_lb_stats

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
---
 io_uring/io-wq.c | 33 +++++++++++++++++++++------------
 io_uring/io-wq.h |  7 ++++++-
 2 files changed, 27 insertions(+), 13 deletions(-)

Comments

Pavel Begunkov Jan. 29, 2025, 6:57 p.m. UTC | #1
On 1/28/25 13:39, Max Kellermann wrote:
> This eliminates several redundant atomic reads and therefore reduces
> the duration the surrounding spinlocks are held.

What architecture are you running? I don't get why the reads
are expensive while it's relaxed and there shouldn't even be
any contention. It doesn't even need to be atomics, we still
should be able to convert int back to plain ints.
  
> In several io_uring benchmarks, this reduced the CPU time spent in
> queued_spin_lock_slowpath() considerably:
> 
> io_uring benchmark with a flood of `IORING_OP_NOP` and `IOSQE_ASYNC`:
> 
>      38.86%     -1.49%  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>       6.75%     +0.36%  [kernel.kallsyms]  [k] io_worker_handle_work
>       2.60%     +0.19%  [kernel.kallsyms]  [k] io_nop
>       3.92%     +0.18%  [kernel.kallsyms]  [k] io_req_task_complete
>       6.34%     -0.18%  [kernel.kallsyms]  [k] io_wq_submit_work
> 
> HTTP server, static file:
> 
>      42.79%     -2.77%  [kernel.kallsyms]     [k] queued_spin_lock_slowpath
>       2.08%     +0.23%  [kernel.kallsyms]     [k] io_wq_submit_work
>       1.19%     +0.20%  [kernel.kallsyms]     [k] amd_iommu_iotlb_sync_map
>       1.46%     +0.15%  [kernel.kallsyms]     [k] ep_poll_callback
>       1.80%     +0.15%  [kernel.kallsyms]     [k] io_worker_handle_work
> 
> HTTP server, PHP:
> 
>      35.03%     -1.80%  [kernel.kallsyms]     [k] queued_spin_lock_slowpath
>       0.84%     +0.21%  [kernel.kallsyms]     [k] amd_iommu_iotlb_sync_map
>       1.39%     +0.12%  [kernel.kallsyms]     [k] _copy_to_iter
>       0.21%     +0.10%  [kernel.kallsyms]     [k] update_sd_lb_stats
> 
> Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Max Kellermann Jan. 29, 2025, 7:11 p.m. UTC | #2
On Wed, Jan 29, 2025 at 7:56 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> What architecture are you running? I don't get why the reads
> are expensive while it's relaxed and there shouldn't even be
> any contention. It doesn't even need to be atomics, we still
> should be able to convert int back to plain ints.

I measured on an AMD Epyc 9654P.
As you see in my numbers, around 40% of the CPU time was wasted on
spinlock contention. Dozens of io-wq threads are trampling on each
other's feet all the time.
I don't think this is about memory accesses being exceptionally
expensive; it's just about wringing every cycle from the code section
that's under the heavy-contention spinlock.
Pavel Begunkov Jan. 29, 2025, 11:41 p.m. UTC | #3
On 1/29/25 19:11, Max Kellermann wrote:
> On Wed, Jan 29, 2025 at 7:56 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>> What architecture are you running? I don't get why the reads
>> are expensive while it's relaxed and there shouldn't even be
>> any contention. It doesn't even need to be atomics, we still
>> should be able to convert int back to plain ints.
> 
> I measured on an AMD Epyc 9654P.
> As you see in my numbers, around 40% of the CPU time was wasted on
> spinlock contention. Dozens of io-wq threads are trampling on each
> other's feet all the time.
> I don't think this is about memory accesses being exceptionally
> expensive; it's just about wringing every cycle from the code section
> that's under the heavy-contention spinlock.

Ok, then it's an architectural problem and needs more serious
reengineering, e.g. of how work items are stored and grabbed, and it
might even get some more use cases for io_uring. FWIW, I'm not saying
smaller optimisations shouldn't have place especially when they're
clean.
Max Kellermann Jan. 30, 2025, 5:36 a.m. UTC | #4
On Thu, Jan 30, 2025 at 12:41 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> Ok, then it's an architectural problem and needs more serious
> reengineering, e.g. of how work items are stored and grabbed

Rough unpolished idea: I was thinking about having multiple work
lists, each with its own spinlock (separate cache line), and each
io-wq thread only uses one of them, while the submitter round-robins
through the lists.
Jens Axboe Jan. 30, 2025, 2:54 p.m. UTC | #5
On 1/29/25 4:41 PM, Pavel Begunkov wrote:
> On 1/29/25 19:11, Max Kellermann wrote:
>> On Wed, Jan 29, 2025 at 7:56?PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>> What architecture are you running? I don't get why the reads
>>> are expensive while it's relaxed and there shouldn't even be
>>> any contention. It doesn't even need to be atomics, we still
>>> should be able to convert int back to plain ints.
>>
>> I measured on an AMD Epyc 9654P.
>> As you see in my numbers, around 40% of the CPU time was wasted on
>> spinlock contention. Dozens of io-wq threads are trampling on each
>> other's feet all the time.
>> I don't think this is about memory accesses being exceptionally
>> expensive; it's just about wringing every cycle from the code section
>> that's under the heavy-contention spinlock.
> 
> Ok, then it's an architectural problem and needs more serious
> reengineering, e.g. of how work items are stored and grabbed, and it
> might even get some more use cases for io_uring. FWIW, I'm not saying
> smaller optimisations shouldn't have place especially when they're
> clean.

Totally agree - io-wq would need some improvements on the where to queue
and pull work to make it scale better, which may indeed be a good idea
to do and would open it up to more use cases that currently don't make
much sense.

That said, also agree that the minor optimizations still have a place,
it's not like they will stand in the way of general improvements as
well.
Jens Axboe Jan. 30, 2025, 2:57 p.m. UTC | #6
On 1/29/25 10:36 PM, Max Kellermann wrote:
> On Thu, Jan 30, 2025 at 12:41?AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>> Ok, then it's an architectural problem and needs more serious
>> reengineering, e.g. of how work items are stored and grabbed
> 
> Rough unpolished idea: I was thinking about having multiple work
> lists, each with its own spinlock (separate cache line), and each
> io-wq thread only uses one of them, while the submitter round-robins
> through the lists.

Pending work would certainly need better spreading than just the two
classes we have now.

One thing to keep in mind is that the design of io-wq is such that it's
quite possible to have N work items pending and just a single thread
serving all of them. If the io-wq thread doesn't go to sleep, it will
keep processing work units. This is done for efficiency reasons, and to
avoid a proliferation of io-wq threads when it's not going to be
beneficial. This means than when you queue a work item, it's not easy to
pick an appropriate io-wq thread upfront, and generally the io-wq thread
itself will pick its next work item at the perfect time - when it
doesn't have anything else to do, or finished the existing work.

This should be kept in mind for making io-wq scale better.
diff mbox series

Patch

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index dfdd45ebe4bb..ba9974e6f521 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -170,9 +170,9 @@  static inline struct io_wq_acct *io_get_acct(struct io_wq *wq, bool bound)
 }
 
 static inline struct io_wq_acct *io_work_get_acct(struct io_wq *wq,
-						  struct io_wq_work *work)
+						  unsigned int work_flags)
 {
-	return io_get_acct(wq, !(atomic_read(&work->flags) & IO_WQ_WORK_UNBOUND));
+	return io_get_acct(wq, !(work_flags & IO_WQ_WORK_UNBOUND));
 }
 
 static inline struct io_wq_acct *io_wq_get_acct(struct io_worker *worker)
@@ -457,9 +457,14 @@  static void __io_worker_idle(struct io_wq_acct *acct, struct io_worker *worker)
 	}
 }
 
+static inline unsigned int __io_get_work_hash(unsigned int work_flags)
+{
+	return work_flags >> IO_WQ_HASH_SHIFT;
+}
+
 static inline unsigned int io_get_work_hash(struct io_wq_work *work)
 {
-	return atomic_read(&work->flags) >> IO_WQ_HASH_SHIFT;
+	return __io_get_work_hash(atomic_read(&work->flags));
 }
 
 static bool io_wait_on_hash(struct io_wq *wq, unsigned int hash)
@@ -489,17 +494,19 @@  static struct io_wq_work *io_get_next_work(struct io_wq_acct *acct,
 	struct io_wq *wq = worker->wq;
 
 	wq_list_for_each(node, prev, &acct->work_list) {
+		unsigned int work_flags;
 		unsigned int hash;
 
 		work = container_of(node, struct io_wq_work, list);
 
 		/* not hashed, can run anytime */
-		if (!io_wq_is_hashed(work)) {
+		work_flags = atomic_read(&work->flags);
+		if (!__io_wq_is_hashed(work_flags)) {
 			wq_list_del(&acct->work_list, node, prev);
 			return work;
 		}
 
-		hash = io_get_work_hash(work);
+		hash = __io_get_work_hash(work_flags);
 		/* all items with this hash lie in [work, tail] */
 		tail = wq->hash_tail[hash];
 
@@ -596,12 +603,13 @@  static void io_worker_handle_work(struct io_wq_acct *acct,
 		/* handle a whole dependent link */
 		do {
 			struct io_wq_work *next_hashed, *linked;
-			unsigned int hash = io_get_work_hash(work);
+			unsigned int work_flags = atomic_read(&work->flags);
+			unsigned int hash = __io_get_work_hash(work_flags);
 
 			next_hashed = wq_next_work(work);
 
 			if (do_kill &&
-			    (atomic_read(&work->flags) & IO_WQ_WORK_UNBOUND))
+			    (work_flags & IO_WQ_WORK_UNBOUND))
 				atomic_or(IO_WQ_WORK_CANCEL, &work->flags);
 			wq->do_work(work);
 			io_assign_current_work(worker, NULL);
@@ -917,18 +925,19 @@  static void io_run_cancel(struct io_wq_work *work, struct io_wq *wq)
 	} while (work);
 }
 
-static void io_wq_insert_work(struct io_wq *wq, struct io_wq_acct *acct, struct io_wq_work *work)
+static void io_wq_insert_work(struct io_wq *wq, struct io_wq_acct *acct,
+			      struct io_wq_work *work, unsigned int work_flags)
 {
 	unsigned int hash;
 	struct io_wq_work *tail;
 
-	if (!io_wq_is_hashed(work)) {
+	if (!__io_wq_is_hashed(work_flags)) {
 append:
 		wq_list_add_tail(&work->list, &acct->work_list);
 		return;
 	}
 
-	hash = io_get_work_hash(work);
+	hash = __io_get_work_hash(work_flags);
 	tail = wq->hash_tail[hash];
 	wq->hash_tail[hash] = work;
 	if (!tail)
@@ -944,8 +953,8 @@  static bool io_wq_work_match_item(struct io_wq_work *work, void *data)
 
 void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
 {
-	struct io_wq_acct *acct = io_work_get_acct(wq, work);
 	unsigned int work_flags = atomic_read(&work->flags);
+	struct io_wq_acct *acct = io_work_get_acct(wq, work_flags);
 	struct io_cb_cancel_data match = {
 		.fn		= io_wq_work_match_item,
 		.data		= work,
@@ -964,7 +973,7 @@  void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
 	}
 
 	raw_spin_lock(&acct->lock);
-	io_wq_insert_work(wq, acct, work);
+	io_wq_insert_work(wq, acct, work, work_flags);
 	clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);
 	raw_spin_unlock(&acct->lock);
 
diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
index b3b004a7b625..d4fb2940e435 100644
--- a/io_uring/io-wq.h
+++ b/io_uring/io-wq.h
@@ -54,9 +54,14 @@  int io_wq_cpu_affinity(struct io_uring_task *tctx, cpumask_var_t mask);
 int io_wq_max_workers(struct io_wq *wq, int *new_count);
 bool io_wq_worker_stopped(void);
 
+static inline bool __io_wq_is_hashed(unsigned int work_flags)
+{
+	return work_flags & IO_WQ_WORK_HASHED;
+}
+
 static inline bool io_wq_is_hashed(struct io_wq_work *work)
 {
-	return atomic_read(&work->flags) & IO_WQ_WORK_HASHED;
+	return __io_wq_is_hashed(atomic_read(&work->flags));
 }
 
 typedef bool (work_cancel_fn)(struct io_wq_work *, void *);