Message ID | 20230518211751.3492982-2-shr@devkernel.io (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | io_uring: add napi busy polling support | expand |
On Thu, 18 May 2023 14:17:45 -0700 Stefan Roesch wrote: > unsigned long start_time = loop_end ? busy_loop_current_time() : 0; > - int (*napi_poll)(struct napi_struct *napi, int budget); > - void *have_poll_lock = NULL; > - struct napi_struct *napi; > + struct napi_busy_poll_ctx ctx = {}; > > restart: Can you refactor this further? I think the only state that's kept across "restarts" is the start_time right? So this version is effectively a loop around what ends up being napi_busy_loop_rcu(), no?
On Wed, 31 May 2023 12:16:50 -0700 Stefan Roesch wrote: > > This will conflict with: > > > > https://git.kernel.org/netdev/net-next/c/c857946a4e26 > > > > :( Not sure what to do about it.. > > > > Maybe we can merge a simpler version to unblock io-uring (just add > > need_resched() to your loop_end callback and you'll get the same > > behavior). Refactor in net-next in parallel. Then once trees converge > > do simple a cleanup and call the _rcu version? > > Jakub, I can certainly call need_resched() in the loop_end callback, but > isn't there a potential race? need_resched() in the loop_end callback > might not return true, but the need_resched() call in napi_busy_poll > does? need_resched() is best effort. It gets added to potentially long execution paths and loops. Extra single round thru the loop won't make a difference.
Jakub Kicinski <kuba@kernel.org> writes: > On Wed, 31 May 2023 12:16:50 -0700 Stefan Roesch wrote: >> > This will conflict with: >> > >> > https://git.kernel.org/netdev/net-next/c/c857946a4e26 >> > >> > :( Not sure what to do about it.. >> > >> > Maybe we can merge a simpler version to unblock io-uring (just add >> > need_resched() to your loop_end callback and you'll get the same >> > behavior). Refactor in net-next in parallel. Then once trees converge >> > do simple a cleanup and call the _rcu version? >> >> Jakub, I can certainly call need_resched() in the loop_end callback, but >> isn't there a potential race? need_resched() in the loop_end callback >> might not return true, but the need_resched() call in napi_busy_poll >> does? > > need_resched() is best effort. It gets added to potentially long > execution paths and loops. Extra single round thru the loop won't > make a difference. I might be missing something, however what can happen at a high-level is: io_napi_blocking_busy_loop() rcu_read_lock() __io_napi_busy_do_busy_loop() rcu_read_unlock() in __io_napi_do_busy_loop() we do __io_napi_do_busy_loop() list_foreach_entry_rcu() napi_busy_loop() and in napi_busy_loop() napi_busy_loop() rcu_read_lock() __napi_busy_poll() loop_end() if (need_resched) { rcu_read_unlock() schedule() } The problem with checking need_resched in loop_end is that need_resched can be false in loop_end, however the check for need_resched in napi_busy_loop succeeds. This means that we unlock the rcu read lock and call schedule. However the code in io_napi_blocking_busy_loop still believes we hold the read lock.
On Thu, 01 Jun 2023 21:12:10 -0700 Stefan Roesch wrote: > The problem with checking need_resched in loop_end is that need_resched > can be false in loop_end, however the check for need_resched in > napi_busy_loop succeeds. This means that we unlock the rcu read lock and > call schedule. However the code in io_napi_blocking_busy_loop still > believes we hold the read lock. Ah, yes, now it makes sense. Yeah, that's a race, scratch the workaround idea then. Let's work on revising the patches, we'll see where we are in the development process when they're ready.
Jakub Kicinski <kuba@kernel.org> writes: > On Thu, 18 May 2023 14:17:45 -0700 Stefan Roesch wrote: >> unsigned long start_time = loop_end ? busy_loop_current_time() : 0; >> - int (*napi_poll)(struct napi_struct *napi, int budget); >> - void *have_poll_lock = NULL; >> - struct napi_struct *napi; >> + struct napi_busy_poll_ctx ctx = {}; >> >> restart: > > Can you refactor this further? I think the only state that's kept > across "restarts" is the start_time right? So this version is > effectively a loop around what ends up being napi_busy_loop_rcu(), no? I'm not sure I understand this correctly. Do you want the start time to be a parameter of the function napi_busy_poll_rcu?
On Mon, 05 Jun 2023 10:47:40 -0700 Stefan Roesch wrote: > > Can you refactor this further? I think the only state that's kept > > across "restarts" is the start_time right? So this version is > > effectively a loop around what ends up being napi_busy_loop_rcu(), no? > > I'm not sure I understand this correctly. Do you want the start time to > be a parameter of the function napi_busy_poll_rcu? The RCU and non-RCU version of this function end up looking very similar, what I'm asking is to share more code.
diff --git a/net/core/dev.c b/net/core/dev.c index 253584777101..f4677aa20f84 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6166,66 +6166,79 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock, bool local_bh_enable(); } +struct napi_busy_poll_ctx { + struct napi_struct *napi; + int (*napi_poll)(struct napi_struct *napi, int budget); + void *have_poll_lock; +}; + +static inline void __napi_busy_poll(struct napi_busy_poll_ctx *ctx, + bool prefer_busy_poll, u16 budget) +{ + struct napi_struct *napi = ctx->napi; + int work = 0; + + WARN_ON_ONCE(!rcu_read_lock_held()); + + local_bh_disable(); + if (!ctx->napi_poll) { + unsigned long val = READ_ONCE(napi->state); + + /* If multiple threads are competing for this napi, + * we avoid dirtying napi->state as much as we can. + */ + if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED | + NAPIF_STATE_IN_BUSY_POLL)) { + if (prefer_busy_poll) + set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state); + goto count; + } + if (cmpxchg(&napi->state, val, + val | NAPIF_STATE_IN_BUSY_POLL | + NAPIF_STATE_SCHED) != val) { + if (prefer_busy_poll) + set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state); + goto count; + } + ctx->have_poll_lock = netpoll_poll_lock(napi); + ctx->napi_poll = napi->poll; + } + work = ctx->napi_poll(napi, budget); + trace_napi_poll(napi, work, budget); + gro_normal_list(napi); + +count: + if (work > 0) + __NET_ADD_STATS(dev_net(napi->dev), + LINUX_MIB_BUSYPOLLRXPACKETS, work); + local_bh_enable(); +} void napi_busy_loop(unsigned int napi_id, bool (*loop_end)(void *, unsigned long), void *loop_end_arg, bool prefer_busy_poll, u16 budget) { unsigned long start_time = loop_end ? busy_loop_current_time() : 0; - int (*napi_poll)(struct napi_struct *napi, int budget); - void *have_poll_lock = NULL; - struct napi_struct *napi; + struct napi_busy_poll_ctx ctx = {}; restart: - napi_poll = NULL; + ctx.napi_poll = NULL; rcu_read_lock(); - napi = napi_by_id(napi_id); - if (!napi) + ctx.napi = napi_by_id(napi_id); + if (!ctx.napi) goto out; preempt_disable(); for (;;) { - int work = 0; - - local_bh_disable(); - if (!napi_poll) { - unsigned long val = READ_ONCE(napi->state); - - /* If multiple threads are competing for this napi, - * we avoid dirtying napi->state as much as we can. - */ - if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED | - NAPIF_STATE_IN_BUSY_POLL)) { - if (prefer_busy_poll) - set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state); - goto count; - } - if (cmpxchg(&napi->state, val, - val | NAPIF_STATE_IN_BUSY_POLL | - NAPIF_STATE_SCHED) != val) { - if (prefer_busy_poll) - set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state); - goto count; - } - have_poll_lock = netpoll_poll_lock(napi); - napi_poll = napi->poll; - } - work = napi_poll(napi, budget); - trace_napi_poll(napi, work, budget); - gro_normal_list(napi); -count: - if (work > 0) - __NET_ADD_STATS(dev_net(napi->dev), - LINUX_MIB_BUSYPOLLRXPACKETS, work); - local_bh_enable(); + __napi_busy_poll(&ctx, prefer_busy_poll, budget); if (!loop_end || loop_end(loop_end_arg, start_time)) break; if (unlikely(need_resched())) { - if (napi_poll) - busy_poll_stop(napi, have_poll_lock, prefer_busy_poll, budget); + if (ctx.napi_poll) + busy_poll_stop(ctx.napi, ctx.have_poll_lock, prefer_busy_poll, budget); preempt_enable(); rcu_read_unlock(); cond_resched(); @@ -6235,8 +6248,8 @@ void napi_busy_loop(unsigned int napi_id, } cpu_relax(); } - if (napi_poll) - busy_poll_stop(napi, have_poll_lock, prefer_busy_poll, budget); + if (ctx.napi_poll) + busy_poll_stop(ctx.napi, ctx.have_poll_lock, prefer_busy_poll, budget); preempt_enable(); out: rcu_read_unlock();
This splits off the key part of the napi_busy_poll function into its own function __napi_busy_poll. This is done in preparation for an additional napi_busy_poll() function, that doesn't take the rcu_read_lock(). The new function is introduced in the next patch. Signed-off-by: Stefan Roesch <shr@devkernel.io> --- net/core/dev.c | 99 ++++++++++++++++++++++++++++---------------------- 1 file changed, 56 insertions(+), 43 deletions(-)