Message ID | 20241016185252.3746190-9-dw@davidwei.uk (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | io_uring zero copy rx | expand |
Hi, On 10/16/24 20:52, David Wei wrote: > @@ -6503,6 +6511,41 @@ void napi_busy_loop(unsigned int napi_id, > } > EXPORT_SYMBOL(napi_busy_loop); > > +void napi_execute(unsigned napi_id, > + void (*cb)(void *), void *cb_arg) > +{ > + struct napi_struct *napi; > + void *have_poll_lock = NULL; Minor nit: please respect the reverse x-mas tree order. > + > + guard(rcu)(); Since this will land into net core code, please use the explicit RCU read lock/unlock: https://elixir.bootlin.com/linux/v6.12-rc3/source/Documentation/process/maintainer-netdev.rst#L387 > + napi = napi_by_id(napi_id); > + if (!napi) > + return; > + > + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) > + preempt_disable(); > + > + for (;;) { > + local_bh_disable(); > + > + if (napi_state_start_busy_polling(napi, 0)) { > + have_poll_lock = netpoll_poll_lock(napi); > + cb(cb_arg); > + local_bh_enable(); > + busy_poll_stop(napi, have_poll_lock, 0, 1); > + break; > + } > + > + local_bh_enable(); > + if (unlikely(need_resched())) > + break; > + cpu_relax(); Don't you need a 'loop_end' condition here? Side notes not specifically related to this patch: I likely got lost in previous revision, but it's unclear to me which is the merge plan here, could you please (re-)word it? Thanks! Paolo
On 10/21/24 8:25 AM, Paolo Abeni wrote: > Side notes not specifically related to this patch: I likely got lost in > previous revision, but it's unclear to me which is the merge plan here, > could you please (re-)word it? In the past when there's been dependencies like this, what we've done is: Someone, usually Jakub, sets up a branch with the net bits only. Both the io_uring tree and netdev-next pulls that in. With that, then the io_uring tree can apply the io_uring specific patches on top. And either side can send a pull, won't impact the other tree. I like that way of doing it, as it keeps things separate, yet still easy to deal with for the side that needs further work/patches on top.
On 10/21/24 16:49, Jens Axboe wrote: > On 10/21/24 8:25 AM, Paolo Abeni wrote: >> Side notes not specifically related to this patch: I likely got lost in >> previous revision, but it's unclear to me which is the merge plan here, >> could you please (re-)word it? > > In the past when there's been dependencies like this, what we've done > is: > > Someone, usually Jakub, sets up a branch with the net bits only. Both > the io_uring tree and netdev-next pulls that in. > > With that, then the io_uring tree can apply the io_uring specific > patches on top. And either side can send a pull, won't impact the other > tree. > > I like that way of doing it, as it keeps things separate, yet still easy > to deal with for the side that needs further work/patches on top. Yep, I outlined in one of the comments same thing (should put it into the cover letter). We'll get a branch with net/ patches on the common base, so that it can be pulled into net and io-uring trees. Then, we can deal with io_uring patches ourselves.
On 10/21/24 15:25, Paolo Abeni wrote: > Hi, > > On 10/16/24 20:52, David Wei wrote: >> @@ -6503,6 +6511,41 @@ void napi_busy_loop(unsigned int napi_id, >> } >> EXPORT_SYMBOL(napi_busy_loop); >> >> +void napi_execute(unsigned napi_id, >> + void (*cb)(void *), void *cb_arg) >> +{ >> + struct napi_struct *napi; >> + void *have_poll_lock = NULL; > > Minor nit: please respect the reverse x-mas tree order. > >> + >> + guard(rcu)(); > > Since this will land into net core code, please use the explicit RCU > read lock/unlock: > > https://elixir.bootlin.com/linux/v6.12-rc3/source/Documentation/process/maintainer-netdev.rst#L387 I missed the doc update, will change it, thanks >> + napi = napi_by_id(napi_id); >> + if (!napi) >> + return; >> + >> + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) >> + preempt_disable(); >> + >> + for (;;) { >> + local_bh_disable(); >> + >> + if (napi_state_start_busy_polling(napi, 0)) { >> + have_poll_lock = netpoll_poll_lock(napi); >> + cb(cb_arg); >> + local_bh_enable(); >> + busy_poll_stop(napi, have_poll_lock, 0, 1); >> + break; >> + } >> + >> + local_bh_enable(); >> + if (unlikely(need_resched())) >> + break; >> + cpu_relax(); > > Don't you need a 'loop_end' condition here? As you mentioned in 14/15, it can indeed spin for long and is bound only by need_resched(). Do you think it's reasonable to wait for it without a time limit with NAPI_STATE_PREFER_BUSY_POLL? softirq should yield napi after it exhausts the budget, it should limit it well enough, what do you think? The only ugly part is that I don't want it to mess with the NAPI_F_PREFER_BUSY_POLL in busy_poll_stop() busy_poll_stop() { ... clear_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state); if (flags & NAPI_F_PREFER_BUSY_POLL) { napi->defer_hard_irqs_count = READ_ONCE(napi->dev->napi_defer_hard_irqs); timeout = READ_ONCE(napi->dev->gro_flush_timeout); if (napi->defer_hard_irqs_count && timeout) { hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED); skip_schedule = true; } } } Is it fine to set PREFER_BUSY_POLL but do the stop call without? E.g. set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state); ... busy_poll_stop(napi, flags=0);
Hi, On 10/21/24 19:16, Pavel Begunkov wrote: > On 10/21/24 15:25, Paolo Abeni wrote: >> On 10/16/24 20:52, David Wei wrote: [...] >>> + napi = napi_by_id(napi_id); >>> + if (!napi) >>> + return; >>> + >>> + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) >>> + preempt_disable(); >>> + >>> + for (;;) { >>> + local_bh_disable(); >>> + >>> + if (napi_state_start_busy_polling(napi, 0)) { >>> + have_poll_lock = netpoll_poll_lock(napi); >>> + cb(cb_arg); >>> + local_bh_enable(); >>> + busy_poll_stop(napi, have_poll_lock, 0, 1); >>> + break; >>> + } >>> + >>> + local_bh_enable(); >>> + if (unlikely(need_resched())) >>> + break; >>> + cpu_relax(); >> >> Don't you need a 'loop_end' condition here? > > As you mentioned in 14/15, it can indeed spin for long and is bound only > by need_resched(). Do you think it's reasonable to wait for it without a > time limit with NAPI_STATE_PREFER_BUSY_POLL? softirq should yield napi > after it exhausts the budget, it should limit it well enough, what do > you think? > > The only ugly part is that I don't want it to mess with the > NAPI_F_PREFER_BUSY_POLL in busy_poll_stop() > > busy_poll_stop() { > ... > clear_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state); > if (flags & NAPI_F_PREFER_BUSY_POLL) { > napi->defer_hard_irqs_count = READ_ONCE(napi->dev->napi_defer_hard_irqs); > timeout = READ_ONCE(napi->dev->gro_flush_timeout); > if (napi->defer_hard_irqs_count && timeout) { > hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED); > skip_schedule = true; > } > } > } Why do you want to avoid such branch? It will do any action only when the user-space explicitly want to leverage the hrtimer to check for incoming packets. In such case, I think the kernel should try to respect the user configuration. > Is it fine to set PREFER_BUSY_POLL but do the stop call without? E.g. > > set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state); > ... > busy_poll_stop(napi, flags=0); My preference is for using NAPI_STATE_PREFER_BUSY_POLL consistently. It should ensure a reasonably low latency for napi_execute() and consistent infra behavior. Unless I'm missing some dangerous side effect ;) Thanks, Paolo
On 10/22/24 08:47, Paolo Abeni wrote: > Hi, > > On 10/21/24 19:16, Pavel Begunkov wrote: >> On 10/21/24 15:25, Paolo Abeni wrote: >>> On 10/16/24 20:52, David Wei wrote: > > [...] >>>> + napi = napi_by_id(napi_id); >>>> + if (!napi) >>>> + return; >>>> + >>>> + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) >>>> + preempt_disable(); >>>> + >>>> + for (;;) { >>>> + local_bh_disable(); >>>> + >>>> + if (napi_state_start_busy_polling(napi, 0)) { >>>> + have_poll_lock = netpoll_poll_lock(napi); >>>> + cb(cb_arg); >>>> + local_bh_enable(); >>>> + busy_poll_stop(napi, have_poll_lock, 0, 1); >>>> + break; >>>> + } >>>> + >>>> + local_bh_enable(); >>>> + if (unlikely(need_resched())) >>>> + break; >>>> + cpu_relax(); >>> >>> Don't you need a 'loop_end' condition here? >> >> As you mentioned in 14/15, it can indeed spin for long and is bound only >> by need_resched(). Do you think it's reasonable to wait for it without a >> time limit with NAPI_STATE_PREFER_BUSY_POLL? softirq should yield napi >> after it exhausts the budget, it should limit it well enough, what do >> you think? >> >> The only ugly part is that I don't want it to mess with the >> NAPI_F_PREFER_BUSY_POLL in busy_poll_stop() >> >> busy_poll_stop() { >> ... >> clear_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state); >> if (flags & NAPI_F_PREFER_BUSY_POLL) { >> napi->defer_hard_irqs_count = READ_ONCE(napi->dev->napi_defer_hard_irqs); >> timeout = READ_ONCE(napi->dev->gro_flush_timeout); >> if (napi->defer_hard_irqs_count && timeout) { >> hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED); >> skip_schedule = true; >> } >> } >> } > > Why do you want to avoid such branch? It will do any action only when > the user-space explicitly want to leverage the hrtimer to check for > incoming packets. In such case, I think the kernel should try to respect > the user configuration. It should be fine to pass the flag here, it just doesn't feel right. napi_execute() is not interested in polling, but IIRC this chunk delays the moment when softirq kicks in when there are no napi pollers. I.e. IMO ideally it shouldn't affect napi polling timings... >> Is it fine to set PREFER_BUSY_POLL but do the stop call without? E.g. >> >> set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state); >> ... >> busy_poll_stop(napi, flags=0); > > My preference is for using NAPI_STATE_PREFER_BUSY_POLL consistently. It > should ensure a reasonably low latency for napi_execute() and consistent > infra behavior. Unless I'm missing some dangerous side effect ;) ... but let's just set it then. It only affects the zerocopy private queue.
diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h index f03040baaefd..3fd9e65731e9 100644 --- a/include/net/busy_poll.h +++ b/include/net/busy_poll.h @@ -47,6 +47,7 @@ bool sk_busy_loop_end(void *p, unsigned long start_time); void napi_busy_loop(unsigned int napi_id, bool (*loop_end)(void *, unsigned long), void *loop_end_arg, bool prefer_busy_poll, u16 budget); +void napi_execute(unsigned napi_id, void (*cb)(void *), void *cb_arg); void napi_busy_loop_rcu(unsigned int napi_id, bool (*loop_end)(void *, unsigned long), @@ -63,6 +64,11 @@ static inline bool sk_can_busy_loop(struct sock *sk) return false; } +static inline void napi_execute(unsigned napi_id, + void (*cb)(void *), void *cb_arg) +{ +} + #endif /* CONFIG_NET_RX_BUSY_POLL */ static inline unsigned long busy_loop_current_time(void) diff --git a/net/core/dev.c b/net/core/dev.c index c682173a7642..f3bd5fd56286 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6347,6 +6347,30 @@ enum { NAPI_F_END_ON_RESCHED = 2, }; +static inline bool napi_state_start_busy_polling(struct napi_struct *napi, + unsigned flags) +{ + unsigned long val = READ_ONCE(napi->state); + + /* If multiple threads are competing for this napi, + * we avoid dirtying napi->state as much as we can. + */ + if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED | + NAPIF_STATE_IN_BUSY_POLL)) + goto fail; + + if (cmpxchg(&napi->state, val, + val | NAPIF_STATE_IN_BUSY_POLL | + NAPIF_STATE_SCHED) != val) + goto fail; + + return true; +fail: + if (flags & NAPI_F_PREFER_BUSY_POLL) + set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state); + return false; +} + static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock, unsigned flags, u16 budget) { @@ -6422,24 +6446,8 @@ static void __napi_busy_loop(unsigned int napi_id, local_bh_disable(); bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx); if (!napi_poll) { - unsigned long val = READ_ONCE(napi->state); - - /* If multiple threads are competing for this napi, - * we avoid dirtying napi->state as much as we can. - */ - if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED | - NAPIF_STATE_IN_BUSY_POLL)) { - if (flags & NAPI_F_PREFER_BUSY_POLL) - set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state); + if (!napi_state_start_busy_polling(napi, flags)) goto count; - } - if (cmpxchg(&napi->state, val, - val | NAPIF_STATE_IN_BUSY_POLL | - NAPIF_STATE_SCHED) != val) { - if (flags & NAPI_F_PREFER_BUSY_POLL) - set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state); - goto count; - } have_poll_lock = netpoll_poll_lock(napi); napi_poll = napi->poll; } @@ -6503,6 +6511,41 @@ void napi_busy_loop(unsigned int napi_id, } EXPORT_SYMBOL(napi_busy_loop); +void napi_execute(unsigned napi_id, + void (*cb)(void *), void *cb_arg) +{ + struct napi_struct *napi; + void *have_poll_lock = NULL; + + guard(rcu)(); + napi = napi_by_id(napi_id); + if (!napi) + return; + + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_disable(); + + for (;;) { + local_bh_disable(); + + if (napi_state_start_busy_polling(napi, 0)) { + have_poll_lock = netpoll_poll_lock(napi); + cb(cb_arg); + local_bh_enable(); + busy_poll_stop(napi, have_poll_lock, 0, 1); + break; + } + + local_bh_enable(); + if (unlikely(need_resched())) + break; + cpu_relax(); + } + + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_enable(); +} + #endif /* CONFIG_NET_RX_BUSY_POLL */ static void __napi_hash_add_with_id(struct napi_struct *napi,