diff mbox series

net: raise RCU qs after each threaded NAPI poll

Message ID Zd4DXTyCf17lcTfq@debian.debian (mailing list archive)
State Superseded
Headers show
Series net: raise RCU qs after each threaded NAPI poll | expand

Commit Message

Yan Zhai Feb. 27, 2024, 3:44 p.m. UTC
We noticed task RCUs being blocked when threaded NAPIs are very busy in
production: detaching any BPF tracing programs, i.e. removing a ftrace
trampoline, will simply block for very long in rcu_tasks_wait_gp. This
ranges from hundreds of seconds to even an hour, severely harming any
observability tools that rely on BPF tracing programs. It can be
easily reproduced locally with following setup:

ip netns add test1
ip netns add test2

ip -n test1 link add veth1 type veth peer name veth2 netns test2

ip -n test1 link set veth1 up
ip -n test1 link set lo up
ip -n test2 link set veth2 up
ip -n test2 link set lo up

ip -n test1 addr add 192.168.1.2/31 dev veth1
ip -n test1 addr add 1.1.1.1/32 dev lo
ip -n test2 addr add 192.168.1.3/31 dev veth2
ip -n test2 addr add 2.2.2.2/31 dev lo

ip -n test1 route add default via 192.168.1.3
ip -n test2 route add default via 192.168.1.2

for i in `seq 10 210`; do
 for j in `seq 10 210`; do
    ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
 done
done

ip netns exec test2 ethtool -K veth2 gro on
ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
ip netns exec test1 ethtool -K veth1 tso off

Then run an iperf3 client/server and a bpftrace script can trigger it:

ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'

Above reproduce for net-next kernel with following RCU and preempt
configuraitons:

# RCU Subsystem
CONFIG_TREE_RCU=y
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_RCU=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_TASKS_TRACE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
# end of RCU Subsystem
# RCU Debugging
# CONFIG_RCU_SCALE_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_REF_SCALE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=21
CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set
# end of RCU Debugging

CONFIG_PREEMPT_BUILD=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y
CONFIG_PREEMPT_DYNAMIC=y
CONFIG_PREEMPT_RCU=y
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_PREEMPT_TRACER is not set
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set

An interesting observation is that, while tasks RCUs are blocked,
related NAPI thread is still being scheduled (even across cores)
regularly. Looking at the gp conditions, I am inclining to cond_resched
after each __napi_poll being the problem: cond_resched enters the
scheduler with PREEMPT bit, which does not account as a gp for tasks
RCUs. Meanwhile, since the thread has been frequently resched, the
normal scheduling point (no PREEMPT bit, accounted as a task RCU gp)
seems to have very little chance to kick in. Given the nature of "busy
polling" program, such NAPI thread won't have task->nvcsw or task->on_rq
updated (other gp conditions), the result is that such NAPI thread is
put on RCU holdouts list for indefinitely long time.

This is simply fixed by mirroring the ksoftirqd behavior: after
NAPI/softirq work, raise a RCU QS to help expedite the RCU period. No
more blocking afterwards for the same setup.

Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
Signed-off-by: Yan Zhai <yan@cloudflare.com>
---
 net/core/dev.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Eric Dumazet Feb. 27, 2024, 4:44 p.m. UTC | #1
On Tue, Feb 27, 2024 at 4:44 PM Yan Zhai <yan@cloudflare.com> wrote:
>
> We noticed task RCUs being blocked when threaded NAPIs are very busy in
> production: detaching any BPF tracing programs, i.e. removing a ftrace
> trampoline, will simply block for very long in rcu_tasks_wait_gp. This
> ranges from hundreds of seconds to even an hour, severely harming any
> observability tools that rely on BPF tracing programs. It can be
> easily reproduced locally with following setup:
>
> ip netns add test1
> ip netns add test2
>
> ip -n test1 link add veth1 type veth peer name veth2 netns test2
>
> ip -n test1 link set veth1 up
> ip -n test1 link set lo up
> ip -n test2 link set veth2 up
> ip -n test2 link set lo up
>
> ip -n test1 addr add 192.168.1.2/31 dev veth1
> ip -n test1 addr add 1.1.1.1/32 dev lo
> ip -n test2 addr add 192.168.1.3/31 dev veth2
> ip -n test2 addr add 2.2.2.2/31 dev lo
>
> ip -n test1 route add default via 192.168.1.3
> ip -n test2 route add default via 192.168.1.2
>
> for i in `seq 10 210`; do
>  for j in `seq 10 210`; do
>     ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
>  done
> done
>
> ip netns exec test2 ethtool -K veth2 gro on
> ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
> ip netns exec test1 ethtool -K veth1 tso off
>
> Then run an iperf3 client/server and a bpftrace script can trigger it:
>
> ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
> ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
> bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'
>
> Above reproduce for net-next kernel with following RCU and preempt
> configuraitons:
>
> # RCU Subsystem
> CONFIG_TREE_RCU=y
> CONFIG_PREEMPT_RCU=y
> # CONFIG_RCU_EXPERT is not set
> CONFIG_SRCU=y
> CONFIG_TREE_SRCU=y
> CONFIG_TASKS_RCU_GENERIC=y
> CONFIG_TASKS_RCU=y
> CONFIG_TASKS_RUDE_RCU=y
> CONFIG_TASKS_TRACE_RCU=y
> CONFIG_RCU_STALL_COMMON=y
> CONFIG_RCU_NEED_SEGCBLIST=y
> # end of RCU Subsystem
> # RCU Debugging
> # CONFIG_RCU_SCALE_TEST is not set
> # CONFIG_RCU_TORTURE_TEST is not set
> # CONFIG_RCU_REF_SCALE_TEST is not set
> CONFIG_RCU_CPU_STALL_TIMEOUT=21
> CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
> # CONFIG_RCU_TRACE is not set
> # CONFIG_RCU_EQS_DEBUG is not set
> # end of RCU Debugging
>
> CONFIG_PREEMPT_BUILD=y
> # CONFIG_PREEMPT_NONE is not set
> CONFIG_PREEMPT_VOLUNTARY=y
> # CONFIG_PREEMPT is not set
> CONFIG_PREEMPT_COUNT=y
> CONFIG_PREEMPTION=y
> CONFIG_PREEMPT_DYNAMIC=y
> CONFIG_PREEMPT_RCU=y
> CONFIG_HAVE_PREEMPT_DYNAMIC=y
> CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
> CONFIG_PREEMPT_NOTIFIERS=y
> # CONFIG_DEBUG_PREEMPT is not set
> # CONFIG_PREEMPT_TRACER is not set
> # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
>
> An interesting observation is that, while tasks RCUs are blocked,
> related NAPI thread is still being scheduled (even across cores)
> regularly. Looking at the gp conditions, I am inclining to cond_resched
> after each __napi_poll being the problem: cond_resched enters the
> scheduler with PREEMPT bit, which does not account as a gp for tasks
> RCUs. Meanwhile, since the thread has been frequently resched, the
> normal scheduling point (no PREEMPT bit, accounted as a task RCU gp)
> seems to have very little chance to kick in. Given the nature of "busy
> polling" program, such NAPI thread won't have task->nvcsw or task->on_rq
> updated (other gp conditions), the result is that such NAPI thread is
> put on RCU holdouts list for indefinitely long time.
>
> This is simply fixed by mirroring the ksoftirqd behavior: after
> NAPI/softirq work, raise a RCU QS to help expedite the RCU period. No
> more blocking afterwards for the same setup.
>
> Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
> Signed-off-by: Yan Zhai <yan@cloudflare.com>
> ---
>  net/core/dev.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 275fd5259a4a..6e41263ff5d3 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6773,6 +6773,10 @@ static int napi_threaded_poll(void *data)
>                                 net_rps_action_and_irq_enable(sd);
>                         }
>                         skb_defer_free_flush(sd);
> +
> +                       if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> +                               rcu_softirq_qs();
> +
>                         local_bh_enable();
>
>                         if (!repoll)
> --
> 2.30.2
>

Hmm....
Why napi_busy_loop() does not have a similar problem ?

It is unclear why rcu_all_qs() in __cond_resched() is guarded by

#ifndef CONFIG_PREEMPT_RCU
     rcu_all_qs();
#endif


Thanks.
Paul E. McKenney Feb. 27, 2024, 6:32 p.m. UTC | #2
On Tue, Feb 27, 2024 at 05:44:17PM +0100, Eric Dumazet wrote:
> On Tue, Feb 27, 2024 at 4:44 PM Yan Zhai <yan@cloudflare.com> wrote:
> >
> > We noticed task RCUs being blocked when threaded NAPIs are very busy in
> > production: detaching any BPF tracing programs, i.e. removing a ftrace
> > trampoline, will simply block for very long in rcu_tasks_wait_gp. This
> > ranges from hundreds of seconds to even an hour, severely harming any
> > observability tools that rely on BPF tracing programs. It can be
> > easily reproduced locally with following setup:
> >
> > ip netns add test1
> > ip netns add test2
> >
> > ip -n test1 link add veth1 type veth peer name veth2 netns test2
> >
> > ip -n test1 link set veth1 up
> > ip -n test1 link set lo up
> > ip -n test2 link set veth2 up
> > ip -n test2 link set lo up
> >
> > ip -n test1 addr add 192.168.1.2/31 dev veth1
> > ip -n test1 addr add 1.1.1.1/32 dev lo
> > ip -n test2 addr add 192.168.1.3/31 dev veth2
> > ip -n test2 addr add 2.2.2.2/31 dev lo
> >
> > ip -n test1 route add default via 192.168.1.3
> > ip -n test2 route add default via 192.168.1.2
> >
> > for i in `seq 10 210`; do
> >  for j in `seq 10 210`; do
> >     ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
> >  done
> > done
> >
> > ip netns exec test2 ethtool -K veth2 gro on
> > ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
> > ip netns exec test1 ethtool -K veth1 tso off
> >
> > Then run an iperf3 client/server and a bpftrace script can trigger it:
> >
> > ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
> > ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
> > bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'
> >
> > Above reproduce for net-next kernel with following RCU and preempt
> > configuraitons:
> >
> > # RCU Subsystem
> > CONFIG_TREE_RCU=y
> > CONFIG_PREEMPT_RCU=y
> > # CONFIG_RCU_EXPERT is not set
> > CONFIG_SRCU=y
> > CONFIG_TREE_SRCU=y
> > CONFIG_TASKS_RCU_GENERIC=y
> > CONFIG_TASKS_RCU=y
> > CONFIG_TASKS_RUDE_RCU=y
> > CONFIG_TASKS_TRACE_RCU=y
> > CONFIG_RCU_STALL_COMMON=y
> > CONFIG_RCU_NEED_SEGCBLIST=y
> > # end of RCU Subsystem
> > # RCU Debugging
> > # CONFIG_RCU_SCALE_TEST is not set
> > # CONFIG_RCU_TORTURE_TEST is not set
> > # CONFIG_RCU_REF_SCALE_TEST is not set
> > CONFIG_RCU_CPU_STALL_TIMEOUT=21
> > CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
> > # CONFIG_RCU_TRACE is not set
> > # CONFIG_RCU_EQS_DEBUG is not set
> > # end of RCU Debugging
> >
> > CONFIG_PREEMPT_BUILD=y
> > # CONFIG_PREEMPT_NONE is not set
> > CONFIG_PREEMPT_VOLUNTARY=y
> > # CONFIG_PREEMPT is not set
> > CONFIG_PREEMPT_COUNT=y
> > CONFIG_PREEMPTION=y
> > CONFIG_PREEMPT_DYNAMIC=y
> > CONFIG_PREEMPT_RCU=y
> > CONFIG_HAVE_PREEMPT_DYNAMIC=y
> > CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
> > CONFIG_PREEMPT_NOTIFIERS=y
> > # CONFIG_DEBUG_PREEMPT is not set
> > # CONFIG_PREEMPT_TRACER is not set
> > # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
> >
> > An interesting observation is that, while tasks RCUs are blocked,
> > related NAPI thread is still being scheduled (even across cores)
> > regularly. Looking at the gp conditions, I am inclining to cond_resched
> > after each __napi_poll being the problem: cond_resched enters the
> > scheduler with PREEMPT bit, which does not account as a gp for tasks
> > RCUs. Meanwhile, since the thread has been frequently resched, the
> > normal scheduling point (no PREEMPT bit, accounted as a task RCU gp)
> > seems to have very little chance to kick in. Given the nature of "busy
> > polling" program, such NAPI thread won't have task->nvcsw or task->on_rq
> > updated (other gp conditions), the result is that such NAPI thread is
> > put on RCU holdouts list for indefinitely long time.
> >
> > This is simply fixed by mirroring the ksoftirqd behavior: after
> > NAPI/softirq work, raise a RCU QS to help expedite the RCU period. No
> > more blocking afterwards for the same setup.
> >
> > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
> > Signed-off-by: Yan Zhai <yan@cloudflare.com>
> > ---
> >  net/core/dev.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 275fd5259a4a..6e41263ff5d3 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -6773,6 +6773,10 @@ static int napi_threaded_poll(void *data)
> >                                 net_rps_action_and_irq_enable(sd);
> >                         }
> >                         skb_defer_free_flush(sd);

Please put a comment here stating that RCU readers cannot cross
this point.

I need to add lockdep to rcu_softirq_qs() to catch placing this in an
RCU read-side critical section.  And a header comment noting that from
an RCU perspective, it acts as a momentary enabling of preemption.

> > +                       if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> > +                               rcu_softirq_qs();
> > +
> >                         local_bh_enable();
> >
> >                         if (!repoll)
> > --
> > 2.30.2
> >
> 
> Hmm....
> Why napi_busy_loop() does not have a similar problem ?
> 
> It is unclear why rcu_all_qs() in __cond_resched() is guarded by
> 
> #ifndef CONFIG_PREEMPT_RCU
>      rcu_all_qs();
> #endif

The theory is that PREEMPT_RCU kernels have preemption, and get their
quiescent states that way.

The more recent practice involves things like PREEMPT_DYNAMIC and maybe
soon PREEMPT_AUTO, which might require adjustments, so thank you for
pointing this out!

Back on the patch, my main other concern is that someone somewhere might
be using something like synchronize_rcu() to wait for all in-progress
softirq handlers to complete.  But I don't know of such a thing, and if
there is, there are workarounds, including synchronize_rcu_tasks().

So something to be aware of, not (as far as I know) something to block
this commit.

With the added comment:

Acked-by: Paul E. McKenney <paulmck@kernel.org>

							Thanx, Paul
Yan Zhai Feb. 27, 2024, 9:17 p.m. UTC | #3
On Tue, Feb 27, 2024 at 10:44 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Hmm....
> Why napi_busy_loop() does not have a similar problem ?
>
That's a good question. Let me try if I can repro this on a busy loop
as well, since the structure seems very alike.

> It is unclear why rcu_all_qs() in __cond_resched() is guarded by
>
> #ifndef CONFIG_PREEMPT_RCU
>      rcu_all_qs();
> #endif
>
>
> Thanks.
Yan Zhai Feb. 27, 2024, 9:22 p.m. UTC | #4
On Tue, Feb 27, 2024 at 12:32 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Tue, Feb 27, 2024 at 05:44:17PM +0100, Eric Dumazet wrote:
> > On Tue, Feb 27, 2024 at 4:44 PM Yan Zhai <yan@cloudflare.com> wrote:
> > >
> > > We noticed task RCUs being blocked when threaded NAPIs are very busy in
> > > production: detaching any BPF tracing programs, i.e. removing a ftrace
> > > trampoline, will simply block for very long in rcu_tasks_wait_gp. This
> > > ranges from hundreds of seconds to even an hour, severely harming any
> > > observability tools that rely on BPF tracing programs. It can be
> > > easily reproduced locally with following setup:
> > >
> > > ip netns add test1
> > > ip netns add test2
> > >
> > > ip -n test1 link add veth1 type veth peer name veth2 netns test2
> > >
> > > ip -n test1 link set veth1 up
> > > ip -n test1 link set lo up
> > > ip -n test2 link set veth2 up
> > > ip -n test2 link set lo up
> > >
> > > ip -n test1 addr add 192.168.1.2/31 dev veth1
> > > ip -n test1 addr add 1.1.1.1/32 dev lo
> > > ip -n test2 addr add 192.168.1.3/31 dev veth2
> > > ip -n test2 addr add 2.2.2.2/31 dev lo
> > >
> > > ip -n test1 route add default via 192.168.1.3
> > > ip -n test2 route add default via 192.168.1.2
> > >
> > > for i in `seq 10 210`; do
> > >  for j in `seq 10 210`; do
> > >     ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
> > >  done
> > > done
> > >
> > > ip netns exec test2 ethtool -K veth2 gro on
> > > ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
> > > ip netns exec test1 ethtool -K veth1 tso off
> > >
> > > Then run an iperf3 client/server and a bpftrace script can trigger it:
> > >
> > > ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
> > > ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
> > > bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'
> > >
> > > Above reproduce for net-next kernel with following RCU and preempt
> > > configuraitons:
> > >
> > > # RCU Subsystem
> > > CONFIG_TREE_RCU=y
> > > CONFIG_PREEMPT_RCU=y
> > > # CONFIG_RCU_EXPERT is not set
> > > CONFIG_SRCU=y
> > > CONFIG_TREE_SRCU=y
> > > CONFIG_TASKS_RCU_GENERIC=y
> > > CONFIG_TASKS_RCU=y
> > > CONFIG_TASKS_RUDE_RCU=y
> > > CONFIG_TASKS_TRACE_RCU=y
> > > CONFIG_RCU_STALL_COMMON=y
> > > CONFIG_RCU_NEED_SEGCBLIST=y
> > > # end of RCU Subsystem
> > > # RCU Debugging
> > > # CONFIG_RCU_SCALE_TEST is not set
> > > # CONFIG_RCU_TORTURE_TEST is not set
> > > # CONFIG_RCU_REF_SCALE_TEST is not set
> > > CONFIG_RCU_CPU_STALL_TIMEOUT=21
> > > CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
> > > # CONFIG_RCU_TRACE is not set
> > > # CONFIG_RCU_EQS_DEBUG is not set
> > > # end of RCU Debugging
> > >
> > > CONFIG_PREEMPT_BUILD=y
> > > # CONFIG_PREEMPT_NONE is not set
> > > CONFIG_PREEMPT_VOLUNTARY=y
> > > # CONFIG_PREEMPT is not set
> > > CONFIG_PREEMPT_COUNT=y
> > > CONFIG_PREEMPTION=y
> > > CONFIG_PREEMPT_DYNAMIC=y
> > > CONFIG_PREEMPT_RCU=y
> > > CONFIG_HAVE_PREEMPT_DYNAMIC=y
> > > CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
> > > CONFIG_PREEMPT_NOTIFIERS=y
> > > # CONFIG_DEBUG_PREEMPT is not set
> > > # CONFIG_PREEMPT_TRACER is not set
> > > # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
> > >
> > > An interesting observation is that, while tasks RCUs are blocked,
> > > related NAPI thread is still being scheduled (even across cores)
> > > regularly. Looking at the gp conditions, I am inclining to cond_resched
> > > after each __napi_poll being the problem: cond_resched enters the
> > > scheduler with PREEMPT bit, which does not account as a gp for tasks
> > > RCUs. Meanwhile, since the thread has been frequently resched, the
> > > normal scheduling point (no PREEMPT bit, accounted as a task RCU gp)
> > > seems to have very little chance to kick in. Given the nature of "busy
> > > polling" program, such NAPI thread won't have task->nvcsw or task->on_rq
> > > updated (other gp conditions), the result is that such NAPI thread is
> > > put on RCU holdouts list for indefinitely long time.
> > >
> > > This is simply fixed by mirroring the ksoftirqd behavior: after
> > > NAPI/softirq work, raise a RCU QS to help expedite the RCU period. No
> > > more blocking afterwards for the same setup.
> > >
> > > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
> > > Signed-off-by: Yan Zhai <yan@cloudflare.com>
> > > ---
> > >  net/core/dev.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > >
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 275fd5259a4a..6e41263ff5d3 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -6773,6 +6773,10 @@ static int napi_threaded_poll(void *data)
> > >                                 net_rps_action_and_irq_enable(sd);
> > >                         }
> > >                         skb_defer_free_flush(sd);
>
> Please put a comment here stating that RCU readers cannot cross
> this point.
>
> I need to add lockdep to rcu_softirq_qs() to catch placing this in an
> RCU read-side critical section.  And a header comment noting that from
> an RCU perspective, it acts as a momentary enabling of preemption.
>
Just to clarify, do you mean I should state that this polling function
can not be called from within an RCU read critical section? Or do you
mean any read critical sections need to end before raising this QS?

Yan

> > > +                       if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> > > +                               rcu_softirq_qs();
> > > +
> > >                         local_bh_enable();
> > >
> > >                         if (!repoll)
> > > --
> > > 2.30.2
> > >
> >
> > Hmm....
> > Why napi_busy_loop() does not have a similar problem ?
> >
> > It is unclear why rcu_all_qs() in __cond_resched() is guarded by
> >
> > #ifndef CONFIG_PREEMPT_RCU
> >      rcu_all_qs();
> > #endif
>
> The theory is that PREEMPT_RCU kernels have preemption, and get their
> quiescent states that way.
>
> The more recent practice involves things like PREEMPT_DYNAMIC and maybe
> soon PREEMPT_AUTO, which might require adjustments, so thank you for
> pointing this out!
>
> Back on the patch, my main other concern is that someone somewhere might
> be using something like synchronize_rcu() to wait for all in-progress
> softirq handlers to complete.  But I don't know of such a thing, and if
> there is, there are workarounds, including synchronize_rcu_tasks().
>
> So something to be aware of, not (as far as I know) something to block
> this commit.
>
> With the added comment:
>
> Acked-by: Paul E. McKenney <paulmck@kernel.org>
>
>                                                         Thanx, Paul
Paul E. McKenney Feb. 27, 2024, 10:58 p.m. UTC | #5
On Tue, Feb 27, 2024 at 03:22:57PM -0600, Yan Zhai wrote:
> On Tue, Feb 27, 2024 at 12:32 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Tue, Feb 27, 2024 at 05:44:17PM +0100, Eric Dumazet wrote:
> > > On Tue, Feb 27, 2024 at 4:44 PM Yan Zhai <yan@cloudflare.com> wrote:
> > > >
> > > > We noticed task RCUs being blocked when threaded NAPIs are very busy in
> > > > production: detaching any BPF tracing programs, i.e. removing a ftrace
> > > > trampoline, will simply block for very long in rcu_tasks_wait_gp. This
> > > > ranges from hundreds of seconds to even an hour, severely harming any
> > > > observability tools that rely on BPF tracing programs. It can be
> > > > easily reproduced locally with following setup:
> > > >
> > > > ip netns add test1
> > > > ip netns add test2
> > > >
> > > > ip -n test1 link add veth1 type veth peer name veth2 netns test2
> > > >
> > > > ip -n test1 link set veth1 up
> > > > ip -n test1 link set lo up
> > > > ip -n test2 link set veth2 up
> > > > ip -n test2 link set lo up
> > > >
> > > > ip -n test1 addr add 192.168.1.2/31 dev veth1
> > > > ip -n test1 addr add 1.1.1.1/32 dev lo
> > > > ip -n test2 addr add 192.168.1.3/31 dev veth2
> > > > ip -n test2 addr add 2.2.2.2/31 dev lo
> > > >
> > > > ip -n test1 route add default via 192.168.1.3
> > > > ip -n test2 route add default via 192.168.1.2
> > > >
> > > > for i in `seq 10 210`; do
> > > >  for j in `seq 10 210`; do
> > > >     ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
> > > >  done
> > > > done
> > > >
> > > > ip netns exec test2 ethtool -K veth2 gro on
> > > > ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
> > > > ip netns exec test1 ethtool -K veth1 tso off
> > > >
> > > > Then run an iperf3 client/server and a bpftrace script can trigger it:
> > > >
> > > > ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
> > > > ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
> > > > bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'
> > > >
> > > > Above reproduce for net-next kernel with following RCU and preempt
> > > > configuraitons:
> > > >
> > > > # RCU Subsystem
> > > > CONFIG_TREE_RCU=y
> > > > CONFIG_PREEMPT_RCU=y
> > > > # CONFIG_RCU_EXPERT is not set
> > > > CONFIG_SRCU=y
> > > > CONFIG_TREE_SRCU=y
> > > > CONFIG_TASKS_RCU_GENERIC=y
> > > > CONFIG_TASKS_RCU=y
> > > > CONFIG_TASKS_RUDE_RCU=y
> > > > CONFIG_TASKS_TRACE_RCU=y
> > > > CONFIG_RCU_STALL_COMMON=y
> > > > CONFIG_RCU_NEED_SEGCBLIST=y
> > > > # end of RCU Subsystem
> > > > # RCU Debugging
> > > > # CONFIG_RCU_SCALE_TEST is not set
> > > > # CONFIG_RCU_TORTURE_TEST is not set
> > > > # CONFIG_RCU_REF_SCALE_TEST is not set
> > > > CONFIG_RCU_CPU_STALL_TIMEOUT=21
> > > > CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
> > > > # CONFIG_RCU_TRACE is not set
> > > > # CONFIG_RCU_EQS_DEBUG is not set
> > > > # end of RCU Debugging
> > > >
> > > > CONFIG_PREEMPT_BUILD=y
> > > > # CONFIG_PREEMPT_NONE is not set
> > > > CONFIG_PREEMPT_VOLUNTARY=y
> > > > # CONFIG_PREEMPT is not set
> > > > CONFIG_PREEMPT_COUNT=y
> > > > CONFIG_PREEMPTION=y
> > > > CONFIG_PREEMPT_DYNAMIC=y
> > > > CONFIG_PREEMPT_RCU=y
> > > > CONFIG_HAVE_PREEMPT_DYNAMIC=y
> > > > CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
> > > > CONFIG_PREEMPT_NOTIFIERS=y
> > > > # CONFIG_DEBUG_PREEMPT is not set
> > > > # CONFIG_PREEMPT_TRACER is not set
> > > > # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
> > > >
> > > > An interesting observation is that, while tasks RCUs are blocked,
> > > > related NAPI thread is still being scheduled (even across cores)
> > > > regularly. Looking at the gp conditions, I am inclining to cond_resched
> > > > after each __napi_poll being the problem: cond_resched enters the
> > > > scheduler with PREEMPT bit, which does not account as a gp for tasks
> > > > RCUs. Meanwhile, since the thread has been frequently resched, the
> > > > normal scheduling point (no PREEMPT bit, accounted as a task RCU gp)
> > > > seems to have very little chance to kick in. Given the nature of "busy
> > > > polling" program, such NAPI thread won't have task->nvcsw or task->on_rq
> > > > updated (other gp conditions), the result is that such NAPI thread is
> > > > put on RCU holdouts list for indefinitely long time.
> > > >
> > > > This is simply fixed by mirroring the ksoftirqd behavior: after
> > > > NAPI/softirq work, raise a RCU QS to help expedite the RCU period. No
> > > > more blocking afterwards for the same setup.
> > > >
> > > > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
> > > > Signed-off-by: Yan Zhai <yan@cloudflare.com>
> > > > ---
> > > >  net/core/dev.c | 4 ++++
> > > >  1 file changed, 4 insertions(+)
> > > >
> > > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > > index 275fd5259a4a..6e41263ff5d3 100644
> > > > --- a/net/core/dev.c
> > > > +++ b/net/core/dev.c
> > > > @@ -6773,6 +6773,10 @@ static int napi_threaded_poll(void *data)
> > > >                                 net_rps_action_and_irq_enable(sd);
> > > >                         }
> > > >                         skb_defer_free_flush(sd);
> >
> > Please put a comment here stating that RCU readers cannot cross
> > this point.
> >
> > I need to add lockdep to rcu_softirq_qs() to catch placing this in an
> > RCU read-side critical section.  And a header comment noting that from
> > an RCU perspective, it acts as a momentary enabling of preemption.
> >
> Just to clarify, do you mean I should state that this polling function
> can not be called from within an RCU read critical section? Or do you
> mean any read critical sections need to end before raising this QS?

Yes to both.

I am preparing a patch to make lockdep complain if you do something
like this:

	rcu_read_lock();
	do_something();
	rcu_softirq_qs();
	do_something_else();
	rcu_read_unlock();

However, it will still be perfectly legal to do something like this:

	local_bh_disable();
	do_something();
	rcu_softirq_qs();  // A
	do_something_else();
	local_bh_enable();  // B

Which might surprise someone expection a synchronize_rcu() to wait
for execution to reach B.  Because of that rcu_softirq_qs(), that
synchronize_rcu() could instead return as soon as execution reached A.

So I am adding this example to a new kernel-doc header for
rcu_softirq_qs().

						Thanx, Paul

> Yan
> 
> > > > +                       if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> > > > +                               rcu_softirq_qs();
> > > > +
> > > >                         local_bh_enable();
> > > >
> > > >                         if (!repoll)
> > > > --
> > > > 2.30.2
> > > >
> > >
> > > Hmm....
> > > Why napi_busy_loop() does not have a similar problem ?
> > >
> > > It is unclear why rcu_all_qs() in __cond_resched() is guarded by
> > >
> > > #ifndef CONFIG_PREEMPT_RCU
> > >      rcu_all_qs();
> > > #endif
> >
> > The theory is that PREEMPT_RCU kernels have preemption, and get their
> > quiescent states that way.
> >
> > The more recent practice involves things like PREEMPT_DYNAMIC and maybe
> > soon PREEMPT_AUTO, which might require adjustments, so thank you for
> > pointing this out!
> >
> > Back on the patch, my main other concern is that someone somewhere might
> > be using something like synchronize_rcu() to wait for all in-progress
> > softirq handlers to complete.  But I don't know of such a thing, and if
> > there is, there are workarounds, including synchronize_rcu_tasks().
> >
> > So something to be aware of, not (as far as I know) something to block
> > this commit.
> >
> > With the added comment:
> >
> > Acked-by: Paul E. McKenney <paulmck@kernel.org>
> >
> >                                                         Thanx, Paul
Jakub Kicinski Feb. 28, 2024, 3:10 a.m. UTC | #6
On Tue, 27 Feb 2024 10:32:22 -0800 Paul E. McKenney wrote:
> > > +                       if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> > > +                               rcu_softirq_qs();
> > > +
> > >                         local_bh_enable();
> > >
> > >                         if (!repoll)
> >
> > Hmm....
> > Why napi_busy_loop() does not have a similar problem ?
> > 
> > It is unclear why rcu_all_qs() in __cond_resched() is guarded by
> > 
> > #ifndef CONFIG_PREEMPT_RCU
> >      rcu_all_qs();
> > #endif  
> 
> The theory is that PREEMPT_RCU kernels have preemption, and get their
> quiescent states that way.

But that doesn't work well enough?

Assuming that's the case why don't we add it with the inverse ifdef
condition next to the cond_resched() which follows a few lines down?

			skb_defer_free_flush(sd);
+
+			if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+				rcu_softirq_qs();
+
			local_bh_enable();

			if (!repoll)
				break;

			cond_resched();
		}

We won't repoll majority of the time.
Paul E. McKenney Feb. 28, 2024, 4:42 a.m. UTC | #7
On Tue, Feb 27, 2024 at 07:10:01PM -0800, Jakub Kicinski wrote:
> On Tue, 27 Feb 2024 10:32:22 -0800 Paul E. McKenney wrote:
> > > > +                       if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> > > > +                               rcu_softirq_qs();
> > > > +
> > > >                         local_bh_enable();
> > > >
> > > >                         if (!repoll)
> > >
> > > Hmm....
> > > Why napi_busy_loop() does not have a similar problem ?
> > > 
> > > It is unclear why rcu_all_qs() in __cond_resched() is guarded by
> > > 
> > > #ifndef CONFIG_PREEMPT_RCU
> > >      rcu_all_qs();
> > > #endif  
> > 
> > The theory is that PREEMPT_RCU kernels have preemption, and get their
> > quiescent states that way.
> 
> But that doesn't work well enough?
> 
> Assuming that's the case why don't we add it with the inverse ifdef
> condition next to the cond_resched() which follows a few lines down?
> 
> 			skb_defer_free_flush(sd);
> +
> +			if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> +				rcu_softirq_qs();
> +
> 			local_bh_enable();
> 
> 			if (!repoll)
> 				break;
> 
> 			cond_resched();
> 		}
> 
> We won't repoll majority of the time.

I am not completely clear on what you are proposing, but one complication
is that We need preemption disabled across calls to rcu_softirq_qs()
and we cannot have preemption disabled across calls to cond_resched().
Another complication is that although CONFIG_PREEMPT_RT kernels are
built with CONFIG_PREEMPT_RCU, the reverse is not always the case.
And if we are not repolling, don't we have a high probability of doing
a voluntary context when we reach napi_thread_wait() at the beginning
of that loop?

All in all, I suspect that I am missing your point.

							Thanx, Paul
Toke Høiland-Jørgensen Feb. 28, 2024, 11:50 a.m. UTC | #8
"Paul E. McKenney" <paulmck@kernel.org> writes:

> On Tue, Feb 27, 2024 at 05:44:17PM +0100, Eric Dumazet wrote:
>> On Tue, Feb 27, 2024 at 4:44 PM Yan Zhai <yan@cloudflare.com> wrote:
>> >
>> > We noticed task RCUs being blocked when threaded NAPIs are very busy in
>> > production: detaching any BPF tracing programs, i.e. removing a ftrace
>> > trampoline, will simply block for very long in rcu_tasks_wait_gp. This
>> > ranges from hundreds of seconds to even an hour, severely harming any
>> > observability tools that rely on BPF tracing programs. It can be
>> > easily reproduced locally with following setup:
>> >
>> > ip netns add test1
>> > ip netns add test2
>> >
>> > ip -n test1 link add veth1 type veth peer name veth2 netns test2
>> >
>> > ip -n test1 link set veth1 up
>> > ip -n test1 link set lo up
>> > ip -n test2 link set veth2 up
>> > ip -n test2 link set lo up
>> >
>> > ip -n test1 addr add 192.168.1.2/31 dev veth1
>> > ip -n test1 addr add 1.1.1.1/32 dev lo
>> > ip -n test2 addr add 192.168.1.3/31 dev veth2
>> > ip -n test2 addr add 2.2.2.2/31 dev lo
>> >
>> > ip -n test1 route add default via 192.168.1.3
>> > ip -n test2 route add default via 192.168.1.2
>> >
>> > for i in `seq 10 210`; do
>> >  for j in `seq 10 210`; do
>> >     ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
>> >  done
>> > done
>> >
>> > ip netns exec test2 ethtool -K veth2 gro on
>> > ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
>> > ip netns exec test1 ethtool -K veth1 tso off
>> >
>> > Then run an iperf3 client/server and a bpftrace script can trigger it:
>> >
>> > ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
>> > ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
>> > bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'
>> >
>> > Above reproduce for net-next kernel with following RCU and preempt
>> > configuraitons:
>> >
>> > # RCU Subsystem
>> > CONFIG_TREE_RCU=y
>> > CONFIG_PREEMPT_RCU=y
>> > # CONFIG_RCU_EXPERT is not set
>> > CONFIG_SRCU=y
>> > CONFIG_TREE_SRCU=y
>> > CONFIG_TASKS_RCU_GENERIC=y
>> > CONFIG_TASKS_RCU=y
>> > CONFIG_TASKS_RUDE_RCU=y
>> > CONFIG_TASKS_TRACE_RCU=y
>> > CONFIG_RCU_STALL_COMMON=y
>> > CONFIG_RCU_NEED_SEGCBLIST=y
>> > # end of RCU Subsystem
>> > # RCU Debugging
>> > # CONFIG_RCU_SCALE_TEST is not set
>> > # CONFIG_RCU_TORTURE_TEST is not set
>> > # CONFIG_RCU_REF_SCALE_TEST is not set
>> > CONFIG_RCU_CPU_STALL_TIMEOUT=21
>> > CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
>> > # CONFIG_RCU_TRACE is not set
>> > # CONFIG_RCU_EQS_DEBUG is not set
>> > # end of RCU Debugging
>> >
>> > CONFIG_PREEMPT_BUILD=y
>> > # CONFIG_PREEMPT_NONE is not set
>> > CONFIG_PREEMPT_VOLUNTARY=y
>> > # CONFIG_PREEMPT is not set
>> > CONFIG_PREEMPT_COUNT=y
>> > CONFIG_PREEMPTION=y
>> > CONFIG_PREEMPT_DYNAMIC=y
>> > CONFIG_PREEMPT_RCU=y
>> > CONFIG_HAVE_PREEMPT_DYNAMIC=y
>> > CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
>> > CONFIG_PREEMPT_NOTIFIERS=y
>> > # CONFIG_DEBUG_PREEMPT is not set
>> > # CONFIG_PREEMPT_TRACER is not set
>> > # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
>> >
>> > An interesting observation is that, while tasks RCUs are blocked,
>> > related NAPI thread is still being scheduled (even across cores)
>> > regularly. Looking at the gp conditions, I am inclining to cond_resched
>> > after each __napi_poll being the problem: cond_resched enters the
>> > scheduler with PREEMPT bit, which does not account as a gp for tasks
>> > RCUs. Meanwhile, since the thread has been frequently resched, the
>> > normal scheduling point (no PREEMPT bit, accounted as a task RCU gp)
>> > seems to have very little chance to kick in. Given the nature of "busy
>> > polling" program, such NAPI thread won't have task->nvcsw or task->on_rq
>> > updated (other gp conditions), the result is that such NAPI thread is
>> > put on RCU holdouts list for indefinitely long time.
>> >
>> > This is simply fixed by mirroring the ksoftirqd behavior: after
>> > NAPI/softirq work, raise a RCU QS to help expedite the RCU period. No
>> > more blocking afterwards for the same setup.
>> >
>> > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
>> > Signed-off-by: Yan Zhai <yan@cloudflare.com>
>> > ---
>> >  net/core/dev.c | 4 ++++
>> >  1 file changed, 4 insertions(+)
>> >
>> > diff --git a/net/core/dev.c b/net/core/dev.c
>> > index 275fd5259a4a..6e41263ff5d3 100644
>> > --- a/net/core/dev.c
>> > +++ b/net/core/dev.c
>> > @@ -6773,6 +6773,10 @@ static int napi_threaded_poll(void *data)
>> >                                 net_rps_action_and_irq_enable(sd);
>> >                         }
>> >                         skb_defer_free_flush(sd);
>
> Please put a comment here stating that RCU readers cannot cross
> this point.
>
> I need to add lockdep to rcu_softirq_qs() to catch placing this in an
> RCU read-side critical section.  And a header comment noting that from
> an RCU perspective, it acts as a momentary enabling of preemption.

OK, so one question here: for XDP, we're basically treating
local_bh_disable/enable() as the RCU critical section, cf the discussion
we had a few years ago that led to this being documented[0]. So why is
it OK to have the rcu_softirq_qs() inside the bh disable/enable pair,
but not inside an rcu_read_lock() section?

Also, looking at the patch in question:

>> > +                       if (!IS_ENABLED(CONFIG_PREEMPT_RT))
>> > +                               rcu_softirq_qs();
>> > +
>> >                         local_bh_enable();

Why does that local_bh_enable() not accomplish the same thing as the qs?

-Toke

[0] https://lore.kernel.org/bpf/20210624160609.292325-6-toke@redhat.com/
Jakub Kicinski Feb. 28, 2024, 2:43 p.m. UTC | #9
On Tue, 27 Feb 2024 20:42:24 -0800 Paul E. McKenney wrote:
> On Tue, Feb 27, 2024 at 07:10:01PM -0800, Jakub Kicinski wrote:
> > On Tue, 27 Feb 2024 10:32:22 -0800 Paul E. McKenney wrote:  
> > > The theory is that PREEMPT_RCU kernels have preemption, and get their
> > > quiescent states that way.  
> > 
> > But that doesn't work well enough?
> > 
> > Assuming that's the case why don't we add it with the inverse ifdef
> > condition next to the cond_resched() which follows a few lines down?
> > 
> > 			skb_defer_free_flush(sd);
> > +
> > +			if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> > +				rcu_softirq_qs();
> > +
> > 			local_bh_enable();
> > 
> > 			if (!repoll)
> > 				break;
> > 
> > 			cond_resched();
> > 		}
> > 
> > We won't repoll majority of the time.  
> 
> I am not completely clear on what you are proposing, but one complication
> is that We need preemption disabled across calls to rcu_softirq_qs()
> and we cannot have preemption disabled across calls to cond_resched().

I was thinking of using rcu_all_qs(), like cond_resched() does.
Not sure how it compares in terms of functionality and cost.

> Another complication is that although CONFIG_PREEMPT_RT kernels are
> built with CONFIG_PREEMPT_RCU, the reverse is not always the case.
> And if we are not repolling, don't we have a high probability of doing
> a voluntary context when we reach napi_thread_wait() at the beginning
> of that loop?

Very much so, which is why adding the cost of rcu_softirq_qs()
for every NAPI run feels like an overkill.
Paul E. McKenney Feb. 28, 2024, 3:10 p.m. UTC | #10
On Wed, Feb 28, 2024 at 12:50:53PM +0100, Toke Høiland-Jørgensen wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> 
> > On Tue, Feb 27, 2024 at 05:44:17PM +0100, Eric Dumazet wrote:
> >> On Tue, Feb 27, 2024 at 4:44 PM Yan Zhai <yan@cloudflare.com> wrote:
> >> >
> >> > We noticed task RCUs being blocked when threaded NAPIs are very busy in
> >> > production: detaching any BPF tracing programs, i.e. removing a ftrace
> >> > trampoline, will simply block for very long in rcu_tasks_wait_gp. This
> >> > ranges from hundreds of seconds to even an hour, severely harming any
> >> > observability tools that rely on BPF tracing programs. It can be
> >> > easily reproduced locally with following setup:
> >> >
> >> > ip netns add test1
> >> > ip netns add test2
> >> >
> >> > ip -n test1 link add veth1 type veth peer name veth2 netns test2
> >> >
> >> > ip -n test1 link set veth1 up
> >> > ip -n test1 link set lo up
> >> > ip -n test2 link set veth2 up
> >> > ip -n test2 link set lo up
> >> >
> >> > ip -n test1 addr add 192.168.1.2/31 dev veth1
> >> > ip -n test1 addr add 1.1.1.1/32 dev lo
> >> > ip -n test2 addr add 192.168.1.3/31 dev veth2
> >> > ip -n test2 addr add 2.2.2.2/31 dev lo
> >> >
> >> > ip -n test1 route add default via 192.168.1.3
> >> > ip -n test2 route add default via 192.168.1.2
> >> >
> >> > for i in `seq 10 210`; do
> >> >  for j in `seq 10 210`; do
> >> >     ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
> >> >  done
> >> > done
> >> >
> >> > ip netns exec test2 ethtool -K veth2 gro on
> >> > ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
> >> > ip netns exec test1 ethtool -K veth1 tso off
> >> >
> >> > Then run an iperf3 client/server and a bpftrace script can trigger it:
> >> >
> >> > ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
> >> > ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
> >> > bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'
> >> >
> >> > Above reproduce for net-next kernel with following RCU and preempt
> >> > configuraitons:
> >> >
> >> > # RCU Subsystem
> >> > CONFIG_TREE_RCU=y
> >> > CONFIG_PREEMPT_RCU=y
> >> > # CONFIG_RCU_EXPERT is not set
> >> > CONFIG_SRCU=y
> >> > CONFIG_TREE_SRCU=y
> >> > CONFIG_TASKS_RCU_GENERIC=y
> >> > CONFIG_TASKS_RCU=y
> >> > CONFIG_TASKS_RUDE_RCU=y
> >> > CONFIG_TASKS_TRACE_RCU=y
> >> > CONFIG_RCU_STALL_COMMON=y
> >> > CONFIG_RCU_NEED_SEGCBLIST=y
> >> > # end of RCU Subsystem
> >> > # RCU Debugging
> >> > # CONFIG_RCU_SCALE_TEST is not set
> >> > # CONFIG_RCU_TORTURE_TEST is not set
> >> > # CONFIG_RCU_REF_SCALE_TEST is not set
> >> > CONFIG_RCU_CPU_STALL_TIMEOUT=21
> >> > CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
> >> > # CONFIG_RCU_TRACE is not set
> >> > # CONFIG_RCU_EQS_DEBUG is not set
> >> > # end of RCU Debugging
> >> >
> >> > CONFIG_PREEMPT_BUILD=y
> >> > # CONFIG_PREEMPT_NONE is not set
> >> > CONFIG_PREEMPT_VOLUNTARY=y
> >> > # CONFIG_PREEMPT is not set
> >> > CONFIG_PREEMPT_COUNT=y
> >> > CONFIG_PREEMPTION=y
> >> > CONFIG_PREEMPT_DYNAMIC=y
> >> > CONFIG_PREEMPT_RCU=y
> >> > CONFIG_HAVE_PREEMPT_DYNAMIC=y
> >> > CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
> >> > CONFIG_PREEMPT_NOTIFIERS=y
> >> > # CONFIG_DEBUG_PREEMPT is not set
> >> > # CONFIG_PREEMPT_TRACER is not set
> >> > # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
> >> >
> >> > An interesting observation is that, while tasks RCUs are blocked,
> >> > related NAPI thread is still being scheduled (even across cores)
> >> > regularly. Looking at the gp conditions, I am inclining to cond_resched
> >> > after each __napi_poll being the problem: cond_resched enters the
> >> > scheduler with PREEMPT bit, which does not account as a gp for tasks
> >> > RCUs. Meanwhile, since the thread has been frequently resched, the
> >> > normal scheduling point (no PREEMPT bit, accounted as a task RCU gp)
> >> > seems to have very little chance to kick in. Given the nature of "busy
> >> > polling" program, such NAPI thread won't have task->nvcsw or task->on_rq
> >> > updated (other gp conditions), the result is that such NAPI thread is
> >> > put on RCU holdouts list for indefinitely long time.
> >> >
> >> > This is simply fixed by mirroring the ksoftirqd behavior: after
> >> > NAPI/softirq work, raise a RCU QS to help expedite the RCU period. No
> >> > more blocking afterwards for the same setup.
> >> >
> >> > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
> >> > Signed-off-by: Yan Zhai <yan@cloudflare.com>
> >> > ---
> >> >  net/core/dev.c | 4 ++++
> >> >  1 file changed, 4 insertions(+)
> >> >
> >> > diff --git a/net/core/dev.c b/net/core/dev.c
> >> > index 275fd5259a4a..6e41263ff5d3 100644
> >> > --- a/net/core/dev.c
> >> > +++ b/net/core/dev.c
> >> > @@ -6773,6 +6773,10 @@ static int napi_threaded_poll(void *data)
> >> >                                 net_rps_action_and_irq_enable(sd);
> >> >                         }
> >> >                         skb_defer_free_flush(sd);
> >
> > Please put a comment here stating that RCU readers cannot cross
> > this point.
> >
> > I need to add lockdep to rcu_softirq_qs() to catch placing this in an
> > RCU read-side critical section.  And a header comment noting that from
> > an RCU perspective, it acts as a momentary enabling of preemption.
> 
> OK, so one question here: for XDP, we're basically treating
> local_bh_disable/enable() as the RCU critical section, cf the discussion
> we had a few years ago that led to this being documented[0]. So why is
> it OK to have the rcu_softirq_qs() inside the bh disable/enable pair,
> but not inside an rcu_read_lock() section?

In general, it is not OK.  And it is not OK in this case if this happens
to be one of the local_bh_disable() regions that XDP is waiting on.
Except that that region ends right after the rcu_softirq_qs(), so that
should not be a problem.

But you are quite right, that is an accident waiting to happen, so it
would be better if the patch did something like this:

	local_bh_enable();
	if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
		preempt_disable();
		rcu_softirq_qs();
		preempt_enable();
	}

Though maybe something like this would be better:

	local_bh_enable();
	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
		rcu_softirq_qs_enable(local_bh_enable());
	else
		local_bh_enable();

A bit ugly, but it does allow exact checking of the rules and also
avoids extra overhead.

I could imagine pulling the CONFIG_PREEMPT_RT check into the body of
rcu_softirq_qs_enable().

But is there a better way?

> Also, looking at the patch in question:
> 
> >> > +                       if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> >> > +                               rcu_softirq_qs();
> >> > +
> >> >                         local_bh_enable();
> 
> Why does that local_bh_enable() not accomplish the same thing as the qs?

In this case, because it does not create the appearance of a voluntary
context switch needed by RCU Tasks.  So the wait for trampoline evacuation
could still take a very long time.

							Thanx, Paul

> -Toke
> 
> [0] https://lore.kernel.org/bpf/20210624160609.292325-6-toke@redhat.com/
>
Paul E. McKenney Feb. 28, 2024, 3:15 p.m. UTC | #11
On Wed, Feb 28, 2024 at 06:43:43AM -0800, Jakub Kicinski wrote:
> On Tue, 27 Feb 2024 20:42:24 -0800 Paul E. McKenney wrote:
> > On Tue, Feb 27, 2024 at 07:10:01PM -0800, Jakub Kicinski wrote:
> > > On Tue, 27 Feb 2024 10:32:22 -0800 Paul E. McKenney wrote:  
> > > > The theory is that PREEMPT_RCU kernels have preemption, and get their
> > > > quiescent states that way.  
> > > 
> > > But that doesn't work well enough?
> > > 
> > > Assuming that's the case why don't we add it with the inverse ifdef
> > > condition next to the cond_resched() which follows a few lines down?
> > > 
> > > 			skb_defer_free_flush(sd);
> > > +
> > > +			if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> > > +				rcu_softirq_qs();
> > > +
> > > 			local_bh_enable();
> > > 
> > > 			if (!repoll)
> > > 				break;
> > > 
> > > 			cond_resched();
> > > 		}
> > > 
> > > We won't repoll majority of the time.  
> > 
> > I am not completely clear on what you are proposing, but one complication
> > is that We need preemption disabled across calls to rcu_softirq_qs()
> > and we cannot have preemption disabled across calls to cond_resched().
> 
> I was thinking of using rcu_all_qs(), like cond_resched() does.
> Not sure how it compares in terms of functionality and cost.

It is probably a bit cheaper, but it does nothing for Tasks RCU.  And that
"_all" in the name is a holdover from when there were separate mechanisms
for bh, sched, and preempt, so maybe we should change that name.

> > Another complication is that although CONFIG_PREEMPT_RT kernels are
> > built with CONFIG_PREEMPT_RCU, the reverse is not always the case.
> > And if we are not repolling, don't we have a high probability of doing
> > a voluntary context when we reach napi_thread_wait() at the beginning
> > of that loop?
> 
> Very much so, which is why adding the cost of rcu_softirq_qs()
> for every NAPI run feels like an overkill.

Would it be better to do the rcu_softirq_qs() only once every 1000 times
or some such?  Or once every HZ jiffies?

Or is there a better way?

							Thanx, Paul
Jakub Kicinski Feb. 28, 2024, 3:35 p.m. UTC | #12
On Wed, 28 Feb 2024 07:15:42 -0800 Paul E. McKenney wrote:
> > > Another complication is that although CONFIG_PREEMPT_RT kernels are
> > > built with CONFIG_PREEMPT_RCU, the reverse is not always the case.
> > > And if we are not repolling, don't we have a high probability of doing
> > > a voluntary context when we reach napi_thread_wait() at the beginning
> > > of that loop?  
> > 
> > Very much so, which is why adding the cost of rcu_softirq_qs()
> > for every NAPI run feels like an overkill.  
> 
> Would it be better to do the rcu_softirq_qs() only once every 1000 times
> or some such?  Or once every HZ jiffies?
> 
> Or is there a better way?

Right, we can do that. Yan Zhai, have you measured the performance
impact / time spent in the call?
Joel Fernandes Feb. 28, 2024, 3:37 p.m. UTC | #13
On 2/27/2024 1:32 PM, Paul E. McKenney wrote:
> On Tue, Feb 27, 2024 at 05:44:17PM +0100, Eric Dumazet wrote:
>> On Tue, Feb 27, 2024 at 4:44 PM Yan Zhai <yan@cloudflare.com> wrote:
>>> We noticed task RCUs being blocked when threaded NAPIs are very busy in
>>> production: detaching any BPF tracing programs, i.e. removing a ftrace
>>> trampoline, will simply block for very long in rcu_tasks_wait_gp. This
>>> ranges from hundreds of seconds to even an hour, severely harming any
>>> observability tools that rely on BPF tracing programs. It can be
>>> easily reproduced locally with following setup:
>>>
>>> ip netns add test1
>>> ip netns add test2
>>>
>>> ip -n test1 link add veth1 type veth peer name veth2 netns test2
>>>
>>> ip -n test1 link set veth1 up
>>> ip -n test1 link set lo up
>>> ip -n test2 link set veth2 up
>>> ip -n test2 link set lo up
>>>
>>> ip -n test1 addr add 192.168.1.2/31 dev veth1
>>> ip -n test1 addr add 1.1.1.1/32 dev lo
>>> ip -n test2 addr add 192.168.1.3/31 dev veth2
>>> ip -n test2 addr add 2.2.2.2/31 dev lo
>>>
>>> ip -n test1 route add default via 192.168.1.3
>>> ip -n test2 route add default via 192.168.1.2
>>>
>>> for i in `seq 10 210`; do
>>>  for j in `seq 10 210`; do
>>>     ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
>>>  done
>>> done
>>>
>>> ip netns exec test2 ethtool -K veth2 gro on
>>> ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
>>> ip netns exec test1 ethtool -K veth1 tso off
>>>
>>> Then run an iperf3 client/server and a bpftrace script can trigger it:
>>>
>>> ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
>>> ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
>>> bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'
>>>
>>> Above reproduce for net-next kernel with following RCU and preempt
>>> configuraitons:
>>>
>>> # RCU Subsystem
>>> CONFIG_TREE_RCU=y
>>> CONFIG_PREEMPT_RCU=y
>>> # CONFIG_RCU_EXPERT is not set
>>> CONFIG_SRCU=y
>>> CONFIG_TREE_SRCU=y
>>> CONFIG_TASKS_RCU_GENERIC=y
>>> CONFIG_TASKS_RCU=y
>>> CONFIG_TASKS_RUDE_RCU=y
>>> CONFIG_TASKS_TRACE_RCU=y
>>> CONFIG_RCU_STALL_COMMON=y
>>> CONFIG_RCU_NEED_SEGCBLIST=y
>>> # end of RCU Subsystem
>>> # RCU Debugging
>>> # CONFIG_RCU_SCALE_TEST is not set
>>> # CONFIG_RCU_TORTURE_TEST is not set
>>> # CONFIG_RCU_REF_SCALE_TEST is not set
>>> CONFIG_RCU_CPU_STALL_TIMEOUT=21
>>> CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
>>> # CONFIG_RCU_TRACE is not set
>>> # CONFIG_RCU_EQS_DEBUG is not set
>>> # end of RCU Debugging
>>>
>>> CONFIG_PREEMPT_BUILD=y
>>> # CONFIG_PREEMPT_NONE is not set
>>> CONFIG_PREEMPT_VOLUNTARY=y
>>> # CONFIG_PREEMPT is not set
>>> CONFIG_PREEMPT_COUNT=y
>>> CONFIG_PREEMPTION=y
>>> CONFIG_PREEMPT_DYNAMIC=y
>>> CONFIG_PREEMPT_RCU=y
>>> CONFIG_HAVE_PREEMPT_DYNAMIC=y
>>> CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
>>> CONFIG_PREEMPT_NOTIFIERS=y
>>> # CONFIG_DEBUG_PREEMPT is not set
>>> # CONFIG_PREEMPT_TRACER is not set
>>> # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
>>>
>>> An interesting observation is that, while tasks RCUs are blocked,
>>> related NAPI thread is still being scheduled (even across cores)
>>> regularly. Looking at the gp conditions, I am inclining to cond_resched
>>> after each __napi_poll being the problem: cond_resched enters the
>>> scheduler with PREEMPT bit, which does not account as a gp for tasks
>>> RCUs. Meanwhile, since the thread has been frequently resched, the
>>> normal scheduling point (no PREEMPT bit, accounted as a task RCU gp)
>>> seems to have very little chance to kick in. Given the nature of "busy
>>> polling" program, such NAPI thread won't have task->nvcsw or task->on_rq
>>> updated (other gp conditions), the result is that such NAPI thread is
>>> put on RCU holdouts list for indefinitely long time.
>>>
>>> This is simply fixed by mirroring the ksoftirqd behavior: after
>>> NAPI/softirq work, raise a RCU QS to help expedite the RCU period. No
>>> more blocking afterwards for the same setup.
>>>
>>> Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
>>> Signed-off-by: Yan Zhai <yan@cloudflare.com>
>>> ---
>>>  net/core/dev.c | 4 ++++
>>>  1 file changed, 4 insertions(+)
>>>
>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>> index 275fd5259a4a..6e41263ff5d3 100644
>>> --- a/net/core/dev.c
>>> +++ b/net/core/dev.c
>>> @@ -6773,6 +6773,10 @@ static int napi_threaded_poll(void *data)
>>>                                 net_rps_action_and_irq_enable(sd);
>>>                         }
>>>                         skb_defer_free_flush(sd);
> Please put a comment here stating that RCU readers cannot cross
> this point.
> 
> I need to add lockdep to rcu_softirq_qs() to catch placing this in an
> RCU read-side critical section.  And a header comment noting that from
> an RCU perspective, it acts as a momentary enabling of preemption.

Agreed, also does PREEMPT_RT not have similar issue? I noticed Thomas had
added this to softirq.c [1] but the changelog did not have details.

Also optionally, I wonder if calling rcu_tasks_qs() directly is better
(for documentation if anything) since the issue is Tasks RCU specific. Also
code comment above the rcu_softirq_qs() call about cond_resched() not taking
care of Tasks RCU would be great!

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

thanks,

 - Joel
[1]
@@ -381,8 +553,10 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
                pending >>= softirq_bit;
        }

-       if (__this_cpu_read(ksoftirqd) == current)
+       if (!IS_ENABLED(CONFIG_PREEMPT_RT) &&
+           __this_cpu_read(ksoftirqd) == current)
                rcu_softirq_qs();
+
        local_irq_disable();
Yan Zhai Feb. 28, 2024, 3:48 p.m. UTC | #14
On Wed, Feb 28, 2024 at 9:10 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Feb 28, 2024 at 12:50:53PM +0100, Toke Høiland-Jørgensen wrote:
> > "Paul E. McKenney" <paulmck@kernel.org> writes:
> >
> > > On Tue, Feb 27, 2024 at 05:44:17PM +0100, Eric Dumazet wrote:
> > >> On Tue, Feb 27, 2024 at 4:44 PM Yan Zhai <yan@cloudflare.com> wrote:
> > >> >
> > >> > We noticed task RCUs being blocked when threaded NAPIs are very busy in
> > >> > production: detaching any BPF tracing programs, i.e. removing a ftrace
> > >> > trampoline, will simply block for very long in rcu_tasks_wait_gp. This
> > >> > ranges from hundreds of seconds to even an hour, severely harming any
> > >> > observability tools that rely on BPF tracing programs. It can be
> > >> > easily reproduced locally with following setup:
> > >> >
> > >> > ip netns add test1
> > >> > ip netns add test2
> > >> >
> > >> > ip -n test1 link add veth1 type veth peer name veth2 netns test2
> > >> >
> > >> > ip -n test1 link set veth1 up
> > >> > ip -n test1 link set lo up
> > >> > ip -n test2 link set veth2 up
> > >> > ip -n test2 link set lo up
> > >> >
> > >> > ip -n test1 addr add 192.168.1.2/31 dev veth1
> > >> > ip -n test1 addr add 1.1.1.1/32 dev lo
> > >> > ip -n test2 addr add 192.168.1.3/31 dev veth2
> > >> > ip -n test2 addr add 2.2.2.2/31 dev lo
> > >> >
> > >> > ip -n test1 route add default via 192.168.1.3
> > >> > ip -n test2 route add default via 192.168.1.2
> > >> >
> > >> > for i in `seq 10 210`; do
> > >> >  for j in `seq 10 210`; do
> > >> >     ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
> > >> >  done
> > >> > done
> > >> >
> > >> > ip netns exec test2 ethtool -K veth2 gro on
> > >> > ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
> > >> > ip netns exec test1 ethtool -K veth1 tso off
> > >> >
> > >> > Then run an iperf3 client/server and a bpftrace script can trigger it:
> > >> >
> > >> > ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
> > >> > ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
> > >> > bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'
> > >> >
> > >> > Above reproduce for net-next kernel with following RCU and preempt
> > >> > configuraitons:
> > >> >
> > >> > # RCU Subsystem
> > >> > CONFIG_TREE_RCU=y
> > >> > CONFIG_PREEMPT_RCU=y
> > >> > # CONFIG_RCU_EXPERT is not set
> > >> > CONFIG_SRCU=y
> > >> > CONFIG_TREE_SRCU=y
> > >> > CONFIG_TASKS_RCU_GENERIC=y
> > >> > CONFIG_TASKS_RCU=y
> > >> > CONFIG_TASKS_RUDE_RCU=y
> > >> > CONFIG_TASKS_TRACE_RCU=y
> > >> > CONFIG_RCU_STALL_COMMON=y
> > >> > CONFIG_RCU_NEED_SEGCBLIST=y
> > >> > # end of RCU Subsystem
> > >> > # RCU Debugging
> > >> > # CONFIG_RCU_SCALE_TEST is not set
> > >> > # CONFIG_RCU_TORTURE_TEST is not set
> > >> > # CONFIG_RCU_REF_SCALE_TEST is not set
> > >> > CONFIG_RCU_CPU_STALL_TIMEOUT=21
> > >> > CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
> > >> > # CONFIG_RCU_TRACE is not set
> > >> > # CONFIG_RCU_EQS_DEBUG is not set
> > >> > # end of RCU Debugging
> > >> >
> > >> > CONFIG_PREEMPT_BUILD=y
> > >> > # CONFIG_PREEMPT_NONE is not set
> > >> > CONFIG_PREEMPT_VOLUNTARY=y
> > >> > # CONFIG_PREEMPT is not set
> > >> > CONFIG_PREEMPT_COUNT=y
> > >> > CONFIG_PREEMPTION=y
> > >> > CONFIG_PREEMPT_DYNAMIC=y
> > >> > CONFIG_PREEMPT_RCU=y
> > >> > CONFIG_HAVE_PREEMPT_DYNAMIC=y
> > >> > CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
> > >> > CONFIG_PREEMPT_NOTIFIERS=y
> > >> > # CONFIG_DEBUG_PREEMPT is not set
> > >> > # CONFIG_PREEMPT_TRACER is not set
> > >> > # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
> > >> >
> > >> > An interesting observation is that, while tasks RCUs are blocked,
> > >> > related NAPI thread is still being scheduled (even across cores)
> > >> > regularly. Looking at the gp conditions, I am inclining to cond_resched
> > >> > after each __napi_poll being the problem: cond_resched enters the
> > >> > scheduler with PREEMPT bit, which does not account as a gp for tasks
> > >> > RCUs. Meanwhile, since the thread has been frequently resched, the
> > >> > normal scheduling point (no PREEMPT bit, accounted as a task RCU gp)
> > >> > seems to have very little chance to kick in. Given the nature of "busy
> > >> > polling" program, such NAPI thread won't have task->nvcsw or task->on_rq
> > >> > updated (other gp conditions), the result is that such NAPI thread is
> > >> > put on RCU holdouts list for indefinitely long time.
> > >> >
> > >> > This is simply fixed by mirroring the ksoftirqd behavior: after
> > >> > NAPI/softirq work, raise a RCU QS to help expedite the RCU period. No
> > >> > more blocking afterwards for the same setup.
> > >> >
> > >> > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
> > >> > Signed-off-by: Yan Zhai <yan@cloudflare.com>
> > >> > ---
> > >> >  net/core/dev.c | 4 ++++
> > >> >  1 file changed, 4 insertions(+)
> > >> >
> > >> > diff --git a/net/core/dev.c b/net/core/dev.c
> > >> > index 275fd5259a4a..6e41263ff5d3 100644
> > >> > --- a/net/core/dev.c
> > >> > +++ b/net/core/dev.c
> > >> > @@ -6773,6 +6773,10 @@ static int napi_threaded_poll(void *data)
> > >> >                                 net_rps_action_and_irq_enable(sd);
> > >> >                         }
> > >> >                         skb_defer_free_flush(sd);
> > >
> > > Please put a comment here stating that RCU readers cannot cross
> > > this point.
> > >
> > > I need to add lockdep to rcu_softirq_qs() to catch placing this in an
> > > RCU read-side critical section.  And a header comment noting that from
> > > an RCU perspective, it acts as a momentary enabling of preemption.
> >
> > OK, so one question here: for XDP, we're basically treating
> > local_bh_disable/enable() as the RCU critical section, cf the discussion
> > we had a few years ago that led to this being documented[0]. So why is
> > it OK to have the rcu_softirq_qs() inside the bh disable/enable pair,
> > but not inside an rcu_read_lock() section?
>
> In general, it is not OK.  And it is not OK in this case if this happens
> to be one of the local_bh_disable() regions that XDP is waiting on.
> Except that that region ends right after the rcu_softirq_qs(), so that
> should not be a problem.
>
> But you are quite right, that is an accident waiting to happen, so it
> would be better if the patch did something like this:
>
>         local_bh_enable();
>         if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
>                 preempt_disable();
>                 rcu_softirq_qs();
>                 preempt_enable();
>         }
>
Yeah we need preempt for this call. When I first attempt it after
local_bh_enable, I got the bug call:
[ 1166.384279] BUG: using __this_cpu_read() in preemptible [00000000]
code: napi/veth2-66/8439
[ 1166.385337] caller is rcu_softirq_qs+0x16/0x130
[ 1166.385900] CPU: 3 PID: 8439 Comm: napi/veth2-66 Not tainted
6.7.0-rc8-g3fbf61207c66-dirty #75
[ 1166.386950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 1166.388110] Call Trace:
[ 1166.388417]  <TASK>
[ 1166.388684]  dump_stack_lvl+0x36/0x50
[ 1166.389147]  check_preemption_disabled+0xd1/0xe0
[ 1166.389725]  rcu_softirq_qs+0x16/0x130
[ 1166.390190]  napi_threaded_poll+0x21e/0x260
[ 1166.390702]  ? __pfx_napi_threaded_poll+0x10/0x10
[ 1166.391277]  kthread+0xf7/0x130
[ 1166.391643]  ? __pfx_kthread+0x10/0x10
[ 1166.392130]  ret_from_fork+0x34/0x50
[ 1166.392574]  ? __pfx_kthread+0x10/0x10
[ 1166.393048]  ret_from_fork_asm+0x1b/0x30
[ 1166.393530]  </TASK>

Since this patch is trying to mirror what __do_softirq has, should the
similar notes/changes apply to that side as well?


> Though maybe something like this would be better:
>
>         local_bh_enable();
>         if (!IS_ENABLED(CONFIG_PREEMPT_RT))
>                 rcu_softirq_qs_enable(local_bh_enable());
>         else
>                 local_bh_enable();
>
> A bit ugly, but it does allow exact checking of the rules and also
> avoids extra overhead.
>
> I could imagine pulling the CONFIG_PREEMPT_RT check into the body of
> rcu_softirq_qs_enable().
>
> But is there a better way?
>
> > Also, looking at the patch in question:
> >
> > >> > +                       if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> > >> > +                               rcu_softirq_qs();
> > >> > +
> > >> >                         local_bh_enable();
> >
> > Why does that local_bh_enable() not accomplish the same thing as the qs?
>
> In this case, because it does not create the appearance of a voluntary
> context switch needed by RCU Tasks.  So the wait for trampoline evacuation
> could still take a very long time.
>
>                                                         Thanx, Paul
>
> > -Toke
> >
> > [0] https://lore.kernel.org/bpf/20210624160609.292325-6-toke@redhat.com/
> >
Yan Zhai Feb. 28, 2024, 3:57 p.m. UTC | #15
On Wed, Feb 28, 2024 at 9:35 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 28 Feb 2024 07:15:42 -0800 Paul E. McKenney wrote:
> > > > Another complication is that although CONFIG_PREEMPT_RT kernels are
> > > > built with CONFIG_PREEMPT_RCU, the reverse is not always the case.
> > > > And if we are not repolling, don't we have a high probability of doing
> > > > a voluntary context when we reach napi_thread_wait() at the beginning
> > > > of that loop?
> > >
> > > Very much so, which is why adding the cost of rcu_softirq_qs()
> > > for every NAPI run feels like an overkill.
> >
> > Would it be better to do the rcu_softirq_qs() only once every 1000 times
> > or some such?  Or once every HZ jiffies?
> >
> > Or is there a better way?
>
> Right, we can do that. Yan Zhai, have you measured the performance
> impact / time spent in the call?
For the case it hits the problem, the __napi_poll itself is usually
consuming much of the cycles, so I didn't notice any difference in
terms of tput. And it is in fact repolling all the time as the
customer traffic might not implement proper backoff. So using a loop
counter or jiffies  to cap the number of invocations sounds like a
decent improvement.

Let me briefly check the overhead in the normal case, too

Yan
Yan Zhai Feb. 28, 2024, 4:37 p.m. UTC | #16
On Wed, Feb 28, 2024 at 9:37 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> Also optionally, I wonder if calling rcu_tasks_qs() directly is better
> (for documentation if anything) since the issue is Tasks RCU specific. Also
> code comment above the rcu_softirq_qs() call about cond_resched() not taking
> care of Tasks RCU would be great!
>
Yes it's quite surprising to me that cond_resched does not help here,
I will include that comment. Raising just the task RCU QS seems
sufficient to the problem we encountered. But according to commit
d28139c4e967 ("rcu: Apply RCU-bh QSes to RCU-sched and RCU-preempt
when safe"), there might be additional threat factor in __do_softirq
that also applies to threaded poll.

Yan


> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>
> thanks,
>
>  - Joel
> [1]
> @@ -381,8 +553,10 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
>                 pending >>= softirq_bit;
>         }
>
> -       if (__this_cpu_read(ksoftirqd) == current)
> +       if (!IS_ENABLED(CONFIG_PREEMPT_RT) &&
> +           __this_cpu_read(ksoftirqd) == current)
>                 rcu_softirq_qs();
> +
>         local_irq_disable();
Paul E. McKenney Feb. 28, 2024, 5:18 p.m. UTC | #17
On Wed, Feb 28, 2024 at 10:37:51AM -0600, Yan Zhai wrote:
> On Wed, Feb 28, 2024 at 9:37 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > Also optionally, I wonder if calling rcu_tasks_qs() directly is better
> > (for documentation if anything) since the issue is Tasks RCU specific. Also
> > code comment above the rcu_softirq_qs() call about cond_resched() not taking
> > care of Tasks RCU would be great!
> >
> Yes it's quite surprising to me that cond_resched does not help here,

In theory, it would be possible to make cond_resched() take care of
Tasks RCU.  In practice, the lazy-preemption work is looking to get rid
of cond_resched().  But if for some reason cond_resched() needs to stay
around, doing that work might make sense.

> I will include that comment. Raising just the task RCU QS seems
> sufficient to the problem we encountered. But according to commit
> d28139c4e967 ("rcu: Apply RCU-bh QSes to RCU-sched and RCU-preempt
> when safe"), there might be additional threat factor in __do_softirq
> that also applies to threaded poll.

We did look into more focused alternatives for Tasks RCU a few days ago,
but they all had problems, for example, requiring that it be possible
to get exact information on the instruction pointers for interrupts on
any given CPU's stack.  The key point of Tasks RCU is to work out when
an old tracing trampoline may safely be freed, so a better way of doing
that would remove the need for this sort of addition to NAPI polling.

(Adding Steve and Mark for their thoughts.)

							Thanx, Paul

> Yan
> 
> 
> > Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >
> > thanks,
> >
> >  - Joel
> > [1]
> > @@ -381,8 +553,10 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
> >                 pending >>= softirq_bit;
> >         }
> >
> > -       if (__this_cpu_read(ksoftirqd) == current)
> > +       if (!IS_ENABLED(CONFIG_PREEMPT_RT) &&
> > +           __this_cpu_read(ksoftirqd) == current)
> >                 rcu_softirq_qs();
> > +
> >         local_irq_disable();
Paul E. McKenney Feb. 28, 2024, 5:47 p.m. UTC | #18
On Wed, Feb 28, 2024 at 09:48:42AM -0600, Yan Zhai wrote:
> On Wed, Feb 28, 2024 at 9:10 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Wed, Feb 28, 2024 at 12:50:53PM +0100, Toke Høiland-Jørgensen wrote:
> > > "Paul E. McKenney" <paulmck@kernel.org> writes:
> > >
> > > > On Tue, Feb 27, 2024 at 05:44:17PM +0100, Eric Dumazet wrote:
> > > >> On Tue, Feb 27, 2024 at 4:44 PM Yan Zhai <yan@cloudflare.com> wrote:
> > > >> >
> > > >> > We noticed task RCUs being blocked when threaded NAPIs are very busy in
> > > >> > production: detaching any BPF tracing programs, i.e. removing a ftrace
> > > >> > trampoline, will simply block for very long in rcu_tasks_wait_gp. This
> > > >> > ranges from hundreds of seconds to even an hour, severely harming any
> > > >> > observability tools that rely on BPF tracing programs. It can be
> > > >> > easily reproduced locally with following setup:
> > > >> >
> > > >> > ip netns add test1
> > > >> > ip netns add test2
> > > >> >
> > > >> > ip -n test1 link add veth1 type veth peer name veth2 netns test2
> > > >> >
> > > >> > ip -n test1 link set veth1 up
> > > >> > ip -n test1 link set lo up
> > > >> > ip -n test2 link set veth2 up
> > > >> > ip -n test2 link set lo up
> > > >> >
> > > >> > ip -n test1 addr add 192.168.1.2/31 dev veth1
> > > >> > ip -n test1 addr add 1.1.1.1/32 dev lo
> > > >> > ip -n test2 addr add 192.168.1.3/31 dev veth2
> > > >> > ip -n test2 addr add 2.2.2.2/31 dev lo
> > > >> >
> > > >> > ip -n test1 route add default via 192.168.1.3
> > > >> > ip -n test2 route add default via 192.168.1.2
> > > >> >
> > > >> > for i in `seq 10 210`; do
> > > >> >  for j in `seq 10 210`; do
> > > >> >     ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
> > > >> >  done
> > > >> > done
> > > >> >
> > > >> > ip netns exec test2 ethtool -K veth2 gro on
> > > >> > ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
> > > >> > ip netns exec test1 ethtool -K veth1 tso off
> > > >> >
> > > >> > Then run an iperf3 client/server and a bpftrace script can trigger it:
> > > >> >
> > > >> > ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
> > > >> > ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
> > > >> > bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'
> > > >> >
> > > >> > Above reproduce for net-next kernel with following RCU and preempt
> > > >> > configuraitons:
> > > >> >
> > > >> > # RCU Subsystem
> > > >> > CONFIG_TREE_RCU=y
> > > >> > CONFIG_PREEMPT_RCU=y
> > > >> > # CONFIG_RCU_EXPERT is not set
> > > >> > CONFIG_SRCU=y
> > > >> > CONFIG_TREE_SRCU=y
> > > >> > CONFIG_TASKS_RCU_GENERIC=y
> > > >> > CONFIG_TASKS_RCU=y
> > > >> > CONFIG_TASKS_RUDE_RCU=y
> > > >> > CONFIG_TASKS_TRACE_RCU=y
> > > >> > CONFIG_RCU_STALL_COMMON=y
> > > >> > CONFIG_RCU_NEED_SEGCBLIST=y
> > > >> > # end of RCU Subsystem
> > > >> > # RCU Debugging
> > > >> > # CONFIG_RCU_SCALE_TEST is not set
> > > >> > # CONFIG_RCU_TORTURE_TEST is not set
> > > >> > # CONFIG_RCU_REF_SCALE_TEST is not set
> > > >> > CONFIG_RCU_CPU_STALL_TIMEOUT=21
> > > >> > CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
> > > >> > # CONFIG_RCU_TRACE is not set
> > > >> > # CONFIG_RCU_EQS_DEBUG is not set
> > > >> > # end of RCU Debugging
> > > >> >
> > > >> > CONFIG_PREEMPT_BUILD=y
> > > >> > # CONFIG_PREEMPT_NONE is not set
> > > >> > CONFIG_PREEMPT_VOLUNTARY=y
> > > >> > # CONFIG_PREEMPT is not set
> > > >> > CONFIG_PREEMPT_COUNT=y
> > > >> > CONFIG_PREEMPTION=y
> > > >> > CONFIG_PREEMPT_DYNAMIC=y
> > > >> > CONFIG_PREEMPT_RCU=y
> > > >> > CONFIG_HAVE_PREEMPT_DYNAMIC=y
> > > >> > CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
> > > >> > CONFIG_PREEMPT_NOTIFIERS=y
> > > >> > # CONFIG_DEBUG_PREEMPT is not set
> > > >> > # CONFIG_PREEMPT_TRACER is not set
> > > >> > # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
> > > >> >
> > > >> > An interesting observation is that, while tasks RCUs are blocked,
> > > >> > related NAPI thread is still being scheduled (even across cores)
> > > >> > regularly. Looking at the gp conditions, I am inclining to cond_resched
> > > >> > after each __napi_poll being the problem: cond_resched enters the
> > > >> > scheduler with PREEMPT bit, which does not account as a gp for tasks
> > > >> > RCUs. Meanwhile, since the thread has been frequently resched, the
> > > >> > normal scheduling point (no PREEMPT bit, accounted as a task RCU gp)
> > > >> > seems to have very little chance to kick in. Given the nature of "busy
> > > >> > polling" program, such NAPI thread won't have task->nvcsw or task->on_rq
> > > >> > updated (other gp conditions), the result is that such NAPI thread is
> > > >> > put on RCU holdouts list for indefinitely long time.
> > > >> >
> > > >> > This is simply fixed by mirroring the ksoftirqd behavior: after
> > > >> > NAPI/softirq work, raise a RCU QS to help expedite the RCU period. No
> > > >> > more blocking afterwards for the same setup.
> > > >> >
> > > >> > Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
> > > >> > Signed-off-by: Yan Zhai <yan@cloudflare.com>
> > > >> > ---
> > > >> >  net/core/dev.c | 4 ++++
> > > >> >  1 file changed, 4 insertions(+)
> > > >> >
> > > >> > diff --git a/net/core/dev.c b/net/core/dev.c
> > > >> > index 275fd5259a4a..6e41263ff5d3 100644
> > > >> > --- a/net/core/dev.c
> > > >> > +++ b/net/core/dev.c
> > > >> > @@ -6773,6 +6773,10 @@ static int napi_threaded_poll(void *data)
> > > >> >                                 net_rps_action_and_irq_enable(sd);
> > > >> >                         }
> > > >> >                         skb_defer_free_flush(sd);
> > > >
> > > > Please put a comment here stating that RCU readers cannot cross
> > > > this point.
> > > >
> > > > I need to add lockdep to rcu_softirq_qs() to catch placing this in an
> > > > RCU read-side critical section.  And a header comment noting that from
> > > > an RCU perspective, it acts as a momentary enabling of preemption.
> > >
> > > OK, so one question here: for XDP, we're basically treating
> > > local_bh_disable/enable() as the RCU critical section, cf the discussion
> > > we had a few years ago that led to this being documented[0]. So why is
> > > it OK to have the rcu_softirq_qs() inside the bh disable/enable pair,
> > > but not inside an rcu_read_lock() section?
> >
> > In general, it is not OK.  And it is not OK in this case if this happens
> > to be one of the local_bh_disable() regions that XDP is waiting on.
> > Except that that region ends right after the rcu_softirq_qs(), so that
> > should not be a problem.
> >
> > But you are quite right, that is an accident waiting to happen, so it
> > would be better if the patch did something like this:
> >
> >         local_bh_enable();
> >         if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
> >                 preempt_disable();
> >                 rcu_softirq_qs();
> >                 preempt_enable();
> >         }
> >
> Yeah we need preempt for this call. When I first attempt it after
> local_bh_enable, I got the bug call:
> [ 1166.384279] BUG: using __this_cpu_read() in preemptible [00000000]
> code: napi/veth2-66/8439
> [ 1166.385337] caller is rcu_softirq_qs+0x16/0x130
> [ 1166.385900] CPU: 3 PID: 8439 Comm: napi/veth2-66 Not tainted
> 6.7.0-rc8-g3fbf61207c66-dirty #75
> [ 1166.386950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [ 1166.388110] Call Trace:
> [ 1166.388417]  <TASK>
> [ 1166.388684]  dump_stack_lvl+0x36/0x50
> [ 1166.389147]  check_preemption_disabled+0xd1/0xe0
> [ 1166.389725]  rcu_softirq_qs+0x16/0x130
> [ 1166.390190]  napi_threaded_poll+0x21e/0x260
> [ 1166.390702]  ? __pfx_napi_threaded_poll+0x10/0x10
> [ 1166.391277]  kthread+0xf7/0x130
> [ 1166.391643]  ? __pfx_kthread+0x10/0x10
> [ 1166.392130]  ret_from_fork+0x34/0x50
> [ 1166.392574]  ? __pfx_kthread+0x10/0x10
> [ 1166.393048]  ret_from_fork_asm+0x1b/0x30
> [ 1166.393530]  </TASK>
> 
> Since this patch is trying to mirror what __do_softirq has, should the
> similar notes/changes apply to that side as well?

Up to now, the rcu_softirq_qs() was a special function strictly for
use by __do_softirq(), hence the lack of documentation.  I will let
the __do_softirq() maintainers decide what they would like to do there,
if anything.

							Thanx, Paul

> > Though maybe something like this would be better:
> >
> >         local_bh_enable();
> >         if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> >                 rcu_softirq_qs_enable(local_bh_enable());
> >         else
> >                 local_bh_enable();
> >
> > A bit ugly, but it does allow exact checking of the rules and also
> > avoids extra overhead.
> >
> > I could imagine pulling the CONFIG_PREEMPT_RT check into the body of
> > rcu_softirq_qs_enable().
> >
> > But is there a better way?
> >
> > > Also, looking at the patch in question:
> > >
> > > >> > +                       if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> > > >> > +                               rcu_softirq_qs();
> > > >> > +
> > > >> >                         local_bh_enable();
> > >
> > > Why does that local_bh_enable() not accomplish the same thing as the qs?
> >
> > In this case, because it does not create the appearance of a voluntary
> > context switch needed by RCU Tasks.  So the wait for trampoline evacuation
> > could still take a very long time.
> >
> >                                                         Thanx, Paul
> >
> > > -Toke
> > >
> > > [0] https://lore.kernel.org/bpf/20210624160609.292325-6-toke@redhat.com/
> > >
Joel Fernandes Feb. 28, 2024, 8:14 p.m. UTC | #19
On Wed, Feb 28, 2024 at 12:18 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Feb 28, 2024 at 10:37:51AM -0600, Yan Zhai wrote:
> > On Wed, Feb 28, 2024 at 9:37 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > Also optionally, I wonder if calling rcu_tasks_qs() directly is better
> > > (for documentation if anything) since the issue is Tasks RCU specific. Also
> > > code comment above the rcu_softirq_qs() call about cond_resched() not taking
> > > care of Tasks RCU would be great!
> > >
> > Yes it's quite surprising to me that cond_resched does not help here,
>
> In theory, it would be possible to make cond_resched() take care of
> Tasks RCU.  In practice, the lazy-preemption work is looking to get rid
> of cond_resched().  But if for some reason cond_resched() needs to stay
> around, doing that work might make sense.

In my opinion, cond_resched() doing Tasks-RCU QS does not make sense
(to me), because cond_resched() is to inform the scheduler to run
something else possibly of higher priority while the current task is
still runnable. On the other hand, what's not permitted in a Tasks RCU
reader is a voluntary sleep. So IMO even though cond_resched() is a
voluntary call, it is still not a sleep but rather a preemption point.

So a Tasks RCU reader should perfectly be able to be scheduled out in
the middle of a read-side critical section (in current code) by
calling cond_resched(). It is just like involuntary preemption in the
middle of a RCU reader, in disguise, Right?

thanks,

 - Joel
Paul E. McKenney Feb. 28, 2024, 9:13 p.m. UTC | #20
On Wed, Feb 28, 2024 at 03:14:34PM -0500, Joel Fernandes wrote:
> On Wed, Feb 28, 2024 at 12:18 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Wed, Feb 28, 2024 at 10:37:51AM -0600, Yan Zhai wrote:
> > > On Wed, Feb 28, 2024 at 9:37 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > Also optionally, I wonder if calling rcu_tasks_qs() directly is better
> > > > (for documentation if anything) since the issue is Tasks RCU specific. Also
> > > > code comment above the rcu_softirq_qs() call about cond_resched() not taking
> > > > care of Tasks RCU would be great!
> > > >
> > > Yes it's quite surprising to me that cond_resched does not help here,
> >
> > In theory, it would be possible to make cond_resched() take care of
> > Tasks RCU.  In practice, the lazy-preemption work is looking to get rid
> > of cond_resched().  But if for some reason cond_resched() needs to stay
> > around, doing that work might make sense.
> 
> In my opinion, cond_resched() doing Tasks-RCU QS does not make sense
> (to me), because cond_resched() is to inform the scheduler to run
> something else possibly of higher priority while the current task is
> still runnable. On the other hand, what's not permitted in a Tasks RCU
> reader is a voluntary sleep. So IMO even though cond_resched() is a
> voluntary call, it is still not a sleep but rather a preemption point.

From the viewpoint of Task RCU's users, the point is to figure out
when it is OK to free an already-removed tracing trampoline.  The
current Task RCU implementation relies on the fact that tracing
trampolines do not do voluntary context switches.

> So a Tasks RCU reader should perfectly be able to be scheduled out in
> the middle of a read-side critical section (in current code) by
> calling cond_resched(). It is just like involuntary preemption in the
> middle of a RCU reader, in disguise, Right?

You lost me on this one.  This for example is not permitted:

	rcu_read_lock();
	cond_resched();
	rcu_read_unlock();

But in a CONFIG_PREEMPT=y kernel, that RCU reader could be preempted.

So cond_resched() looks like a voluntary context switch to me.  Recall
that vanilla non-preemptible RCU will treat them as quiescent states if
the grace period extends long enough.

What am I missing here?

							Thanx, Paul
Joel Fernandes Feb. 28, 2024, 9:27 p.m. UTC | #21
> On Feb 28, 2024, at 4:13 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Wed, Feb 28, 2024 at 03:14:34PM -0500, Joel Fernandes wrote:
>>> On Wed, Feb 28, 2024 at 12:18 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>>> 
>>> On Wed, Feb 28, 2024 at 10:37:51AM -0600, Yan Zhai wrote:
>>>> On Wed, Feb 28, 2024 at 9:37 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>>>>> Also optionally, I wonder if calling rcu_tasks_qs() directly is better
>>>>> (for documentation if anything) since the issue is Tasks RCU specific. Also
>>>>> code comment above the rcu_softirq_qs() call about cond_resched() not taking
>>>>> care of Tasks RCU would be great!
>>>>> 
>>>> Yes it's quite surprising to me that cond_resched does not help here,
>>> 
>>> In theory, it would be possible to make cond_resched() take care of
>>> Tasks RCU.  In practice, the lazy-preemption work is looking to get rid
>>> of cond_resched().  But if for some reason cond_resched() needs to stay
>>> around, doing that work might make sense.
>> 
>> In my opinion, cond_resched() doing Tasks-RCU QS does not make sense
>> (to me), because cond_resched() is to inform the scheduler to run
>> something else possibly of higher priority while the current task is
>> still runnable. On the other hand, what's not permitted in a Tasks RCU
>> reader is a voluntary sleep. So IMO even though cond_resched() is a
>> voluntary call, it is still not a sleep but rather a preemption point.
> 
> From the viewpoint of Task RCU's users, the point is to figure out
> when it is OK to free an already-removed tracing trampoline.  The
> current Task RCU implementation relies on the fact that tracing
> trampolines do not do voluntary context switches.

Yes.

> 
>> So a Tasks RCU reader should perfectly be able to be scheduled out in
>> the middle of a read-side critical section (in current code) by
>> calling cond_resched(). It is just like involuntary preemption in the
>> middle of a RCU reader, in disguise, Right?
> 
> You lost me on this one.  This for example is not permitted:
> 
>    rcu_read_lock();
>    cond_resched();
>    rcu_read_unlock();
> 
> But in a CONFIG_PREEMPT=y kernel, that RCU reader could be preempted.
> 
> So cond_resched() looks like a voluntary context switch to me.  Recall
> that vanilla non-preemptible RCU will treat them as quiescent states if
> the grace period extends long enough.
> 
> What am I missing here?

That we are discussing Tasks-RCU read side section? Sorry I should have been more clear. I thought sleeping was not permitted in Tasks RCU reader, but non-sleep context switches (example involuntarily getting preempted were).

 - Joel



> 
>                            Thanx, Paul
Paul E. McKenney Feb. 28, 2024, 9:52 p.m. UTC | #22
On Wed, Feb 28, 2024 at 04:27:47PM -0500, Joel Fernandes wrote:
> 
> 
> > On Feb 28, 2024, at 4:13 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > 
> > On Wed, Feb 28, 2024 at 03:14:34PM -0500, Joel Fernandes wrote:
> >>> On Wed, Feb 28, 2024 at 12:18 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >>> 
> >>> On Wed, Feb 28, 2024 at 10:37:51AM -0600, Yan Zhai wrote:
> >>>> On Wed, Feb 28, 2024 at 9:37 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >>>>> Also optionally, I wonder if calling rcu_tasks_qs() directly is better
> >>>>> (for documentation if anything) since the issue is Tasks RCU specific. Also
> >>>>> code comment above the rcu_softirq_qs() call about cond_resched() not taking
> >>>>> care of Tasks RCU would be great!
> >>>>> 
> >>>> Yes it's quite surprising to me that cond_resched does not help here,
> >>> 
> >>> In theory, it would be possible to make cond_resched() take care of
> >>> Tasks RCU.  In practice, the lazy-preemption work is looking to get rid
> >>> of cond_resched().  But if for some reason cond_resched() needs to stay
> >>> around, doing that work might make sense.
> >> 
> >> In my opinion, cond_resched() doing Tasks-RCU QS does not make sense
> >> (to me), because cond_resched() is to inform the scheduler to run
> >> something else possibly of higher priority while the current task is
> >> still runnable. On the other hand, what's not permitted in a Tasks RCU
> >> reader is a voluntary sleep. So IMO even though cond_resched() is a
> >> voluntary call, it is still not a sleep but rather a preemption point.
> > 
> > From the viewpoint of Task RCU's users, the point is to figure out
> > when it is OK to free an already-removed tracing trampoline.  The
> > current Task RCU implementation relies on the fact that tracing
> > trampolines do not do voluntary context switches.
> 
> Yes.
> 
> > 
> >> So a Tasks RCU reader should perfectly be able to be scheduled out in
> >> the middle of a read-side critical section (in current code) by
> >> calling cond_resched(). It is just like involuntary preemption in the
> >> middle of a RCU reader, in disguise, Right?
> > 
> > You lost me on this one.  This for example is not permitted:
> > 
> >    rcu_read_lock();
> >    cond_resched();
> >    rcu_read_unlock();
> > 
> > But in a CONFIG_PREEMPT=y kernel, that RCU reader could be preempted.
> > 
> > So cond_resched() looks like a voluntary context switch to me.  Recall
> > that vanilla non-preemptible RCU will treat them as quiescent states if
> > the grace period extends long enough.
> > 
> > What am I missing here?
> 
> That we are discussing Tasks-RCU read side section? Sorry I should have been more clear. I thought sleeping was not permitted in Tasks RCU reader, but non-sleep context switches (example involuntarily getting preempted were).

Well, to your initial point, cond_resched() does eventually invoke
preempt_schedule_common(), so you are quite correct that as far as
Tasks RCU is concerned, cond_resched() is not a quiescent state.

						Thanx, Paul
Joel Fernandes Feb. 28, 2024, 10:10 p.m. UTC | #23
> On Feb 28, 2024, at 4:52 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Wed, Feb 28, 2024 at 04:27:47PM -0500, Joel Fernandes wrote:
>> 
>> 
>>>> On Feb 28, 2024, at 4:13 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
>>> 
>>> On Wed, Feb 28, 2024 at 03:14:34PM -0500, Joel Fernandes wrote:
>>>>> On Wed, Feb 28, 2024 at 12:18 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>>>>> 
>>>>> On Wed, Feb 28, 2024 at 10:37:51AM -0600, Yan Zhai wrote:
>>>>>> On Wed, Feb 28, 2024 at 9:37 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>>>>>>> Also optionally, I wonder if calling rcu_tasks_qs() directly is better
>>>>>>> (for documentation if anything) since the issue is Tasks RCU specific. Also
>>>>>>> code comment above the rcu_softirq_qs() call about cond_resched() not taking
>>>>>>> care of Tasks RCU would be great!
>>>>>>> 
>>>>>> Yes it's quite surprising to me that cond_resched does not help here,
>>>>> 
>>>>> In theory, it would be possible to make cond_resched() take care of
>>>>> Tasks RCU.  In practice, the lazy-preemption work is looking to get rid
>>>>> of cond_resched().  But if for some reason cond_resched() needs to stay
>>>>> around, doing that work might make sense.
>>>> 
>>>> In my opinion, cond_resched() doing Tasks-RCU QS does not make sense
>>>> (to me), because cond_resched() is to inform the scheduler to run
>>>> something else possibly of higher priority while the current task is
>>>> still runnable. On the other hand, what's not permitted in a Tasks RCU
>>>> reader is a voluntary sleep. So IMO even though cond_resched() is a
>>>> voluntary call, it is still not a sleep but rather a preemption point.
>>> 
>>> From the viewpoint of Task RCU's users, the point is to figure out
>>> when it is OK to free an already-removed tracing trampoline.  The
>>> current Task RCU implementation relies on the fact that tracing
>>> trampolines do not do voluntary context switches.
>> 
>> Yes.
>> 
>>> 
>>>> So a Tasks RCU reader should perfectly be able to be scheduled out in
>>>> the middle of a read-side critical section (in current code) by
>>>> calling cond_resched(). It is just like involuntary preemption in the
>>>> middle of a RCU reader, in disguise, Right?
>>> 
>>> You lost me on this one.  This for example is not permitted:
>>> 
>>>   rcu_read_lock();
>>>   cond_resched();
>>>   rcu_read_unlock();
>>> 
>>> But in a CONFIG_PREEMPT=y kernel, that RCU reader could be preempted.
>>> 
>>> So cond_resched() looks like a voluntary context switch to me.  Recall
>>> that vanilla non-preemptible RCU will treat them as quiescent states if
>>> the grace period extends long enough.
>>> 
>>> What am I missing here?
>> 
>> That we are discussing Tasks-RCU read side section? Sorry I should have been more clear. I thought sleeping was not permitted in Tasks RCU reader, but non-sleep context switches (example involuntarily getting preempted were).
> 
> Well, to your initial point, cond_resched() does eventually invoke
> preempt_schedule_common(), so you are quite correct that as far as
> Tasks RCU is concerned, cond_resched() is not a quiescent state.

 Thanks for confirming. :-)

 - Joel



> 
>                        Thanx, Paul
Paul E. McKenney Feb. 28, 2024, 10:19 p.m. UTC | #24
On Wed, Feb 28, 2024 at 05:10:43PM -0500, Joel Fernandes wrote:
> 
> 
> > On Feb 28, 2024, at 4:52 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > 
> > On Wed, Feb 28, 2024 at 04:27:47PM -0500, Joel Fernandes wrote:
> >> 
> >> 
> >>>> On Feb 28, 2024, at 4:13 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> >>> 
> >>> On Wed, Feb 28, 2024 at 03:14:34PM -0500, Joel Fernandes wrote:
> >>>>> On Wed, Feb 28, 2024 at 12:18 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >>>>> 
> >>>>> On Wed, Feb 28, 2024 at 10:37:51AM -0600, Yan Zhai wrote:
> >>>>>> On Wed, Feb 28, 2024 at 9:37 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >>>>>>> Also optionally, I wonder if calling rcu_tasks_qs() directly is better
> >>>>>>> (for documentation if anything) since the issue is Tasks RCU specific. Also
> >>>>>>> code comment above the rcu_softirq_qs() call about cond_resched() not taking
> >>>>>>> care of Tasks RCU would be great!
> >>>>>>> 
> >>>>>> Yes it's quite surprising to me that cond_resched does not help here,
> >>>>> 
> >>>>> In theory, it would be possible to make cond_resched() take care of
> >>>>> Tasks RCU.  In practice, the lazy-preemption work is looking to get rid
> >>>>> of cond_resched().  But if for some reason cond_resched() needs to stay
> >>>>> around, doing that work might make sense.
> >>>> 
> >>>> In my opinion, cond_resched() doing Tasks-RCU QS does not make sense
> >>>> (to me), because cond_resched() is to inform the scheduler to run
> >>>> something else possibly of higher priority while the current task is
> >>>> still runnable. On the other hand, what's not permitted in a Tasks RCU
> >>>> reader is a voluntary sleep. So IMO even though cond_resched() is a
> >>>> voluntary call, it is still not a sleep but rather a preemption point.
> >>> 
> >>> From the viewpoint of Task RCU's users, the point is to figure out
> >>> when it is OK to free an already-removed tracing trampoline.  The
> >>> current Task RCU implementation relies on the fact that tracing
> >>> trampolines do not do voluntary context switches.
> >> 
> >> Yes.
> >> 
> >>> 
> >>>> So a Tasks RCU reader should perfectly be able to be scheduled out in
> >>>> the middle of a read-side critical section (in current code) by
> >>>> calling cond_resched(). It is just like involuntary preemption in the
> >>>> middle of a RCU reader, in disguise, Right?
> >>> 
> >>> You lost me on this one.  This for example is not permitted:
> >>> 
> >>>   rcu_read_lock();
> >>>   cond_resched();
> >>>   rcu_read_unlock();
> >>> 
> >>> But in a CONFIG_PREEMPT=y kernel, that RCU reader could be preempted.
> >>> 
> >>> So cond_resched() looks like a voluntary context switch to me.  Recall
> >>> that vanilla non-preemptible RCU will treat them as quiescent states if
> >>> the grace period extends long enough.
> >>> 
> >>> What am I missing here?
> >> 
> >> That we are discussing Tasks-RCU read side section? Sorry I should have been more clear. I thought sleeping was not permitted in Tasks RCU reader, but non-sleep context switches (example involuntarily getting preempted were).
> > 
> > Well, to your initial point, cond_resched() does eventually invoke
> > preempt_schedule_common(), so you are quite correct that as far as
> > Tasks RCU is concerned, cond_resched() is not a quiescent state.
> 
>  Thanks for confirming. :-)

However, given that the current Tasks RCU use cases wait for trampolines
to be evacuated, Tasks RCU could make the choice that cond_resched()
be a quiescent state, for example, by adjusting rcu_all_qs() and
.rcu_urgent_qs accordingly.

But this seems less pressing given the chance that cond_resched() might
go away in favor of lazy preemption.

							Thanx, Paul
Steven Rostedt Feb. 28, 2024, 10:33 p.m. UTC | #25
On Wed, 28 Feb 2024 14:19:11 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> > > 
> > > Well, to your initial point, cond_resched() does eventually invoke
> > > preempt_schedule_common(), so you are quite correct that as far as
> > > Tasks RCU is concerned, cond_resched() is not a quiescent state.  
> > 
> >  Thanks for confirming. :-)  
> 
> However, given that the current Tasks RCU use cases wait for trampolines
> to be evacuated, Tasks RCU could make the choice that cond_resched()
> be a quiescent state, for example, by adjusting rcu_all_qs() and
> .rcu_urgent_qs accordingly.
> 
> But this seems less pressing given the chance that cond_resched() might
> go away in favor of lazy preemption.

Although cond_resched() is technically a "preemption point" and not truly a
voluntary schedule, I would be happy to state that it's not allowed to be
called from trampolines, or their callbacks. Now the question is, does BPF
programs ever call cond_resched()? I don't think they do.

[ Added Alexei ]

-- Steve
Alexei Starovoitov Feb. 28, 2024, 10:48 p.m. UTC | #26
On Wed, Feb 28, 2024 at 2:31 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 28 Feb 2024 14:19:11 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
>
> > > >
> > > > Well, to your initial point, cond_resched() does eventually invoke
> > > > preempt_schedule_common(), so you are quite correct that as far as
> > > > Tasks RCU is concerned, cond_resched() is not a quiescent state.
> > >
> > >  Thanks for confirming. :-)
> >
> > However, given that the current Tasks RCU use cases wait for trampolines
> > to be evacuated, Tasks RCU could make the choice that cond_resched()
> > be a quiescent state, for example, by adjusting rcu_all_qs() and
> > .rcu_urgent_qs accordingly.
> >
> > But this seems less pressing given the chance that cond_resched() might
> > go away in favor of lazy preemption.
>
> Although cond_resched() is technically a "preemption point" and not truly a
> voluntary schedule, I would be happy to state that it's not allowed to be
> called from trampolines, or their callbacks. Now the question is, does BPF
> programs ever call cond_resched()? I don't think they do.
>
> [ Added Alexei ]

I'm a bit lost in this thread :)
Just answering the above question.
bpf progs never call cond_resched() directly.
But there are sleepable (aka faultable) bpf progs that
can call some helper or kfunc that may call cond_resched()
in some path.
sleepable bpf progs are protected by rcu_tasks_trace.
That's a very different one vs rcu_tasks.
Paul E. McKenney Feb. 28, 2024, 10:49 p.m. UTC | #27
On Wed, Feb 28, 2024 at 05:33:07PM -0500, Steven Rostedt wrote:
> On Wed, 28 Feb 2024 14:19:11 -0800
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > > > 
> > > > Well, to your initial point, cond_resched() does eventually invoke
> > > > preempt_schedule_common(), so you are quite correct that as far as
> > > > Tasks RCU is concerned, cond_resched() is not a quiescent state.  
> > > 
> > >  Thanks for confirming. :-)  
> > 
> > However, given that the current Tasks RCU use cases wait for trampolines
> > to be evacuated, Tasks RCU could make the choice that cond_resched()
> > be a quiescent state, for example, by adjusting rcu_all_qs() and
> > .rcu_urgent_qs accordingly.
> > 
> > But this seems less pressing given the chance that cond_resched() might
> > go away in favor of lazy preemption.
> 
> Although cond_resched() is technically a "preemption point" and not truly a
> voluntary schedule, I would be happy to state that it's not allowed to be
> called from trampolines, or their callbacks. Now the question is, does BPF
> programs ever call cond_resched()? I don't think they do.

Nor do I, but I too must defer to Alexei.  ;-)

> [ Added Alexei ]

The other issue with making cond_resched() be a Tasks RCU quiescent
state is that the CONFIG_PREEMPTION=y version of cond_resched() would
need to stop being a complete no-op.  Which actually might be OK.

							Thanx, Paul
Paul E. McKenney Feb. 28, 2024, 10:58 p.m. UTC | #28
On Wed, Feb 28, 2024 at 02:48:44PM -0800, Alexei Starovoitov wrote:
> On Wed, Feb 28, 2024 at 2:31 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Wed, 28 Feb 2024 14:19:11 -0800
> > "Paul E. McKenney" <paulmck@kernel.org> wrote:
> >
> > > > >
> > > > > Well, to your initial point, cond_resched() does eventually invoke
> > > > > preempt_schedule_common(), so you are quite correct that as far as
> > > > > Tasks RCU is concerned, cond_resched() is not a quiescent state.
> > > >
> > > >  Thanks for confirming. :-)
> > >
> > > However, given that the current Tasks RCU use cases wait for trampolines
> > > to be evacuated, Tasks RCU could make the choice that cond_resched()
> > > be a quiescent state, for example, by adjusting rcu_all_qs() and
> > > .rcu_urgent_qs accordingly.
> > >
> > > But this seems less pressing given the chance that cond_resched() might
> > > go away in favor of lazy preemption.
> >
> > Although cond_resched() is technically a "preemption point" and not truly a
> > voluntary schedule, I would be happy to state that it's not allowed to be
> > called from trampolines, or their callbacks. Now the question is, does BPF
> > programs ever call cond_resched()? I don't think they do.
> >
> > [ Added Alexei ]
> 
> I'm a bit lost in this thread :)
> Just answering the above question.
> bpf progs never call cond_resched() directly.
> But there are sleepable (aka faultable) bpf progs that
> can call some helper or kfunc that may call cond_resched()
> in some path.
> sleepable bpf progs are protected by rcu_tasks_trace.
> That's a very different one vs rcu_tasks.

Suppose that the various cond_resched() invocations scattered throughout
the kernel acted as RCU Tasks quiescent states, so that as soon as a
given task executed a cond_resched(), synchronize_rcu_tasks() might
return or call_rcu_tasks() might invoke its callback.

Would that cause BPF any trouble?

My guess is "no", because it looks like BPF is using RCU Tasks (as you
say, as opposed to RCU Tasks Trace) only to wait for execution to leave a
trampoline.  But I trust you much more than I trust myself on this topic!

							Thanx, Paul
Yan Zhai Feb. 28, 2024, 11:53 p.m. UTC | #29
Hi Eric,

On Tue, Feb 27, 2024 at 10:44 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Hmm....
> Why napi_busy_loop() does not have a similar problem ?
>
I just tried and can reproduce similar behavior on sk busy poll.
However, the interesting thing is, this can happen if I set a super
high polling interval but just send rare packets. In my case I had a 5
sec polling interval (unlikely to be realistic in prod but just for
demonstration), then used nc to send a few packets. Here is what
bpftrace react:

Normal:
time sudo bpftrace -e 'kfunc:napi_busy_loop{@=count();}
interval:s:1{exit();} kfunc:udp_recvmsg {printf("%ld\n",
args->sk->sk_ll_usec);}'
Attaching 3 probes...

@: 0

real    0m1.527s
user    0m0.073s
sys     0m0.128s


Extra wait when polling:
time sudo bpftrace -e 'kfunc:napi_busy_loop{@=count();}
interval:s:1{exit();} kfunc:udp_recvmsg {printf("%ld\n",
args->sk->sk_ll_usec);}'
Attaching 3 probes...
5000000


@: 16

real    0m11.167s
user    0m0.070s
sys     0m0.120s

So the symptoms are the same, bpftrace cannot exit despite having an
1sec timeout. But the execution pattern for these two are probably
different: NAPI threads would keep polling by itself, whereas sk poll
program might only poll when there is no immediate data. When there
are packets, it switches to process packets instead of polling any
more.


Yan
Joel Fernandes Feb. 29, 2024, 2:21 p.m. UTC | #30
On 2/28/2024 5:58 PM, Paul E. McKenney wrote:
> On Wed, Feb 28, 2024 at 02:48:44PM -0800, Alexei Starovoitov wrote:
>> On Wed, Feb 28, 2024 at 2:31 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>>>
>>> On Wed, 28 Feb 2024 14:19:11 -0800
>>> "Paul E. McKenney" <paulmck@kernel.org> wrote:
>>>
>>>>>>
>>>>>> Well, to your initial point, cond_resched() does eventually invoke
>>>>>> preempt_schedule_common(), so you are quite correct that as far as
>>>>>> Tasks RCU is concerned, cond_resched() is not a quiescent state.
>>>>>
>>>>>  Thanks for confirming. :-)
>>>>
>>>> However, given that the current Tasks RCU use cases wait for trampolines
>>>> to be evacuated, Tasks RCU could make the choice that cond_resched()
>>>> be a quiescent state, for example, by adjusting rcu_all_qs() and
>>>> .rcu_urgent_qs accordingly.
>>>>
>>>> But this seems less pressing given the chance that cond_resched() might
>>>> go away in favor of lazy preemption.
>>>
>>> Although cond_resched() is technically a "preemption point" and not truly a
>>> voluntary schedule, I would be happy to state that it's not allowed to be
>>> called from trampolines, or their callbacks. Now the question is, does BPF
>>> programs ever call cond_resched()? I don't think they do.
>>>
>>> [ Added Alexei ]
>>
>> I'm a bit lost in this thread :)
>> Just answering the above question.
>> bpf progs never call cond_resched() directly.
>> But there are sleepable (aka faultable) bpf progs that
>> can call some helper or kfunc that may call cond_resched()
>> in some path.
>> sleepable bpf progs are protected by rcu_tasks_trace.
>> That's a very different one vs rcu_tasks.
> 
> Suppose that the various cond_resched() invocations scattered throughout
> the kernel acted as RCU Tasks quiescent states, so that as soon as a
> given task executed a cond_resched(), synchronize_rcu_tasks() might
> return or call_rcu_tasks() might invoke its callback.
> 
> Would that cause BPF any trouble?
> 
> My guess is "no", because it looks like BPF is using RCU Tasks (as you
> say, as opposed to RCU Tasks Trace) only to wait for execution to leave a
> trampoline.  But I trust you much more than I trust myself on this topic!

But it uses RCU Tasks Trace as well (for sleepable bpf programs), not just
Tasks? Looks like that's what Alexei said above as well, and I confirmed it in
bpf/trampoline.c

        /* The trampoline without fexit and fmod_ret progs doesn't call original
         * function and doesn't use percpu_ref.
         * Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
         * Then use call_rcu_tasks() to wait for the rest of trampoline asm
         * and normal progs.
         */
        call_rcu_tasks_trace(&im->rcu, __bpf_tramp_image_put_rcu_tasks);

The code comment says it uses both.

Thanks,

 - Joel
Paul E. McKenney Feb. 29, 2024, 4:57 p.m. UTC | #31
On Thu, Feb 29, 2024 at 09:21:48AM -0500, Joel Fernandes wrote:
> 
> 
> On 2/28/2024 5:58 PM, Paul E. McKenney wrote:
> > On Wed, Feb 28, 2024 at 02:48:44PM -0800, Alexei Starovoitov wrote:
> >> On Wed, Feb 28, 2024 at 2:31 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >>>
> >>> On Wed, 28 Feb 2024 14:19:11 -0800
> >>> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> >>>
> >>>>>>
> >>>>>> Well, to your initial point, cond_resched() does eventually invoke
> >>>>>> preempt_schedule_common(), so you are quite correct that as far as
> >>>>>> Tasks RCU is concerned, cond_resched() is not a quiescent state.
> >>>>>
> >>>>>  Thanks for confirming. :-)
> >>>>
> >>>> However, given that the current Tasks RCU use cases wait for trampolines
> >>>> to be evacuated, Tasks RCU could make the choice that cond_resched()
> >>>> be a quiescent state, for example, by adjusting rcu_all_qs() and
> >>>> .rcu_urgent_qs accordingly.
> >>>>
> >>>> But this seems less pressing given the chance that cond_resched() might
> >>>> go away in favor of lazy preemption.
> >>>
> >>> Although cond_resched() is technically a "preemption point" and not truly a
> >>> voluntary schedule, I would be happy to state that it's not allowed to be
> >>> called from trampolines, or their callbacks. Now the question is, does BPF
> >>> programs ever call cond_resched()? I don't think they do.
> >>>
> >>> [ Added Alexei ]
> >>
> >> I'm a bit lost in this thread :)
> >> Just answering the above question.
> >> bpf progs never call cond_resched() directly.
> >> But there are sleepable (aka faultable) bpf progs that
> >> can call some helper or kfunc that may call cond_resched()
> >> in some path.
> >> sleepable bpf progs are protected by rcu_tasks_trace.
> >> That's a very different one vs rcu_tasks.
> > 
> > Suppose that the various cond_resched() invocations scattered throughout
> > the kernel acted as RCU Tasks quiescent states, so that as soon as a
> > given task executed a cond_resched(), synchronize_rcu_tasks() might
> > return or call_rcu_tasks() might invoke its callback.
> > 
> > Would that cause BPF any trouble?
> > 
> > My guess is "no", because it looks like BPF is using RCU Tasks (as you
> > say, as opposed to RCU Tasks Trace) only to wait for execution to leave a
> > trampoline.  But I trust you much more than I trust myself on this topic!
> 
> But it uses RCU Tasks Trace as well (for sleepable bpf programs), not just
> Tasks? Looks like that's what Alexei said above as well, and I confirmed it in
> bpf/trampoline.c
> 
>         /* The trampoline without fexit and fmod_ret progs doesn't call original
>          * function and doesn't use percpu_ref.
>          * Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
>          * Then use call_rcu_tasks() to wait for the rest of trampoline asm
>          * and normal progs.
>          */
>         call_rcu_tasks_trace(&im->rcu, __bpf_tramp_image_put_rcu_tasks);
> 
> The code comment says it uses both.

BPF does quite a few interesting things with these.

But would you like to look at the update-side uses of RCU Tasks Rude
to see if lazy preemption affects them?  I don't believe that there
are any problems here, but we do need to check.

							Thanx, Paul
Joel Fernandes Feb. 29, 2024, 5:41 p.m. UTC | #32
> On Feb 29, 2024, at 11:57 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Thu, Feb 29, 2024 at 09:21:48AM -0500, Joel Fernandes wrote:
>> 
>> 
>>> On 2/28/2024 5:58 PM, Paul E. McKenney wrote:
>>> On Wed, Feb 28, 2024 at 02:48:44PM -0800, Alexei Starovoitov wrote:
>>>> On Wed, Feb 28, 2024 at 2:31 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>>>>> 
>>>>> On Wed, 28 Feb 2024 14:19:11 -0800
>>>>> "Paul E. McKenney" <paulmck@kernel.org> wrote:
>>>>> 
>>>>>>>> 
>>>>>>>> Well, to your initial point, cond_resched() does eventually invoke
>>>>>>>> preempt_schedule_common(), so you are quite correct that as far as
>>>>>>>> Tasks RCU is concerned, cond_resched() is not a quiescent state.
>>>>>>> 
>>>>>>> Thanks for confirming. :-)
>>>>>> 
>>>>>> However, given that the current Tasks RCU use cases wait for trampolines
>>>>>> to be evacuated, Tasks RCU could make the choice that cond_resched()
>>>>>> be a quiescent state, for example, by adjusting rcu_all_qs() and
>>>>>> .rcu_urgent_qs accordingly.
>>>>>> 
>>>>>> But this seems less pressing given the chance that cond_resched() might
>>>>>> go away in favor of lazy preemption.
>>>>> 
>>>>> Although cond_resched() is technically a "preemption point" and not truly a
>>>>> voluntary schedule, I would be happy to state that it's not allowed to be
>>>>> called from trampolines, or their callbacks. Now the question is, does BPF
>>>>> programs ever call cond_resched()? I don't think they do.
>>>>> 
>>>>> [ Added Alexei ]
>>>> 
>>>> I'm a bit lost in this thread :)
>>>> Just answering the above question.
>>>> bpf progs never call cond_resched() directly.
>>>> But there are sleepable (aka faultable) bpf progs that
>>>> can call some helper or kfunc that may call cond_resched()
>>>> in some path.
>>>> sleepable bpf progs are protected by rcu_tasks_trace.
>>>> That's a very different one vs rcu_tasks.
>>> 
>>> Suppose that the various cond_resched() invocations scattered throughout
>>> the kernel acted as RCU Tasks quiescent states, so that as soon as a
>>> given task executed a cond_resched(), synchronize_rcu_tasks() might
>>> return or call_rcu_tasks() might invoke its callback.
>>> 
>>> Would that cause BPF any trouble?
>>> 
>>> My guess is "no", because it looks like BPF is using RCU Tasks (as you
>>> say, as opposed to RCU Tasks Trace) only to wait for execution to leave a
>>> trampoline.  But I trust you much more than I trust myself on this topic!
>> 
>> But it uses RCU Tasks Trace as well (for sleepable bpf programs), not just
>> Tasks? Looks like that's what Alexei said above as well, and I confirmed it in
>> bpf/trampoline.c
>> 
>>        /* The trampoline without fexit and fmod_ret progs doesn't call original
>>         * function and doesn't use percpu_ref.
>>         * Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
>>         * Then use call_rcu_tasks() to wait for the rest of trampoline asm
>>         * and normal progs.
>>         */
>>        call_rcu_tasks_trace(&im->rcu, __bpf_tramp_image_put_rcu_tasks);
>> 
>> The code comment says it uses both.
> 
> BPF does quite a few interesting things with these.
> 
> But would you like to look at the update-side uses of RCU Tasks Rude
> to see if lazy preemption affects them?  I don't believe that there
> are any problems here, but we do need to check.

Sure I will be happy to. I am planning look at it in detail over the 3 day weekend. Too much fun! ;-)

thanks,

- Joel



> 
>                            Thanx, Paul
Paul E. McKenney Feb. 29, 2024, 6:29 p.m. UTC | #33
On Thu, Feb 29, 2024 at 12:41:55PM -0500, Joel Fernandes wrote:
> > On Feb 29, 2024, at 11:57 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > On Thu, Feb 29, 2024 at 09:21:48AM -0500, Joel Fernandes wrote:
> >>> On 2/28/2024 5:58 PM, Paul E. McKenney wrote:
> >>> On Wed, Feb 28, 2024 at 02:48:44PM -0800, Alexei Starovoitov wrote:
> >>>> On Wed, Feb 28, 2024 at 2:31 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >>>>> 
> >>>>> On Wed, 28 Feb 2024 14:19:11 -0800
> >>>>> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> >>>>> 
> >>>>>>>> 
> >>>>>>>> Well, to your initial point, cond_resched() does eventually invoke
> >>>>>>>> preempt_schedule_common(), so you are quite correct that as far as
> >>>>>>>> Tasks RCU is concerned, cond_resched() is not a quiescent state.
> >>>>>>> 
> >>>>>>> Thanks for confirming. :-)
> >>>>>> 
> >>>>>> However, given that the current Tasks RCU use cases wait for trampolines
> >>>>>> to be evacuated, Tasks RCU could make the choice that cond_resched()
> >>>>>> be a quiescent state, for example, by adjusting rcu_all_qs() and
> >>>>>> .rcu_urgent_qs accordingly.
> >>>>>> 
> >>>>>> But this seems less pressing given the chance that cond_resched() might
> >>>>>> go away in favor of lazy preemption.
> >>>>> 
> >>>>> Although cond_resched() is technically a "preemption point" and not truly a
> >>>>> voluntary schedule, I would be happy to state that it's not allowed to be
> >>>>> called from trampolines, or their callbacks. Now the question is, does BPF
> >>>>> programs ever call cond_resched()? I don't think they do.
> >>>>> 
> >>>>> [ Added Alexei ]
> >>>> 
> >>>> I'm a bit lost in this thread :)
> >>>> Just answering the above question.
> >>>> bpf progs never call cond_resched() directly.
> >>>> But there are sleepable (aka faultable) bpf progs that
> >>>> can call some helper or kfunc that may call cond_resched()
> >>>> in some path.
> >>>> sleepable bpf progs are protected by rcu_tasks_trace.
> >>>> That's a very different one vs rcu_tasks.
> >>> 
> >>> Suppose that the various cond_resched() invocations scattered throughout
> >>> the kernel acted as RCU Tasks quiescent states, so that as soon as a
> >>> given task executed a cond_resched(), synchronize_rcu_tasks() might
> >>> return or call_rcu_tasks() might invoke its callback.
> >>> 
> >>> Would that cause BPF any trouble?
> >>> 
> >>> My guess is "no", because it looks like BPF is using RCU Tasks (as you
> >>> say, as opposed to RCU Tasks Trace) only to wait for execution to leave a
> >>> trampoline.  But I trust you much more than I trust myself on this topic!
> >> 
> >> But it uses RCU Tasks Trace as well (for sleepable bpf programs), not just
> >> Tasks? Looks like that's what Alexei said above as well, and I confirmed it in
> >> bpf/trampoline.c
> >> 
> >>        /* The trampoline without fexit and fmod_ret progs doesn't call original
> >>         * function and doesn't use percpu_ref.
> >>         * Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> >>         * Then use call_rcu_tasks() to wait for the rest of trampoline asm
> >>         * and normal progs.
> >>         */
> >>        call_rcu_tasks_trace(&im->rcu, __bpf_tramp_image_put_rcu_tasks);
> >> 
> >> The code comment says it uses both.
> > 
> > BPF does quite a few interesting things with these.
> > 
> > But would you like to look at the update-side uses of RCU Tasks Rude
> > to see if lazy preemption affects them?  I don't believe that there
> > are any problems here, but we do need to check.
> 
> Sure I will be happy to. I am planning look at it in detail over the 3 day weekend. Too much fun! ;-)

Thank you, and looking forward to seeing what you come up with!

The canonical concern would be that someone somewhere is using either
call_rcu_tasks_rude() or synchronize_rcu_tasks_rude() to wait for
non-preemptible regions of code that does not account for the possibility
of preemption in CONFIG_PREEMPT_NONE or PREEMPT_PREEMPT_VOLUNTARY kernels.

I *think* that these are used only to handle the possibility
of tracepoints on functions on the entry/exit path and on the
RCU-not-watching portions of the idle loop.  If so, then there is no
difference in behavior for lazy preemption.  But who knows?

						Thanx, Paul
Joel Fernandes March 2, 2024, 2:24 a.m. UTC | #34
(Shrinking CC a bit)

On Thu, Feb 29, 2024 at 1:29 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Thu, Feb 29, 2024 at 12:41:55PM -0500, Joel Fernandes wrote:
> > > On Feb 29, 2024, at 11:57 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > On Thu, Feb 29, 2024 at 09:21:48AM -0500, Joel Fernandes wrote:
> > >>> On 2/28/2024 5:58 PM, Paul E. McKenney wrote:
> > >>> On Wed, Feb 28, 2024 at 02:48:44PM -0800, Alexei Starovoitov wrote:
> > >>>> On Wed, Feb 28, 2024 at 2:31 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >>>>>
> > >>>>> On Wed, 28 Feb 2024 14:19:11 -0800
> > >>>>> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> > >>>>>
> > >>>>>>>>
> > >>>>>>>> Well, to your initial point, cond_resched() does eventually invoke
> > >>>>>>>> preempt_schedule_common(), so you are quite correct that as far as
> > >>>>>>>> Tasks RCU is concerned, cond_resched() is not a quiescent state.
> > >>>>>>>
> > >>>>>>> Thanks for confirming. :-)
> > >>>>>>
> > >>>>>> However, given that the current Tasks RCU use cases wait for trampolines
> > >>>>>> to be evacuated, Tasks RCU could make the choice that cond_resched()
> > >>>>>> be a quiescent state, for example, by adjusting rcu_all_qs() and
> > >>>>>> .rcu_urgent_qs accordingly.
> > >>>>>>
> > >>>>>> But this seems less pressing given the chance that cond_resched() might
> > >>>>>> go away in favor of lazy preemption.
> > >>>>>
> > >>>>> Although cond_resched() is technically a "preemption point" and not truly a
> > >>>>> voluntary schedule, I would be happy to state that it's not allowed to be
> > >>>>> called from trampolines, or their callbacks. Now the question is, does BPF
> > >>>>> programs ever call cond_resched()? I don't think they do.
> > >>>>>
> > >>>>> [ Added Alexei ]
> > >>>>
> > >>>> I'm a bit lost in this thread :)
> > >>>> Just answering the above question.
> > >>>> bpf progs never call cond_resched() directly.
> > >>>> But there are sleepable (aka faultable) bpf progs that
> > >>>> can call some helper or kfunc that may call cond_resched()
> > >>>> in some path.
> > >>>> sleepable bpf progs are protected by rcu_tasks_trace.
> > >>>> That's a very different one vs rcu_tasks.
> > >>>
> > >>> Suppose that the various cond_resched() invocations scattered throughout
> > >>> the kernel acted as RCU Tasks quiescent states, so that as soon as a
> > >>> given task executed a cond_resched(), synchronize_rcu_tasks() might
> > >>> return or call_rcu_tasks() might invoke its callback.
> > >>>
> > >>> Would that cause BPF any trouble?
> > >>>
> > >>> My guess is "no", because it looks like BPF is using RCU Tasks (as you
> > >>> say, as opposed to RCU Tasks Trace) only to wait for execution to leave a
> > >>> trampoline.  But I trust you much more than I trust myself on this topic!
> > >>
> > >> But it uses RCU Tasks Trace as well (for sleepable bpf programs), not just
> > >> Tasks? Looks like that's what Alexei said above as well, and I confirmed it in
> > >> bpf/trampoline.c
> > >>
> > >>        /* The trampoline without fexit and fmod_ret progs doesn't call original
> > >>         * function and doesn't use percpu_ref.
> > >>         * Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> > >>         * Then use call_rcu_tasks() to wait for the rest of trampoline asm
> > >>         * and normal progs.
> > >>         */
> > >>        call_rcu_tasks_trace(&im->rcu, __bpf_tramp_image_put_rcu_tasks);
> > >>
> > >> The code comment says it uses both.
> > >
> > > BPF does quite a few interesting things with these.
> > >
> > > But would you like to look at the update-side uses of RCU Tasks Rude
> > > to see if lazy preemption affects them?  I don't believe that there
> > > are any problems here, but we do need to check.
> >
> > Sure I will be happy to. I am planning look at it in detail over the 3 day weekend. Too much fun! ;-)
>
> Thank you, and looking forward to seeing what you come up with!
>
> The canonical concern would be that someone somewhere is using either
> call_rcu_tasks_rude() or synchronize_rcu_tasks_rude() to wait for
> non-preemptible regions of code that does not account for the possibility
> of preemption in CONFIG_PREEMPT_NONE or PREEMPT_PREEMPT_VOLUNTARY kernels.
>
> I *think* that these are used only to handle the possibility
> of tracepoints on functions on the entry/exit path and on the
> RCU-not-watching portions of the idle loop.  If so, then there is no
> difference in behavior for lazy preemption.  But who knows?

Hi Paul, regarding CONFIG_PREEMPT_AUTO, for Tasks RCU rude, I think
the following patch will address your concern about quiescent states
on CPUs spinning away in kernel mode:

"sched/fair: handle tick expiry under lazy preemption"
Link: https://lore.kernel.org/all/20240213055554.1802415-24-ankur.a.arora@oracle.com/

In this patch Ankur makes sure that the scheduling-clock interrupt
will reschedule the CPU after a tick and not let queued tasks starve
due to lazy re-scheduling. So my impression is the
"schedule_on_each_cpu()" should schedule a worker thread in time to
apply the implied Tasks RCU quiescent state even if the rescheduling
was a LAZY-reschedule.

Also, not sure if the "voluntary mode" of CONFIG_PREEMPT_AUTO behaves
differently. My feeling is regardless of preemption mode,
CONFIG_PREEMPT_AUTO should always preempt after a tick if something
else needs to run. It just will not preempt immediately like before
(although CFS did already have some wakeup preemption logic to slow it
down a bit). I am reviewing Ankur's patches more to confirm that and
also reviewing his patches more to see how it could affect.

thanks,

 - Joel
Paul E. McKenney March 3, 2024, 12:25 a.m. UTC | #35
On Fri, Mar 01, 2024 at 09:24:15PM -0500, Joel Fernandes wrote:
> (Shrinking CC a bit)
> 
> On Thu, Feb 29, 2024 at 1:29 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Thu, Feb 29, 2024 at 12:41:55PM -0500, Joel Fernandes wrote:
> > > > On Feb 29, 2024, at 11:57 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > On Thu, Feb 29, 2024 at 09:21:48AM -0500, Joel Fernandes wrote:
> > > >>> On 2/28/2024 5:58 PM, Paul E. McKenney wrote:
> > > >>> On Wed, Feb 28, 2024 at 02:48:44PM -0800, Alexei Starovoitov wrote:
> > > >>>> On Wed, Feb 28, 2024 at 2:31 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > > >>>>>
> > > >>>>> On Wed, 28 Feb 2024 14:19:11 -0800
> > > >>>>> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> > > >>>>>
> > > >>>>>>>>
> > > >>>>>>>> Well, to your initial point, cond_resched() does eventually invoke
> > > >>>>>>>> preempt_schedule_common(), so you are quite correct that as far as
> > > >>>>>>>> Tasks RCU is concerned, cond_resched() is not a quiescent state.
> > > >>>>>>>
> > > >>>>>>> Thanks for confirming. :-)
> > > >>>>>>
> > > >>>>>> However, given that the current Tasks RCU use cases wait for trampolines
> > > >>>>>> to be evacuated, Tasks RCU could make the choice that cond_resched()
> > > >>>>>> be a quiescent state, for example, by adjusting rcu_all_qs() and
> > > >>>>>> .rcu_urgent_qs accordingly.
> > > >>>>>>
> > > >>>>>> But this seems less pressing given the chance that cond_resched() might
> > > >>>>>> go away in favor of lazy preemption.
> > > >>>>>
> > > >>>>> Although cond_resched() is technically a "preemption point" and not truly a
> > > >>>>> voluntary schedule, I would be happy to state that it's not allowed to be
> > > >>>>> called from trampolines, or their callbacks. Now the question is, does BPF
> > > >>>>> programs ever call cond_resched()? I don't think they do.
> > > >>>>>
> > > >>>>> [ Added Alexei ]
> > > >>>>
> > > >>>> I'm a bit lost in this thread :)
> > > >>>> Just answering the above question.
> > > >>>> bpf progs never call cond_resched() directly.
> > > >>>> But there are sleepable (aka faultable) bpf progs that
> > > >>>> can call some helper or kfunc that may call cond_resched()
> > > >>>> in some path.
> > > >>>> sleepable bpf progs are protected by rcu_tasks_trace.
> > > >>>> That's a very different one vs rcu_tasks.
> > > >>>
> > > >>> Suppose that the various cond_resched() invocations scattered throughout
> > > >>> the kernel acted as RCU Tasks quiescent states, so that as soon as a
> > > >>> given task executed a cond_resched(), synchronize_rcu_tasks() might
> > > >>> return or call_rcu_tasks() might invoke its callback.
> > > >>>
> > > >>> Would that cause BPF any trouble?
> > > >>>
> > > >>> My guess is "no", because it looks like BPF is using RCU Tasks (as you
> > > >>> say, as opposed to RCU Tasks Trace) only to wait for execution to leave a
> > > >>> trampoline.  But I trust you much more than I trust myself on this topic!
> > > >>
> > > >> But it uses RCU Tasks Trace as well (for sleepable bpf programs), not just
> > > >> Tasks? Looks like that's what Alexei said above as well, and I confirmed it in
> > > >> bpf/trampoline.c
> > > >>
> > > >>        /* The trampoline without fexit and fmod_ret progs doesn't call original
> > > >>         * function and doesn't use percpu_ref.
> > > >>         * Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> > > >>         * Then use call_rcu_tasks() to wait for the rest of trampoline asm
> > > >>         * and normal progs.
> > > >>         */
> > > >>        call_rcu_tasks_trace(&im->rcu, __bpf_tramp_image_put_rcu_tasks);
> > > >>
> > > >> The code comment says it uses both.
> > > >
> > > > BPF does quite a few interesting things with these.
> > > >
> > > > But would you like to look at the update-side uses of RCU Tasks Rude
> > > > to see if lazy preemption affects them?  I don't believe that there
> > > > are any problems here, but we do need to check.
> > >
> > > Sure I will be happy to. I am planning look at it in detail over the 3 day weekend. Too much fun! ;-)
> >
> > Thank you, and looking forward to seeing what you come up with!
> >
> > The canonical concern would be that someone somewhere is using either
> > call_rcu_tasks_rude() or synchronize_rcu_tasks_rude() to wait for
> > non-preemptible regions of code that does not account for the possibility
> > of preemption in CONFIG_PREEMPT_NONE or PREEMPT_PREEMPT_VOLUNTARY kernels.
> >
> > I *think* that these are used only to handle the possibility
> > of tracepoints on functions on the entry/exit path and on the
> > RCU-not-watching portions of the idle loop.  If so, then there is no
> > difference in behavior for lazy preemption.  But who knows?
> 
> Hi Paul, regarding CONFIG_PREEMPT_AUTO, for Tasks RCU rude, I think
> the following patch will address your concern about quiescent states
> on CPUs spinning away in kernel mode:
> 
> "sched/fair: handle tick expiry under lazy preemption"
> Link: https://lore.kernel.org/all/20240213055554.1802415-24-ankur.a.arora@oracle.com/
> 
> In this patch Ankur makes sure that the scheduling-clock interrupt
> will reschedule the CPU after a tick and not let queued tasks starve
> due to lazy re-scheduling. So my impression is the
> "schedule_on_each_cpu()" should schedule a worker thread in time to
> apply the implied Tasks RCU quiescent state even if the rescheduling
> was a LAZY-reschedule.
> 
> Also, not sure if the "voluntary mode" of CONFIG_PREEMPT_AUTO behaves
> differently. My feeling is regardless of preemption mode,
> CONFIG_PREEMPT_AUTO should always preempt after a tick if something
> else needs to run. It just will not preempt immediately like before
> (although CFS did already have some wakeup preemption logic to slow it
> down a bit). I am reviewing Ankur's patches more to confirm that and
> also reviewing his patches more to see how it could affect.

Thank you for the info!

As you noted, one thing that Ankur's series changes is that preemption
can occur anywhere that it is not specifically disabled in kernels
built with CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y.  This in
turn changes Tasks Rude RCU's definition of a quiescent state for these
kernels, adding all code regions where preemption is not specifically
disabled to the list of such quiescent states.

Although from what I know, this is OK, it would be good to check the
calls to call_rcu_tasks_rude() or synchronize_rcu_tasks_rude() are set
up so as to expect these new quiescent states.  One example where it
would definitely be OK is if there was a call to synchronize_rcu_tasks()
right before or after that call to synchronize_rcu_tasks_rude().

Would you be willing to check the call sites to verify that they
are OK with this change in semantics?

							Thanx, Paul
Joel Fernandes March 3, 2024, 1:01 a.m. UTC | #36
On Sat, Mar 2, 2024 at 7:25 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Mar 01, 2024 at 09:24:15PM -0500, Joel Fernandes wrote:
> > (Shrinking CC a bit)
> >
> > On Thu, Feb 29, 2024 at 1:29 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Thu, Feb 29, 2024 at 12:41:55PM -0500, Joel Fernandes wrote:
> > > > > On Feb 29, 2024, at 11:57 AM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > On Thu, Feb 29, 2024 at 09:21:48AM -0500, Joel Fernandes wrote:
> > > > >>> On 2/28/2024 5:58 PM, Paul E. McKenney wrote:
> > > > >>> On Wed, Feb 28, 2024 at 02:48:44PM -0800, Alexei Starovoitov wrote:
> > > > >>>> On Wed, Feb 28, 2024 at 2:31 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > > > >>>>>
> > > > >>>>> On Wed, 28 Feb 2024 14:19:11 -0800
> > > > >>>>> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> > > > >>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> Well, to your initial point, cond_resched() does eventually invoke
> > > > >>>>>>>> preempt_schedule_common(), so you are quite correct that as far as
> > > > >>>>>>>> Tasks RCU is concerned, cond_resched() is not a quiescent state.
> > > > >>>>>>>
> > > > >>>>>>> Thanks for confirming. :-)
> > > > >>>>>>
> > > > >>>>>> However, given that the current Tasks RCU use cases wait for trampolines
> > > > >>>>>> to be evacuated, Tasks RCU could make the choice that cond_resched()
> > > > >>>>>> be a quiescent state, for example, by adjusting rcu_all_qs() and
> > > > >>>>>> .rcu_urgent_qs accordingly.
> > > > >>>>>>
> > > > >>>>>> But this seems less pressing given the chance that cond_resched() might
> > > > >>>>>> go away in favor of lazy preemption.
> > > > >>>>>
> > > > >>>>> Although cond_resched() is technically a "preemption point" and not truly a
> > > > >>>>> voluntary schedule, I would be happy to state that it's not allowed to be
> > > > >>>>> called from trampolines, or their callbacks. Now the question is, does BPF
> > > > >>>>> programs ever call cond_resched()? I don't think they do.
> > > > >>>>>
> > > > >>>>> [ Added Alexei ]
> > > > >>>>
> > > > >>>> I'm a bit lost in this thread :)
> > > > >>>> Just answering the above question.
> > > > >>>> bpf progs never call cond_resched() directly.
> > > > >>>> But there are sleepable (aka faultable) bpf progs that
> > > > >>>> can call some helper or kfunc that may call cond_resched()
> > > > >>>> in some path.
> > > > >>>> sleepable bpf progs are protected by rcu_tasks_trace.
> > > > >>>> That's a very different one vs rcu_tasks.
> > > > >>>
> > > > >>> Suppose that the various cond_resched() invocations scattered throughout
> > > > >>> the kernel acted as RCU Tasks quiescent states, so that as soon as a
> > > > >>> given task executed a cond_resched(), synchronize_rcu_tasks() might
> > > > >>> return or call_rcu_tasks() might invoke its callback.
> > > > >>>
> > > > >>> Would that cause BPF any trouble?
> > > > >>>
> > > > >>> My guess is "no", because it looks like BPF is using RCU Tasks (as you
> > > > >>> say, as opposed to RCU Tasks Trace) only to wait for execution to leave a
> > > > >>> trampoline.  But I trust you much more than I trust myself on this topic!
> > > > >>
> > > > >> But it uses RCU Tasks Trace as well (for sleepable bpf programs), not just
> > > > >> Tasks? Looks like that's what Alexei said above as well, and I confirmed it in
> > > > >> bpf/trampoline.c
> > > > >>
> > > > >>        /* The trampoline without fexit and fmod_ret progs doesn't call original
> > > > >>         * function and doesn't use percpu_ref.
> > > > >>         * Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> > > > >>         * Then use call_rcu_tasks() to wait for the rest of trampoline asm
> > > > >>         * and normal progs.
> > > > >>         */
> > > > >>        call_rcu_tasks_trace(&im->rcu, __bpf_tramp_image_put_rcu_tasks);
> > > > >>
> > > > >> The code comment says it uses both.
> > > > >
> > > > > BPF does quite a few interesting things with these.
> > > > >
> > > > > But would you like to look at the update-side uses of RCU Tasks Rude
> > > > > to see if lazy preemption affects them?  I don't believe that there
> > > > > are any problems here, but we do need to check.
> > > >
> > > > Sure I will be happy to. I am planning look at it in detail over the 3 day weekend. Too much fun! ;-)
> > >
> > > Thank you, and looking forward to seeing what you come up with!
> > >
> > > The canonical concern would be that someone somewhere is using either
> > > call_rcu_tasks_rude() or synchronize_rcu_tasks_rude() to wait for
> > > non-preemptible regions of code that does not account for the possibility
> > > of preemption in CONFIG_PREEMPT_NONE or PREEMPT_PREEMPT_VOLUNTARY kernels.
> > >
> > > I *think* that these are used only to handle the possibility
> > > of tracepoints on functions on the entry/exit path and on the
> > > RCU-not-watching portions of the idle loop.  If so, then there is no
> > > difference in behavior for lazy preemption.  But who knows?
> >
> > Hi Paul, regarding CONFIG_PREEMPT_AUTO, for Tasks RCU rude, I think
> > the following patch will address your concern about quiescent states
> > on CPUs spinning away in kernel mode:
> >
> > "sched/fair: handle tick expiry under lazy preemption"
> > Link: https://lore.kernel.org/all/20240213055554.1802415-24-ankur.a.arora@oracle.com/
> >
> > In this patch Ankur makes sure that the scheduling-clock interrupt
> > will reschedule the CPU after a tick and not let queued tasks starve
> > due to lazy re-scheduling. So my impression is the
> > "schedule_on_each_cpu()" should schedule a worker thread in time to
> > apply the implied Tasks RCU quiescent state even if the rescheduling
> > was a LAZY-reschedule.
> >
> > Also, not sure if the "voluntary mode" of CONFIG_PREEMPT_AUTO behaves
> > differently. My feeling is regardless of preemption mode,
> > CONFIG_PREEMPT_AUTO should always preempt after a tick if something
> > else needs to run. It just will not preempt immediately like before
> > (although CFS did already have some wakeup preemption logic to slow it
> > down a bit). I am reviewing Ankur's patches more to confirm that and
> > also reviewing his patches more to see how it could affect.
>
> Thank you for the info!
>
> As you noted, one thing that Ankur's series changes is that preemption
> can occur anywhere that it is not specifically disabled in kernels
> built with CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y.  This in
> turn changes Tasks Rude RCU's definition of a quiescent state for these
> kernels, adding all code regions where preemption is not specifically
> disabled to the list of such quiescent states.
>
> Although from what I know, this is OK, it would be good to check the
> calls to call_rcu_tasks_rude() or synchronize_rcu_tasks_rude() are set
> up so as to expect these new quiescent states.  One example where it
> would definitely be OK is if there was a call to synchronize_rcu_tasks()
> right before or after that call to synchronize_rcu_tasks_rude().
>
> Would you be willing to check the call sites to verify that they
> are OK with this change in semantics?

Yes, I will analyze and make sure those users did not unexpectedly
assume something about AUTO (i.e. preempt enabled sections using
readers).

Btw, as I think you mentioned, with Ankur's patch even with
CONFIG_PREEMPT_NONE=y, a preemption on the tick boundary can occur (in
preempt=none mode)!

Btw, For RUDE - If we wish to preempt sooner on "preempt=voluntary" of
future CONFIG_PREEMPT_AUTO=y kernels, then we can potentially replace
the schedule_on_each_cpu() with a higher priority (higher class)
per-CPU threads like RT. Then wake them all up and waiting till the
next tick is not needed for a CPU to be marked quiescent. Would
something like that be of interest?

Thanks.
Joel Fernandes March 4, 2024, 9:16 a.m. UTC | #37
Hi Paul,

On 3/2/2024 8:01 PM, Joel Fernandes wrote:
>> As you noted, one thing that Ankur's series changes is that preemption
>> can occur anywhere that it is not specifically disabled in kernels
>> built with CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y.  This in
>> turn changes Tasks Rude RCU's definition of a quiescent state for these
>> kernels, adding all code regions where preemption is not specifically
>> disabled to the list of such quiescent states.
>>
>> Although from what I know, this is OK, it would be good to check the
>> calls to call_rcu_tasks_rude() or synchronize_rcu_tasks_rude() are set
>> up so as to expect these new quiescent states.  One example where it
>> would definitely be OK is if there was a call to synchronize_rcu_tasks()
>> right before or after that call to synchronize_rcu_tasks_rude().
>>
>> Would you be willing to check the call sites to verify that they
>> are OK with this change in 
> Yes, I will analyze and make sure those users did not unexpectedly
> assume something about AUTO (i.e. preempt enabled sections using
> readers).

Other than RCU test code, there are just 3 call sites for RUDE right now, all in
ftrace.c.

(Long story short, PREEMPT_AUTO should not cause wreckage in TASKS_RCU_RUDE
other than any preexisting wreckage that !PREEMPT_AUTO already had. Steve is on
CC as well to CMIIW).

Case 1: For !CONFIG_DYNAMIC_FTRACE update of ftrace_trace_function

This config is itself expected to be slow. However seeing what it does, it is
trying to make sure the global function pointer "ftrace_trace_function" is
updated and any readers of that pointers would have finished reading it. I don't
personally think preemption has to be disabled across the entirety of the
section that calls into this function. So sensitivity to preempt disabling
should not be relevant for this case IMO, but lets see if ftrace folks disagree
(on CC). It has more to do with, any callers of this function pointer are no
longer calling into the old function.

Case 2: Trampoline structures accessing

For this there is a code comment that says preemption will disabled so it should
not be dependent on any of the preemptiblity modes, because preempt_disable()
should disable preempt with PREEMPT_AUTO.

		/*
		 * We need to do a hard force of sched synchronization.
		 * This is because we use preempt_disable() to do RCU, but
		 * the function tracers can be called where RCU is not watching
		 * (like before user_exit()). We can not rely on the RCU
		 * infrastructure to do the synchronization, thus we must do it
		 * ourselves.
		 */
		synchronize_rcu_tasks_rude();
		[...]
		ftrace_trampoline_free(ops);

Code comment probably needs update because it says 'can not rely on RCU..' ;-)

My *guess* is the preempt_disable() mentioned in this case is
ftrace_ops_trampoline() where trampoline-related datas tructures are accessed
for stack unwinding purposes. This is a data structure protection thing AFAICS
and nothing to do with "trampoline execution" itself which needs "Tasks RCU" to
allow for preemption in trampolines.

Case 3: This has to do with update of function graph tracing and there is the
same comment as case 2, where preempt will be disabled in readers, so it should
be safe for PREEMPT_AUTO (famous last words).

Though I am not yet able to locate that preempt_disable() which is not an
PREEMPT_AUTO-related issue anyway. Maybe its buried in function graph tracing
logic somewhere?

Finally, my thought also was, if any of these thread usages/cases of Tasks RCU
RUDE assume working only on a CONFIG_PREEMPT_NONE=y or
CONFIG_PREEMPT_VOLUNTARY=y kernel, that could be worrying but AFAICS, they don't
assume anything related to that.

thanks,

 - Joel
Joel Fernandes March 4, 2024, 9:16 a.m. UTC | #38
Hi Paul,

On 3/2/2024 8:01 PM, Joel Fernandes wrote:
>> As you noted, one thing that Ankur's series changes is that preemption
>> can occur anywhere that it is not specifically disabled in kernels
>> built with CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y.  This in
>> turn changes Tasks Rude RCU's definition of a quiescent state for these
>> kernels, adding all code regions where preemption is not specifically
>> disabled to the list of such quiescent states.
>>
>> Although from what I know, this is OK, it would be good to check the
>> calls to call_rcu_tasks_rude() or synchronize_rcu_tasks_rude() are set
>> up so as to expect these new quiescent states.  One example where it
>> would definitely be OK is if there was a call to synchronize_rcu_tasks()
>> right before or after that call to synchronize_rcu_tasks_rude().
>>
>> Would you be willing to check the call sites to verify that they
>> are OK with this change in 
> Yes, I will analyze and make sure those users did not unexpectedly
> assume something about AUTO (i.e. preempt enabled sections using
> readers).

Other than RCU test code, there are just 3 call sites for RUDE right now, all in
ftrace.c.

(Long story short, PREEMPT_AUTO should not cause wreckage in TASKS_RCU_RUDE
other than any preexisting wreckage that !PREEMPT_AUTO already had. Steve is on
CC as well to CMIIW).

Case 1: For !CONFIG_DYNAMIC_FTRACE update of ftrace_trace_function

This config is itself expected to be slow. However seeing what it does, it is
trying to make sure the global function pointer "ftrace_trace_function" is
updated and any readers of that pointers would have finished reading it. I don't
personally think preemption has to be disabled across the entirety of the
section that calls into this function. So sensitivity to preempt disabling
should not be relevant for this case IMO, but lets see if ftrace folks disagree
(on CC). It has more to do with, any callers of this function pointer are no
longer calling into the old function.

Case 2: Trampoline structures accessing

For this there is a code comment that says preemption will disabled so it should
not be dependent on any of the preemptiblity modes, because preempt_disable()
should disable preempt with PREEMPT_AUTO.

		/*
		 * We need to do a hard force of sched synchronization.
		 * This is because we use preempt_disable() to do RCU, but
		 * the function tracers can be called where RCU is not watching
		 * (like before user_exit()). We can not rely on the RCU
		 * infrastructure to do the synchronization, thus we must do it
		 * ourselves.
		 */
		synchronize_rcu_tasks_rude();
		[...]
		ftrace_trampoline_free(ops);

Code comment probably needs update because it says 'can not rely on RCU..' ;-)

My *guess* is the preempt_disable() mentioned in this case is
ftrace_ops_trampoline() where trampoline-related datas tructures are accessed
for stack unwinding purposes. This is a data structure protection thing AFAICS
and nothing to do with "trampoline execution" itself which needs "Tasks RCU" to
allow for preemption in trampolines.

Case 3: This has to do with update of function graph tracing and there is the
same comment as case 2, where preempt will be disabled in readers, so it should
be safe for PREEMPT_AUTO (famous last words).

Though I am not yet able to locate that preempt_disable() which is not an
PREEMPT_AUTO-related issue anyway. Maybe its buried in function graph tracing
logic somewhere?

Finally, my thought also was, if any of these thread usages/cases of Tasks RCU
RUDE assume working only on a CONFIG_PREEMPT_NONE=y or
CONFIG_PREEMPT_VOLUNTARY=y kernel, that could be worrying but AFAICS, they don't
assume anything related to that.

thanks,

 - Joel
Paul E. McKenney March 5, 2024, 5:53 p.m. UTC | #39
On Mon, Mar 04, 2024 at 04:16:01AM -0500, Joel Fernandes wrote:
> Hi Paul,

Thank you, Joel!

> On 3/2/2024 8:01 PM, Joel Fernandes wrote:
> >> As you noted, one thing that Ankur's series changes is that preemption
> >> can occur anywhere that it is not specifically disabled in kernels
> >> built with CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y.  This in
> >> turn changes Tasks Rude RCU's definition of a quiescent state for these
> >> kernels, adding all code regions where preemption is not specifically
> >> disabled to the list of such quiescent states.
> >>
> >> Although from what I know, this is OK, it would be good to check the
> >> calls to call_rcu_tasks_rude() or synchronize_rcu_tasks_rude() are set
> >> up so as to expect these new quiescent states.  One example where it
> >> would definitely be OK is if there was a call to synchronize_rcu_tasks()
> >> right before or after that call to synchronize_rcu_tasks_rude().
> >>
> >> Would you be willing to check the call sites to verify that they
> >> are OK with this change in 
> > Yes, I will analyze and make sure those users did not unexpectedly
> > assume something about AUTO (i.e. preempt enabled sections using
> > readers).
> 
> Other than RCU test code, there are just 3 call sites for RUDE right now, all in
> ftrace.c.
> 
> (Long story short, PREEMPT_AUTO should not cause wreckage in TASKS_RCU_RUDE
> other than any preexisting wreckage that !PREEMPT_AUTO already had. Steve is on
> CC as well to CMIIW).
> 
> Case 1: For !CONFIG_DYNAMIC_FTRACE update of ftrace_trace_function
> 
> This config is itself expected to be slow. However seeing what it does, it is
> trying to make sure the global function pointer "ftrace_trace_function" is
> updated and any readers of that pointers would have finished reading it. I don't
> personally think preemption has to be disabled across the entirety of the
> section that calls into this function. So sensitivity to preempt disabling
> should not be relevant for this case IMO, but lets see if ftrace folks disagree
> (on CC). It has more to do with, any callers of this function pointer are no
> longer calling into the old function.

Assuming the loads from the function pointer aren't torn by the compiler,
they will be loaded by a single instruction, which as you say cannot
be preempted.  Might be good to have READ_ONCE() if they aren't already
in place.

> Case 2: Trampoline structures accessing
> 
> For this there is a code comment that says preemption will disabled so it should
> not be dependent on any of the preemptiblity modes, because preempt_disable()
> should disable preempt with PREEMPT_AUTO.
> 
> 		/*
> 		 * We need to do a hard force of sched synchronization.
> 		 * This is because we use preempt_disable() to do RCU, but
> 		 * the function tracers can be called where RCU is not watching
> 		 * (like before user_exit()). We can not rely on the RCU
> 		 * infrastructure to do the synchronization, thus we must do it
> 		 * ourselves.
> 		 */
> 		synchronize_rcu_tasks_rude();
> 		[...]
> 		ftrace_trampoline_free(ops);
> 
> Code comment probably needs update because it says 'can not rely on RCU..' ;-)

My guess is that this comment is left over from when that call to
synchronize_rcu_tasks_rude() was open-coded.  ;-)

Maybe "We can not rely on vanilla RCU to do..."?

> My *guess* is the preempt_disable() mentioned in this case is
> ftrace_ops_trampoline() where trampoline-related datas tructures are accessed
> for stack unwinding purposes. This is a data structure protection thing AFAICS
> and nothing to do with "trampoline execution" itself which needs "Tasks RCU" to
> allow for preemption in trampolines.

Sounds plausible to me, but let's see what Steve's thoughts are.

> Case 3: This has to do with update of function graph tracing and there is the
> same comment as case 2, where preempt will be disabled in readers, so it should
> be safe for PREEMPT_AUTO (famous last words).
> 
> Though I am not yet able to locate that preempt_disable() which is not an
> PREEMPT_AUTO-related issue anyway. Maybe its buried in function graph tracing
> logic somewhere?

With the trampolines, isn't synchronize_rcu_tasks_rude() paired with
a call to synchronize_rcu_tasks()?  In that case, rude's only job is
getting all CPUs out of their previous sojourn in either the entry/exit
code or the deep idle loop.  RCU Tasks waits for each task to voluntarily
block, which takes care of all tasks executing elsewhere.  (Recall that
RCU Tasks ignores the idle tasks.)

> Finally, my thought also was, if any of these thread usages/cases of Tasks RCU
> RUDE assume working only on a CONFIG_PREEMPT_NONE=y or
> CONFIG_PREEMPT_VOLUNTARY=y kernel, that could be worrying but AFAICS, they don't
> assume anything related to that.

Good point, most generic code should need to tolerate preemption in
any case.  But I have nine commits queued thus far that handle some
CONFIG_AUTO breakage or another, so a little paranoia won't go too
far amiss.  ;-)

Remaining on my list are uses of the various CONFIG_PREEMPT* Kconfig
options, both in code and in Makefiles.  Though who knows?  Perhaps Ankur
or Thomas have already done that.

							Thanx, Paul
Mark Rutland March 5, 2024, 7:57 p.m. UTC | #40
On Tue, Mar 05, 2024 at 09:53:42AM -0800, Paul E. McKenney wrote:
> On Mon, Mar 04, 2024 at 04:16:01AM -0500, Joel Fernandes wrote:
> > Hi Paul,
> 
> Thank you, Joel!
> 
> > On 3/2/2024 8:01 PM, Joel Fernandes wrote:
> > >> As you noted, one thing that Ankur's series changes is that preemption
> > >> can occur anywhere that it is not specifically disabled in kernels
> > >> built with CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y.  This in
> > >> turn changes Tasks Rude RCU's definition of a quiescent state for these
> > >> kernels, adding all code regions where preemption is not specifically
> > >> disabled to the list of such quiescent states.
> > >>
> > >> Although from what I know, this is OK, it would be good to check the
> > >> calls to call_rcu_tasks_rude() or synchronize_rcu_tasks_rude() are set
> > >> up so as to expect these new quiescent states.  One example where it
> > >> would definitely be OK is if there was a call to synchronize_rcu_tasks()
> > >> right before or after that call to synchronize_rcu_tasks_rude().
> > >>
> > >> Would you be willing to check the call sites to verify that they
> > >> are OK with this change in 
> > > Yes, I will analyze and make sure those users did not unexpectedly
> > > assume something about AUTO (i.e. preempt enabled sections using
> > > readers).
> > 
> > Other than RCU test code, there are just 3 call sites for RUDE right now, all in
> > ftrace.c.
> > 
> > (Long story short, PREEMPT_AUTO should not cause wreckage in TASKS_RCU_RUDE
> > other than any preexisting wreckage that !PREEMPT_AUTO already had. Steve is on
> > CC as well to CMIIW).
> > 
> > Case 1: For !CONFIG_DYNAMIC_FTRACE update of ftrace_trace_function
> > 
> > This config is itself expected to be slow. However seeing what it does, it is
> > trying to make sure the global function pointer "ftrace_trace_function" is
> > updated and any readers of that pointers would have finished reading it. I don't
> > personally think preemption has to be disabled across the entirety of the
> > section that calls into this function. So sensitivity to preempt disabling
> > should not be relevant for this case IMO, but lets see if ftrace folks disagree
> > (on CC). It has more to do with, any callers of this function pointer are no
> > longer calling into the old function.
> 
> Assuming the loads from the function pointer aren't torn by the compiler,
> they will be loaded by a single instruction, which as you say cannot
> be preempted.  Might be good to have READ_ONCE() if they aren't already
> in place.

As a heads-up I'm actively digging through case 1 now and I think the existing
code is actually redundant or broken depending on architecture and
configuration (but largely redundant, hence not seeing any reports of an
issue).

I've dug through v3.14 up to v5.4, and I'll hopefully have a writeup of that
out tomorrow, or in the next couple of hours if I continue after dinner...

I haven't yet looked at cases 2 or 3 yet, and I haven't convinced myself on how
the CONFIG_DYNAMIC_FTRACE=y case works either.

Mark.

> > Case 2: Trampoline structures accessing
> > 
> > For this there is a code comment that says preemption will disabled so it should
> > not be dependent on any of the preemptiblity modes, because preempt_disable()
> > should disable preempt with PREEMPT_AUTO.
> > 
> > 		/*
> > 		 * We need to do a hard force of sched synchronization.
> > 		 * This is because we use preempt_disable() to do RCU, but
> > 		 * the function tracers can be called where RCU is not watching
> > 		 * (like before user_exit()). We can not rely on the RCU
> > 		 * infrastructure to do the synchronization, thus we must do it
> > 		 * ourselves.
> > 		 */
> > 		synchronize_rcu_tasks_rude();
> > 		[...]
> > 		ftrace_trampoline_free(ops);
> > 
> > Code comment probably needs update because it says 'can not rely on RCU..' ;-)
> 
> My guess is that this comment is left over from when that call to
> synchronize_rcu_tasks_rude() was open-coded.  ;-)
> 
> Maybe "We can not rely on vanilla RCU to do..."?
> 
> > My *guess* is the preempt_disable() mentioned in this case is
> > ftrace_ops_trampoline() where trampoline-related datas tructures are accessed
> > for stack unwinding purposes. This is a data structure protection thing AFAICS
> > and nothing to do with "trampoline execution" itself which needs "Tasks RCU" to
> > allow for preemption in trampolines.
> 
> Sounds plausible to me, but let's see what Steve's thoughts are.
> 
> > Case 3: This has to do with update of function graph tracing and there is the
> > same comment as case 2, where preempt will be disabled in readers, so it should
> > be safe for PREEMPT_AUTO (famous last words).
> > 
> > Though I am not yet able to locate that preempt_disable() which is not an
> > PREEMPT_AUTO-related issue anyway. Maybe its buried in function graph tracing
> > logic somewhere?
> 
> With the trampolines, isn't synchronize_rcu_tasks_rude() paired with
> a call to synchronize_rcu_tasks()?  In that case, rude's only job is
> getting all CPUs out of their previous sojourn in either the entry/exit
> code or the deep idle loop.  RCU Tasks waits for each task to voluntarily
> block, which takes care of all tasks executing elsewhere.  (Recall that
> RCU Tasks ignores the idle tasks.)
> 
> > Finally, my thought also was, if any of these thread usages/cases of Tasks RCU
> > RUDE assume working only on a CONFIG_PREEMPT_NONE=y or
> > CONFIG_PREEMPT_VOLUNTARY=y kernel, that could be worrying but AFAICS, they don't
> > assume anything related to that.
> 
> Good point, most generic code should need to tolerate preemption in
> any case.  But I have nine commits queued thus far that handle some
> CONFIG_AUTO breakage or another, so a little paranoia won't go too
> far amiss.  ;-)
> 
> Remaining on my list are uses of the various CONFIG_PREEMPT* Kconfig
> options, both in code and in Makefiles.  Though who knows?  Perhaps Ankur
> or Thomas have already done that.
> 
> 							Thanx, Paul
>
Paul E. McKenney March 5, 2024, 9:52 p.m. UTC | #41
On Tue, Mar 05, 2024 at 07:57:40PM +0000, Mark Rutland wrote:
> On Tue, Mar 05, 2024 at 09:53:42AM -0800, Paul E. McKenney wrote:
> > On Mon, Mar 04, 2024 at 04:16:01AM -0500, Joel Fernandes wrote:
> > > Hi Paul,
> > 
> > Thank you, Joel!
> > 
> > > On 3/2/2024 8:01 PM, Joel Fernandes wrote:
> > > >> As you noted, one thing that Ankur's series changes is that preemption
> > > >> can occur anywhere that it is not specifically disabled in kernels
> > > >> built with CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y.  This in
> > > >> turn changes Tasks Rude RCU's definition of a quiescent state for these
> > > >> kernels, adding all code regions where preemption is not specifically
> > > >> disabled to the list of such quiescent states.
> > > >>
> > > >> Although from what I know, this is OK, it would be good to check the
> > > >> calls to call_rcu_tasks_rude() or synchronize_rcu_tasks_rude() are set
> > > >> up so as to expect these new quiescent states.  One example where it
> > > >> would definitely be OK is if there was a call to synchronize_rcu_tasks()
> > > >> right before or after that call to synchronize_rcu_tasks_rude().
> > > >>
> > > >> Would you be willing to check the call sites to verify that they
> > > >> are OK with this change in 
> > > > Yes, I will analyze and make sure those users did not unexpectedly
> > > > assume something about AUTO (i.e. preempt enabled sections using
> > > > readers).
> > > 
> > > Other than RCU test code, there are just 3 call sites for RUDE right now, all in
> > > ftrace.c.
> > > 
> > > (Long story short, PREEMPT_AUTO should not cause wreckage in TASKS_RCU_RUDE
> > > other than any preexisting wreckage that !PREEMPT_AUTO already had. Steve is on
> > > CC as well to CMIIW).
> > > 
> > > Case 1: For !CONFIG_DYNAMIC_FTRACE update of ftrace_trace_function
> > > 
> > > This config is itself expected to be slow. However seeing what it does, it is
> > > trying to make sure the global function pointer "ftrace_trace_function" is
> > > updated and any readers of that pointers would have finished reading it. I don't
> > > personally think preemption has to be disabled across the entirety of the
> > > section that calls into this function. So sensitivity to preempt disabling
> > > should not be relevant for this case IMO, but lets see if ftrace folks disagree
> > > (on CC). It has more to do with, any callers of this function pointer are no
> > > longer calling into the old function.
> > 
> > Assuming the loads from the function pointer aren't torn by the compiler,
> > they will be loaded by a single instruction, which as you say cannot
> > be preempted.  Might be good to have READ_ONCE() if they aren't already
> > in place.
> 
> As a heads-up I'm actively digging through case 1 now and I think the existing
> code is actually redundant or broken depending on architecture and
> configuration (but largely redundant, hence not seeing any reports of an
> issue).
> 
> I've dug through v3.14 up to v5.4, and I'll hopefully have a writeup of that
> out tomorrow, or in the next couple of hours if I continue after dinner...
> 
> I haven't yet looked at cases 2 or 3 yet, and I haven't convinced myself on how
> the CONFIG_DYNAMIC_FTRACE=y case works either.

Ouch!  Looking forward to seeing what you come up with.

							Thanx, Paul

> Mark.
> 
> > > Case 2: Trampoline structures accessing
> > > 
> > > For this there is a code comment that says preemption will disabled so it should
> > > not be dependent on any of the preemptiblity modes, because preempt_disable()
> > > should disable preempt with PREEMPT_AUTO.
> > > 
> > > 		/*
> > > 		 * We need to do a hard force of sched synchronization.
> > > 		 * This is because we use preempt_disable() to do RCU, but
> > > 		 * the function tracers can be called where RCU is not watching
> > > 		 * (like before user_exit()). We can not rely on the RCU
> > > 		 * infrastructure to do the synchronization, thus we must do it
> > > 		 * ourselves.
> > > 		 */
> > > 		synchronize_rcu_tasks_rude();
> > > 		[...]
> > > 		ftrace_trampoline_free(ops);
> > > 
> > > Code comment probably needs update because it says 'can not rely on RCU..' ;-)
> > 
> > My guess is that this comment is left over from when that call to
> > synchronize_rcu_tasks_rude() was open-coded.  ;-)
> > 
> > Maybe "We can not rely on vanilla RCU to do..."?
> > 
> > > My *guess* is the preempt_disable() mentioned in this case is
> > > ftrace_ops_trampoline() where trampoline-related datas tructures are accessed
> > > for stack unwinding purposes. This is a data structure protection thing AFAICS
> > > and nothing to do with "trampoline execution" itself which needs "Tasks RCU" to
> > > allow for preemption in trampolines.
> > 
> > Sounds plausible to me, but let's see what Steve's thoughts are.
> > 
> > > Case 3: This has to do with update of function graph tracing and there is the
> > > same comment as case 2, where preempt will be disabled in readers, so it should
> > > be safe for PREEMPT_AUTO (famous last words).
> > > 
> > > Though I am not yet able to locate that preempt_disable() which is not an
> > > PREEMPT_AUTO-related issue anyway. Maybe its buried in function graph tracing
> > > logic somewhere?
> > 
> > With the trampolines, isn't synchronize_rcu_tasks_rude() paired with
> > a call to synchronize_rcu_tasks()?  In that case, rude's only job is
> > getting all CPUs out of their previous sojourn in either the entry/exit
> > code or the deep idle loop.  RCU Tasks waits for each task to voluntarily
> > block, which takes care of all tasks executing elsewhere.  (Recall that
> > RCU Tasks ignores the idle tasks.)
> > 
> > > Finally, my thought also was, if any of these thread usages/cases of Tasks RCU
> > > RUDE assume working only on a CONFIG_PREEMPT_NONE=y or
> > > CONFIG_PREEMPT_VOLUNTARY=y kernel, that could be worrying but AFAICS, they don't
> > > assume anything related to that.
> > 
> > Good point, most generic code should need to tolerate preemption in
> > any case.  But I have nine commits queued thus far that handle some
> > CONFIG_AUTO breakage or another, so a little paranoia won't go too
> > far amiss.  ;-)
> > 
> > Remaining on my list are uses of the various CONFIG_PREEMPT* Kconfig
> > options, both in code and in Makefiles.  Though who knows?  Perhaps Ankur
> > or Thomas have already done that.
> > 
> > 							Thanx, Paul
> >
Steven Rostedt March 6, 2024, 4:56 p.m. UTC | #42
On Tue, 5 Mar 2024 09:53:42 -0800
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> > Case 1: For !CONFIG_DYNAMIC_FTRACE update of ftrace_trace_function
> > 
> > This config is itself expected to be slow. However seeing what it does, it is
> > trying to make sure the global function pointer "ftrace_trace_function" is
> > updated and any readers of that pointers would have finished reading it. I don't
> > personally think preemption has to be disabled across the entirety of the
> > section that calls into this function. So sensitivity to preempt disabling
> > should not be relevant for this case IMO, but lets see if ftrace folks disagree
> > (on CC). It has more to do with, any callers of this function pointer are no
> > longer calling into the old function.  
> 
> Assuming the loads from the function pointer aren't torn by the compiler,
> they will be loaded by a single instruction, which as you say cannot
> be preempted.  Might be good to have READ_ONCE() if they aren't already
> in place.

See my previous reply about READ_ONCE() logic, but going back to the reason
for rcu_synchronize_rude() we have this:

from kernel/trace/ftrace.c: update_ftrace_function()

> 	ftrace_trace_function = ftrace_ops_list_func;

The reason for switching to ftrace_ops_list_func is because the registered
ftrace_ops is what is to be passed to the function callback. For
non-dynamic ftrace, if we switch ftrace_trace_function to the function,
it's going to get the ftrace_ops of the old function, and that could be
disastrous if the callback is expecting to get its own ftrace_ops.

The ftrace_ops_list_func is an iterator of all the registered callbacks,
and will pass each callback its own ftrace_ops. So by switching to the
list_func, we guarantee that the callback will get its own ftrace_ops.

> 	/*
> 	 * Make sure all CPUs see this. Yes this is slow, but static
> 	 * tracing is slow and nasty to have enabled.
> 	 */
> 	synchronize_rcu_tasks_rude();

Now perhaps we don't need rude here. We just need to make sure that there's
no function that's about to call the old ftrace_trace_function.

> 	/* Now all cpus are using the list ops. */
> 	function_trace_op = set_function_trace_op;

We update the ftrace_ops that gets passed to whatever is registered to
ftrace_trace_function (but ftrace_ops_list_func ignores this parameter, as
it's just a helper function to call other callbacks).

> 	/* Make sure the function_trace_op is visible on all CPUs */
> 	smp_wmb();
> 	/* Nasty way to force a rmb on all cpus */
> 	smp_call_function(ftrace_sync_ipi, NULL, 1);

Needs to make sure everything will see the new ftarce_trace_op.

> 	/* OK, we are all set to update the ftrace_trace_function now! */
> #endif /* !CONFIG_DYNAMIC_FTRACE */
> 
> 	ftrace_trace_function = func;

Finally, we update what will be called by the trampoline, as func expects
to get the set_function_trace_op here. And it should.

> 
> > Case 2: Trampoline structures accessing
> > 
> > For this there is a code comment that says preemption will disabled so
> > it should not be dependent on any of the preemptiblity modes, because
> > preempt_disable() should disable preempt with PREEMPT_AUTO.

If ftrace_ops itself was allocated, meaning that when ftrace_shutdown()
returns, it had better not have anything calling the callback that will
reference the ftrace_ops, because after ftrace_shutdown() returns, it is OK
to free ftrace_ops.

Thus, the trampoline does not call the ftrace_ops callback directly, but
instead once again calls ftrace_ops_list_func! That's because
ftrace_ops_list_func has:

> 	bit = trace_test_and_set_recursion(ip, parent_ip, TRACE_LIST_START);

Where the above disables preemption.

> 	if (bit < 0)
> 		return;
> 
> 	do_for_each_ftrace_op(op, ftrace_ops_list) {
> 		/* Stub functions don't need to be called nor tested */
> 		if (op->flags & FTRACE_OPS_FL_STUB)
> 			continue;
> 		/*
> 		 * Check the following for each ops before calling their func:
> 		 *  if RCU flag is set, then rcu_is_watching() must be true
> 		 *  Otherwise test if the ip matches the ops filter
> 		 *
> 		 * If any of the above fails then the op->func() is not executed.
> 		 */
> 		if ((!(op->flags & FTRACE_OPS_FL_RCU) || rcu_is_watching()) &&
> 		    ftrace_ops_test(op, ip, regs)) {

If the callback expects RCU to be watching, then it is skipped if RCU is
not watching. If we have fixed all the places that function tracing can
happen to make sure RCU is always watching, we could possibly get rid of
this code.

This may be the case, because trace_test_and_set_recursion() has:

#ifdef CONFIG_ARCH_WANTS_NO_INSTR
# define trace_warn_on_no_rcu(ip)					\
	({								\
		bool __ret = !rcu_is_watching();			\
		if (__ret && !trace_recursion_test(TRACE_RECORD_RECURSION_BIT)) { \
			trace_recursion_set(TRACE_RECORD_RECURSION_BIT); \
			WARN_ONCE(true, "RCU not on for: %pS\n", (void *)ip); \
			trace_recursion_clear(TRACE_RECORD_RECURSION_BIT); \
		}							\
		__ret;							\
	})
#else
# define trace_warn_on_no_rcu(ip)	false
#endif

> 			if (FTRACE_WARN_ON(!op->func)) {
> 				pr_warn("op=%p %pS\n", op, op);
> 				goto out;
> 			}
> 			op->func(ip, parent_ip, op, fregs);
> 		}
> 	} while_for_each_ftrace_op(op);
> out:
> 	trace_clear_recursion(bit);

Thus, if RCU is always watching for all function calls, then we can remove
the rude here. Perhaps we could remove rude everywhere?


> > 
> > 		/*
> > 		 * We need to do a hard force of sched synchronization.
> > 		 * This is because we use preempt_disable() to do RCU,
> > but
> > 		 * the function tracers can be called where RCU is not
> > watching
> > 		 * (like before user_exit()). We can not rely on the RCU
> > 		 * infrastructure to do the synchronization, thus we
> > must do it
> > 		 * ourselves.
> > 		 */
> > 		synchronize_rcu_tasks_rude();
> > 		[...]
> > 		ftrace_trampoline_free(ops);
> > 
> > Code comment probably needs update because it says 'can not rely on
> > RCU..' ;-)  

Can not rely on normal RCU ;-)

> 
> My guess is that this comment is left over from when that call to
> synchronize_rcu_tasks_rude() was open-coded.  ;-)

Correct.

> 
> Maybe "We can not rely on vanilla RCU to do..."?

If RCU is always watching, then we can. But is that the case for all archs?

> 
> > My *guess* is the preempt_disable() mentioned in this case is
> > ftrace_ops_trampoline() where trampoline-related datas tructures are
> > accessed for stack unwinding purposes. This is a data structure
> > protection thing AFAICS and nothing to do with "trampoline execution"
> > itself which needs "Tasks RCU" to allow for preemption in trampolines.  
> 
> Sounds plausible to me, but let's see what Steve's thoughts are.

As mentioned above, the preempt_disable() is in the ftrace_ops_list_func()
that these callbacks are forced to go through.

> 
> > Case 3: This has to do with update of function graph tracing and there
> > is the same comment as case 2, where preempt will be disabled in
> > readers, so it should be safe for PREEMPT_AUTO (famous last words).

The preempt_disable here is in kernel/trace/trace.h:

  ftrace_graph_addr()

-- Steve

> > 
> > Though I am not yet able to locate that preempt_disable() which is not
> > an PREEMPT_AUTO-related issue anyway. Maybe its buried in function
> > graph tracing logic somewhere?  
> 
> With the trampolines, isn't synchronize_rcu_tasks_rude() paired with
> a call to synchronize_rcu_tasks()?  In that case, rude's only job is
> getting all CPUs out of their previous sojourn in either the entry/exit
> code or the deep idle loop.  RCU Tasks waits for each task to voluntarily
> block, which takes care of all tasks executing elsewhere.  (Recall that
> RCU Tasks ignores the idle tasks.)
> 
> > Finally, my thought also was, if any of these thread usages/cases of
> > Tasks RCU RUDE assume working only on a CONFIG_PREEMPT_NONE=y or
> > CONFIG_PREEMPT_VOLUNTARY=y kernel, that could be worrying but AFAICS,
> > they don't assume anything related to that.  
> 
> Good point, most generic code should need to tolerate preemption in
> any case.  But I have nine commits queued thus far that handle some
> CONFIG_AUTO breakage or another, so a little paranoia won't go too
> far amiss.  ;-)
> 
> Remaining on my list are uses of the various CONFIG_PREEMPT* Kconfig
> options, both in code and in Makefiles.  Though who knows?  Perhaps Ankur
> or Thomas have already done that.
Mark Rutland March 7, 2024, 4:57 p.m. UTC | #43
Hi Joel, Paul, Steve,

On Mon, Mar 04, 2024 at 04:16:01AM -0500, Joel Fernandes wrote:
> Hi Paul,
> 
> On 3/2/2024 8:01 PM, Joel Fernandes wrote:
> >> As you noted, one thing that Ankur's series changes is that preemption
> >> can occur anywhere that it is not specifically disabled in kernels
> >> built with CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y.  This in
> >> turn changes Tasks Rude RCU's definition of a quiescent state for these
> >> kernels, adding all code regions where preemption is not specifically
> >> disabled to the list of such quiescent states.
> >>
> >> Although from what I know, this is OK, it would be good to check the
> >> calls to call_rcu_tasks_rude() or synchronize_rcu_tasks_rude() are set
> >> up so as to expect these new quiescent states.  One example where it
> >> would definitely be OK is if there was a call to synchronize_rcu_tasks()
> >> right before or after that call to synchronize_rcu_tasks_rude().
> >>
> >> Would you be willing to check the call sites to verify that they
> >> are OK with this change in 
> > Yes, I will analyze and make sure those users did not unexpectedly
> > assume something about AUTO (i.e. preempt enabled sections using
> > readers).
> 
> Other than RCU test code, there are just 3 call sites for RUDE right now, all in
> ftrace.c.
> 
> (Long story short, PREEMPT_AUTO should not cause wreckage in TASKS_RCU_RUDE
> other than any preexisting wreckage that !PREEMPT_AUTO already had. Steve is on
> CC as well to CMIIW).
> 
> Case 1: For !CONFIG_DYNAMIC_FTRACE update of ftrace_trace_function
> 
> This config is itself expected to be slow. However seeing what it does, it is
> trying to make sure the global function pointer "ftrace_trace_function" is
> updated and any readers of that pointers would have finished reading it. I don't
> personally think preemption has to be disabled across the entirety of the
> section that calls into this function. So sensitivity to preempt disabling
> should not be relevant for this case IMO, but lets see if ftrace folks disagree
> (on CC). It has more to do with, any callers of this function pointer are no
> longer calling into the old function.

I've been looking into this case for the last couple of days, and the short
story is that the existing code is broken today for PREEMPT_FULL, and the code
for CONFIG_DYNAMIC_FTRACE=y is similarly broken. A number of architectures have
also implemented the entry assembly incorrectly...

The !CONFIG_DYNAMIC_FTRACE code in update_ftrace_function() is intended to
ensure that 'ftrace_trace_function' and 'function_trace_op' are used as a
consistent pair; i.e. for a given 'op', op->func() should be called with that
specific 'op' as an argument. That was introduced back in commit:

  405e1d834807e51b ("ftrace: Synchronize setting function_trace_op with ftrace_trace_function")

... and subsequently reworked to use synchronize_rcu_tasks_rude() in commit:

  e5a971d76d701dbf ("ftrace: Use synchronize_rcu_tasks_rude() instead of ftrace_sync())"

The key detail is that the ftrace entry code doesn't disable preemption, and so
when full preemption is used the entry code can be preempted between loading
'ftrace_trace_function' and 'function_trace_op', and there's a race of the
form:

	Thread A				Thread B

	// in ftrace entry asm			// in update_ftrace_function()

	tmp_func = ftrace_trace_function;
	< preempted >

						ftrace_trace_function = ftrace_ops_list_func;
						/*   
						 * Make sure all CPUs see this. Yes this is slow, but static
						 * tracing is slow and nasty to have enabled.
						 */
						synchronize_rcu_tasks_rude();
						/* Now all cpus are using the list ops. */
						function_trace_op = set_function_trace_op;

	< restored > 
	tmp_ops = function_trace_op;

	tmp_func(..., tmp_ops, ...);

... and so we can end up using the old func with the new ops.



For static ftrace, this is only a problem when full preemption is used *and*
the architecture defines ARCH_SUPPORTS_FTRACE_OPS=1. Otherwise architecture
code isn't expected to provide the ops, and core code forces the use of the
list func, which iterates over all the ops to call them, e.g.

	do_for_each_ftrace_op(op, ftrace_ops_list) {
		...
		op->func(ip, parent_ip, op, fregs);
	} while_for_each_ftrace_op(op);

For static ftrace, a simple fix would be to force the use of the list func, and
not support static ftrace && ARCH_SUPPORTS_FTRACE_OPS=1. That's what x86 does today.

Another option is to follow the example of DYNAMIC_FTRACE_WITH_CALL_OPS,
and rewrite all the static ftrace asm to load the 'ops' pointer, and then load
'ops->func' from that, which would ensure the two are always in sync.

As an aside, a few architectures which unconditionally define
ARCH_SUPPORTS_FTRACE_OPS=1 fail to pass the 'fregs' argument; that'll need
cleaning up too...



For dynamic ftrace, there's a similar problem between the ftrace entry assembly and
ftrace_modify_all_code(), which looks like:

	Thread A				Thread B

	// in ftrace entry asm			// in ftrace_modify_all_code()

	tmp_ops = function_trace_op;
	< preempted (or stalled) >

						if (update) {
							update_ftrace_func(ftrace_ops_list_func);
							...
						}

						...

						if (update && ftrace_trace_function != ftrace_ops_list_func) {
							function_trace_op = set_function_trace_op;
							smp_wmb();
							smp_call_function(ftrace_sync_ipi, NULL, 1);
							update_ftrace_func(ftrace_trace_function);
						}

	< restored (or stopped stalling) >
	patched_func_call(tmp_ops);
				
... and so we can use the ops ops with the new func.

It's worth noting that (currently) trampolines don't save us here, as we can have a sequence:

	register_ftrace_function(ops_a)

	// callsites call trampoline for ops_a
	// ftrace_caller points at ops_a and calls ops_a->func

	register_ftrace_function(ops_b);

	// callsites call common ftrace_caller
	// ftrace_caller points to ops_a and calls arch_ftrace_ops_list_func

	unregister_ftrace_function(ops_a)

	// callsites call trampoline for ops_b
	// ftrace_caller points at ops_a and calls ops_b->func
	// threads already in ftrace_caller might have stale pointer to ops_a

... and I have a test (included at the end of this mail) which demonstrates
this for x86_64 defconfig + DYNAMIC_FTRACE + preempt=full:

| Kernel panic - not syncing: trace_func_c(tracee+0x4/0x10, fstress_tracee_thread+0x22/0x40, ops_b+0x0/0xc0, 0xffff9000c01dfe38) expected ops ops_c+0x0/0xc0
| CPU: 0 PID: 57 Comm: fstress_18 Not tainted 6.8.0-rc7-00001-ga4fa96d75cf9 #1
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
| Call Trace:
|  <TASK>
|  dump_stack_lvl+0x36/0x50
|  panic+0x317/0x340
|  ? tracee+0x4/0x10
|  ? tracee+0x4/0x10
|  ? fstress_tracee_thread+0x22/0x40
|  ? __pfx_fstress_tracee_thread+0x10/0x10
|  trace_func_c+0x3b/0x40
|  ftrace_call+0x5/0x44
|  ? __pfx_fstress_tracee_thread+0x10/0x10
|  ? tracee+0x9/0x10
|  tracee+0x9/0x10
|  fstress_tracee_thread+0x22/0x40
|  kthread+0xd7/0x100
|  ? __pfx_kthread+0x10/0x10
|  ret_from_fork+0x34/0x50
|  ? __pfx_kthread+0x10/0x10
|  ret_from_fork_asm+0x1b/0x30
|  </TASK>
| Kernel Offset: 0x2ba00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
| ---[ end Kernel panic - not syncing: trace_func_c(tracee+0x4/0x10, fstress_tracee_thread+0x22/0x40, ops_b+0x0/0xc0, 0xffff9000c01dfe38) expected ops ops_c+0x0/0xc0 ]---

On arm64, when using CALL_OPS we avoid the problem by construction.

On arm64 without CALL_OPS, the race is possible but *very* difficult to hit as
the load of 'function_trace_op' and the patched branch are back-to-back. With a
delay hacked in that explodes similarly:

| Kernel panic - not syncing: trace_func_b(tracee+0x4/0xc, fstress_tracee_thread+0x28/0x4c, ops_d+0x0/0xb8, 0xffff800083443dc0) expected ops ops_b+0x0/0xb8
| CPU: 0 PID: 60 Comm: fstress_16 Not tainted 6.8.0-rc7-00001-gc8e32c288657-dirty #5
| Hardware name: linux,dummy-virt (DT)
| Call trace:
|  dump_backtrace+0x98/0xf0
|  show_stack+0x20/0x2c
|  dump_stack_lvl+0x48/0x60
|  dump_stack+0x18/0x24
|  panic+0x39c/0x3d0
|  trace_func_c+0x0/0x54
|  ftrace_call+0x4/0x2c
|  tracee+0x8/0xc
|  fstress_tracee_thread+0x28/0x4c
|  kthread+0x118/0x11c
|  ret_from_fork+0x10/0x20
| Kernel Offset: disabled
| CPU features: 0x0,01000000,14020148,21004203
| Memory Limit: none

I believe our options are:

* Avoid the mismatch by construction:

  - On architectures with trampolines, we could ensure that the list_ops gets
    its own trampoline and that we *always* use a trampoline, never using a
    common ftrace_caller. When we switch callers from one trampoline to another
    they'd atomically get the new func+ops.

    I reckon that might be a good option for x86-64?

  - On architectures without trampolines we could 
    require that that the ftrace_caller 
    loads ops->func from the ops pointer.
    
    That'd mean removing the 'ftrace_trace_function' pointer and removing
    patching of the call to the trace function (but the actual tracee callsites
    would still be patched).

    I'd be in favour of this for arm64 since that matches the way CALL_OPS
    works; the only difference is we'd load a global ops pointer rather than a
    per-callsite ops pointer.

* Use rcu_tasks_trace to synchronize updates?

* Something else?

I've included my test case below. Beware that you might need to play with
NR_TRACEE_THREADS vs the number of CPUs you have, and since it uses
ftrace_set_filter_ip() it won't currently work for !DYNAMIC_FTRACE, but that
can easily be hacked around.

Mark.

---->8----
From a4fa96d75cf9eda7c4b1b936dd331e14153385c8 Mon Sep 17 00:00:00 2001
From: Mark Rutland <mark.rutland@arm.com>
Date: Wed, 6 Mar 2024 13:23:47 +0000
Subject: [PATCH] HACK: ftrace stress test

Try to check that ops->func() is called with the correct ops.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
---
 lib/Makefile  |   3 +
 lib/fstress.c | 158 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 161 insertions(+)
 create mode 100644 lib/fstress.c

diff --git a/lib/Makefile b/lib/Makefile
index 6b09731d8e619..678a916d4f0a0 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -36,6 +36,9 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 nmi_backtrace.o win_minmax.o memcat_p.o \
 	 buildid.o objpool.o
 
+obj-y += fstress.o
+CFLAGS_fstress.o += $(CC_FLAGS_FTRACE)
+
 lib-$(CONFIG_PRINTK) += dump_stack.o
 lib-$(CONFIG_SMP) += cpumask.o
 
diff --git a/lib/fstress.c b/lib/fstress.c
new file mode 100644
index 0000000000000..1ec2303f67f6f
--- /dev/null
+++ b/lib/fstress.c
@@ -0,0 +1,158 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#define pr_fmt(fmt)	"fstress: " fmt
+
+#include <linux/bug.h>
+#include <linux/errno.h>
+#include <linux/ftrace.h>
+#include <linux/init.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+
+#define DEFINE_OPS(x)							\
+	static void trace_func_##x(unsigned long ip,			\
+				   unsigned long parent_ip,		\
+				   struct ftrace_ops *op,		\
+				   struct ftrace_regs *fregs);		\
+									\
+	struct ftrace_ops ops_##x =  {					\
+		.func = trace_func_##x,					\
+	};								\
+									\
+	static void trace_func_##x(unsigned long ip,			\
+				   unsigned long parent_ip,		\
+				   struct ftrace_ops *op,		\
+				   struct ftrace_regs *fregs)		\
+	{								\
+		struct ftrace_ops *expected = &ops_##x;			\
+		if (op == expected)					\
+			return;						\
+		panic("%s(%pS, %pS, %pS, %pS) expected ops %pS\n",	\
+		      __func__, (void *)ip, (void *)parent_ip,		\
+		      op, fregs, expected);				\
+	}
+
+DEFINE_OPS(a);
+DEFINE_OPS(b);
+DEFINE_OPS(c);
+DEFINE_OPS(d);
+
+static struct ftrace_ops *fstress_ops[] = {
+	&ops_a,
+	&ops_b,
+	&ops_c,
+	&ops_d,
+};
+
+static noinline void tracee(void)
+{
+	/*
+	 * Do nothing, but prevent compiler from eliding calls to this
+	 * function.
+	 */
+	barrier();
+}
+
+static noinline int fstress_tracee_thread(void *unused)
+{
+	while (!kthread_should_stop()) {
+		for (int i = 0; i < 100; i++)
+			tracee();
+		cond_resched();
+	}
+
+	return 0;
+}
+
+static int fstress_manager_thread(void *unused)
+{
+	while (!kthread_should_stop()) {
+		int nr_ops = ARRAY_SIZE(fstress_ops);
+
+		for (int i = 0; i <= nr_ops; i++) {
+			if (i < nr_ops)
+				WARN_ON_ONCE(register_ftrace_function(fstress_ops[i]));
+			if (i > 0)
+				WARN_ON_ONCE(unregister_ftrace_function(fstress_ops[i - 1]));
+
+			/*
+			 * TODO: introduce some delay here, so that we spent
+			 * more time with a single tracer enabled (rather than
+			 * the list func).
+			 */
+		}
+
+		cond_resched();
+	}
+
+	return 0;
+}
+
+#define NR_TRACEE_THREADS	25
+
+struct task_struct *tracee_threads[NR_TRACEE_THREADS];
+struct task_struct *manager_thread;
+
+static __init int fstress_init(void)
+{
+	int ret;
+
+	pr_info("Initializing ops\n");
+	for (int i = 0; i < ARRAY_SIZE(fstress_ops); i++) {
+		ret = ftrace_set_filter_ip(fstress_ops[i], (unsigned long)tracee, 0, 0);
+		ret = 0;
+		if (ret) {
+			pr_err("Cannot set filter: %d\n", ret);
+			goto out_free_filters;
+		}
+	}
+
+	pr_info("Creating tracee threads\n");
+	for (int i = 0; i < NR_TRACEE_THREADS; i++) {
+		tracee_threads[i] = kthread_create(fstress_tracee_thread, NULL, "fstress_%d", i);
+		if (!tracee_threads[i]) {
+			pr_err("Cannot create tracee thread %d\n", i);
+			goto out_free_tracees;
+		}
+	}
+
+	pr_info("Creating manager thread\n");
+	manager_thread = kthread_create(fstress_manager_thread, NULL, "fstress_manager");
+	if (!manager_thread) {
+		pr_err("Cannot create manager thread\n");
+		goto out_free_tracees;
+	}
+
+	pr_info("Starting threads\n");
+	for (int i = 0; i < NR_TRACEE_THREADS; i++)
+		wake_up_process(tracee_threads[i]);
+	wake_up_process(manager_thread);
+
+	return 0;
+
+out_free_tracees:
+	for (int i = 0; i < NR_TRACEE_THREADS; i++) {
+		if (tracee_threads[i])
+			kthread_stop(tracee_threads[i]);
+	}
+out_free_filters:
+	for (int i = 0; i < ARRAY_SIZE(fstress_ops); i++)
+		ftrace_free_filter(fstress_ops[i]);
+
+	return -EAGAIN;
+}
+
+static __exit void fstress_exit(void)
+{
+	kthread_stop(manager_thread);
+
+	for (int i = 0; i < NR_TRACEE_THREADS; i++)
+		kthread_stop(tracee_threads[i]);
+
+	for (int i = 0; i < ARRAY_SIZE(fstress_ops); i++)
+		ftrace_free_filter(fstress_ops[i]);
+}
+
+module_init(fstress_init);
+module_exit(fstress_exit);
+MODULE_LICENSE("GPL");
Mark Rutland March 7, 2024, 6:34 p.m. UTC | #44
On Thu, Mar 07, 2024 at 04:57:33PM +0000, Mark Rutland wrote:
> On Mon, Mar 04, 2024 at 04:16:01AM -0500, Joel Fernandes wrote:
> > On 3/2/2024 8:01 PM, Joel Fernandes wrote:
> > Case 1: For !CONFIG_DYNAMIC_FTRACE update of ftrace_trace_function
> > 
> > This config is itself expected to be slow. However seeing what it does, it is
> > trying to make sure the global function pointer "ftrace_trace_function" is
> > updated and any readers of that pointers would have finished reading it. I don't
> > personally think preemption has to be disabled across the entirety of the
> > section that calls into this function. So sensitivity to preempt disabling
> > should not be relevant for this case IMO, but lets see if ftrace folks disagree
> > (on CC). It has more to do with, any callers of this function pointer are no
> > longer calling into the old function.
> 
> I've been looking into this case for the last couple of days, and the short
> story is that the existing code is broken today for PREEMPT_FULL, and the code
> for CONFIG_DYNAMIC_FTRACE=y is similarly broken. A number of architectures have
> also implemented the entry assembly incorrectly...

> I believe our options are:
> 
> * Avoid the mismatch by construction:
> 
>   - On architectures with trampolines, we could ensure that the list_ops gets
>     its own trampoline and that we *always* use a trampoline, never using a
>     common ftrace_caller. When we switch callers from one trampoline to another
>     they'd atomically get the new func+ops.
> 
>     I reckon that might be a good option for x86-64?
> 
>   - On architectures without trampolines we could 
>     require that that the ftrace_caller 
>     loads ops->func from the ops pointer.
>     
>     That'd mean removing the 'ftrace_trace_function' pointer and removing
>     patching of the call to the trace function (but the actual tracee callsites
>     would still be patched).
> 
>     I'd be in favour of this for arm64 since that matches the way CALL_OPS
>     works; the only difference is we'd load a global ops pointer rather than a
>     per-callsite ops pointer.
> 
> * Use rcu_tasks_trace to synchronize updates?

Having acquainted myself with the RCU flavours, I think the RCU Tasks Trace
suggestion wouldn't help, but *some* flavour of RCU might give us what we need.

That said, my preference is the "avoid the mismatch by construction" camp, as
even if we need to wait for uses of the old func+ops to finish, we'd have fewer
transitions (and consequently less patching) if we have:

	switch_to_new_ops();
	wait_for_old_ops_usage_to_finish();

... rather than:

	switch_to_list_func();
	wait_for_old_ops_usage_to_finish();
	switch_to_new_ops();
	ensure_new_ops_are_visible();
	switch_to_new_func();

Mark.
Steven Rostedt March 7, 2024, 6:52 p.m. UTC | #45
On Thu, 7 Mar 2024 16:57:33 +0000
Mark Rutland <mark.rutland@arm.com> wrote:

> * Use rcu_tasks_trace to synchronize updates?

Yes. I think I wanted both. The above to make sure it covers all cases
where something could be preempted, and a case for those covered when RCU
isn't watching (which always has preemption disabled).

My mistake was I thought synchronize_rcu_tasks_rude() did both. But I just
found out recently that it is not a superset of synchronize_rcu_tasks().

But it really needs it in every location that synchronize_rcu_tasks_rude()
is called.

-- Steve
Paul E. McKenney March 7, 2024, 6:58 p.m. UTC | #46
On Thu, Mar 07, 2024 at 01:52:15PM -0500, Steven Rostedt wrote:
> On Thu, 7 Mar 2024 16:57:33 +0000
> Mark Rutland <mark.rutland@arm.com> wrote:
> 
> > * Use rcu_tasks_trace to synchronize updates?
> 
> Yes. I think I wanted both. The above to make sure it covers all cases
> where something could be preempted, and a case for those covered when RCU
> isn't watching (which always has preemption disabled).
> 
> My mistake was I thought synchronize_rcu_tasks_rude() did both. But I just
> found out recently that it is not a superset of synchronize_rcu_tasks().
> 
> But it really needs it in every location that synchronize_rcu_tasks_rude()
> is called.

Should any RCU Tasks Rude grace-period request also wait for an RCU Tasks
grace period?

I would feel better about proposing this, especially for
call_rcu_tasks_rude(), if RCU Tasks Rude was not supposed to be going
away after all the noinstr tags are in place.

							Thanx, Paul
diff mbox series

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index 275fd5259a4a..6e41263ff5d3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6773,6 +6773,10 @@  static int napi_threaded_poll(void *data)
 				net_rps_action_and_irq_enable(sd);
 			}
 			skb_defer_free_flush(sd);
+
+			if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+				rcu_softirq_qs();
+
 			local_bh_enable();
 
 			if (!repoll)