diff mbox series

[net-next,14/15,v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.

Message ID 20240510162121.f-tvqcyf@linutronix.de (mailing list archive)
State Not Applicable
Delegated to: Netdev Maintainers
Headers show
Series None | expand

Commit Message

Sebastian Andrzej Siewior May 10, 2024, 4:21 p.m. UTC
The XDP redirect process is two staged:
- bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
  packet and makes decisions. While doing that, the per-CPU variable
  bpf_redirect_info is used.

- Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
  and it may also access other per-CPU variables like xskmap_flush_list.

At the very end of the NAPI callback, xdp_do_flush() is invoked which
does not access bpf_redirect_info but will touch the individual per-CPU
lists.

The per-CPU variables are only used in the NAPI callback hence disabling
bottom halves is the only protection mechanism. Users from preemptible
context (like cpu_map_kthread_run()) explicitly disable bottom halves
for protections reasons.
Without locking in local_bh_disable() on PREEMPT_RT this data structure
requires explicit locking.

PREEMPT_RT has forced-threaded interrupts enabled and every
NAPI-callback runs in a thread. If each thread has its own data
structure then locking can be avoided.

Create a struct bpf_net_context which contains struct bpf_redirect_info.
Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
it. Use the __free() annotation to automatically reset the pointer once
function returns.
The bpf_net_ctx_set() may nest. For instance a function can be used from
within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
NET_TX_SOFTIRQ which does not. Therefore only the first invocations
updates the pointer.
Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
bpf_redirect_info.

On PREEMPT_RT the pointer to bpf_net_context is saved task's
task_struct. On non-PREEMPT_RT builds the pointer saved in a per-CPU
variable (which is always NODE-local memory). Using always the
bpf_net_context approach has the advantage that there is almost zero
differences between PREEMPT_RT and non-PREEMPT_RT builds.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: bpf@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/filter.h | 42 ++++++++++++++++++++++++++++++++-----
 include/linux/sched.h  |  3 +++
 kernel/bpf/cpumap.c    |  3 +++
 kernel/fork.c          |  1 +
 net/bpf/test_run.c     | 11 +++++++++-
 net/core/dev.c         | 19 ++++++++++++++++-
 net/core/filter.c      | 47 +++++++++++++++++++-----------------------
 net/core/lwt_bpf.c     |  3 +++
 8 files changed, 96 insertions(+), 33 deletions(-)

Comments

Sebastian Andrzej Siewior May 10, 2024, 4:22 p.m. UTC | #1
On 2024-05-10 18:21:24 [+0200], To Jesper Dangaard Brouer wrote:
> The XDP redirect process is two staged:
…
On 2024-05-07 15:27:44 [+0200], Jesper Dangaard Brouer wrote:
> 
> I need/want to echo Toke's request to benchmark these changes.

I have:
boxA: ixgbe
boxB: i40e

Both are bigger NUMA boxes. I have to patch ixgbe to ignore the 64CPU
limit and I boot box with only 64CPUs. The IOMMU has been disabled on
both box as well as CPU mitigations. The link is 10G.

The base for testing I have is commit a17ef9e6c2c1c ("net_sched:
sch_sfq: annotate data-races around q->perturb_period") which I used to
rebase my series on top of.

pktgen_sample03_burst_single_flow.sh has been used to send packets and
"xdp-bench drop $nic -e" to receive them.

baseline
~~~~~~~~
boxB -> boxA | gov performance
-t2 (to pktgen)
| receive total 14,854,233 pkt/s        14,854,233 drop/s                0 error/s      

-t1 (to pktgen)
| receive total 10,642,895 pkt/s        10,642,895 drop/s                0 error/s      


boxB -> boxA | gov powersave
-t2 (to pktgen)
  receive total 10,196,085 pkt/s        10,196,085 drop/s                0 error/s      
  receive total 10,187,254 pkt/s        10,187,254 drop/s                0 error/s      
  receive total 10,553,298 pkt/s        10,553,298 drop/s                0 error/s

-t1
  receive total 10,427,732 pkt/s        10,427,732 drop/s                0 error/s      

======
boxA -> boxB (-t1) gov performance
performace:
  receive total 13,171,962 pkt/s        13,171,962 drop/s                0 error/s      
  receive total 13,368,344 pkt/s        13,368,344 drop/s                0 error/s

powersave:
  receive total 13,343,136 pkt/s        13,343,136 drop/s                0 error/s      
  receive total 13,220,326 pkt/s        13,220,326 drop/s                0 error/s      

(I the CPU governor had no impact, just noise)

The series applied (with updated 14/15)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
boxB -> boxA | gov performance
-t2:
  receive total  14,880,199 pkt/s        14,880,199 drop/s                0 error/s

-t1:
  receive total  10,769,082 pkt/s        10,769,082 drop/s                0 error/s      

boxB -> boxA | gov powersave
-t2:
 receive total   11,163,323 pkt/s        11,163,323 drop/s                0 error/s      

-t1:
 receive total   10,756,515 pkt/s        10,756,515 drop/s                0 error/s      

boxA -> boxB | gov perfomance

 receive total  13,395,919 pkt/s        13,395,919 drop/s                0 error/s      

boxA -> boxB | gov perfomance
 receive total  13,290,527 pkt/s        13,290,527 drop/s                0 error/s


Based on my numbers, there is just noise.  BoxA hit the CPU limit during
receive while lowering the CPU-freq. BoxB seems to be unaffected by
lowing CPU frequency during receive.

I can't comment on anything >10G due to HW limits.

Sebastian
Jesper Dangaard Brouer May 14, 2024, 5:07 a.m. UTC | #2
On 10/05/2024 18.22, Sebastian Andrzej Siewior wrote:
> On 2024-05-10 18:21:24 [+0200], To Jesper Dangaard Brouer wrote:
>> The XDP redirect process is two staged:
> …
> On 2024-05-07 15:27:44 [+0200], Jesper Dangaard Brouer wrote:
>>
>> I need/want to echo Toke's request to benchmark these changes.
> 
> I have:
> boxA: ixgbe
> boxB: i40e
> 
> Both are bigger NUMA boxes. I have to patch ixgbe to ignore the 64CPU
> limit and I boot box with only 64CPUs. The IOMMU has been disabled on
> both box as well as CPU mitigations. The link is 10G.
> 
> The base for testing I have is commit a17ef9e6c2c1c ("net_sched:
> sch_sfq: annotate data-races around q->perturb_period") which I used to
> rebase my series on top of.
> 
> pktgen_sample03_burst_single_flow.sh has been used to send packets and
> "xdp-bench drop $nic -e" to receive them.
> 

Sorry, but a XDP_DROP test will not activate the code you are modifying.
Thus, this test is invalid and doesn't tell us anything about your code 
changes.

The code is modifying the XDP_REDIRECT handling system. Thus, the
benchmark test needs to activate this code.


> baseline
> ~~~~~~~~
> boxB -> boxA | gov performance
> -t2 (to pktgen)
> | receive total 14,854,233 pkt/s        14,854,233 drop/s                0 error/s
> 
> -t1 (to pktgen)
> | receive total 10,642,895 pkt/s        10,642,895 drop/s                0 error/s
> 
> 
> boxB -> boxA | gov powersave
> -t2 (to pktgen)
>    receive total 10,196,085 pkt/s        10,196,085 drop/s                0 error/s
>    receive total 10,187,254 pkt/s        10,187,254 drop/s                0 error/s
>    receive total 10,553,298 pkt/s        10,553,298 drop/s                0 error/s
> 
> -t1
>    receive total 10,427,732 pkt/s        10,427,732 drop/s                0 error/s
> 
> ======
> boxA -> boxB (-t1) gov performance
> performace:
>    receive total 13,171,962 pkt/s        13,171,962 drop/s                0 error/s
>    receive total 13,368,344 pkt/s        13,368,344 drop/s                0 error/s
> 
> powersave:
>    receive total 13,343,136 pkt/s        13,343,136 drop/s                0 error/s
>    receive total 13,220,326 pkt/s        13,220,326 drop/s                0 error/s
> 
> (I the CPU governor had no impact, just noise)
> 
> The series applied (with updated 14/15)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> boxB -> boxA | gov performance
> -t2:
>    receive total  14,880,199 pkt/s        14,880,199 drop/s                0 error/s
> 
> -t1:
>    receive total  10,769,082 pkt/s        10,769,082 drop/s                0 error/s
> 
> boxB -> boxA | gov powersave
> -t2:
>   receive total   11,163,323 pkt/s        11,163,323 drop/s                0 error/s
> 
> -t1:
>   receive total   10,756,515 pkt/s        10,756,515 drop/s                0 error/s
> 
> boxA -> boxB | gov perfomance
> 
>   receive total  13,395,919 pkt/s        13,395,919 drop/s                0 error/s
> 
> boxA -> boxB | gov perfomance
>   receive total  13,290,527 pkt/s        13,290,527 drop/s                0 error/s
> 
> 
> Based on my numbers, there is just noise.  BoxA hit the CPU limit during
> receive while lowering the CPU-freq. BoxB seems to be unaffected by
> lowing CPU frequency during receive.
> 
> I can't comment on anything >10G due to HW limits.
> 
> Sebastian
Sebastian Andrzej Siewior May 14, 2024, 5:43 a.m. UTC | #3
On 2024-05-14 07:07:21 [+0200], Jesper Dangaard Brouer wrote:
> > pktgen_sample03_burst_single_flow.sh has been used to send packets and
> > "xdp-bench drop $nic -e" to receive them.
> > 
> 
> Sorry, but a XDP_DROP test will not activate the code you are modifying.
> Thus, this test is invalid and doesn't tell us anything about your code
> changes.
> 
> The code is modifying the XDP_REDIRECT handling system. Thus, the
> benchmark test needs to activate this code.

This was a misunderstanding on my side then. What do you suggest
instead? Same setup but "redirect" on the same interface instead of
"drop"?

Sebastian
Toke Høiland-Jørgensen May 14, 2024, 11:54 a.m. UTC | #4
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> The XDP redirect process is two staged:
> - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
>   packet and makes decisions. While doing that, the per-CPU variable
>   bpf_redirect_info is used.
>
> - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
>   and it may also access other per-CPU variables like xskmap_flush_list.
>
> At the very end of the NAPI callback, xdp_do_flush() is invoked which
> does not access bpf_redirect_info but will touch the individual per-CPU
> lists.
>
> The per-CPU variables are only used in the NAPI callback hence disabling
> bottom halves is the only protection mechanism. Users from preemptible
> context (like cpu_map_kthread_run()) explicitly disable bottom halves
> for protections reasons.
> Without locking in local_bh_disable() on PREEMPT_RT this data structure
> requires explicit locking.
>
> PREEMPT_RT has forced-threaded interrupts enabled and every
> NAPI-callback runs in a thread. If each thread has its own data
> structure then locking can be avoided.
>
> Create a struct bpf_net_context which contains struct bpf_redirect_info.
> Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
> it. Use the __free() annotation to automatically reset the pointer once
> function returns.
> The bpf_net_ctx_set() may nest. For instance a function can be used from
> within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
> NET_TX_SOFTIRQ which does not. Therefore only the first invocations
> updates the pointer.
> Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
> bpf_redirect_info.
>
> On PREEMPT_RT the pointer to bpf_net_context is saved task's
> task_struct. On non-PREEMPT_RT builds the pointer saved in a per-CPU
> variable (which is always NODE-local memory). Using always the
> bpf_net_context approach has the advantage that there is almost zero
> differences between PREEMPT_RT and non-PREEMPT_RT builds.
>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Andrii Nakryiko <andrii@kernel.org>
> Cc: Eduard Zingerman <eddyz87@gmail.com>
> Cc: Hao Luo <haoluo@google.com>
> Cc: Jesper Dangaard Brouer <hawk@kernel.org>
> Cc: Jiri Olsa <jolsa@kernel.org>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: KP Singh <kpsingh@kernel.org>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Song Liu <song@kernel.org>
> Cc: Stanislav Fomichev <sdf@google.com>
> Cc: Toke Høiland-Jørgensen <toke@redhat.com>
> Cc: Yonghong Song <yonghong.song@linux.dev>
> Cc: bpf@vger.kernel.org
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  include/linux/filter.h | 42 ++++++++++++++++++++++++++++++++-----
>  include/linux/sched.h  |  3 +++
>  kernel/bpf/cpumap.c    |  3 +++
>  kernel/fork.c          |  1 +
>  net/bpf/test_run.c     | 11 +++++++++-
>  net/core/dev.c         | 19 ++++++++++++++++-
>  net/core/filter.c      | 47 +++++++++++++++++++-----------------------
>  net/core/lwt_bpf.c     |  3 +++
>  8 files changed, 96 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index d5fea03cb6e61..6db5a68db6ee1 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -744,7 +744,39 @@ struct bpf_redirect_info {
>  	struct bpf_nh_params nh;
>  };
>  
> -DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
> +struct bpf_net_context {
> +	struct bpf_redirect_info ri;
> +};
> +
> +static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx)
> +{
> +	struct task_struct *tsk = current;
> +
> +	if (tsk->bpf_net_context != NULL)
> +		return NULL;
> +	tsk->bpf_net_context = bpf_net_ctx;
> +	return bpf_net_ctx;
> +}
> +
> +static inline void bpf_net_ctx_clear(struct bpf_net_context *bpf_net_ctx)
> +{
> +	if (bpf_net_ctx)
> +		current->bpf_net_context = NULL;
> +}
> +
> +static inline struct bpf_net_context *bpf_net_ctx_get(void)
> +{
> +	return current->bpf_net_context;
> +}
> +
> +static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
> +{
> +	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
> +
> +	return &bpf_net_ctx->ri;
> +}
> +
> +DEFINE_FREE(bpf_net_ctx_clear, struct bpf_net_context *, bpf_net_ctx_clear(_T));
>  
>  /* flags for bpf_redirect_info kern_flags */
>  #define BPF_RI_F_RF_NO_DIRECT	BIT(0)	/* no napi_direct on return_frame */
> @@ -1021,21 +1053,21 @@ void bpf_clear_redirect_map(struct bpf_map *map);
>  
>  static inline bool xdp_return_frame_no_direct(void)
>  {
> -	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
>  
>  	return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
>  }
>  
>  static inline void xdp_set_return_frame_no_direct(void)
>  {
> -	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
>  
>  	ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
>  }
>  
>  static inline void xdp_clear_return_frame_no_direct(void)
>  {
> -	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
>  
>  	ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
>  }
> @@ -1591,7 +1623,7 @@ static __always_inline long __bpf_xdp_redirect_map(struct bpf_map *map, u64 inde
>  						   u64 flags, const u64 flag_mask,
>  						   void *lookup_elem(struct bpf_map *map, u32 key))
>  {
> -	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
> +	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
>  	const u64 action_mask = XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX;
>  
>  	/* Lower bits of the flags are used as return code on lookup failure */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6779d3b8f2578..cc9be45de6606 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -53,6 +53,7 @@ struct bio_list;
>  struct blk_plug;
>  struct bpf_local_storage;
>  struct bpf_run_ctx;
> +struct bpf_net_context;
>  struct capture_control;
>  struct cfs_rq;
>  struct fs_struct;
> @@ -1504,6 +1505,8 @@ struct task_struct {
>  	/* Used for BPF run context */
>  	struct bpf_run_ctx		*bpf_ctx;
>  #endif
> +	/* Used by BPF for per-TASK xdp storage */
> +	struct bpf_net_context		*bpf_net_context;

Okay, so if we are going the route of always putting this in 'current',
why not just embed the whole struct bpf_net_context inside task_struct,
instead of mucking about with the stack-allocated structures and
setting/clearing of pointers?

-Toke
Jesper Dangaard Brouer May 14, 2024, 12:20 p.m. UTC | #5
On 14/05/2024 07.43, Sebastian Andrzej Siewior wrote:
> On 2024-05-14 07:07:21 [+0200], Jesper Dangaard Brouer wrote:
>>> pktgen_sample03_burst_single_flow.sh has been used to send packets and
>>> "xdp-bench drop $nic -e" to receive them.
>>>
>>
>> Sorry, but a XDP_DROP test will not activate the code you are modifying.
>> Thus, this test is invalid and doesn't tell us anything about your code
>> changes.
>>
>> The code is modifying the XDP_REDIRECT handling system. Thus, the
>> benchmark test needs to activate this code.
> 
> This was a misunderstanding on my side then. What do you suggest
> instead? Same setup but "redirect" on the same interface instead of
> "drop"?
> 

Redirect is more flexible, but redirect back-out same interface is one
option, but I've often seen this will give issues, because it will
overload the traffic generator (without people realizing this) leading
to false-results. Thus, verify packet generator is sending faster than
results you are collecting. (I use this tool[2] on generator machine, in
another terminal, to see of something funky is happening with ethtool
stats).

To workaround this issue, I've previously redirected to device 'lo'
localhost, which is obviously invalid so packet gets dropped, but I can
see that when we converted from kernel samples/bpf/ to this tool, this
trick/hack is no longer supported.

The xdp-bench[1] tool also provide a number of redirect sub-commands.
E.g. redirect / redirect-cpu / redirect-map / redirect-multi.
Given you also modify CPU-map code, I would say we also need a
'redirect-cpu' test case.

Trick for CPU-map to do early drop on remote CPU:

  # ./xdp-bench redirect-cpu --cpu 3 --remote-action drop ixgbe1

I recommend using Ctrl+\ while running to show more info like CPUs being
used and what kthread consumes.  To catch issues e.g. if you are CPU
redirecting to same CPU as RX happen to run on.

--Jesper

[1] https://github.com/xdp-project/xdp-tools/tree/master/xdp-bench
[2] 
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
Sebastian Andrzej Siewior May 15, 2024, 1:43 p.m. UTC | #6
On 2024-05-14 13:54:43 [+0200], Toke Høiland-Jørgensen wrote:
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1504,6 +1505,8 @@ struct task_struct {
> >  	/* Used for BPF run context */
> >  	struct bpf_run_ctx		*bpf_ctx;
> >  #endif
> > +	/* Used by BPF for per-TASK xdp storage */
> > +	struct bpf_net_context		*bpf_net_context;
> 
> Okay, so if we are going the route of always putting this in 'current',
> why not just embed the whole struct bpf_net_context inside task_struct,
> instead of mucking about with the stack-allocated structures and
> setting/clearing of pointers?

The whole struct bpf_net_context has 112 bytes. task_struct has 12352
bytes in my debug-config or 7296 bytes with defconfig on x86-64. Adding
it unconditionally would grow task_struct by ~1% but it would make
things way easier: The NULL case goes away, the assignment and cleanup
goes away, the INIT_LIST_HEAD can be moved to fork(). If the size
increase is not an issue then why not. Let me prepare…

> -Toke

Sebastian
Sebastian Andrzej Siewior May 17, 2024, 4:15 p.m. UTC | #7
On 2024-05-14 14:20:03 [+0200], Jesper Dangaard Brouer wrote:
> Trick for CPU-map to do early drop on remote CPU:
> 
>  # ./xdp-bench redirect-cpu --cpu 3 --remote-action drop ixgbe1
> 
> I recommend using Ctrl+\ while running to show more info like CPUs being
> used and what kthread consumes.  To catch issues e.g. if you are CPU
> redirecting to same CPU as RX happen to run on.

Okay. So I reworked the last two patches make the struct part of
task_struct and then did as you suggested:

Unpatched:
|Sending:
|Show adapter(s) (eno2np1) statistics (ONLY that changed!)
|Ethtool(eno2np1 ) stat:    952102520 (    952,102,520) <= port.tx_bytes /sec
|Ethtool(eno2np1 ) stat:     14876602 (     14,876,602) <= port.tx_size_64 /sec
|Ethtool(eno2np1 ) stat:     14876602 (     14,876,602) <= port.tx_unicast /sec
|Ethtool(eno2np1 ) stat:    446045897 (    446,045,897) <= tx-0.bytes /sec
|Ethtool(eno2np1 ) stat:      7434098 (      7,434,098) <= tx-0.packets /sec
|Ethtool(eno2np1 ) stat:    446556042 (    446,556,042) <= tx-1.bytes /sec
|Ethtool(eno2np1 ) stat:      7442601 (      7,442,601) <= tx-1.packets /sec
|Ethtool(eno2np1 ) stat:    892592523 (    892,592,523) <= tx_bytes /sec
|Ethtool(eno2np1 ) stat:     14876542 (     14,876,542) <= tx_packets /sec
|Ethtool(eno2np1 ) stat:            2 (              2) <= tx_restart /sec
|Ethtool(eno2np1 ) stat:            2 (              2) <= tx_stopped /sec
|Ethtool(eno2np1 ) stat:     14876622 (     14,876,622) <= tx_unicast /sec
|
|Receive:
|eth1->?                 8,732,508 rx/s                  0 err,drop/s
|  receive total         8,732,508 pkt/s                 0 drop/s                0 error/s
|    cpu:10              8,732,508 pkt/s                 0 drop/s                0 error/s
|  enqueue to cpu 3      8,732,510 pkt/s                 0 drop/s             7.00 bulk-avg
|    cpu:10->3           8,732,510 pkt/s                 0 drop/s             7.00 bulk-avg
|  kthread total         8,732,506 pkt/s                 0 drop/s          205,650 sched
|    cpu:3               8,732,506 pkt/s                 0 drop/s          205,650 sched
|    xdp_stats                   0 pass/s        8,732,506 drop/s                0 redir/s
|      cpu:3                     0 pass/s        8,732,506 drop/s                0 redir/s
|  redirect_err                  0 error/s
|  xdp_exception                 0 hit/s

I verified that the "drop only" case hits 14M packets/s while this
redirect part reports 8M packets/s.

Patched:
|Sending:
|Show adapter(s) (eno2np1) statistics (ONLY that changed!)
|Ethtool(eno2np1 ) stat:    952635404 (    952,635,404) <= port.tx_bytes /sec
|Ethtool(eno2np1 ) stat:     14884934 (     14,884,934) <= port.tx_size_64 /sec
|Ethtool(eno2np1 ) stat:     14884928 (     14,884,928) <= port.tx_unicast /sec
|Ethtool(eno2np1 ) stat:    446496117 (    446,496,117) <= tx-0.bytes /sec
|Ethtool(eno2np1 ) stat:      7441602 (      7,441,602) <= tx-0.packets /sec
|Ethtool(eno2np1 ) stat:    446603461 (    446,603,461) <= tx-1.bytes /sec
|Ethtool(eno2np1 ) stat:      7443391 (      7,443,391) <= tx-1.packets /sec
|Ethtool(eno2np1 ) stat:    893086506 (    893,086,506) <= tx_bytes /sec
|Ethtool(eno2np1 ) stat:     14884775 (     14,884,775) <= tx_packets /sec
|Ethtool(eno2np1 ) stat:           14 (             14) <= tx_restart /sec
|Ethtool(eno2np1 ) stat:           14 (             14) <= tx_stopped /sec
|Ethtool(eno2np1 ) stat:     14884937 (     14,884,937) <= tx_unicast /sec
|
|Receive:
|eth1->?                 8,735,198 rx/s                  0 err,drop/s
|  receive total         8,735,198 pkt/s                 0 drop/s                0 error/s
|    cpu:6               8,735,198 pkt/s                 0 drop/s                0 error/s
|  enqueue to cpu 3      8,735,193 pkt/s                 0 drop/s             7.00 bulk-avg
|    cpu:6->3            8,735,193 pkt/s                 0 drop/s             7.00 bulk-avg
|  kthread total         8,735,191 pkt/s                 0 drop/s          208,054 sched
|    cpu:3               8,735,191 pkt/s                 0 drop/s          208,054 sched
|    xdp_stats                   0 pass/s        8,735,191 drop/s                0 redir/s
|      cpu:3                     0 pass/s        8,735,191 drop/s                0 redir/s
|  redirect_err                  0 error/s
|  xdp_exception                 0 hit/s

This looks to be in the same range/ noise level. top wise I have
ksoftirqd at 100% and cpumap/./map at ~60% so I hit CPU speed limit on a
10G link. perf top shows
|   18.37%  bpf_prog_4f0ffbb35139c187_cpumap_l4_hash         [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash
|   13.15%  [kernel]                                         [k] cpu_map_kthread_run
|   12.96%  [kernel]                                         [k] ixgbe_poll
|    6.78%  [kernel]                                         [k] page_frag_free
|    5.62%  [kernel]                                         [k] xdp_do_redirect

for the top 5. Is this something that looks reasonable?

Sebastian
Alexei Starovoitov May 21, 2024, 1:52 a.m. UTC | #8
On Wed, May 15, 2024 at 6:43 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2024-05-14 13:54:43 [+0200], Toke Høiland-Jørgensen wrote:
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -1504,6 +1505,8 @@ struct task_struct {
> > >     /* Used for BPF run context */
> > >     struct bpf_run_ctx              *bpf_ctx;
> > >  #endif
> > > +   /* Used by BPF for per-TASK xdp storage */
> > > +   struct bpf_net_context          *bpf_net_context;
> >
> > Okay, so if we are going the route of always putting this in 'current',
> > why not just embed the whole struct bpf_net_context inside task_struct,
> > instead of mucking about with the stack-allocated structures and
> > setting/clearing of pointers?
>
> The whole struct bpf_net_context has 112 bytes. task_struct has 12352
> bytes in my debug-config or 7296 bytes with defconfig on x86-64. Adding
> it unconditionally would grow task_struct by ~1% but it would make
> things way easier: The NULL case goes away, the assignment and cleanup
> goes away, the INIT_LIST_HEAD can be moved to fork(). If the size
> increase is not an issue then why not. Let me prepare…

I think 112 bytes or whatever the size of bpf_net_context is a bit
too much to consume in task_struct.
Yes, it's big, but there are systems with 1m threads. 112Mbyte is not
that small.
bpf_net_ctx_set/get are not in critical path and get_ri will be
inlined without any conditionals, so performance should be the same.
Jesper Dangaard Brouer May 22, 2024, 7:09 a.m. UTC | #9
On 17/05/2024 18.15, Sebastian Andrzej Siewior wrote:
> On 2024-05-14 14:20:03 [+0200], Jesper Dangaard Brouer wrote:
>> Trick for CPU-map to do early drop on remote CPU:
>>
>>   # ./xdp-bench redirect-cpu --cpu 3 --remote-action drop ixgbe1
>>
>> I recommend using Ctrl+\ while running to show more info like CPUs being
>> used and what kthread consumes.  To catch issues e.g. if you are CPU
>> redirecting to same CPU as RX happen to run on.
> 
> Okay. So I reworked the last two patches make the struct part of
> task_struct and then did as you suggested:
> 
> Unpatched:
> |Sending:
> |Show adapter(s) (eno2np1) statistics (ONLY that changed!)
> |Ethtool(eno2np1 ) stat:    952102520 (    952,102,520) <= port.tx_bytes /sec
> |Ethtool(eno2np1 ) stat:     14876602 (     14,876,602) <= port.tx_size_64 /sec
> |Ethtool(eno2np1 ) stat:     14876602 (     14,876,602) <= port.tx_unicast /sec
> |Ethtool(eno2np1 ) stat:    446045897 (    446,045,897) <= tx-0.bytes /sec
> |Ethtool(eno2np1 ) stat:      7434098 (      7,434,098) <= tx-0.packets /sec
> |Ethtool(eno2np1 ) stat:    446556042 (    446,556,042) <= tx-1.bytes /sec
> |Ethtool(eno2np1 ) stat:      7442601 (      7,442,601) <= tx-1.packets /sec
> |Ethtool(eno2np1 ) stat:    892592523 (    892,592,523) <= tx_bytes /sec
> |Ethtool(eno2np1 ) stat:     14876542 (     14,876,542) <= tx_packets /sec
> |Ethtool(eno2np1 ) stat:            2 (              2) <= tx_restart /sec
> |Ethtool(eno2np1 ) stat:            2 (              2) <= tx_stopped /sec
> |Ethtool(eno2np1 ) stat:     14876622 (     14,876,622) <= tx_unicast /sec
> |
> |Receive:
> |eth1->?                 8,732,508 rx/s                  0 err,drop/s
> |  receive total         8,732,508 pkt/s                 0 drop/s                0 error/s
> |    cpu:10              8,732,508 pkt/s                 0 drop/s                0 error/s
> |  enqueue to cpu 3      8,732,510 pkt/s                 0 drop/s             7.00 bulk-avg
> |    cpu:10->3           8,732,510 pkt/s                 0 drop/s             7.00 bulk-avg
> |  kthread total         8,732,506 pkt/s                 0 drop/s          205,650 sched
> |    cpu:3               8,732,506 pkt/s                 0 drop/s          205,650 sched
> |    xdp_stats                   0 pass/s        8,732,506 drop/s                0 redir/s
> |      cpu:3                     0 pass/s        8,732,506 drop/s                0 redir/s
> |  redirect_err                  0 error/s
> |  xdp_exception                 0 hit/s
> 
> I verified that the "drop only" case hits 14M packets/s while this
> redirect part reports 8M packets/s.
>

Great, this is a good test.

The transmit speed 14.88 Mpps is 10G wirespeed at smallest Ethernet
packet size (84 bytes with overhead + intergap, 10*10^9/(84*8) = 14880952).


> Patched:
> |Sending:
> |Show adapter(s) (eno2np1) statistics (ONLY that changed!)
> |Ethtool(eno2np1 ) stat:    952635404 (    952,635,404) <= port.tx_bytes /sec
> |Ethtool(eno2np1 ) stat:     14884934 (     14,884,934) <= port.tx_size_64 /sec
> |Ethtool(eno2np1 ) stat:     14884928 (     14,884,928) <= port.tx_unicast /sec
> |Ethtool(eno2np1 ) stat:    446496117 (    446,496,117) <= tx-0.bytes /sec
> |Ethtool(eno2np1 ) stat:      7441602 (      7,441,602) <= tx-0.packets /sec
> |Ethtool(eno2np1 ) stat:    446603461 (    446,603,461) <= tx-1.bytes /sec
> |Ethtool(eno2np1 ) stat:      7443391 (      7,443,391) <= tx-1.packets /sec
> |Ethtool(eno2np1 ) stat:    893086506 (    893,086,506) <= tx_bytes /sec
> |Ethtool(eno2np1 ) stat:     14884775 (     14,884,775) <= tx_packets /sec
> |Ethtool(eno2np1 ) stat:           14 (             14) <= tx_restart /sec
> |Ethtool(eno2np1 ) stat:           14 (             14) <= tx_stopped /sec
> |Ethtool(eno2np1 ) stat:     14884937 (     14,884,937) <= tx_unicast /sec
> |
> |Receive:
> |eth1->?                 8,735,198 rx/s                  0 err,drop/s
> |  receive total         8,735,198 pkt/s                 0 drop/s                0 error/s
> |    cpu:6               8,735,198 pkt/s                 0 drop/s                0 error/s
> |  enqueue to cpu 3      8,735,193 pkt/s                 0 drop/s             7.00 bulk-avg
> |    cpu:6->3            8,735,193 pkt/s                 0 drop/s             7.00 bulk-avg
> |  kthread total         8,735,191 pkt/s                 0 drop/s          208,054 sched
> |    cpu:3               8,735,191 pkt/s                 0 drop/s          208,054 sched
> |    xdp_stats                   0 pass/s        8,735,191 drop/s                0 redir/s
> |      cpu:3                     0 pass/s        8,735,191 drop/s                0 redir/s
> |  redirect_err                  0 error/s
> |  xdp_exception                 0 hit/s
> 

Great basically zero overhead. Awesome you verified this!


> This looks to be in the same range/ noise level. top wise I have
> ksoftirqd at 100% and cpumap/./map at ~60% so I hit CPU speed limit on a
> 10G link. 

For our purpose of testing XDP_REDIRECT code, that you are modifying,
this is what we want.  Where RX CPU/NAPI is the bottleneck, given remote
cpumap CPU have idle cycles (also indicated by the 208,054 sched stats).

> perf top shows

I appreciate getting this perf data.

As we are explicitly dealing with splitting workload across CPUs, it
worth mentioning that perf support displaying and filtering on CPUs.

This perf commands include the CPU number (zero indexed):
  # perf report --sort cpu,comm,dso,symbol --no-children

For this benchmark, to focus, I would reduce this to:
   # perf report --sort cpu,symbol --no-children

The perf tool can also use -C to filter on some CPUs like:

  # perf report --sort cpu,symbol --no-children -C 3,6


> |   18.37%  bpf_prog_4f0ffbb35139c187_cpumap_l4_hash         [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash

This bpf_prog_4f0ffbb35139c187_cpumap_l4_hash is running on RX CPU doing 
the load-balancing.

> |   13.15%  [kernel]                                         [k] cpu_map_kthread_run

This runs on remote cpumap CPU (in this case CPU 3).

> |   12.96%  [kernel]                                         [k] ixgbe_poll
> |    6.78%  [kernel]                                         [k] page_frag_free

The page_frag_free call might run on remote cpumap CPU.

> |    5.62%  [kernel]                                         [k] xdp_do_redirect
> 
> for the top 5. Is this something that looks reasonable?

Yes, except I had to guess how the workload was split between CPUs ;-)

Thanks for doing these benchmarks! :-)
--Jesper
Sebastian Andrzej Siewior May 24, 2024, 7:54 a.m. UTC | #10
On 2024-05-22 09:09:45 [+0200], Jesper Dangaard Brouer wrote:
> 
> Yes, except I had to guess how the workload was split between CPUs ;-)
> 
> Thanks for doing these benchmarks! :-)

Thank you for all the explanations and walking me through it.
I'm going to update the patches as discussed and redo the numbers as
suggested here.
Thanks.

> --Jesper

Sebastian
Sebastian Andrzej Siewior May 24, 2024, 1:59 p.m. UTC | #11
On 2024-05-22 09:09:45 [+0200], Jesper Dangaard Brouer wrote:
> For this benchmark, to focus, I would reduce this to:
>   # perf report --sort cpu,symbol --no-children

Keeping the bpf_net_ctx_set()/clear, removing the NULL checks (to align
with Alexei in his last email).
Perf numbers wise, I'm using
	xdp-bench redirect-cpu --cpu 3 --remote-action drop eth1 -e

unpached:

| eth1->?                 9,427,705 rx/s                  0 err,drop/s
|   receive total         9,427,705 pkt/s                 0 drop/s                0 error/s
|     cpu:17              9,427,705 pkt/s                 0 drop/s                0 error/s
|   enqueue to cpu 3      9,427,708 pkt/s                 0 drop/s             8.00 bulk-avg
|     cpu:17->3           9,427,708 pkt/s                 0 drop/s             8.00 bulk-avg
|   kthread total         9,427,710 pkt/s                 0 drop/s          147,276 sched
|     cpu:3               9,427,710 pkt/s                 0 drop/s          147,276 sched
|     xdp_stats                   0 pass/s        9,427,710 drop/s                0 redir/s
|       cpu:3                     0 pass/s        9,427,710 drop/s                0 redir/s
|   redirect_err                  0 error/s
|   xdp_exception                 0 hit/s

Patched:
| eth1->?                 9,557,170 rx/s                  0 err,drop/s
|   receive total         9,557,170 pkt/s                 0 drop/s                0 error/s
|     cpu:9               9,557,170 pkt/s                 0 drop/s                0 error/s
|   enqueue to cpu 3      9,557,170 pkt/s                 0 drop/s             8.00 bulk-avg
|     cpu:9->3            9,557,170 pkt/s                 0 drop/s             8.00 bulk-avg
|   kthread total         9,557,195 pkt/s                 0 drop/s          126,164 sched
|     cpu:3               9,557,195 pkt/s                 0 drop/s          126,164 sched
|     xdp_stats                   0 pass/s        9,557,195 drop/s                0 redir/s
|       cpu:3                     0 pass/s        9,557,195 drop/s                0 redir/s
|   redirect_err                  0 error/s
|   xdp_exception                 0 hit/s

I think this is noise. perf output as suggested (perf report --sort
cpu,symbol --no-children).

unpatched:
|  19.05%  017  [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash
|  11.40%  017  [k] ixgbe_poll
|  10.68%  003  [k] cpu_map_kthread_run
|   7.62%  003  [k] intel_idle
|   6.11%  017  [k] xdp_do_redirect
|   6.01%  003  [k] page_frag_free
|   4.72%  017  [k] bq_flush_to_queue
|   3.74%  017  [k] cpu_map_redirect
|   2.35%  003  [k] xdp_return_frame
|   1.55%  003  [k] bpf_prog_57cd311f2e27366b_cpumap_drop
|   1.49%  017  [k] dma_sync_single_for_device
|   1.41%  017  [k] ixgbe_alloc_rx_buffers
|   1.26%  017  [k] cpu_map_enqueue
|   1.24%  017  [k] dma_sync_single_for_cpu
|   1.12%  003  [k] __xdp_return
|   0.83%  017  [k] bpf_trace_run4
|   0.77%  003  [k] __switch_to

patched:
|  18.20%  009  [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash
|  11.64%  009  [k] ixgbe_poll
|   7.74%  003  [k] page_frag_free
|   6.69%  003  [k] cpu_map_bpf_prog_run_xdp
|   6.02%  003  [k] intel_idle
|   5.96%  009  [k] xdp_do_redirect
|   4.45%  003  [k] cpu_map_kthread_run
|   3.71%  009  [k] cpu_map_redirect
|   3.23%  009  [k] bq_flush_to_queue
|   2.55%  003  [k] xdp_return_frame
|   1.67%  003  [k] bpf_prog_57cd311f2e27366b_cpumap_drop
|   1.60%  009  [k] _raw_spin_lock
|   1.57%  009  [k] bpf_prog_d7eca17ddc334d36_tp_xdp_cpumap_enqueue
|   1.48%  009  [k] dma_sync_single_for_device
|   1.47%  009  [k] ixgbe_alloc_rx_buffers
|   1.39%  009  [k] dma_sync_single_for_cpu
|   1.33%  009  [k] cpu_map_enqueue
|   1.19%  003  [k] __xdp_return
|   0.66%  003  [k] __switch_to

I'm going to repost the series once the merge window closes unless there
is something you want me to do.

> --Jesper

Sebastian
diff mbox series

Patch

diff --git a/include/linux/filter.h b/include/linux/filter.h
index d5fea03cb6e61..6db5a68db6ee1 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -744,7 +744,39 @@  struct bpf_redirect_info {
 	struct bpf_nh_params nh;
 };
 
-DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+struct bpf_net_context {
+	struct bpf_redirect_info ri;
+};
+
+static inline struct bpf_net_context *bpf_net_ctx_set(struct bpf_net_context *bpf_net_ctx)
+{
+	struct task_struct *tsk = current;
+
+	if (tsk->bpf_net_context != NULL)
+		return NULL;
+	tsk->bpf_net_context = bpf_net_ctx;
+	return bpf_net_ctx;
+}
+
+static inline void bpf_net_ctx_clear(struct bpf_net_context *bpf_net_ctx)
+{
+	if (bpf_net_ctx)
+		current->bpf_net_context = NULL;
+}
+
+static inline struct bpf_net_context *bpf_net_ctx_get(void)
+{
+	return current->bpf_net_context;
+}
+
+static inline struct bpf_redirect_info *bpf_net_ctx_get_ri(void)
+{
+	struct bpf_net_context *bpf_net_ctx = bpf_net_ctx_get();
+
+	return &bpf_net_ctx->ri;
+}
+
+DEFINE_FREE(bpf_net_ctx_clear, struct bpf_net_context *, bpf_net_ctx_clear(_T));
 
 /* flags for bpf_redirect_info kern_flags */
 #define BPF_RI_F_RF_NO_DIRECT	BIT(0)	/* no napi_direct on return_frame */
@@ -1021,21 +1053,21 @@  void bpf_clear_redirect_map(struct bpf_map *map);
 
 static inline bool xdp_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
 }
 
 static inline void xdp_set_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
 }
 
 static inline void xdp_clear_return_frame_no_direct(void)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
 }
@@ -1591,7 +1623,7 @@  static __always_inline long __bpf_xdp_redirect_map(struct bpf_map *map, u64 inde
 						   u64 flags, const u64 flag_mask,
 						   void *lookup_elem(struct bpf_map *map, u32 key))
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	const u64 action_mask = XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX;
 
 	/* Lower bits of the flags are used as return code on lookup failure */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6779d3b8f2578..cc9be45de6606 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -53,6 +53,7 @@  struct bio_list;
 struct blk_plug;
 struct bpf_local_storage;
 struct bpf_run_ctx;
+struct bpf_net_context;
 struct capture_control;
 struct cfs_rq;
 struct fs_struct;
@@ -1504,6 +1505,8 @@  struct task_struct {
 	/* Used for BPF run context */
 	struct bpf_run_ctx		*bpf_ctx;
 #endif
+	/* Used by BPF for per-TASK xdp storage */
+	struct bpf_net_context		*bpf_net_context;
 
 #ifdef CONFIG_GCC_PLUGIN_STACKLEAK
 	unsigned long			lowest_stack;
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index a8e34416e960f..66974bd027109 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -240,12 +240,14 @@  static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 				int xdp_n, struct xdp_cpumap_stats *stats,
 				struct list_head *list)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int nframes;
 
 	if (!rcpu->prog)
 		return xdp_n;
 
 	rcu_read_lock_bh();
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 
 	nframes = cpu_map_bpf_prog_run_xdp(rcpu, frames, xdp_n, stats);
 
@@ -255,6 +257,7 @@  static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 	if (unlikely(!list_empty(list)))
 		cpu_map_bpf_prog_run_skb(rcpu, list, stats);
 
+	bpf_net_ctx_clear(bpf_net_ctx);
 	rcu_read_unlock_bh(); /* resched point, may call do_softirq() */
 
 	return nframes;
diff --git a/kernel/fork.c b/kernel/fork.c
index aebb3e6c96dc6..aa61c9bca62be 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2355,6 +2355,7 @@  __latent_entropy struct task_struct *copy_process(
 	RCU_INIT_POINTER(p->bpf_storage, NULL);
 	p->bpf_ctx = NULL;
 #endif
+	p->bpf_net_context =  NULL;
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index f6aad4ed2ab2f..600cc8e428c1a 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -283,9 +283,10 @@  static int xdp_recv_frames(struct xdp_frame **frames, int nframes,
 static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 			      u32 repeat)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int err = 0, act, ret, i, nframes = 0, batch_sz;
 	struct xdp_frame **frames = xdp->frames;
+	struct bpf_redirect_info *ri;
 	struct xdp_page_head *head;
 	struct xdp_frame *frm;
 	bool redirect = false;
@@ -295,6 +296,8 @@  static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 	batch_sz = min_t(u32, repeat, xdp->batch_size);
 
 	local_bh_disable();
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+	ri = bpf_net_ctx_get_ri();
 	xdp_set_return_frame_no_direct();
 
 	for (i = 0; i < batch_sz; i++) {
@@ -359,6 +362,7 @@  static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
 	}
 
 	xdp_clear_return_frame_no_direct();
+	bpf_net_ctx_clear(bpf_net_ctx);
 	local_bh_enable();
 	return err;
 }
@@ -394,6 +398,7 @@  static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx,
 static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 			u32 *retval, u32 *time, bool xdp)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	struct bpf_prog_array_item item = {.prog = prog};
 	struct bpf_run_ctx *old_ctx;
 	struct bpf_cg_run_ctx run_ctx;
@@ -419,10 +424,14 @@  static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 	do {
 		run_ctx.prog_item = &item;
 		local_bh_disable();
+		bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+
 		if (xdp)
 			*retval = bpf_prog_run_xdp(prog, ctx);
 		else
 			*retval = bpf_prog_run(prog, ctx);
+
+		bpf_net_ctx_clear(bpf_net_ctx);
 		local_bh_enable();
 	} while (bpf_test_timer_continue(&t, 1, repeat, &ret, time));
 	bpf_reset_run_ctx(old_ctx);
diff --git a/net/core/dev.c b/net/core/dev.c
index 1503883ce15a4..42c05ab53ee86 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4031,11 +4031,15 @@  sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 		   struct net_device *orig_dev, bool *another)
 {
 	struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_INGRESS;
+	struct bpf_net_context __bpf_net_ctx;
 	int sch_ret;
 
 	if (!entry)
 		return skb;
+
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 	if (*pt_prev) {
 		*ret = deliver_skb(skb, *pt_prev, orig_dev);
 		*pt_prev = NULL;
@@ -4086,13 +4090,17 @@  sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 static __always_inline struct sk_buff *
 sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 {
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	struct bpf_mprog_entry *entry = rcu_dereference_bh(dev->tcx_egress);
 	enum skb_drop_reason drop_reason = SKB_DROP_REASON_TC_EGRESS;
+	struct bpf_net_context __bpf_net_ctx;
 	int sch_ret;
 
 	if (!entry)
 		return skb;
 
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+
 	/* qdisc_skb_cb(skb)->pkt_len & tcx_set_ingress() was
 	 * already set by the caller.
 	 */
@@ -6357,13 +6365,15 @@  static void __napi_busy_loop(unsigned int napi_id,
 		      bool (*loop_end)(void *, unsigned long),
 		      void *loop_end_arg, unsigned flags, u16 budget)
 {
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	unsigned long start_time = loop_end ? busy_loop_current_time() : 0;
 	int (*napi_poll)(struct napi_struct *napi, int budget);
+	struct bpf_net_context __bpf_net_ctx;
 	void *have_poll_lock = NULL;
 	struct napi_struct *napi;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
-
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 restart:
 	napi_poll = NULL;
 
@@ -6834,6 +6844,7 @@  static int napi_thread_wait(struct napi_struct *napi)
 
 static void napi_threaded_poll_loop(struct napi_struct *napi)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	struct softnet_data *sd;
 	unsigned long last_qs = jiffies;
 
@@ -6842,6 +6853,8 @@  static void napi_threaded_poll_loop(struct napi_struct *napi)
 		void *have;
 
 		local_bh_disable();
+		bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+
 		sd = this_cpu_ptr(&softnet_data);
 		sd->in_napi_threaded_poll = true;
 
@@ -6857,6 +6870,7 @@  static void napi_threaded_poll_loop(struct napi_struct *napi)
 			net_rps_action_and_irq_enable(sd);
 		}
 		skb_defer_free_flush(sd);
+		bpf_net_ctx_clear(bpf_net_ctx);
 		local_bh_enable();
 
 		if (!repoll)
@@ -6879,13 +6893,16 @@  static int napi_threaded_poll(void *data)
 
 static __latent_entropy void net_rx_action(struct softirq_action *h)
 {
+	struct bpf_net_context *bpf_net_ctx __free(bpf_net_ctx_clear) = NULL;
 	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
 	unsigned long time_limit = jiffies +
 		usecs_to_jiffies(READ_ONCE(net_hotdata.netdev_budget_usecs));
 	int budget = READ_ONCE(net_hotdata.netdev_budget);
+	struct bpf_net_context __bpf_net_ctx;
 	LIST_HEAD(list);
 	LIST_HEAD(repoll);
 
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 start:
 	sd->in_net_rx_action = true;
 	local_irq_disable();
diff --git a/net/core/filter.c b/net/core/filter.c
index e95b235a1e4f4..25ec2a76afb42 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2475,9 +2475,6 @@  static const struct bpf_func_proto bpf_clone_redirect_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
-DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
-EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info);
-
 static struct net_device *skb_get_peer_dev(struct net_device *dev)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
@@ -2490,7 +2487,7 @@  static struct net_device *skb_get_peer_dev(struct net_device *dev)
 
 int skb_do_redirect(struct sk_buff *skb)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	struct net *net = dev_net(skb->dev);
 	struct net_device *dev;
 	u32 flags = ri->flags;
@@ -2523,7 +2520,7 @@  int skb_do_redirect(struct sk_buff *skb)
 
 BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	if (unlikely(flags & (~(BPF_F_INGRESS) | BPF_F_REDIRECT_INTERNAL)))
 		return TC_ACT_SHOT;
@@ -2544,7 +2541,7 @@  static const struct bpf_func_proto bpf_redirect_proto = {
 
 BPF_CALL_2(bpf_redirect_peer, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	if (unlikely(flags))
 		return TC_ACT_SHOT;
@@ -2566,7 +2563,7 @@  static const struct bpf_func_proto bpf_redirect_peer_proto = {
 BPF_CALL_4(bpf_redirect_neigh, u32, ifindex, struct bpf_redir_neigh *, params,
 	   int, plen, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	if (unlikely((plen && plen < sizeof(*params)) || flags))
 		return TC_ACT_SHOT;
@@ -4294,19 +4291,17 @@  void xdp_do_check_flushed(struct napi_struct *napi)
 
 void bpf_clear_redirect_map(struct bpf_map *map)
 {
-	struct bpf_redirect_info *ri;
-	int cpu;
-
-	for_each_possible_cpu(cpu) {
-		ri = per_cpu_ptr(&bpf_redirect_info, cpu);
-		/* Avoid polluting remote cacheline due to writes if
-		 * not needed. Once we pass this test, we need the
-		 * cmpxchg() to make sure it hasn't been changed in
-		 * the meantime by remote CPU.
-		 */
-		if (unlikely(READ_ONCE(ri->map) == map))
-			cmpxchg(&ri->map, map, NULL);
-	}
+	/* ri->map is assigned in __bpf_xdp_redirect_map() from within a eBPF
+	 * program/ during NAPI callback. It is used during
+	 * xdp_do_generic_redirect_map()/ __xdp_do_redirect_frame() from the
+	 * redirect callback afterwards. ri->map is cleared after usage.
+	 * The path has no explicit RCU read section but the local_bh_disable()
+	 * is also a RCU read section which makes the complete softirq callback
+	 * RCU protected. This in turn makes ri->map RCU protected and it is
+	 * sufficient to wait a grace period to ensure that no "ri->map == map"
+	 * exists. dev_map_free() removes the map from the list and then
+	 * invokes synchronize_rcu() after calling this function.
+	 */
 }
 
 DEFINE_STATIC_KEY_FALSE(bpf_master_redirect_enabled_key);
@@ -4314,8 +4309,8 @@  EXPORT_SYMBOL_GPL(bpf_master_redirect_enabled_key);
 
 u32 xdp_master_redirect(struct xdp_buff *xdp)
 {
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	struct net_device *master, *slave;
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 
 	master = netdev_master_upper_dev_get_rcu(xdp->rxq->dev);
 	slave = master->netdev_ops->ndo_xdp_get_xmit_slave(master, xdp);
@@ -4432,7 +4427,7 @@  static __always_inline int __xdp_do_redirect_frame(struct bpf_redirect_info *ri,
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	enum bpf_map_type map_type = ri->map_type;
 
 	if (map_type == BPF_MAP_TYPE_XSKMAP)
@@ -4446,7 +4441,7 @@  EXPORT_SYMBOL_GPL(xdp_do_redirect);
 int xdp_do_redirect_frame(struct net_device *dev, struct xdp_buff *xdp,
 			  struct xdp_frame *xdpf, struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	enum bpf_map_type map_type = ri->map_type;
 
 	if (map_type == BPF_MAP_TYPE_XSKMAP)
@@ -4463,7 +4458,7 @@  static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       enum bpf_map_type map_type, u32 map_id,
 				       u32 flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	struct bpf_map *map;
 	int err;
 
@@ -4517,7 +4512,7 @@  static int xdp_do_generic_redirect_map(struct net_device *dev,
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 			    struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 	enum bpf_map_type map_type = ri->map_type;
 	void *fwd = ri->tgt_value;
 	u32 map_id = ri->map_id;
@@ -4553,7 +4548,7 @@  int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 
 BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 {
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_redirect_info *ri = bpf_net_ctx_get_ri();
 
 	if (unlikely(flags))
 		return XDP_ABORTED;
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index a94943681e5aa..afb05f58b64c5 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -38,12 +38,14 @@  static inline struct bpf_lwt *bpf_lwt_lwtunnel(struct lwtunnel_state *lwt)
 static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 		       struct dst_entry *dst, bool can_redirect)
 {
+	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
 	int ret;
 
 	/* Disabling BH is needed to protect per-CPU bpf_redirect_info between
 	 * BPF prog and skb_do_redirect().
 	 */
 	local_bh_disable();
+	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 	bpf_compute_data_pointers(skb);
 	ret = bpf_prog_run_save_cb(lwt->prog, skb);
 
@@ -76,6 +78,7 @@  static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 		break;
 	}
 
+	bpf_net_ctx_clear(bpf_net_ctx);
 	local_bh_enable();
 
 	return ret;