diff mbox series

[bpf-next,v3,1/7] bpf, sockmap: add BPF_F_PERMANENT flag for skmsg redirect

Message ID 20230824143959.1134019-2-liujian56@huawei.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series add BPF_F_PERMANENT flag for sockmap skmsg redirect | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-VM_Test-11 pending Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-15 pending Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-25 pending Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-8 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-28 success Logs for veristat
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 3084 this patch: 3084
netdev/cc_maintainers success CCed 19 of 19 maintainers
netdev/build_clang success Errors and warnings before: 1532 this patch: 1532
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 3116 this patch: 3116
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns WARNING: please, no space before tabs
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-3 fail Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-0 success Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain_full }}
bpf/vmtest-bpf-next-VM_Test-6 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-7 success Logs for veristat

Commit Message

liujian (CE) Aug. 24, 2023, 2:39 p.m. UTC
If the sockmap msg redirection function is used only to forward packets
and no other operation, the execution result of the BPF_SK_MSG_VERDICT
program is the same each time. In this case, the BPF program only needs to
be run once. Add BPF_F_PERMANENT flag to bpf_msg_redirect_map() and
bpf_msg_redirect_hash() to implement this ability.

Then we can enable this function in the bpf program as follows:
bpf_msg_redirect_hash(xx, xx, xx, BPF_F_INGRESS | BPF_F_PERMANENT);

Test results using netperf  TCP_STREAM mode:
for i in 1 64 128 512 1k 2k 32k 64k 100k 500k 1m;then
netperf -T 1,2 -t TCP_STREAM -H 127.0.0.1 -l 20 -- -m $i -s 100m,100m -S 100m,100m
done

before:
3.84 246.52 496.89 1885.03 3415.29 6375.03 40749.09 48764.40 51611.34 55678.26 55992.78
after:
4.43 279.20 555.82 2080.79 3870.70 7105.44 41836.41 49709.75 51861.56 55211.00 54566.85

Signed-off-by: Liu Jian <liujian56@huawei.com>
---
 include/linux/skmsg.h          |  1 +
 include/uapi/linux/bpf.h       | 15 +++++++++++++--
 net/core/skmsg.c               |  5 +++++
 net/core/sock_map.c            |  4 ++--
 net/ipv4/tcp_bpf.c             | 18 +++++++++++++-----
 tools/include/uapi/linux/bpf.h | 15 +++++++++++++--
 6 files changed, 47 insertions(+), 11 deletions(-)

Comments

Jakub Sitnicki Aug. 25, 2023, 1:04 p.m. UTC | #1
On Thu, Aug 24, 2023 at 10:39 PM +08, Liu Jian wrote:
> If the sockmap msg redirection function is used only to forward packets
> and no other operation, the execution result of the BPF_SK_MSG_VERDICT
> program is the same each time. In this case, the BPF program only needs to
> be run once. Add BPF_F_PERMANENT flag to bpf_msg_redirect_map() and
> bpf_msg_redirect_hash() to implement this ability.
>
> Then we can enable this function in the bpf program as follows:
> bpf_msg_redirect_hash(xx, xx, xx, BPF_F_INGRESS | BPF_F_PERMANENT);
>
> Test results using netperf  TCP_STREAM mode:
> for i in 1 64 128 512 1k 2k 32k 64k 100k 500k 1m;then
> netperf -T 1,2 -t TCP_STREAM -H 127.0.0.1 -l 20 -- -m $i -s 100m,100m -S 100m,100m
> done
>
> before:
> 3.84 246.52 496.89 1885.03 3415.29 6375.03 40749.09 48764.40 51611.34 55678.26 55992.78
> after:
> 4.43 279.20 555.82 2080.79 3870.70 7105.44 41836.41 49709.75 51861.56 55211.00 54566.85
>
> Signed-off-by: Liu Jian <liujian56@huawei.com>
> ---
>  include/linux/skmsg.h          |  1 +
>  include/uapi/linux/bpf.h       | 15 +++++++++++++--
>  net/core/skmsg.c               |  5 +++++
>  net/core/sock_map.c            |  4 ++--
>  net/ipv4/tcp_bpf.c             | 18 +++++++++++++-----
>  tools/include/uapi/linux/bpf.h | 15 +++++++++++++--
>  6 files changed, 47 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index 054d7911bfc9..2f4e9811ff85 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -82,6 +82,7 @@ struct sk_psock {
>  	u32				cork_bytes;
>  	u32				eval;
>  	bool				redir_ingress; /* undefined if sk_redir is null */
> +	bool				redir_permanent;
>  	struct sk_msg			*cork;
>  	struct sk_psock_progs		progs;
>  #if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 70da85200695..f4de1ba390b4 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3004,7 +3004,12 @@ union bpf_attr {
>   * 		egress interfaces can be used for redirection. The
>   * 		**BPF_F_INGRESS** value in *flags* is used to make the
>   * 		distinction (ingress path is selected if the flag is present,
> - * 		egress path otherwise). This is the only flag supported for now.
> + * 		egress path otherwise). The **BPF_F_PERMANENT** value in
> + *		*flags* is used to indicates whether the eBPF result is
> + *		permanently (please note that, BPF_F_PERMANENT does not work with
> + *		msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or
> + *		msg_cork_bytes() is configured, the BPF_F_PERMANENT function is
> + *		automatically disabled).

There are some grammar mistakes here we need to fix. Hint - I find it
helpful to run the text through an online grammar checker or an AI
chatbot when unsure.

Either way, let's reword so the flags are clearly listed out, since we
now have two of them:

The following *flags* are supported:

**BPF_F_INGRESS**
        Both ingress and egress interfaces can be used for redirection.
        The **BPF_F_INGRESS** value in *flags* is used to make the
        distinction. Ingress path is selected if the flag is present,
        egress path otherwise.
**BPF_F_PERMANENT**
        Indicates that redirect verdict and the target socket should be
        remembered. The verdict program will not be run for subsequent
        packets, unless an error occurs when forwarding packets.

        **BPF_F_PERMANENT** cannot be use together with
        **bpf_msg_apply_bytes**\ () and **bpf_msg_cork_bytes**\ (). If
        either has been called, the **BPF_F_PERMANENT** flag is ignored.


Please check the formatting is correct with:

./scripts/bpf_doc.py --filename include/uapi/linux/bpf.h \
  | rst2man /dev/stdin | man /dev/stdin

That said, I'm not sure I like these semantics - flag being ignored
under some circumstances. It leads to a silent failure.

If I've asked for a permanent redirect, but it is cannot work in my BPF
program, I'd rather get an error from bpf_msg_redirect_map/hash(), than
be surprised that it is getting executed for every packet and have to
troubleshoot why.

>   * 	Return
>   * 		**SK_PASS** on success, or **SK_DROP** on error.
>   *
> @@ -3276,7 +3281,12 @@ union bpf_attr {
>   *		egress interfaces can be used for redirection. The
>   *		**BPF_F_INGRESS** value in *flags* is used to make the
>   *		distinction (ingress path is selected if the flag is present,
> - *		egress path otherwise). This is the only flag supported for now.
> + *		egress path otherwise). The **BPF_F_PERMANENT** value in
> + *		*flags* is used to indicates whether the eBPF result is
> + *		permanently (please note that, BPF_F_PERMANENT does not work with
> + *		msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or
> + *		msg_cork_bytes() is configured, the BPF_F_PERMANENT function is
> + *		automatically disabled).
>   *	Return
>   *		**SK_PASS** on success, or **SK_DROP** on error.
>   *
> @@ -5872,6 +5882,7 @@ enum {
>  /* BPF_FUNC_clone_redirect and BPF_FUNC_redirect flags. */
>  enum {
>  	BPF_F_INGRESS			= (1ULL << 0),
> +	BPF_F_PERMANENT			= (1ULL << 1),
>  };
>  
>  /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index a29508e1ff35..df1443cf5fbd 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -885,6 +885,11 @@ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock,
>  			goto out;
>  		}
>  		psock->redir_ingress = sk_msg_to_ingress(msg);
> +		if (!msg->apply_bytes && !msg->cork_bytes)
> +			psock->redir_permanent =
> +				msg->flags & BPF_F_PERMANENT;
> +		else
> +			psock->redir_permanent = false;

Above can be rewritten as:

		psock->redir_permanent = !msg->apply_bytes &&
					 !msg->cork_bytes &&
					 (msg->flags & BPF_F_PERMANENT);

But as I wrote earlier, I don't think it's a good idea to ignore the
flag. We can detect this conflict at the time the bpf_msg_sk_redirect_*
helper is called and return an error.

Naturally that means that that bpf_msg_{cork,apply}_bytes helpers need
to be adjusted to return an error if BPF_F_PERMANENT has been set.

>  		psock->sk_redir = msg->sk_redir;
>  		sock_hold(psock->sk_redir);
>  	}
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 08ab108206bf..35a361614f5e 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -662,7 +662,7 @@ BPF_CALL_4(bpf_msg_redirect_map, struct sk_msg *, msg,
>  {
>  	struct sock *sk;
>  
> -	if (unlikely(flags & ~(BPF_F_INGRESS)))
> +	if (unlikely(flags & ~(BPF_F_INGRESS | BPF_F_PERMANENT)))
>  		return SK_DROP;
>  
>  	sk = __sock_map_lookup_elem(map, key);
> @@ -1261,7 +1261,7 @@ BPF_CALL_4(bpf_msg_redirect_hash, struct sk_msg *, msg,
>  {
>  	struct sock *sk;
>  
> -	if (unlikely(flags & ~(BPF_F_INGRESS)))
> +	if (unlikely(flags & ~(BPF_F_INGRESS | BPF_F_PERMANENT)))
>  		return SK_DROP;
>  
>  	sk = __sock_hash_lookup_elem(map, key);
> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index 81f0dff69e0b..b53e356562a6 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c
> @@ -419,8 +419,10 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
>  		if (!psock->apply_bytes) {
>  			/* Clean up before releasing the sock lock. */
>  			eval = psock->eval;
> -			psock->eval = __SK_NONE;
> -			psock->sk_redir = NULL;
> +			if (!psock->redir_permanent) {
> +				psock->eval = __SK_NONE;
> +				psock->sk_redir = NULL;
> +			}
>  		}
>  		if (psock->cork) {
>  			cork = true;
> @@ -433,9 +435,15 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
>  		ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
>  					    msg, tosend, flags);
>  		sent = origsize - msg->sg.size;
> +		/* disable the ability when something wrong */
> +		if (unlikely(ret < 0))
> +			psock->redir_permanent = 0;
>  
> -		if (eval == __SK_REDIRECT)
> +		if (!psock->redir_permanent && eval == __SK_REDIRECT) {

I believe eval == __SK_REDIRECT is always true here, and the eval local
variable is redundant. We will be in this switch branch only if
psock->eval == __SK_REDIRECT. It's something we missed during the review
of cd9733f5d75c ("tcp_bpf: Fix one concurrency problem in the
tcp_bpf_send_verdict function").

>  			sock_put(sk_redir);
> +			psock->sk_redir = NULL;
> +			psock->eval = __SK_NONE;
> +		}
>  
>  		lock_sock(sk);
>  		if (unlikely(ret < 0)) {
> @@ -460,8 +468,8 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
>  	}
>  
>  	if (likely(!ret)) {
> -		if (!psock->apply_bytes) {
> -			psock->eval =  __SK_NONE;
> +		if (!psock->apply_bytes && !psock->redir_permanent) {
> +			psock->eval = __SK_NONE;
>  			if (psock->sk_redir) {
>  				sock_put(psock->sk_redir);
>  				psock->sk_redir = NULL;
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 70da85200695..f4de1ba390b4 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -3004,7 +3004,12 @@ union bpf_attr {
>   * 		egress interfaces can be used for redirection. The
>   * 		**BPF_F_INGRESS** value in *flags* is used to make the
>   * 		distinction (ingress path is selected if the flag is present,
> - * 		egress path otherwise). This is the only flag supported for now.
> + * 		egress path otherwise). The **BPF_F_PERMANENT** value in
> + *		*flags* is used to indicates whether the eBPF result is
> + *		permanently (please note that, BPF_F_PERMANENT does not work with
> + *		msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or
> + *		msg_cork_bytes() is configured, the BPF_F_PERMANENT function is
> + *		automatically disabled).
>   * 	Return
>   * 		**SK_PASS** on success, or **SK_DROP** on error.
>   *
> @@ -3276,7 +3281,12 @@ union bpf_attr {
>   *		egress interfaces can be used for redirection. The
>   *		**BPF_F_INGRESS** value in *flags* is used to make the
>   *		distinction (ingress path is selected if the flag is present,
> - *		egress path otherwise). This is the only flag supported for now.
> + *		egress path otherwise). The **BPF_F_PERMANENT** value in
> + *		*flags* is used to indicates whether the eBPF result is
> + *		permanently (please note that, BPF_F_PERMANENT does not work with
> + *		msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or
> + *		msg_cork_bytes() is configured, the BPF_F_PERMANENT function is
> + *		automatically disabled).
>   *	Return
>   *		**SK_PASS** on success, or **SK_DROP** on error.
>   *
> @@ -5872,6 +5882,7 @@ enum {
>  /* BPF_FUNC_clone_redirect and BPF_FUNC_redirect flags. */
>  enum {
>  	BPF_F_INGRESS			= (1ULL << 0),
> +	BPF_F_PERMANENT			= (1ULL << 1),
>  };
>  
>  /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
John Fastabend Aug. 26, 2023, 1:32 a.m. UTC | #2
Jakub Sitnicki wrote:
> On Thu, Aug 24, 2023 at 10:39 PM +08, Liu Jian wrote:
> > If the sockmap msg redirection function is used only to forward packets
> > and no other operation, the execution result of the BPF_SK_MSG_VERDICT
> > program is the same each time. In this case, the BPF program only needs to
> > be run once. Add BPF_F_PERMANENT flag to bpf_msg_redirect_map() and
> > bpf_msg_redirect_hash() to implement this ability.
> >
> > Then we can enable this function in the bpf program as follows:
> > bpf_msg_redirect_hash(xx, xx, xx, BPF_F_INGRESS | BPF_F_PERMANENT);
> >
> > Test results using netperf  TCP_STREAM mode:
> > for i in 1 64 128 512 1k 2k 32k 64k 100k 500k 1m;then
> > netperf -T 1,2 -t TCP_STREAM -H 127.0.0.1 -l 20 -- -m $i -s 100m,100m -S 100m,100m
> > done
> >
> > before:
> > 3.84 246.52 496.89 1885.03 3415.29 6375.03 40749.09 48764.40 51611.34 55678.26 55992.78
> > after:
> > 4.43 279.20 555.82 2080.79 3870.70 7105.44 41836.41 49709.75 51861.56 55211.00 54566.85
> >
> > Signed-off-by: Liu Jian <liujian56@huawei.com>

[...]

> >  /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> > index a29508e1ff35..df1443cf5fbd 100644
> > --- a/net/core/skmsg.c
> > +++ b/net/core/skmsg.c
> > @@ -885,6 +885,11 @@ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock,
> >  			goto out;
> >  		}
> >  		psock->redir_ingress = sk_msg_to_ingress(msg);
> > +		if (!msg->apply_bytes && !msg->cork_bytes)
> > +			psock->redir_permanent =
> > +				msg->flags & BPF_F_PERMANENT;
> > +		else
> > +			psock->redir_permanent = false;
> 
> Above can be rewritten as:
> 
> 		psock->redir_permanent = !msg->apply_bytes &&
> 					 !msg->cork_bytes &&
> 					 (msg->flags & BPF_F_PERMANENT);
> 
> But as I wrote earlier, I don't think it's a good idea to ignore the
> flag. We can detect this conflict at the time the bpf_msg_sk_redirect_*
> helper is called and return an error.
> 
> Naturally that means that that bpf_msg_{cork,apply}_bytes helpers need
> to be adjusted to return an error if BPF_F_PERMANENT has been set.

So far we've not really done much to protect a user from doing
rather silly things. The following will all do something without
errors,

  bpf_msg_apply_bytes()
  bpf_msg_apply_bytes() <- reset apply bytes

  bpf_msg_cork_bytes()
  bpf_msg_cork_bytes() <- resets cork byte

also,

  bpf_msg_redirect(..., BPF_F_INGRESS);
  bpf_msg_redirect(..., 0); <- resets sk_redir and flags

maybe there is some valid reason to even do above if further parsing
identifies some reason to redirect to a alert socket or something.

My original thinking was in the interest of not having a bunch of
extra checks for performance reasons we shouldn't add guard rails
unless something really unexpected might happen like a kernel
panic or what not.

This does feel a bit different though because before we
didn't have calls that could impact other calls. My best idea
is to just create a precedence and follow it. I would propose,

'If BPF_F_PERMANENT is set apply_bytes and cork_bytes are
 ignored.'

The other direction (what is above?) has a bit of an inconsistency
where these two flows are different?

  bpf_apply_bytes()
  bpf_msg_redirect(..., BPF_F_PERMANENT)

and

  bpf_msg_redirect(..., BPF_F_PERMANENT)
  bpf_apply_bytes()

It would be best if order of operations doesn't change the
outcome because that starts to get really hard to reason about.

This avoids having to add checks all over the place and then
if users want we could give some mechanisms to read apply
and cork bytes so people could write macros over those if
they really want the hard error.

WDYT?

[...]

Thanks!
liujian (CE) Aug. 26, 2023, 11:53 a.m. UTC | #3
On 2023/8/25 21:04, Jakub Sitnicki wrote:
> On Thu, Aug 24, 2023 at 10:39 PM +08, Liu Jian wrote:
>> If the sockmap msg redirection function is used only to forward packets
>> and no other operation, the execution result of the BPF_SK_MSG_VERDICT
>> program is the same each time. In this case, the BPF program only needs to
>> be run once. Add BPF_F_PERMANENT flag to bpf_msg_redirect_map() and
>> bpf_msg_redirect_hash() to implement this ability.
>>
>> Then we can enable this function in the bpf program as follows:
>> bpf_msg_redirect_hash(xx, xx, xx, BPF_F_INGRESS | BPF_F_PERMANENT);
>>
>> Test results using netperf  TCP_STREAM mode:
>> for i in 1 64 128 512 1k 2k 32k 64k 100k 500k 1m;then
>> netperf -T 1,2 -t TCP_STREAM -H 127.0.0.1 -l 20 -- -m $i -s 100m,100m -S 100m,100m
>> done
>>
>> before:
>> 3.84 246.52 496.89 1885.03 3415.29 6375.03 40749.09 48764.40 51611.34 55678.26 55992.78
>> after:
>> 4.43 279.20 555.82 2080.79 3870.70 7105.44 41836.41 49709.75 51861.56 55211.00 54566.85
>>
>> Signed-off-by: Liu Jian <liujian56@huawei.com>
>> ---
>>   include/linux/skmsg.h          |  1 +
>>   include/uapi/linux/bpf.h       | 15 +++++++++++++--
>>   net/core/skmsg.c               |  5 +++++
>>   net/core/sock_map.c            |  4 ++--
>>   net/ipv4/tcp_bpf.c             | 18 +++++++++++++-----
>>   tools/include/uapi/linux/bpf.h | 15 +++++++++++++--
>>   6 files changed, 47 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
>> index 054d7911bfc9..2f4e9811ff85 100644
>> --- a/include/linux/skmsg.h
>> +++ b/include/linux/skmsg.h
>> @@ -82,6 +82,7 @@ struct sk_psock {
>>   	u32				cork_bytes;
>>   	u32				eval;
>>   	bool				redir_ingress; /* undefined if sk_redir is null */
>> +	bool				redir_permanent;
>>   	struct sk_msg			*cork;
>>   	struct sk_psock_progs		progs;
>>   #if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 70da85200695..f4de1ba390b4 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -3004,7 +3004,12 @@ union bpf_attr {
>>    * 		egress interfaces can be used for redirection. The
>>    * 		**BPF_F_INGRESS** value in *flags* is used to make the
>>    * 		distinction (ingress path is selected if the flag is present,
>> - * 		egress path otherwise). This is the only flag supported for now.
>> + * 		egress path otherwise). The **BPF_F_PERMANENT** value in
>> + *		*flags* is used to indicates whether the eBPF result is
>> + *		permanently (please note that, BPF_F_PERMANENT does not work with
>> + *		msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or
>> + *		msg_cork_bytes() is configured, the BPF_F_PERMANENT function is
>> + *		automatically disabled).
> 
> There are some grammar mistakes here we need to fix. Hint - I find it
> helpful to run the text through an online grammar checker or an AI
> chatbot when unsure.
> 
> Either way, let's reword so the flags are clearly listed out, since we
> now have two of them:
> 
> The following *flags* are supported:
> 
> **BPF_F_INGRESS**
>          Both ingress and egress interfaces can be used for redirection.
>          The **BPF_F_INGRESS** value in *flags* is used to make the
>          distinction. Ingress path is selected if the flag is present,
>          egress path otherwise.
> **BPF_F_PERMANENT**
>          Indicates that redirect verdict and the target socket should be
>          remembered. The verdict program will not be run for subsequent
>          packets, unless an error occurs when forwarding packets.
> 
>          **BPF_F_PERMANENT** cannot be use together with
>          **bpf_msg_apply_bytes**\ () and **bpf_msg_cork_bytes**\ (). If
>          either has been called, the **BPF_F_PERMANENT** flag is ignored.
> 
> 
Thanks a lot.
> Please check the formatting is correct with:
> 
> ./scripts/bpf_doc.py --filename include/uapi/linux/bpf.h \
>    | rst2man /dev/stdin | man /dev/stdin
> 
> That said, I'm not sure I like these semantics - flag being ignored
> under some circumstances. It leads to a silent failure.
> 
> If I've asked for a permanent redirect, but it is cannot work in my BPF
> program, I'd rather get an error from bpf_msg_redirect_map/hash(), than
> be surprised that it is getting executed for every packet and have to
> troubleshoot why.
> 
>>    * 	Return
>>    * 		**SK_PASS** on success, or **SK_DROP** on error.
>>    *
>> @@ -3276,7 +3281,12 @@ union bpf_attr {
>>    *		egress interfaces can be used for redirection. The
>>    *		**BPF_F_INGRESS** value in *flags* is used to make the
>>    *		distinction (ingress path is selected if the flag is present,
>> - *		egress path otherwise). This is the only flag supported for now.
>> + *		egress path otherwise). The **BPF_F_PERMANENT** value in
>> + *		*flags* is used to indicates whether the eBPF result is
>> + *		permanently (please note that, BPF_F_PERMANENT does not work with
>> + *		msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or
>> + *		msg_cork_bytes() is configured, the BPF_F_PERMANENT function is
>> + *		automatically disabled).
>>    *	Return
>>    *		**SK_PASS** on success, or **SK_DROP** on error.
>>    *
>> @@ -5872,6 +5882,7 @@ enum {
>>   /* BPF_FUNC_clone_redirect and BPF_FUNC_redirect flags. */
>>   enum {
>>   	BPF_F_INGRESS			= (1ULL << 0),
>> +	BPF_F_PERMANENT			= (1ULL << 1),
>>   };
>>   
>>   /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
>> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
>> index a29508e1ff35..df1443cf5fbd 100644
>> --- a/net/core/skmsg.c
>> +++ b/net/core/skmsg.c
>> @@ -885,6 +885,11 @@ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock,
>>   			goto out;
>>   		}
>>   		psock->redir_ingress = sk_msg_to_ingress(msg);
>> +		if (!msg->apply_bytes && !msg->cork_bytes)
>> +			psock->redir_permanent =
>> +				msg->flags & BPF_F_PERMANENT;
>> +		else
>> +			psock->redir_permanent = false;
> 
> Above can be rewritten as:
> 
> 		psock->redir_permanent = !msg->apply_bytes &&
> 					 !msg->cork_bytes &&
> 					 (msg->flags & BPF_F_PERMANENT);
> 
Will change it in v4. Thanks.

> But as I wrote earlier, I don't think it's a good idea to ignore the
> flag. We can detect this conflict at the time the bpf_msg_sk_redirect_*
> helper is called and return an error.
>
> Naturally that means that that bpf_msg_{cork,apply}_bytes helpers need
> to be adjusted to return an error if BPF_F_PERMANENT has been set.
OK. If so, is the configuration that is set first and has a higher priority?

> 
>>   		psock->sk_redir = msg->sk_redir;
>>   		sock_hold(psock->sk_redir);
>>   	}
>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>> index 08ab108206bf..35a361614f5e 100644
>> --- a/net/core/sock_map.c
>> +++ b/net/core/sock_map.c
>> @@ -662,7 +662,7 @@ BPF_CALL_4(bpf_msg_redirect_map, struct sk_msg *, msg,
>>   {
>>   	struct sock *sk;
>>   
>> -	if (unlikely(flags & ~(BPF_F_INGRESS)))
>> +	if (unlikely(flags & ~(BPF_F_INGRESS | BPF_F_PERMANENT)))
>>   		return SK_DROP;
>>   
>>   	sk = __sock_map_lookup_elem(map, key);
>> @@ -1261,7 +1261,7 @@ BPF_CALL_4(bpf_msg_redirect_hash, struct sk_msg *, msg,
>>   {
>>   	struct sock *sk;
>>   
>> -	if (unlikely(flags & ~(BPF_F_INGRESS)))
>> +	if (unlikely(flags & ~(BPF_F_INGRESS | BPF_F_PERMANENT)))
>>   		return SK_DROP;
>>   
>>   	sk = __sock_hash_lookup_elem(map, key);
>> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
>> index 81f0dff69e0b..b53e356562a6 100644
>> --- a/net/ipv4/tcp_bpf.c
>> +++ b/net/ipv4/tcp_bpf.c
>> @@ -419,8 +419,10 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
>>   		if (!psock->apply_bytes) {
>>   			/* Clean up before releasing the sock lock. */
>>   			eval = psock->eval;
>> -			psock->eval = __SK_NONE;
>> -			psock->sk_redir = NULL;
>> +			if (!psock->redir_permanent) {
>> +				psock->eval = __SK_NONE;
>> +				psock->sk_redir = NULL;
>> +			}
>>   		}
>>   		if (psock->cork) {
>>   			cork = true;
>> @@ -433,9 +435,15 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
>>   		ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
>>   					    msg, tosend, flags);
>>   		sent = origsize - msg->sg.size;
>> +		/* disable the ability when something wrong */
>> +		if (unlikely(ret < 0))
>> +			psock->redir_permanent = 0;
>>   
>> -		if (eval == __SK_REDIRECT)
>> +		if (!psock->redir_permanent && eval == __SK_REDIRECT) {
> 
> I believe eval == __SK_REDIRECT is always true here, and the eval local
> variable is redundant. We will be in this switch branch only if
> psock->eval == __SK_REDIRECT. It's something we missed during the review
> of cd9733f5d75c ("tcp_bpf: Fix one concurrency problem in the
> tcp_bpf_send_verdict function").
> 
When apply_bytes is set, eval == __SK_REDIRECT is not true.
>>   			sock_put(sk_redir);
>> +			psock->sk_redir = NULL;
>> +			psock->eval = __SK_NONE;
>> +		}
>>   
>>   		lock_sock(sk);
>>   		if (unlikely(ret < 0)) {
>> @@ -460,8 +468,8 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
>>   	}
>>   
>>   	if (likely(!ret)) {
>> -		if (!psock->apply_bytes) {
>> -			psock->eval =  __SK_NONE;
>> +		if (!psock->apply_bytes && !psock->redir_permanent) {
>> +			psock->eval = __SK_NONE;
>>   			if (psock->sk_redir) {
>>   				sock_put(psock->sk_redir);
>>   				psock->sk_redir = NULL;
>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>> index 70da85200695..f4de1ba390b4 100644
>> --- a/tools/include/uapi/linux/bpf.h
>> +++ b/tools/include/uapi/linux/bpf.h
>> @@ -3004,7 +3004,12 @@ union bpf_attr {
>>    * 		egress interfaces can be used for redirection. The
>>    * 		**BPF_F_INGRESS** value in *flags* is used to make the
>>    * 		distinction (ingress path is selected if the flag is present,
>> - * 		egress path otherwise). This is the only flag supported for now.
>> + * 		egress path otherwise). The **BPF_F_PERMANENT** value in
>> + *		*flags* is used to indicates whether the eBPF result is
>> + *		permanently (please note that, BPF_F_PERMANENT does not work with
>> + *		msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or
>> + *		msg_cork_bytes() is configured, the BPF_F_PERMANENT function is
>> + *		automatically disabled).
>>    * 	Return
>>    * 		**SK_PASS** on success, or **SK_DROP** on error.
>>    *
>> @@ -3276,7 +3281,12 @@ union bpf_attr {
>>    *		egress interfaces can be used for redirection. The
>>    *		**BPF_F_INGRESS** value in *flags* is used to make the
>>    *		distinction (ingress path is selected if the flag is present,
>> - *		egress path otherwise). This is the only flag supported for now.
>> + *		egress path otherwise). The **BPF_F_PERMANENT** value in
>> + *		*flags* is used to indicates whether the eBPF result is
>> + *		permanently (please note that, BPF_F_PERMANENT does not work with
>> + *		msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or
>> + *		msg_cork_bytes() is configured, the BPF_F_PERMANENT function is
>> + *		automatically disabled).
>>    *	Return
>>    *		**SK_PASS** on success, or **SK_DROP** on error.
>>    *
>> @@ -5872,6 +5882,7 @@ enum {
>>   /* BPF_FUNC_clone_redirect and BPF_FUNC_redirect flags. */
>>   enum {
>>   	BPF_F_INGRESS			= (1ULL << 0),
>> +	BPF_F_PERMANENT			= (1ULL << 1),
>>   };
>>   
>>   /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
>
liujian (CE) Aug. 26, 2023, 11:54 a.m. UTC | #4
On 2023/8/26 9:32, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> On Thu, Aug 24, 2023 at 10:39 PM +08, Liu Jian wrote:
>>> If the sockmap msg redirection function is used only to forward packets
>>> and no other operation, the execution result of the BPF_SK_MSG_VERDICT
>>> program is the same each time. In this case, the BPF program only needs to
>>> be run once. Add BPF_F_PERMANENT flag to bpf_msg_redirect_map() and
>>> bpf_msg_redirect_hash() to implement this ability.
>>>
>>> Then we can enable this function in the bpf program as follows:
>>> bpf_msg_redirect_hash(xx, xx, xx, BPF_F_INGRESS | BPF_F_PERMANENT);
>>>
>>> Test results using netperf  TCP_STREAM mode:
>>> for i in 1 64 128 512 1k 2k 32k 64k 100k 500k 1m;then
>>> netperf -T 1,2 -t TCP_STREAM -H 127.0.0.1 -l 20 -- -m $i -s 100m,100m -S 100m,100m
>>> done
>>>
>>> before:
>>> 3.84 246.52 496.89 1885.03 3415.29 6375.03 40749.09 48764.40 51611.34 55678.26 55992.78
>>> after:
>>> 4.43 279.20 555.82 2080.79 3870.70 7105.44 41836.41 49709.75 51861.56 55211.00 54566.85
>>>
>>> Signed-off-by: Liu Jian <liujian56@huawei.com>
> 
> [...]
> 
>>>   /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
>>> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
>>> index a29508e1ff35..df1443cf5fbd 100644
>>> --- a/net/core/skmsg.c
>>> +++ b/net/core/skmsg.c
>>> @@ -885,6 +885,11 @@ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock,
>>>   			goto out;
>>>   		}
>>>   		psock->redir_ingress = sk_msg_to_ingress(msg);
>>> +		if (!msg->apply_bytes && !msg->cork_bytes)
>>> +			psock->redir_permanent =
>>> +				msg->flags & BPF_F_PERMANENT;
>>> +		else
>>> +			psock->redir_permanent = false;
>>
>> Above can be rewritten as:
>>
>> 		psock->redir_permanent = !msg->apply_bytes &&
>> 					 !msg->cork_bytes &&
>> 					 (msg->flags & BPF_F_PERMANENT);
>>
>> But as I wrote earlier, I don't think it's a good idea to ignore the
>> flag. We can detect this conflict at the time the bpf_msg_sk_redirect_*
>> helper is called and return an error.
>>
>> Naturally that means that that bpf_msg_{cork,apply}_bytes helpers need
>> to be adjusted to return an error if BPF_F_PERMANENT has been set.
> 
> So far we've not really done much to protect a user from doing
> rather silly things. The following will all do something without
> errors,
> 
>    bpf_msg_apply_bytes()
>    bpf_msg_apply_bytes() <- reset apply bytes
> 
>    bpf_msg_cork_bytes()
>    bpf_msg_cork_bytes() <- resets cork byte
> 
> also,
> 
>    bpf_msg_redirect(..., BPF_F_INGRESS);
>    bpf_msg_redirect(..., 0); <- resets sk_redir and flags
> 
> maybe there is some valid reason to even do above if further parsing
> identifies some reason to redirect to a alert socket or something.
> 
> My original thinking was in the interest of not having a bunch of
> extra checks for performance reasons we shouldn't add guard rails
> unless something really unexpected might happen like a kernel
> panic or what not.
> 
> This does feel a bit different though because before we
> didn't have calls that could impact other calls. My best idea
> is to just create a precedence and follow it. I would propose,
> 
> 'If BPF_F_PERMANENT is set apply_bytes and cork_bytes are
>   ignored.'
> 
I think it's better.
Both low-priority or high-priority are ok for me. But I think it's 
better that BPF_F_PERMANENT has a low priority. Because BPF_F_PERMANEN 
is only for performance, and apply_bytes or cork_bytes may be used to a 
user logic function.

> The other direction (what is above?) has a bit of an inconsistency
> where these two flows are different?
> 
>    bpf_apply_bytes()
>    bpf_msg_redirect(..., BPF_F_PERMANENT)
> 
> and
> 
>    bpf_msg_redirect(..., BPF_F_PERMANENT)
>    bpf_apply_bytes()
> 
> It would be best if order of operations doesn't change the
> outcome because that starts to get really hard to reason about.
> 
> This avoids having to add checks all over the place and then
> if users want we could give some mechanisms to read apply
> and cork bytes so people could write macros over those if
> they really want the hard error.
> 
> WDYT?
> 
> [...]
> 
> Thanks!
Jakub Sitnicki Aug. 28, 2023, 8:13 a.m. UTC | #5
On Fri, Aug 25, 2023 at 06:32 PM -07, John Fastabend wrote:
> Jakub Sitnicki wrote:

[...]

>> But as I wrote earlier, I don't think it's a good idea to ignore the
>> flag. We can detect this conflict at the time the bpf_msg_sk_redirect_*
>> helper is called and return an error.
>> 
>> Naturally that means that that bpf_msg_{cork,apply}_bytes helpers need
>> to be adjusted to return an error if BPF_F_PERMANENT has been set.
>
> So far we've not really done much to protect a user from doing
> rather silly things. The following will all do something without
> errors,
>
>   bpf_msg_apply_bytes()
>   bpf_msg_apply_bytes() <- reset apply bytes
>
>   bpf_msg_cork_bytes()
>   bpf_msg_cork_bytes() <- resets cork byte
>
> also,
>
>   bpf_msg_redirect(..., BPF_F_INGRESS);
>   bpf_msg_redirect(..., 0); <- resets sk_redir and flags
>
> maybe there is some valid reason to even do above if further parsing
> identifies some reason to redirect to a alert socket or something.
>
> My original thinking was in the interest of not having a bunch of
> extra checks for performance reasons we shouldn't add guard rails
> unless something really unexpected might happen like a kernel
> panic or what not.
>
> This does feel a bit different though because before we
> didn't have calls that could impact other calls. My best idea
> is to just create a precedence and follow it. I would propose,
>
> 'If BPF_F_PERMANENT is set apply_bytes and cork_bytes are
>  ignored.'
>
> The other direction (what is above?) has a bit of an inconsistency
> where these two flows are different?
>
>   bpf_apply_bytes()
>   bpf_msg_redirect(..., BPF_F_PERMANENT)
>
> and
>
>   bpf_msg_redirect(..., BPF_F_PERMANENT)
>   bpf_apply_bytes()
>
> It would be best if order of operations doesn't change the
> outcome because that starts to get really hard to reason about.
>
> This avoids having to add checks all over the place and then
> if users want we could give some mechanisms to read apply
> and cork bytes so people could write macros over those if
> they really want the hard error.
>
> WDYT?

These semantics sound sane to me. Easy to explain:

BPF_F_PERMANENT takes precedence over apply/cork_bytes.

Good point about order of operations.
diff mbox series

Patch

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 054d7911bfc9..2f4e9811ff85 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -82,6 +82,7 @@  struct sk_psock {
 	u32				cork_bytes;
 	u32				eval;
 	bool				redir_ingress; /* undefined if sk_redir is null */
+	bool				redir_permanent;
 	struct sk_msg			*cork;
 	struct sk_psock_progs		progs;
 #if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 70da85200695..f4de1ba390b4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3004,7 +3004,12 @@  union bpf_attr {
  * 		egress interfaces can be used for redirection. The
  * 		**BPF_F_INGRESS** value in *flags* is used to make the
  * 		distinction (ingress path is selected if the flag is present,
- * 		egress path otherwise). This is the only flag supported for now.
+ * 		egress path otherwise). The **BPF_F_PERMANENT** value in
+ *		*flags* is used to indicates whether the eBPF result is
+ *		permanently (please note that, BPF_F_PERMANENT does not work with
+ *		msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or
+ *		msg_cork_bytes() is configured, the BPF_F_PERMANENT function is
+ *		automatically disabled).
  * 	Return
  * 		**SK_PASS** on success, or **SK_DROP** on error.
  *
@@ -3276,7 +3281,12 @@  union bpf_attr {
  *		egress interfaces can be used for redirection. The
  *		**BPF_F_INGRESS** value in *flags* is used to make the
  *		distinction (ingress path is selected if the flag is present,
- *		egress path otherwise). This is the only flag supported for now.
+ *		egress path otherwise). The **BPF_F_PERMANENT** value in
+ *		*flags* is used to indicates whether the eBPF result is
+ *		permanently (please note that, BPF_F_PERMANENT does not work with
+ *		msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or
+ *		msg_cork_bytes() is configured, the BPF_F_PERMANENT function is
+ *		automatically disabled).
  *	Return
  *		**SK_PASS** on success, or **SK_DROP** on error.
  *
@@ -5872,6 +5882,7 @@  enum {
 /* BPF_FUNC_clone_redirect and BPF_FUNC_redirect flags. */
 enum {
 	BPF_F_INGRESS			= (1ULL << 0),
+	BPF_F_PERMANENT			= (1ULL << 1),
 };
 
 /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index a29508e1ff35..df1443cf5fbd 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -885,6 +885,11 @@  int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock,
 			goto out;
 		}
 		psock->redir_ingress = sk_msg_to_ingress(msg);
+		if (!msg->apply_bytes && !msg->cork_bytes)
+			psock->redir_permanent =
+				msg->flags & BPF_F_PERMANENT;
+		else
+			psock->redir_permanent = false;
 		psock->sk_redir = msg->sk_redir;
 		sock_hold(psock->sk_redir);
 	}
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 08ab108206bf..35a361614f5e 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -662,7 +662,7 @@  BPF_CALL_4(bpf_msg_redirect_map, struct sk_msg *, msg,
 {
 	struct sock *sk;
 
-	if (unlikely(flags & ~(BPF_F_INGRESS)))
+	if (unlikely(flags & ~(BPF_F_INGRESS | BPF_F_PERMANENT)))
 		return SK_DROP;
 
 	sk = __sock_map_lookup_elem(map, key);
@@ -1261,7 +1261,7 @@  BPF_CALL_4(bpf_msg_redirect_hash, struct sk_msg *, msg,
 {
 	struct sock *sk;
 
-	if (unlikely(flags & ~(BPF_F_INGRESS)))
+	if (unlikely(flags & ~(BPF_F_INGRESS | BPF_F_PERMANENT)))
 		return SK_DROP;
 
 	sk = __sock_hash_lookup_elem(map, key);
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 81f0dff69e0b..b53e356562a6 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -419,8 +419,10 @@  static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
 		if (!psock->apply_bytes) {
 			/* Clean up before releasing the sock lock. */
 			eval = psock->eval;
-			psock->eval = __SK_NONE;
-			psock->sk_redir = NULL;
+			if (!psock->redir_permanent) {
+				psock->eval = __SK_NONE;
+				psock->sk_redir = NULL;
+			}
 		}
 		if (psock->cork) {
 			cork = true;
@@ -433,9 +435,15 @@  static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
 		ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
 					    msg, tosend, flags);
 		sent = origsize - msg->sg.size;
+		/* disable the ability when something wrong */
+		if (unlikely(ret < 0))
+			psock->redir_permanent = 0;
 
-		if (eval == __SK_REDIRECT)
+		if (!psock->redir_permanent && eval == __SK_REDIRECT) {
 			sock_put(sk_redir);
+			psock->sk_redir = NULL;
+			psock->eval = __SK_NONE;
+		}
 
 		lock_sock(sk);
 		if (unlikely(ret < 0)) {
@@ -460,8 +468,8 @@  static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
 	}
 
 	if (likely(!ret)) {
-		if (!psock->apply_bytes) {
-			psock->eval =  __SK_NONE;
+		if (!psock->apply_bytes && !psock->redir_permanent) {
+			psock->eval = __SK_NONE;
 			if (psock->sk_redir) {
 				sock_put(psock->sk_redir);
 				psock->sk_redir = NULL;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 70da85200695..f4de1ba390b4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3004,7 +3004,12 @@  union bpf_attr {
  * 		egress interfaces can be used for redirection. The
  * 		**BPF_F_INGRESS** value in *flags* is used to make the
  * 		distinction (ingress path is selected if the flag is present,
- * 		egress path otherwise). This is the only flag supported for now.
+ * 		egress path otherwise). The **BPF_F_PERMANENT** value in
+ *		*flags* is used to indicates whether the eBPF result is
+ *		permanently (please note that, BPF_F_PERMANENT does not work with
+ *		msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or
+ *		msg_cork_bytes() is configured, the BPF_F_PERMANENT function is
+ *		automatically disabled).
  * 	Return
  * 		**SK_PASS** on success, or **SK_DROP** on error.
  *
@@ -3276,7 +3281,12 @@  union bpf_attr {
  *		egress interfaces can be used for redirection. The
  *		**BPF_F_INGRESS** value in *flags* is used to make the
  *		distinction (ingress path is selected if the flag is present,
- *		egress path otherwise). This is the only flag supported for now.
+ *		egress path otherwise). The **BPF_F_PERMANENT** value in
+ *		*flags* is used to indicates whether the eBPF result is
+ *		permanently (please note that, BPF_F_PERMANENT does not work with
+ *		msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or
+ *		msg_cork_bytes() is configured, the BPF_F_PERMANENT function is
+ *		automatically disabled).
  *	Return
  *		**SK_PASS** on success, or **SK_DROP** on error.
  *
@@ -5872,6 +5882,7 @@  enum {
 /* BPF_FUNC_clone_redirect and BPF_FUNC_redirect flags. */
 enum {
 	BPF_F_INGRESS			= (1ULL << 0),
+	BPF_F_PERMANENT			= (1ULL << 1),
 };
 
 /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */