diff mbox series

[bpf-next,1/2] bpf: return correct -ENOBUFS from bpf_clone_redirect

Message ID 20230908210007.1469091-1-sdf@google.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series [bpf-next,1/2] bpf: return correct -ENOBUFS from bpf_clone_redirect | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1366 this patch: 1366
netdev/cc_maintainers warning 6 maintainers not CCed: pabeni@redhat.com davem@davemloft.net edumazet@google.com yonghong.song@linux.dev netdev@vger.kernel.org kuba@kernel.org
netdev/build_clang success Errors and warnings before: 1364 this patch: 1364
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1389 this patch: 1389
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 9 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-0 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-5 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-1 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for veristat
bpf/vmtest-bpf-next-VM_Test-8 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_parallel on x86_64 with llvm-16

Commit Message

Stanislav Fomichev Sept. 8, 2023, 9 p.m. UTC
Commit 151e887d8ff9 ("veth: Fixing transmit return status for dropped
packets") exposed the fact that bpf_clone_redirect is capable of
returning raw NET_XMIT_XXX return codes.

This is in the conflict with its UAPI doc which says the following:
"0 on success, or a negative error in case of failure."

Let's wrap dev_queue_xmit's return value (in __bpf_tx_skb) into
net_xmit_errno to make sure we correctly propagate NET_XMIT_DROP
as -ENOBUFS instead of 1.

Note, this is technically breaking existing UAPI where we used to
return 1 and now will do -ENOBUFS. The alternative is to
document that bpf_clone_redirect can return 1 for DROP and 2 for CN.

Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 net/core/filter.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Martin KaFai Lau Sept. 9, 2023, 7:31 a.m. UTC | #1
On 9/8/23 2:00 PM, Stanislav Fomichev wrote:
> Commit 151e887d8ff9 ("veth: Fixing transmit return status for dropped
> packets") exposed the fact that bpf_clone_redirect is capable of
> returning raw NET_XMIT_XXX return codes.
> 
> This is in the conflict with its UAPI doc which says the following:
> "0 on success, or a negative error in case of failure."
> 
> Let's wrap dev_queue_xmit's return value (in __bpf_tx_skb) into
> net_xmit_errno to make sure we correctly propagate NET_XMIT_DROP
> as -ENOBUFS instead of 1.
> 
> Note, this is technically breaking existing UAPI where we used to
> return 1 and now will do -ENOBUFS. The alternative is to
> document that bpf_clone_redirect can return 1 for DROP and 2 for CN.
> 
> Reported-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>   net/core/filter.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index a094694899c9..9e297931b02f 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2129,6 +2129,9 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
>   	ret = dev_queue_xmit(skb);
>   	dev_xmit_recursion_dec();
>   
> +	if (ret > 0)
> +		ret = net_xmit_errno(ret);

I think it is better to have bpf_clone_redirect returning -ENOBUFS instead of 
leaking NET_XMIT_XXX to the uapi. The bpf_clone_redirect in the uapi/bpf.h also 
mentions

  *      Return
  *              0 on success, or a negative error in case of failure.

If -ENOBUFS is returned in __bpf_tx_skb, should the same be done for 
__bpf_rx_skb? and should net_xmit_errno() only be done for bpf_clone_redirect()? 
  __bpf_{tx,rx}_skb is also used by skb_do_redirect() which also calls 
__bpf_redirect_neigh() that returns NET_XMIT_xxx but no caller seems to care the 
NET_XMIT_xxx value now.

Daniel should know more here. I would wait for Daniel to comment.

~~~~

For the selftest, may be another option is to use a 28 bytes data_in for the lwt 
program redirecting to veth? 14 bytes used by bpf_prog_test_run_skb and leave 14 
bytes for veth_xmit. It seems the original intention of the "veth ETH_HLEN+1 
packet ingress" test is expecting it to succeed also.

> +
>   	return ret;
>   }
>
Stanislav Fomichev Sept. 11, 2023, 5:11 p.m. UTC | #2
On 09/09, Martin KaFai Lau wrote:
> On 9/8/23 2:00 PM, Stanislav Fomichev wrote:
> > Commit 151e887d8ff9 ("veth: Fixing transmit return status for dropped
> > packets") exposed the fact that bpf_clone_redirect is capable of
> > returning raw NET_XMIT_XXX return codes.
> > 
> > This is in the conflict with its UAPI doc which says the following:
> > "0 on success, or a negative error in case of failure."
> > 
> > Let's wrap dev_queue_xmit's return value (in __bpf_tx_skb) into
> > net_xmit_errno to make sure we correctly propagate NET_XMIT_DROP
> > as -ENOBUFS instead of 1.
> > 
> > Note, this is technically breaking existing UAPI where we used to
> > return 1 and now will do -ENOBUFS. The alternative is to
> > document that bpf_clone_redirect can return 1 for DROP and 2 for CN.
> > 
> > Reported-by: Daniel Borkmann <daniel@iogearbox.net>
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >   net/core/filter.c | 3 +++
> >   1 file changed, 3 insertions(+)
> > 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index a094694899c9..9e297931b02f 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -2129,6 +2129,9 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
> >   	ret = dev_queue_xmit(skb);
> >   	dev_xmit_recursion_dec();
> > +	if (ret > 0)
> > +		ret = net_xmit_errno(ret);
> 
> I think it is better to have bpf_clone_redirect returning -ENOBUFS instead
> of leaking NET_XMIT_XXX to the uapi. The bpf_clone_redirect in the
> uapi/bpf.h also mentions
> 
>  *      Return
>  *              0 on success, or a negative error in case of failure.
> 
> If -ENOBUFS is returned in __bpf_tx_skb, should the same be done for
> __bpf_rx_skb? and should net_xmit_errno() only be done for
> bpf_clone_redirect()?  __bpf_{tx,rx}_skb is also used by skb_do_redirect()
> which also calls __bpf_redirect_neigh() that returns NET_XMIT_xxx but no
> caller seems to care the NET_XMIT_xxx value now.

__bpf_rx_skb seems to only add to backlog and doesn't seem to return any
of the NET_XMIT_xxx. But I might be wrong and haven't looked too deep
into that.

> Daniel should know more here. I would wait for Daniel to comment.

Ack, sure!

> ~~~~
> 
> For the selftest, may be another option is to use a 28 bytes data_in for the
> lwt program redirecting to veth? 14 bytes used by bpf_prog_test_run_skb and
> leave 14 bytes for veth_xmit. It seems the original intention of the "veth
> ETH_HLEN+1 packet ingress" test is expecting it to succeed also.

IIUC, you're suggesting to pass full ipv4 or ipv6 packet for veth tests
to make them actually succeed with the forwarding, right?

Sure, I can do that. But let's keep this entry with the -NOBUFS as well?
Just for the sake of ensuring that we don't export NET_XMIT_xxx from
uapi.
Daniel Borkmann Sept. 11, 2023, 5:23 p.m. UTC | #3
On 9/11/23 7:11 PM, Stanislav Fomichev wrote:
> On 09/09, Martin KaFai Lau wrote:
>> On 9/8/23 2:00 PM, Stanislav Fomichev wrote:
>>> Commit 151e887d8ff9 ("veth: Fixing transmit return status for dropped
>>> packets") exposed the fact that bpf_clone_redirect is capable of
>>> returning raw NET_XMIT_XXX return codes.
>>>
>>> This is in the conflict with its UAPI doc which says the following:
>>> "0 on success, or a negative error in case of failure."
>>>
>>> Let's wrap dev_queue_xmit's return value (in __bpf_tx_skb) into
>>> net_xmit_errno to make sure we correctly propagate NET_XMIT_DROP
>>> as -ENOBUFS instead of 1.
>>>
>>> Note, this is technically breaking existing UAPI where we used to
>>> return 1 and now will do -ENOBUFS. The alternative is to
>>> document that bpf_clone_redirect can return 1 for DROP and 2 for CN.
>>>
>>> Reported-by: Daniel Borkmann <daniel@iogearbox.net>
>>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>>> ---
>>>    net/core/filter.c | 3 +++
>>>    1 file changed, 3 insertions(+)
>>>
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index a094694899c9..9e297931b02f 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -2129,6 +2129,9 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
>>>    	ret = dev_queue_xmit(skb);
>>>    	dev_xmit_recursion_dec();
>>> +	if (ret > 0)
>>> +		ret = net_xmit_errno(ret);
>>
>> I think it is better to have bpf_clone_redirect returning -ENOBUFS instead
>> of leaking NET_XMIT_XXX to the uapi. The bpf_clone_redirect in the
>> uapi/bpf.h also mentions
>>
>>   *      Return
>>   *              0 on success, or a negative error in case of failure.
>>
>> If -ENOBUFS is returned in __bpf_tx_skb, should the same be done for
>> __bpf_rx_skb? and should net_xmit_errno() only be done for
>> bpf_clone_redirect()?  __bpf_{tx,rx}_skb is also used by skb_do_redirect()
>> which also calls __bpf_redirect_neigh() that returns NET_XMIT_xxx but no
>> caller seems to care the NET_XMIT_xxx value now.
> 
> __bpf_rx_skb seems to only add to backlog and doesn't seem to return any
> of the NET_XMIT_xxx. But I might be wrong and haven't looked too deep
> into that.
> 
>> Daniel should know more here. I would wait for Daniel to comment.
> 
> Ack, sure!

I think my preference would be to just document it in the helper UAPI, what
Stan was suggesting below:

| Note, this is technically breaking existing UAPI where we used to
| return 1 and now will do -ENOBUFS. The alternative is to
| document that bpf_clone_redirect can return 1 for DROP and 2 for CN.

And then only adjusting the test case.

Programs checking for ret < 0 will continue to behave as before. Technically
the bpf_clone_redirect() did its job just that on the veth side things were
dropped. Other drivers such as tun, vrf, ipvlan, bond could already have
returned NET_XMIT_DROP, so technically it's not a new situation where it is
possible. And having a ret > 0 could then also be clearly used to differentiate
that something came from driver side rather than helper side.

>> For the selftest, may be another option is to use a 28 bytes data_in for the
>> lwt program redirecting to veth? 14 bytes used by bpf_prog_test_run_skb and
>> leave 14 bytes for veth_xmit. It seems the original intention of the "veth
>> ETH_HLEN+1 packet ingress" test is expecting it to succeed also.
> 
> IIUC, you're suggesting to pass full ipv4 or ipv6 packet for veth tests
> to make them actually succeed with the forwarding, right?
> 
> Sure, I can do that. But let's keep this entry with the -NOBUFS as well?
> Just for the sake of ensuring that we don't export NET_XMIT_xxx from
> uapi.
>
Stanislav Fomichev Sept. 11, 2023, 5:41 p.m. UTC | #4
On 09/11, Daniel Borkmann wrote:
> On 9/11/23 7:11 PM, Stanislav Fomichev wrote:
> > On 09/09, Martin KaFai Lau wrote:
> > > On 9/8/23 2:00 PM, Stanislav Fomichev wrote:
> > > > Commit 151e887d8ff9 ("veth: Fixing transmit return status for dropped
> > > > packets") exposed the fact that bpf_clone_redirect is capable of
> > > > returning raw NET_XMIT_XXX return codes.
> > > > 
> > > > This is in the conflict with its UAPI doc which says the following:
> > > > "0 on success, or a negative error in case of failure."
> > > > 
> > > > Let's wrap dev_queue_xmit's return value (in __bpf_tx_skb) into
> > > > net_xmit_errno to make sure we correctly propagate NET_XMIT_DROP
> > > > as -ENOBUFS instead of 1.
> > > > 
> > > > Note, this is technically breaking existing UAPI where we used to
> > > > return 1 and now will do -ENOBUFS. The alternative is to
> > > > document that bpf_clone_redirect can return 1 for DROP and 2 for CN.
> > > > 
> > > > Reported-by: Daniel Borkmann <daniel@iogearbox.net>
> > > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > > ---
> > > >    net/core/filter.c | 3 +++
> > > >    1 file changed, 3 insertions(+)
> > > > 
> > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > index a094694899c9..9e297931b02f 100644
> > > > --- a/net/core/filter.c
> > > > +++ b/net/core/filter.c
> > > > @@ -2129,6 +2129,9 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
> > > >    	ret = dev_queue_xmit(skb);
> > > >    	dev_xmit_recursion_dec();
> > > > +	if (ret > 0)
> > > > +		ret = net_xmit_errno(ret);
> > > 
> > > I think it is better to have bpf_clone_redirect returning -ENOBUFS instead
> > > of leaking NET_XMIT_XXX to the uapi. The bpf_clone_redirect in the
> > > uapi/bpf.h also mentions
> > > 
> > >   *      Return
> > >   *              0 on success, or a negative error in case of failure.
> > > 
> > > If -ENOBUFS is returned in __bpf_tx_skb, should the same be done for
> > > __bpf_rx_skb? and should net_xmit_errno() only be done for
> > > bpf_clone_redirect()?  __bpf_{tx,rx}_skb is also used by skb_do_redirect()
> > > which also calls __bpf_redirect_neigh() that returns NET_XMIT_xxx but no
> > > caller seems to care the NET_XMIT_xxx value now.
> > 
> > __bpf_rx_skb seems to only add to backlog and doesn't seem to return any
> > of the NET_XMIT_xxx. But I might be wrong and haven't looked too deep
> > into that.
> > 
> > > Daniel should know more here. I would wait for Daniel to comment.
> > 
> > Ack, sure!
> 
> I think my preference would be to just document it in the helper UAPI, what
> Stan was suggesting below:
> 
> | Note, this is technically breaking existing UAPI where we used to
> | return 1 and now will do -ENOBUFS. The alternative is to
> | document that bpf_clone_redirect can return 1 for DROP and 2 for CN.
> 
> And then only adjusting the test case.

In this case, would we also need something similar to our
TCP_BPF_<state> changes? Like BUILD_BUG_ON(BPF_NET_XMIT_XXX !=
NET_XMIT_XXX)? Otherwise, we risk more leakage into the UAPI.
Merely documenting doesn't seem enough?

> Programs checking for ret < 0 will continue to behave as before. Technically
> the bpf_clone_redirect() did its job just that on the veth side things were
> dropped. Other drivers such as tun, vrf, ipvlan, bond could already have
> returned NET_XMIT_DROP, so technically it's not a new situation where it is
> possible. And having a ret > 0 could then also be clearly used to differentiate
> that something came from driver side rather than helper side.
> 
> > > For the selftest, may be another option is to use a 28 bytes data_in for the
> > > lwt program redirecting to veth? 14 bytes used by bpf_prog_test_run_skb and
> > > leave 14 bytes for veth_xmit. It seems the original intention of the "veth
> > > ETH_HLEN+1 packet ingress" test is expecting it to succeed also.
> > 
> > IIUC, you're suggesting to pass full ipv4 or ipv6 packet for veth tests
> > to make them actually succeed with the forwarding, right?
> > 
> > Sure, I can do that. But let's keep this entry with the -NOBUFS as well?
> > Just for the sake of ensuring that we don't export NET_XMIT_xxx from
> > uapi.
> > 
>
Martin KaFai Lau Sept. 11, 2023, 6:08 p.m. UTC | #5
On 9/11/23 10:23 AM, Daniel Borkmann wrote:
> On 9/11/23 7:11 PM, Stanislav Fomichev wrote:
>> On 09/09, Martin KaFai Lau wrote:
>>> On 9/8/23 2:00 PM, Stanislav Fomichev wrote:
>>>> Commit 151e887d8ff9 ("veth: Fixing transmit return status for dropped
>>>> packets") exposed the fact that bpf_clone_redirect is capable of
>>>> returning raw NET_XMIT_XXX return codes.
>>>>
>>>> This is in the conflict with its UAPI doc which says the following:
>>>> "0 on success, or a negative error in case of failure."
>>>>
>>>> Let's wrap dev_queue_xmit's return value (in __bpf_tx_skb) into
>>>> net_xmit_errno to make sure we correctly propagate NET_XMIT_DROP
>>>> as -ENOBUFS instead of 1.
>>>>
>>>> Note, this is technically breaking existing UAPI where we used to
>>>> return 1 and now will do -ENOBUFS. The alternative is to
>>>> document that bpf_clone_redirect can return 1 for DROP and 2 for CN.
>>>>
>>>> Reported-by: Daniel Borkmann <daniel@iogearbox.net>
>>>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>>>> ---
>>>>    net/core/filter.c | 3 +++
>>>>    1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>> index a094694899c9..9e297931b02f 100644
>>>> --- a/net/core/filter.c
>>>> +++ b/net/core/filter.c
>>>> @@ -2129,6 +2129,9 @@ static inline int __bpf_tx_skb(struct net_device *dev, 
>>>> struct sk_buff *skb)
>>>>        ret = dev_queue_xmit(skb);
>>>>        dev_xmit_recursion_dec();
>>>> +    if (ret > 0)
>>>> +        ret = net_xmit_errno(ret);
>>>
>>> I think it is better to have bpf_clone_redirect returning -ENOBUFS instead
>>> of leaking NET_XMIT_XXX to the uapi. The bpf_clone_redirect in the
>>> uapi/bpf.h also mentions
>>>
>>>   *      Return
>>>   *              0 on success, or a negative error in case of failure.
>>>
>>> If -ENOBUFS is returned in __bpf_tx_skb, should the same be done for
>>> __bpf_rx_skb? and should net_xmit_errno() only be done for
>>> bpf_clone_redirect()?  __bpf_{tx,rx}_skb is also used by skb_do_redirect()
>>> which also calls __bpf_redirect_neigh() that returns NET_XMIT_xxx but no
>>> caller seems to care the NET_XMIT_xxx value now.
>>
>> __bpf_rx_skb seems to only add to backlog and doesn't seem to return any
>> of the NET_XMIT_xxx. But I might be wrong and haven't looked too deep
>> into that.

enqueue_to_backlog could return NET_RX_DROP which happens to have the same value 
as NET_XMIT_DROP. I think this will get propagated back to __bpf_rx_skb().

>>
>>> Daniel should know more here. I would wait for Daniel to comment.
>>
>> Ack, sure!
> 
> I think my preference would be to just document it in the helper UAPI, what
> Stan was suggesting below:
> 
> | Note, this is technically breaking existing UAPI where we used to
> | return 1 and now will do -ENOBUFS. The alternative is to
> | document that bpf_clone_redirect can return 1 for DROP and 2 for CN.
> 
> And then only adjusting the test case.
> 
> Programs checking for ret < 0 will continue to behave as before. Technically
> the bpf_clone_redirect() did its job just that on the veth side things were
> dropped. Other drivers such as tun, vrf, ipvlan, bond could already have
> returned NET_XMIT_DROP, so technically it's not a new situation where it is
> possible. And having a ret > 0 could then also be clearly used to differentiate
> that something came from driver side rather than helper side.

sure. sgtm. Not sure if it will be useful to spell out the >0 meaning in uapi/bpf.h.

> 
>>> For the selftest, may be another option is to use a 28 bytes data_in for the
>>> lwt program redirecting to veth? 14 bytes used by bpf_prog_test_run_skb and
>>> leave 14 bytes for veth_xmit. It seems the original intention of the "veth
>>> ETH_HLEN+1 packet ingress" test is expecting it to succeed also.
>>
>> IIUC, you're suggesting to pass full ipv4 or ipv6 packet for veth tests
>> to make them actually succeed with the forwarding, right?
>>
>> Sure, I can do that. But let's keep this entry with the -NOBUFS as well?
>> Just for the sake of ensuring that we don't export NET_XMIT_xxx from
>> uapi.

In that case it makes sense to only change the eth+1byte test case to expect >0 
ret (or -ENOBUF as in patch 2, depending on the above discussion). No need to 
add an extra test.
Daniel Borkmann Sept. 11, 2023, 6:36 p.m. UTC | #6
On 9/11/23 7:41 PM, Stanislav Fomichev wrote:
> On 09/11, Daniel Borkmann wrote:
>> On 9/11/23 7:11 PM, Stanislav Fomichev wrote:
>>> On 09/09, Martin KaFai Lau wrote:
>>>> On 9/8/23 2:00 PM, Stanislav Fomichev wrote:
>>>>> Commit 151e887d8ff9 ("veth: Fixing transmit return status for dropped
>>>>> packets") exposed the fact that bpf_clone_redirect is capable of
>>>>> returning raw NET_XMIT_XXX return codes.
>>>>>
>>>>> This is in the conflict with its UAPI doc which says the following:
>>>>> "0 on success, or a negative error in case of failure."
>>>>>
>>>>> Let's wrap dev_queue_xmit's return value (in __bpf_tx_skb) into
>>>>> net_xmit_errno to make sure we correctly propagate NET_XMIT_DROP
>>>>> as -ENOBUFS instead of 1.
>>>>>
>>>>> Note, this is technically breaking existing UAPI where we used to
>>>>> return 1 and now will do -ENOBUFS. The alternative is to
>>>>> document that bpf_clone_redirect can return 1 for DROP and 2 for CN.
>>>>>
>>>>> Reported-by: Daniel Borkmann <daniel@iogearbox.net>
>>>>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>>>>> ---
>>>>>     net/core/filter.c | 3 +++
>>>>>     1 file changed, 3 insertions(+)
>>>>>
>>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>>> index a094694899c9..9e297931b02f 100644
>>>>> --- a/net/core/filter.c
>>>>> +++ b/net/core/filter.c
>>>>> @@ -2129,6 +2129,9 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
>>>>>     	ret = dev_queue_xmit(skb);
>>>>>     	dev_xmit_recursion_dec();
>>>>> +	if (ret > 0)
>>>>> +		ret = net_xmit_errno(ret);
>>>>
>>>> I think it is better to have bpf_clone_redirect returning -ENOBUFS instead
>>>> of leaking NET_XMIT_XXX to the uapi. The bpf_clone_redirect in the
>>>> uapi/bpf.h also mentions
>>>>
>>>>    *      Return
>>>>    *              0 on success, or a negative error in case of failure.
>>>>
>>>> If -ENOBUFS is returned in __bpf_tx_skb, should the same be done for
>>>> __bpf_rx_skb? and should net_xmit_errno() only be done for
>>>> bpf_clone_redirect()?  __bpf_{tx,rx}_skb is also used by skb_do_redirect()
>>>> which also calls __bpf_redirect_neigh() that returns NET_XMIT_xxx but no
>>>> caller seems to care the NET_XMIT_xxx value now.
>>>
>>> __bpf_rx_skb seems to only add to backlog and doesn't seem to return any
>>> of the NET_XMIT_xxx. But I might be wrong and haven't looked too deep
>>> into that.
>>>
>>>> Daniel should know more here. I would wait for Daniel to comment.
>>>
>>> Ack, sure!
>>
>> I think my preference would be to just document it in the helper UAPI, what
>> Stan was suggesting below:
>>
>> | Note, this is technically breaking existing UAPI where we used to
>> | return 1 and now will do -ENOBUFS. The alternative is to
>> | document that bpf_clone_redirect can return 1 for DROP and 2 for CN.
>>
>> And then only adjusting the test case.
> 
> In this case, would we also need something similar to our
> TCP_BPF_<state> changes? Like BUILD_BUG_ON(BPF_NET_XMIT_XXX !=
> NET_XMIT_XXX)? Otherwise, we risk more leakage into the UAPI.
> Merely documenting doesn't seem enough?

We could probably just mention that a positive, non-zero code indicates
that the skb clone got forwarded to the target netdevice but got dropped
from driver side. This is somewhat also driver dependent e.g. if you look
at dummy which does drop-all, it returns NETDEV_TX_OK. Anything more
specific in the helper doc such as defining BPF_NET_XMIT_* would be more
confusing.

>> Programs checking for ret < 0 will continue to behave as before. Technically
>> the bpf_clone_redirect() did its job just that on the veth side things were
>> dropped. Other drivers such as tun, vrf, ipvlan, bond could already have
>> returned NET_XMIT_DROP, so technically it's not a new situation where it is
>> possible. And having a ret > 0 could then also be clearly used to differentiate
>> that something came from driver side rather than helper side.
>>
>>>> For the selftest, may be another option is to use a 28 bytes data_in for the
>>>> lwt program redirecting to veth? 14 bytes used by bpf_prog_test_run_skb and
>>>> leave 14 bytes for veth_xmit. It seems the original intention of the "veth
>>>> ETH_HLEN+1 packet ingress" test is expecting it to succeed also.
>>>
>>> IIUC, you're suggesting to pass full ipv4 or ipv6 packet for veth tests
>>> to make them actually succeed with the forwarding, right?
>>>
>>> Sure, I can do that. But let's keep this entry with the -NOBUFS as well?
>>> Just for the sake of ensuring that we don't export NET_XMIT_xxx from
>>> uapi.
Stanislav Fomichev Sept. 11, 2023, 6:52 p.m. UTC | #7
On 09/11, Daniel Borkmann wrote:
> On 9/11/23 7:41 PM, Stanislav Fomichev wrote:
> > On 09/11, Daniel Borkmann wrote:
> > > On 9/11/23 7:11 PM, Stanislav Fomichev wrote:
> > > > On 09/09, Martin KaFai Lau wrote:
> > > > > On 9/8/23 2:00 PM, Stanislav Fomichev wrote:
> > > > > > Commit 151e887d8ff9 ("veth: Fixing transmit return status for dropped
> > > > > > packets") exposed the fact that bpf_clone_redirect is capable of
> > > > > > returning raw NET_XMIT_XXX return codes.
> > > > > > 
> > > > > > This is in the conflict with its UAPI doc which says the following:
> > > > > > "0 on success, or a negative error in case of failure."
> > > > > > 
> > > > > > Let's wrap dev_queue_xmit's return value (in __bpf_tx_skb) into
> > > > > > net_xmit_errno to make sure we correctly propagate NET_XMIT_DROP
> > > > > > as -ENOBUFS instead of 1.
> > > > > > 
> > > > > > Note, this is technically breaking existing UAPI where we used to
> > > > > > return 1 and now will do -ENOBUFS. The alternative is to
> > > > > > document that bpf_clone_redirect can return 1 for DROP and 2 for CN.
> > > > > > 
> > > > > > Reported-by: Daniel Borkmann <daniel@iogearbox.net>
> > > > > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > > > > ---
> > > > > >     net/core/filter.c | 3 +++
> > > > > >     1 file changed, 3 insertions(+)
> > > > > > 
> > > > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > > > index a094694899c9..9e297931b02f 100644
> > > > > > --- a/net/core/filter.c
> > > > > > +++ b/net/core/filter.c
> > > > > > @@ -2129,6 +2129,9 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
> > > > > >     	ret = dev_queue_xmit(skb);
> > > > > >     	dev_xmit_recursion_dec();
> > > > > > +	if (ret > 0)
> > > > > > +		ret = net_xmit_errno(ret);
> > > > > 
> > > > > I think it is better to have bpf_clone_redirect returning -ENOBUFS instead
> > > > > of leaking NET_XMIT_XXX to the uapi. The bpf_clone_redirect in the
> > > > > uapi/bpf.h also mentions
> > > > > 
> > > > >    *      Return
> > > > >    *              0 on success, or a negative error in case of failure.
> > > > > 
> > > > > If -ENOBUFS is returned in __bpf_tx_skb, should the same be done for
> > > > > __bpf_rx_skb? and should net_xmit_errno() only be done for
> > > > > bpf_clone_redirect()?  __bpf_{tx,rx}_skb is also used by skb_do_redirect()
> > > > > which also calls __bpf_redirect_neigh() that returns NET_XMIT_xxx but no
> > > > > caller seems to care the NET_XMIT_xxx value now.
> > > > 
> > > > __bpf_rx_skb seems to only add to backlog and doesn't seem to return any
> > > > of the NET_XMIT_xxx. But I might be wrong and haven't looked too deep
> > > > into that.
> > > > 
> > > > > Daniel should know more here. I would wait for Daniel to comment.
> > > > 
> > > > Ack, sure!
> > > 
> > > I think my preference would be to just document it in the helper UAPI, what
> > > Stan was suggesting below:
> > > 
> > > | Note, this is technically breaking existing UAPI where we used to
> > > | return 1 and now will do -ENOBUFS. The alternative is to
> > > | document that bpf_clone_redirect can return 1 for DROP and 2 for CN.
> > > 
> > > And then only adjusting the test case.
> > 
> > In this case, would we also need something similar to our
> > TCP_BPF_<state> changes? Like BUILD_BUG_ON(BPF_NET_XMIT_XXX !=
> > NET_XMIT_XXX)? Otherwise, we risk more leakage into the UAPI.
> > Merely documenting doesn't seem enough?
> 
> We could probably just mention that a positive, non-zero code indicates
> that the skb clone got forwarded to the target netdevice but got dropped
> from driver side. This is somewhat also driver dependent e.g. if you look
> at dummy which does drop-all, it returns NETDEV_TX_OK. Anything more
> specific in the helper doc such as defining BPF_NET_XMIT_* would be more
> confusing.

Something like the following?

Return
	0 on success, or a negative error in case of failure. Positive
	error indicates a potential drop or congestion in the target
	device. The particular positive error codes are not defined.
Daniel Borkmann Sept. 11, 2023, 7:05 p.m. UTC | #8
On 9/11/23 8:52 PM, Stanislav Fomichev wrote:
> On 09/11, Daniel Borkmann wrote:
>> On 9/11/23 7:41 PM, Stanislav Fomichev wrote:
>>> On 09/11, Daniel Borkmann wrote:
>>>> On 9/11/23 7:11 PM, Stanislav Fomichev wrote:
>>>>> On 09/09, Martin KaFai Lau wrote:
>>>>>> On 9/8/23 2:00 PM, Stanislav Fomichev wrote:
[...]
>>>> I think my preference would be to just document it in the helper UAPI, what
>>>> Stan was suggesting below:
>>>>
>>>> | Note, this is technically breaking existing UAPI where we used to
>>>> | return 1 and now will do -ENOBUFS. The alternative is to
>>>> | document that bpf_clone_redirect can return 1 for DROP and 2 for CN.
>>>>
>>>> And then only adjusting the test case.
>>>
>>> In this case, would we also need something similar to our
>>> TCP_BPF_<state> changes? Like BUILD_BUG_ON(BPF_NET_XMIT_XXX !=
>>> NET_XMIT_XXX)? Otherwise, we risk more leakage into the UAPI.
>>> Merely documenting doesn't seem enough?
>>
>> We could probably just mention that a positive, non-zero code indicates
>> that the skb clone got forwarded to the target netdevice but got dropped
>> from driver side. This is somewhat also driver dependent e.g. if you look
>> at dummy which does drop-all, it returns NETDEV_TX_OK. Anything more
>> specific in the helper doc such as defining BPF_NET_XMIT_* would be more
>> confusing.
> 
> Something like the following?
> 
> Return
> 	0 on success, or a negative error in case of failure. Positive
> 	error indicates a potential drop or congestion in the target
> 	device. The particular positive error codes are not defined.
> 

yeap, sgtm
diff mbox series

Patch

diff --git a/net/core/filter.c b/net/core/filter.c
index a094694899c9..9e297931b02f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2129,6 +2129,9 @@  static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb)
 	ret = dev_queue_xmit(skb);
 	dev_xmit_recursion_dec();
 
+	if (ret > 0)
+		ret = net_xmit_errno(ret);
+
 	return ret;
 }