diff mbox series

[bpf-next,v3,1/2] net: bpf: Always call BPF cgroup filters for egress.

Message ID 20230620171409.166001-2-kuifeng@meta.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series Fix missing synack in BPF cgroup_skb filters | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-6 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for veristat
bpf/vmtest-bpf-next-VM_Test-8 fail Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-26 fail Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_progs on s390x with gcc
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1070 this patch: 1070
netdev/cc_maintainers success CCed 12 of 12 maintainers
netdev/build_clang success Errors and warnings before: 142 this patch: 142
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1077 this patch: 1077
netdev/checkpatch warning WARNING: From:/Signed-off-by: email address mismatch: 'From: Kui-Feng Lee <thinker.li@gmail.com>' != 'Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>'
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Kui-Feng Lee June 20, 2023, 5:14 p.m. UTC
Always call BPF filters if CGROUP BPF is enabled for EGRESS without
checking skb->sk against sk.

The filters were called only if skb is owned by the sock that the
skb is sent out through.  In another words, skb->sk should point to
the sock that it is sending through its egress.  However, the filters would
miss SYNACK skbs that they are owned by a request_sock but sent through
the listening sock, that is the socket listening incoming connections.
This is an unnecessary restrict.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 include/linux/bpf-cgroup.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Yonghong Song June 22, 2023, 3:37 a.m. UTC | #1
On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
> Always call BPF filters if CGROUP BPF is enabled for EGRESS without
> checking skb->sk against sk.
> 
> The filters were called only if skb is owned by the sock that the
> skb is sent out through.  In another words, skb->sk should point to
> the sock that it is sending through its egress.  However, the filters would
> miss SYNACK skbs that they are owned by a request_sock but sent through
> the listening sock, that is the socket listening incoming connections.
> This is an unnecessary restrict.

The original patch which introduced 'sk == skb->sk' is
   3007098494be  cgroup: add support for eBPF programs
There are no mentioning in commit message why 'sk == skb->sk'
is needed. So it is possible that this is just restricted
for use cases at that moment. Now there are use cases
where 'sk != skb->sk' so removing this check can enable
the new use case. Maybe you can add this into your commit
message so people can understand the history of 'sk == skb->sk'.

> 
> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> ---
>   include/linux/bpf-cgroup.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index 57e9e109257e..e656da531f9f 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -199,7 +199,7 @@ static inline bool cgroup_bpf_sock_enabled(struct sock *sk,
>   #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb)			       \
>   ({									       \
>   	int __ret = 0;							       \
> -	if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk && sk == skb->sk) { \
> +	if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) {		       \
>   		typeof(sk) __sk = sk_to_full_sk(sk);			       \
>   		if (sk_fullsock(__sk) &&				       \
>   		    cgroup_bpf_sock_enabled(__sk, CGROUP_INET_EGRESS))	       \
Kui-Feng Lee June 22, 2023, 3:34 p.m. UTC | #2
On 6/21/23 20:37, Yonghong Song wrote:
> 
> 
> On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
>> Always call BPF filters if CGROUP BPF is enabled for EGRESS without
>> checking skb->sk against sk.
>>
>> The filters were called only if skb is owned by the sock that the
>> skb is sent out through.  In another words, skb->sk should point to
>> the sock that it is sending through its egress.  However, the filters 
>> would
>> miss SYNACK skbs that they are owned by a request_sock but sent through
>> the listening sock, that is the socket listening incoming connections.
>> This is an unnecessary restrict.
> 
> The original patch which introduced 'sk == skb->sk' is
>    3007098494be  cgroup: add support for eBPF programs
> There are no mentioning in commit message why 'sk == skb->sk'
> is needed. So it is possible that this is just restricted
> for use cases at that moment. Now there are use cases
> where 'sk != skb->sk' so removing this check can enable
> the new use case. Maybe you can add this into your commit
> message so people can understand the history of 'sk == skb->sk'.

I will put it down on the next version.

> 
>>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   include/linux/bpf-cgroup.h | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
>> index 57e9e109257e..e656da531f9f 100644
>> --- a/include/linux/bpf-cgroup.h
>> +++ b/include/linux/bpf-cgroup.h
>> @@ -199,7 +199,7 @@ static inline bool cgroup_bpf_sock_enabled(struct 
>> sock *sk,
>>   #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb)                   \
>>   ({                                           \
>>       int __ret = 0;                                   \
>> -    if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk && sk == 
>> skb->sk) { \
>> +    if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) {               \
>>           typeof(sk) __sk = sk_to_full_sk(sk);                   \
>>           if (sk_fullsock(__sk) &&                       \
>>               cgroup_bpf_sock_enabled(__sk, 
>> CGROUP_INET_EGRESS))           \
>
Kui-Feng Lee June 22, 2023, 5:15 p.m. UTC | #3
On 6/21/23 20:37, Yonghong Song wrote:
> 
> 
> On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
>> Always call BPF filters if CGROUP BPF is enabled for EGRESS without
>> checking skb->sk against sk.
>>
>> The filters were called only if skb is owned by the sock that the
>> skb is sent out through.  In another words, skb->sk should point to
>> the sock that it is sending through its egress.  However, the filters 
>> would
>> miss SYNACK skbs that they are owned by a request_sock but sent through
>> the listening sock, that is the socket listening incoming connections.
>> This is an unnecessary restrict.
> 
> The original patch which introduced 'sk == skb->sk' is
>    3007098494be  cgroup: add support for eBPF programs
> There are no mentioning in commit message why 'sk == skb->sk'
> is needed. So it is possible that this is just restricted
> for use cases at that moment. Now there are use cases
> where 'sk != skb->sk' so removing this check can enable
> the new use case. Maybe you can add this into your commit
> message so people can understand the history of 'sk == skb->sk'.

After checking the code and the Alexei's comment[1] again, this check
may be different from what I thought. In another post[2],
Daniel Borkmann mentioned

     Wouldn't that mean however, when you go through stacked devices that
     you'd run the same eBPF cgroup program for skb->sk multiple times?

I read this paragraph several times.
This check ensures the filters are only called for the device on
the top of a stack.  So, I probably should change the check to

     sk == skb_to_full_sk(skb)

instead of removing it.  If we remove the check, egress filters
could be called multiple times for a skb, just like what Daniel said.

Does that make sense?

[1] 
https://lore.kernel.org/all/CAADnVQKi0c=Mf3b=z43=b6n2xBVhwPw4QoV_au5+pFE29iLkaQ@mail.gmail.com/
[2] https://lore.kernel.org/all/58193E9D.7040201@iogearbox.net/

> 
>>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   include/linux/bpf-cgroup.h | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
>> index 57e9e109257e..e656da531f9f 100644
>> --- a/include/linux/bpf-cgroup.h
>> +++ b/include/linux/bpf-cgroup.h
>> @@ -199,7 +199,7 @@ static inline bool cgroup_bpf_sock_enabled(struct 
>> sock *sk,
>>   #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb)                   \
>>   ({                                           \
>>       int __ret = 0;                                   \
>> -    if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk && sk == 
>> skb->sk) { \
>> +    if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) {               \
>>           typeof(sk) __sk = sk_to_full_sk(sk);                   \
>>           if (sk_fullsock(__sk) &&                       \
>>               cgroup_bpf_sock_enabled(__sk, 
>> CGROUP_INET_EGRESS))           \
>
Yonghong Song June 22, 2023, 6:28 p.m. UTC | #4
On 6/22/23 10:15 AM, Kui-Feng Lee wrote:
> 
> 
> On 6/21/23 20:37, Yonghong Song wrote:
>>
>>
>> On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
>>> Always call BPF filters if CGROUP BPF is enabled for EGRESS without
>>> checking skb->sk against sk.
>>>
>>> The filters were called only if skb is owned by the sock that the
>>> skb is sent out through.  In another words, skb->sk should point to
>>> the sock that it is sending through its egress.  However, the filters 
>>> would
>>> miss SYNACK skbs that they are owned by a request_sock but sent through
>>> the listening sock, that is the socket listening incoming connections.
>>> This is an unnecessary restrict.
>>
>> The original patch which introduced 'sk == skb->sk' is
>>    3007098494be  cgroup: add support for eBPF programs
>> There are no mentioning in commit message why 'sk == skb->sk'
>> is needed. So it is possible that this is just restricted
>> for use cases at that moment. Now there are use cases
>> where 'sk != skb->sk' so removing this check can enable
>> the new use case. Maybe you can add this into your commit
>> message so people can understand the history of 'sk == skb->sk'.
> 
> After checking the code and the Alexei's comment[1] again, this check
> may be different from what I thought. In another post[2],
> Daniel Borkmann mentioned
> 
>      Wouldn't that mean however, when you go through stacked devices that
>      you'd run the same eBPF cgroup program for skb->sk multiple times?
> 
> I read this paragraph several times.
> This check ensures the filters are only called for the device on
> the top of a stack.  So, I probably should change the check to
> 
>      sk == skb_to_full_sk(skb)

I think this should work. It exactly covers your use case:
   they are owned by a request_sock but sent through
   the listening sock, that is the socket listening incoming connections
and sk == skb->sk for non request_sock/listening_sock case.

I originally though whether you could do
   sk == skb->sk || skb->sk->sk_state == TCP_NEW_SYN_RECV
but obviously your approach is better.

> 
> instead of removing it.  If we remove the check, egress filters
> could be called multiple times for a skb, just like what Daniel said.
> 
> Does that make sense?
> 
> [1] 
> https://lore.kernel.org/all/CAADnVQKi0c=Mf3b=z43=b6n2xBVhwPw4QoV_au5+pFE29iLkaQ@mail.gmail.com/
> [2] https://lore.kernel.org/all/58193E9D.7040201@iogearbox.net/
> 
>>
>>>
>>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>>> ---
>>>   include/linux/bpf-cgroup.h | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
>>> index 57e9e109257e..e656da531f9f 100644
>>> --- a/include/linux/bpf-cgroup.h
>>> +++ b/include/linux/bpf-cgroup.h
>>> @@ -199,7 +199,7 @@ static inline bool cgroup_bpf_sock_enabled(struct 
>>> sock *sk,
>>>   #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb)                   \
>>>   ({                                           \
>>>       int __ret = 0;                                   \
>>> -    if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk && sk == 
>>> skb->sk) { \
>>> +    if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) {               \
>>>           typeof(sk) __sk = sk_to_full_sk(sk);                   \
>>>           if (sk_fullsock(__sk) &&                       \
>>>               cgroup_bpf_sock_enabled(__sk, 
>>> CGROUP_INET_EGRESS))           \
>>
Daniel Borkmann June 22, 2023, 8:06 p.m. UTC | #5
On 6/22/23 8:28 PM, Yonghong Song wrote:
> On 6/22/23 10:15 AM, Kui-Feng Lee wrote:
>> On 6/21/23 20:37, Yonghong Song wrote:
>>> On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
>>>> Always call BPF filters if CGROUP BPF is enabled for EGRESS without
>>>> checking skb->sk against sk.
>>>>
>>>> The filters were called only if skb is owned by the sock that the
>>>> skb is sent out through.  In another words, skb->sk should point to
>>>> the sock that it is sending through its egress.  However, the filters would
>>>> miss SYNACK skbs that they are owned by a request_sock but sent through
>>>> the listening sock, that is the socket listening incoming connections.
>>>> This is an unnecessary restrict.
>>>
>>> The original patch which introduced 'sk == skb->sk' is
>>>    3007098494be  cgroup: add support for eBPF programs
>>> There are no mentioning in commit message why 'sk == skb->sk'
>>> is needed. So it is possible that this is just restricted
>>> for use cases at that moment. Now there are use cases
>>> where 'sk != skb->sk' so removing this check can enable
>>> the new use case. Maybe you can add this into your commit
>>> message so people can understand the history of 'sk == skb->sk'.
>>
>> After checking the code and the Alexei's comment[1] again, this check
>> may be different from what I thought. In another post[2],
>> Daniel Borkmann mentioned
>>
>>      Wouldn't that mean however, when you go through stacked devices that
>>      you'd run the same eBPF cgroup program for skb->sk multiple times?
>>
>> I read this paragraph several times.
>> This check ensures the filters are only called for the device on
>> the top of a stack.  So, I probably should change the check to
>>
>>      sk == skb_to_full_sk(skb)
> 
> I think this should work. It exactly covers your use case:
>    they are owned by a request_sock but sent through
>    the listening sock, that is the socket listening incoming connections
> and sk == skb->sk for non request_sock/listening_sock case.

Just a thought, should the test look like the below?

         int __ret = 0;                                                         \
         if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) {                    \
                 typeof(sk) __sk = sk_to_full_sk(sk);                           \
                 if (sk_fullsock(__sk) && __sk == skb_to_full_sk(skb) &&        \
                     cgroup_bpf_sock_enabled(__sk, CGROUP_INET_EGRESS))         \
                         __ret = __cgroup_bpf_run_filter_skb(__sk, skb,         \
                                                       CGROUP_INET_EGRESS); \
         }                                                                      \

Iow, we do already convert __sk to full sk, so we should then also use that
for the test with skb_to_full_sk(skb).

Thanks,
Daniel
Kui-Feng Lee June 22, 2023, 11:55 p.m. UTC | #6
On 6/22/23 13:06, Daniel Borkmann wrote:
> On 6/22/23 8:28 PM, Yonghong Song wrote:
>> On 6/22/23 10:15 AM, Kui-Feng Lee wrote:
>>> On 6/21/23 20:37, Yonghong Song wrote:
>>>> On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
>>>>> Always call BPF filters if CGROUP BPF is enabled for EGRESS without
>>>>> checking skb->sk against sk.
>>>>>
>>>>> The filters were called only if skb is owned by the sock that the
>>>>> skb is sent out through.  In another words, skb->sk should point to
>>>>> the sock that it is sending through its egress.  However, the 
>>>>> filters would
>>>>> miss SYNACK skbs that they are owned by a request_sock but sent 
>>>>> through
>>>>> the listening sock, that is the socket listening incoming connections.
>>>>> This is an unnecessary restrict.
>>>>
>>>> The original patch which introduced 'sk == skb->sk' is
>>>>    3007098494be  cgroup: add support for eBPF programs
>>>> There are no mentioning in commit message why 'sk == skb->sk'
>>>> is needed. So it is possible that this is just restricted
>>>> for use cases at that moment. Now there are use cases
>>>> where 'sk != skb->sk' so removing this check can enable
>>>> the new use case. Maybe you can add this into your commit
>>>> message so people can understand the history of 'sk == skb->sk'.
>>>
>>> After checking the code and the Alexei's comment[1] again, this check
>>> may be different from what I thought. In another post[2],
>>> Daniel Borkmann mentioned
>>>
>>>      Wouldn't that mean however, when you go through stacked devices 
>>> that
>>>      you'd run the same eBPF cgroup program for skb->sk multiple times?
>>>
>>> I read this paragraph several times.
>>> This check ensures the filters are only called for the device on
>>> the top of a stack.  So, I probably should change the check to
>>>
>>>      sk == skb_to_full_sk(skb)
>>
>> I think this should work. It exactly covers your use case:
>>    they are owned by a request_sock but sent through
>>    the listening sock, that is the socket listening incoming connections
>> and sk == skb->sk for non request_sock/listening_sock case.
> 
> Just a thought, should the test look like the below?
> 
>          int __ret = 
> 0;                                                         \
>          if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) 
> {                    \
>                  typeof(sk) __sk = 
> sk_to_full_sk(sk);                           \
>                  if (sk_fullsock(__sk) && __sk == skb_to_full_sk(skb) 
> &&        \
>                      cgroup_bpf_sock_enabled(__sk, 
> CGROUP_INET_EGRESS))         \
>                          __ret = __cgroup_bpf_run_filter_skb(__sk, 
> skb,         \
>                                                        
> CGROUP_INET_EGRESS); \
>          
> }                                                                      \
> 
> Iow, we do already convert __sk to full sk, so we should then also use that
> for the test with skb_to_full_sk(skb).

Agree!

> 
> Thanks,
> Daniel
Daniel Borkmann June 23, 2023, 8:50 a.m. UTC | #7
On 6/23/23 1:55 AM, Kui-Feng Lee wrote:
> On 6/22/23 13:06, Daniel Borkmann wrote:
>> On 6/22/23 8:28 PM, Yonghong Song wrote:
>>> On 6/22/23 10:15 AM, Kui-Feng Lee wrote:
>>>> On 6/21/23 20:37, Yonghong Song wrote:
>>>>> On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
>>>>>> Always call BPF filters if CGROUP BPF is enabled for EGRESS without
>>>>>> checking skb->sk against sk.
>>>>>>
>>>>>> The filters were called only if skb is owned by the sock that the
>>>>>> skb is sent out through.  In another words, skb->sk should point to
>>>>>> the sock that it is sending through its egress.  However, the filters would
>>>>>> miss SYNACK skbs that they are owned by a request_sock but sent through
>>>>>> the listening sock, that is the socket listening incoming connections.
>>>>>> This is an unnecessary restrict.
>>>>>
>>>>> The original patch which introduced 'sk == skb->sk' is
>>>>>    3007098494be  cgroup: add support for eBPF programs
>>>>> There are no mentioning in commit message why 'sk == skb->sk'
>>>>> is needed. So it is possible that this is just restricted
>>>>> for use cases at that moment. Now there are use cases
>>>>> where 'sk != skb->sk' so removing this check can enable
>>>>> the new use case. Maybe you can add this into your commit
>>>>> message so people can understand the history of 'sk == skb->sk'.
>>>>
>>>> After checking the code and the Alexei's comment[1] again, this check
>>>> may be different from what I thought. In another post[2],
>>>> Daniel Borkmann mentioned
>>>>
>>>>      Wouldn't that mean however, when you go through stacked devices that
>>>>      you'd run the same eBPF cgroup program for skb->sk multiple times?
>>>>
>>>> I read this paragraph several times.
>>>> This check ensures the filters are only called for the device on
>>>> the top of a stack.  So, I probably should change the check to
>>>>
>>>>      sk == skb_to_full_sk(skb)
>>>
>>> I think this should work. It exactly covers your use case:
>>>    they are owned by a request_sock but sent through
>>>    the listening sock, that is the socket listening incoming connections
>>> and sk == skb->sk for non request_sock/listening_sock case.
>>
>> Just a thought, should the test look like the below?
>>
>>          int __ret = 0;                                                         \
>>          if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) {                    \
>>                  typeof(sk) __sk = sk_to_full_sk(sk);                           \
>>                  if (sk_fullsock(__sk) && __sk == skb_to_full_sk(skb) &&        \
>>                      cgroup_bpf_sock_enabled(__sk, CGROUP_INET_EGRESS))         \
>>                          __ret = __cgroup_bpf_run_filter_skb(__sk, skb,         \
>> CGROUP_INET_EGRESS); \
>> }                                                                      \
>>
>> Iow, we do already convert __sk to full sk, so we should then also use that
>> for the test with skb_to_full_sk(skb).
> 
> Agree!

It would also be useful to do an in-depth analysis for the commit msg in which
cases the sk == skb->sk matches and sk was not a full sock (but __sk is) given
the __sk = sk_to_full_sk(sk) exists in the code to document which situation this
is covering in the existing code (... perhaps it used to work back then for
synack just that later changes altered it without anyone noticing until now).

Thanks,
Daniel
Kui-Feng Lee June 23, 2023, 4:30 p.m. UTC | #8
On 6/23/23 01:50, Daniel Borkmann wrote:
> On 6/23/23 1:55 AM, Kui-Feng Lee wrote:
>> On 6/22/23 13:06, Daniel Borkmann wrote:
>>> On 6/22/23 8:28 PM, Yonghong Song wrote:
>>>> On 6/22/23 10:15 AM, Kui-Feng Lee wrote:
>>>>> On 6/21/23 20:37, Yonghong Song wrote:
>>>>>> On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
>>>>>>> Always call BPF filters if CGROUP BPF is enabled for EGRESS without
>>>>>>> checking skb->sk against sk.
>>>>>>>
>>>>>>> The filters were called only if skb is owned by the sock that the
>>>>>>> skb is sent out through.  In another words, skb->sk should point to
>>>>>>> the sock that it is sending through its egress.  However, the 
>>>>>>> filters would
>>>>>>> miss SYNACK skbs that they are owned by a request_sock but sent 
>>>>>>> through
>>>>>>> the listening sock, that is the socket listening incoming 
>>>>>>> connections.
>>>>>>> This is an unnecessary restrict.
>>>>>>
>>>>>> The original patch which introduced 'sk == skb->sk' is
>>>>>>    3007098494be  cgroup: add support for eBPF programs
>>>>>> There are no mentioning in commit message why 'sk == skb->sk'
>>>>>> is needed. So it is possible that this is just restricted
>>>>>> for use cases at that moment. Now there are use cases
>>>>>> where 'sk != skb->sk' so removing this check can enable
>>>>>> the new use case. Maybe you can add this into your commit
>>>>>> message so people can understand the history of 'sk == skb->sk'.
>>>>>
>>>>> After checking the code and the Alexei's comment[1] again, this check
>>>>> may be different from what I thought. In another post[2],
>>>>> Daniel Borkmann mentioned
>>>>>
>>>>>      Wouldn't that mean however, when you go through stacked 
>>>>> devices that
>>>>>      you'd run the same eBPF cgroup program for skb->sk multiple 
>>>>> times?
>>>>>
>>>>> I read this paragraph several times.
>>>>> This check ensures the filters are only called for the device on
>>>>> the top of a stack.  So, I probably should change the check to
>>>>>
>>>>>      sk == skb_to_full_sk(skb)
>>>>
>>>> I think this should work. It exactly covers your use case:
>>>>    they are owned by a request_sock but sent through
>>>>    the listening sock, that is the socket listening incoming 
>>>> connections
>>>> and sk == skb->sk for non request_sock/listening_sock case.
>>>
>>> Just a thought, should the test look like the below?
>>>
>>>          int __ret = 
>>> 0;                                                         \
>>>          if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) 
>>> {                    \
>>>                  typeof(sk) __sk = 
>>> sk_to_full_sk(sk);                           \
>>>                  if (sk_fullsock(__sk) && __sk == skb_to_full_sk(skb) 
>>> &&        \
>>>                      cgroup_bpf_sock_enabled(__sk, 
>>> CGROUP_INET_EGRESS))         \
>>>                          __ret = __cgroup_bpf_run_filter_skb(__sk, 
>>> skb,         \
>>> CGROUP_INET_EGRESS); \
>>> }                                                                      \
>>>
>>> Iow, we do already convert __sk to full sk, so we should then also 
>>> use that
>>> for the test with skb_to_full_sk(skb).
>>
>> Agree!
> 
> It would also be useful to do an in-depth analysis for the commit msg in 
> which
> cases the sk == skb->sk matches and sk was not a full sock (but __sk is) 
> given
> the __sk = sk_to_full_sk(sk) exists in the code to document which 
> situation this
> is covering in the existing code (... perhaps it used to work back then for
> synack just that later changes altered it without anyone noticing until 
> now).

I did a test that trace how a packet going through L2TP
devices. I am going to include the analysis of the test and other
related links of discussions in the commit log.
diff mbox series

Patch

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 57e9e109257e..e656da531f9f 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -199,7 +199,7 @@  static inline bool cgroup_bpf_sock_enabled(struct sock *sk,
 #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb)			       \
 ({									       \
 	int __ret = 0;							       \
-	if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk && sk == skb->sk) { \
+	if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) {		       \
 		typeof(sk) __sk = sk_to_full_sk(sk);			       \
 		if (sk_fullsock(__sk) &&				       \
 		    cgroup_bpf_sock_enabled(__sk, CGROUP_INET_EGRESS))	       \