diff mbox series

[v2,bpf-next,4/5] bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself

Message ID 20220923224518.2353383-1-kafai@fb.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series bpf: Remove recursion check for struct_ops prog | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2133 this patch: 2133
netdev/cc_maintainers warning 8 maintainers not CCed: sdf@google.com john.fastabend@gmail.com yhs@fb.com haoluo@google.com jolsa@kernel.org kpsingh@kernel.org song@kernel.org martin.lau@linux.dev
netdev/build_clang success Errors and warnings before: 545 this patch: 545
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 2263 this patch: 2263
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-4 success Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-5 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-1 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-8 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-6 success Logs for test_maps on s390x with gcc

Commit Message

Martin KaFai Lau Sept. 23, 2022, 10:45 p.m. UTC
From: Martin KaFai Lau <martin.lau@kernel.org>

When a bad bpf prog '.init' calls
bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:

.init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
... => .init => bpf_setsockopt(tcp_cc).

It was prevented by the prog->active counter before but the prog->active
detection cannot be used in struct_ops as explained in the earlier
patch of the set.

In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
in order to break the loop.  This is done by using a bit of
an existing 1 byte hole in tcp_sock to check if there is
on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock.

Note that this essentially limits only the first '.init' can
call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
does not support ECN) and the second '.init' cannot fallback to
another cc.  This applies even the second
bpf_setsockopt(TCP_CONGESTION) will not cause a loop.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
---
 include/linux/tcp.h |  6 ++++++
 net/core/filter.c   | 28 +++++++++++++++++++++++++++-
 2 files changed, 33 insertions(+), 1 deletion(-)

Comments

Alexei Starovoitov Sept. 27, 2022, 3:34 a.m. UTC | #1
On Fri, Sep 23, 2022 at 3:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> From: Martin KaFai Lau <martin.lau@kernel.org>
>
> When a bad bpf prog '.init' calls
> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
>
> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
> ... => .init => bpf_setsockopt(tcp_cc).
>
> It was prevented by the prog->active counter before but the prog->active
> detection cannot be used in struct_ops as explained in the earlier
> patch of the set.
>
> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
> in order to break the loop.  This is done by using a bit of
> an existing 1 byte hole in tcp_sock to check if there is
> on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock.
>
> Note that this essentially limits only the first '.init' can
> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
> does not support ECN) and the second '.init' cannot fallback to
> another cc.  This applies even the second
> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
>
> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
> ---
>  include/linux/tcp.h |  6 ++++++
>  net/core/filter.c   | 28 +++++++++++++++++++++++++++-
>  2 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index a9fbe22732c3..3bdf687e2fb3 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -388,6 +388,12 @@ struct tcp_sock {
>         u8      bpf_sock_ops_cb_flags;  /* Control calling BPF programs
>                                          * values defined in uapi/linux/tcp.h
>                                          */
> +       u8      bpf_chg_cc_inprogress:1; /* In the middle of
> +                                         * bpf_setsockopt(TCP_CONGESTION),
> +                                         * it is to avoid the bpf_tcp_cc->init()
> +                                         * to recur itself by calling
> +                                         * bpf_setsockopt(TCP_CONGESTION, "itself").
> +                                         */
>  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
>  #else
>  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 96f2f7a65e65..ac4c45c02da5 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5105,6 +5105,9 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
>  static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
>                                       int *optlen, bool getopt)
>  {
> +       struct tcp_sock *tp;
> +       int ret;
> +
>         if (*optlen < 2)
>                 return -EINVAL;
>
> @@ -5125,8 +5128,31 @@ static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
>         if (*optlen >= sizeof("cdg") - 1 && !strncmp("cdg", optval, *optlen))
>                 return -ENOTSUPP;
>
> -       return do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
> +       /* It stops this looping
> +        *
> +        * .init => bpf_setsockopt(tcp_cc) => .init =>
> +        * bpf_setsockopt(tcp_cc)" => .init => ....
> +        *
> +        * The second bpf_setsockopt(tcp_cc) is not allowed
> +        * in order to break the loop when both .init
> +        * are the same bpf prog.
> +        *
> +        * This applies even the second bpf_setsockopt(tcp_cc)
> +        * does not cause a loop.  This limits only the first
> +        * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
> +        * pick a fallback cc (eg. peer does not support ECN)
> +        * and the second '.init' cannot fallback to
> +        * another.
> +        */
> +       tp = tcp_sk(sk);
> +       if (tp->bpf_chg_cc_inprogress)
> +               return -EBUSY;
> +
> +       tp->bpf_chg_cc_inprogress = 1;
> +       ret = do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
>                                 KERNEL_SOCKPTR(optval), *optlen);
> +       tp->bpf_chg_cc_inprogress = 0;
> +       return ret;

Eric,

Could you please ack this patch?
Alexei Starovoitov Sept. 29, 2022, 1:12 a.m. UTC | #2
Eric,

Ping! This is an important fix for anyone using bpf-based tcp-cc.

On Mon, Sep 26, 2022 at 8:34 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Sep 23, 2022 at 3:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > From: Martin KaFai Lau <martin.lau@kernel.org>
> >
> > When a bad bpf prog '.init' calls
> > bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
> >
> > .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
> > ... => .init => bpf_setsockopt(tcp_cc).
> >
> > It was prevented by the prog->active counter before but the prog->active
> > detection cannot be used in struct_ops as explained in the earlier
> > patch of the set.
> >
> > In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
> > in order to break the loop.  This is done by using a bit of
> > an existing 1 byte hole in tcp_sock to check if there is
> > on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock.
> >
> > Note that this essentially limits only the first '.init' can
> > call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
> > does not support ECN) and the second '.init' cannot fallback to
> > another cc.  This applies even the second
> > bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
> >
> > Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
> > ---
> >  include/linux/tcp.h |  6 ++++++
> >  net/core/filter.c   | 28 +++++++++++++++++++++++++++-
> >  2 files changed, 33 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> > index a9fbe22732c3..3bdf687e2fb3 100644
> > --- a/include/linux/tcp.h
> > +++ b/include/linux/tcp.h
> > @@ -388,6 +388,12 @@ struct tcp_sock {
> >         u8      bpf_sock_ops_cb_flags;  /* Control calling BPF programs
> >                                          * values defined in uapi/linux/tcp.h
> >                                          */
> > +       u8      bpf_chg_cc_inprogress:1; /* In the middle of
> > +                                         * bpf_setsockopt(TCP_CONGESTION),
> > +                                         * it is to avoid the bpf_tcp_cc->init()
> > +                                         * to recur itself by calling
> > +                                         * bpf_setsockopt(TCP_CONGESTION, "itself").
> > +                                         */
> >  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
> >  #else
> >  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 96f2f7a65e65..ac4c45c02da5 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5105,6 +5105,9 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
> >  static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
> >                                       int *optlen, bool getopt)
> >  {
> > +       struct tcp_sock *tp;
> > +       int ret;
> > +
> >         if (*optlen < 2)
> >                 return -EINVAL;
> >
> > @@ -5125,8 +5128,31 @@ static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
> >         if (*optlen >= sizeof("cdg") - 1 && !strncmp("cdg", optval, *optlen))
> >                 return -ENOTSUPP;
> >
> > -       return do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
> > +       /* It stops this looping
> > +        *
> > +        * .init => bpf_setsockopt(tcp_cc) => .init =>
> > +        * bpf_setsockopt(tcp_cc)" => .init => ....
> > +        *
> > +        * The second bpf_setsockopt(tcp_cc) is not allowed
> > +        * in order to break the loop when both .init
> > +        * are the same bpf prog.
> > +        *
> > +        * This applies even the second bpf_setsockopt(tcp_cc)
> > +        * does not cause a loop.  This limits only the first
> > +        * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
> > +        * pick a fallback cc (eg. peer does not support ECN)
> > +        * and the second '.init' cannot fallback to
> > +        * another.
> > +        */
> > +       tp = tcp_sk(sk);
> > +       if (tp->bpf_chg_cc_inprogress)
> > +               return -EBUSY;
> > +
> > +       tp->bpf_chg_cc_inprogress = 1;
> > +       ret = do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
> >                                 KERNEL_SOCKPTR(optval), *optlen);
> > +       tp->bpf_chg_cc_inprogress = 0;
> > +       return ret;
>
> Eric,
>
> Could you please ack this patch?
Eric Dumazet Sept. 29, 2022, 2:04 a.m. UTC | #3
On Fri, Sep 23, 2022 at 3:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> From: Martin KaFai Lau <martin.lau@kernel.org>
>
> When a bad bpf prog '.init' calls
> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
>
> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
> ... => .init => bpf_setsockopt(tcp_cc).
>
> It was prevented by the prog->active counter before but the prog->active
> detection cannot be used in struct_ops as explained in the earlier
> patch of the set.
>
> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
> in order to break the loop.  This is done by using a bit of
> an existing 1 byte hole in tcp_sock to check if there is
> on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock.
>
> Note that this essentially limits only the first '.init' can
> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
> does not support ECN) and the second '.init' cannot fallback to
> another cc.  This applies even the second
> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
>
> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
> ---
>  include/linux/tcp.h |  6 ++++++
>  net/core/filter.c   | 28 +++++++++++++++++++++++++++-
>  2 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index a9fbe22732c3..3bdf687e2fb3 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -388,6 +388,12 @@ struct tcp_sock {
>         u8      bpf_sock_ops_cb_flags;  /* Control calling BPF programs
>                                          * values defined in uapi/linux/tcp.h
>                                          */
> +       u8      bpf_chg_cc_inprogress:1; /* In the middle of
> +                                         * bpf_setsockopt(TCP_CONGESTION),
> +                                         * it is to avoid the bpf_tcp_cc->init()
> +                                         * to recur itself by calling
> +                                         * bpf_setsockopt(TCP_CONGESTION, "itself").
> +                                         */
>  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
>  #else
>  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 96f2f7a65e65..ac4c45c02da5 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5105,6 +5105,9 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
>  static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
>                                       int *optlen, bool getopt)
>  {
> +       struct tcp_sock *tp;
> +       int ret;
> +
>         if (*optlen < 2)
>                 return -EINVAL;
>
> @@ -5125,8 +5128,31 @@ static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
>         if (*optlen >= sizeof("cdg") - 1 && !strncmp("cdg", optval, *optlen))
>                 return -ENOTSUPP;
>
> -       return do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
> +       /* It stops this looping
> +        *
> +        * .init => bpf_setsockopt(tcp_cc) => .init =>
> +        * bpf_setsockopt(tcp_cc)" => .init => ....
> +        *
> +        * The second bpf_setsockopt(tcp_cc) is not allowed
> +        * in order to break the loop when both .init
> +        * are the same bpf prog.
> +        *
> +        * This applies even the second bpf_setsockopt(tcp_cc)
> +        * does not cause a loop.  This limits only the first
> +        * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
> +        * pick a fallback cc (eg. peer does not support ECN)
> +        * and the second '.init' cannot fallback to
> +        * another.
> +        */
> +       tp = tcp_sk(sk);
> +       if (tp->bpf_chg_cc_inprogress)
> +               return -EBUSY;
> +

Is the socket locked (and owned by current thread) at this point ?
If not, changing bpf_chg_cc_inprogress would be racy.


> +       tp->bpf_chg_cc_inprogress = 1;
> +       ret = do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
>                                 KERNEL_SOCKPTR(optval), *optlen);
> +       tp->bpf_chg_cc_inprogress = 0;
> +       return ret;
>  }
>
>  static int sol_tcp_sockopt(struct sock *sk, int optname,
> --
> 2.30.2
>
Martin KaFai Lau Sept. 29, 2022, 5:31 a.m. UTC | #4
On 9/28/22 7:04 PM, Eric Dumazet wrote:
> On Fri, Sep 23, 2022 at 3:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
>>
>> From: Martin KaFai Lau <martin.lau@kernel.org>
>>
>> When a bad bpf prog '.init' calls
>> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
>>
>> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
>> ... => .init => bpf_setsockopt(tcp_cc).
>>
>> It was prevented by the prog->active counter before but the prog->active
>> detection cannot be used in struct_ops as explained in the earlier
>> patch of the set.
>>
>> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
>> in order to break the loop.  This is done by using a bit of
>> an existing 1 byte hole in tcp_sock to check if there is
>> on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock.
>>
>> Note that this essentially limits only the first '.init' can
>> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
>> does not support ECN) and the second '.init' cannot fallback to
>> another cc.  This applies even the second
>> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
>>
>> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
>> ---
>>   include/linux/tcp.h |  6 ++++++
>>   net/core/filter.c   | 28 +++++++++++++++++++++++++++-
>>   2 files changed, 33 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>> index a9fbe22732c3..3bdf687e2fb3 100644
>> --- a/include/linux/tcp.h
>> +++ b/include/linux/tcp.h
>> @@ -388,6 +388,12 @@ struct tcp_sock {
>>          u8      bpf_sock_ops_cb_flags;  /* Control calling BPF programs
>>                                           * values defined in uapi/linux/tcp.h
>>                                           */
>> +       u8      bpf_chg_cc_inprogress:1; /* In the middle of
>> +                                         * bpf_setsockopt(TCP_CONGESTION),
>> +                                         * it is to avoid the bpf_tcp_cc->init()
>> +                                         * to recur itself by calling
>> +                                         * bpf_setsockopt(TCP_CONGESTION, "itself").
>> +                                         */
>>   #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
>>   #else
>>   #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 96f2f7a65e65..ac4c45c02da5 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -5105,6 +5105,9 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
>>   static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
>>                                        int *optlen, bool getopt)
>>   {
>> +       struct tcp_sock *tp;
>> +       int ret;
>> +
>>          if (*optlen < 2)
>>                  return -EINVAL;
>>
>> @@ -5125,8 +5128,31 @@ static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
>>          if (*optlen >= sizeof("cdg") - 1 && !strncmp("cdg", optval, *optlen))
>>                  return -ENOTSUPP;
>>
>> -       return do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
>> +       /* It stops this looping
>> +        *
>> +        * .init => bpf_setsockopt(tcp_cc) => .init =>
>> +        * bpf_setsockopt(tcp_cc)" => .init => ....
>> +        *
>> +        * The second bpf_setsockopt(tcp_cc) is not allowed
>> +        * in order to break the loop when both .init
>> +        * are the same bpf prog.
>> +        *
>> +        * This applies even the second bpf_setsockopt(tcp_cc)
>> +        * does not cause a loop.  This limits only the first
>> +        * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
>> +        * pick a fallback cc (eg. peer does not support ECN)
>> +        * and the second '.init' cannot fallback to
>> +        * another.
>> +        */
>> +       tp = tcp_sk(sk);
>> +       if (tp->bpf_chg_cc_inprogress)
>> +               return -EBUSY;
>> +
> 
> Is the socket locked (and owned by current thread) at this point ?
> If not, changing bpf_chg_cc_inprogress would be racy.

Yes, the socket is locked and owned.  There is a sock_owned_by_me check earlier 
in _bpf_setsockopt().

> 
> 
>> +       tp->bpf_chg_cc_inprogress = 1;
>> +       ret = do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
>>                                  KERNEL_SOCKPTR(optval), *optlen);
>> +       tp->bpf_chg_cc_inprogress = 0;
>> +       return ret;
>>   }
>>
>>   static int sol_tcp_sockopt(struct sock *sk, int optname,
>> --
>> 2.30.2
>>
Eric Dumazet Sept. 29, 2022, 5:37 a.m. UTC | #5
On Wed, Sep 28, 2022 at 10:31 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 9/28/22 7:04 PM, Eric Dumazet wrote:
> > On Fri, Sep 23, 2022 at 3:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >>
> >> From: Martin KaFai Lau <martin.lau@kernel.org>
> >>
> >> When a bad bpf prog '.init' calls
> >> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
> >>
> >> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
> >> ... => .init => bpf_setsockopt(tcp_cc).
> >>
> >> It was prevented by the prog->active counter before but the prog->active
> >> detection cannot be used in struct_ops as explained in the earlier
> >> patch of the set.
> >>
> >> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
> >> in order to break the loop.  This is done by using a bit of
> >> an existing 1 byte hole in tcp_sock to check if there is
> >> on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock.
> >>
> >> Note that this essentially limits only the first '.init' can
> >> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
> >> does not support ECN) and the second '.init' cannot fallback to
> >> another cc.  This applies even the second
> >> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
> >>
> >> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
> >> ---
> >>   include/linux/tcp.h |  6 ++++++
> >>   net/core/filter.c   | 28 +++++++++++++++++++++++++++-
> >>   2 files changed, 33 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> >> index a9fbe22732c3..3bdf687e2fb3 100644
> >> --- a/include/linux/tcp.h
> >> +++ b/include/linux/tcp.h
> >> @@ -388,6 +388,12 @@ struct tcp_sock {
> >>          u8      bpf_sock_ops_cb_flags;  /* Control calling BPF programs
> >>                                           * values defined in uapi/linux/tcp.h
> >>                                           */
> >> +       u8      bpf_chg_cc_inprogress:1; /* In the middle of
> >> +                                         * bpf_setsockopt(TCP_CONGESTION),
> >> +                                         * it is to avoid the bpf_tcp_cc->init()
> >> +                                         * to recur itself by calling
> >> +                                         * bpf_setsockopt(TCP_CONGESTION, "itself").
> >> +                                         */
> >>   #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
> >>   #else
> >>   #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
> >> diff --git a/net/core/filter.c b/net/core/filter.c
> >> index 96f2f7a65e65..ac4c45c02da5 100644
> >> --- a/net/core/filter.c
> >> +++ b/net/core/filter.c
> >> @@ -5105,6 +5105,9 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
> >>   static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
> >>                                        int *optlen, bool getopt)
> >>   {
> >> +       struct tcp_sock *tp;
> >> +       int ret;
> >> +
> >>          if (*optlen < 2)
> >>                  return -EINVAL;
> >>
> >> @@ -5125,8 +5128,31 @@ static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
> >>          if (*optlen >= sizeof("cdg") - 1 && !strncmp("cdg", optval, *optlen))
> >>                  return -ENOTSUPP;
> >>
> >> -       return do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
> >> +       /* It stops this looping
> >> +        *
> >> +        * .init => bpf_setsockopt(tcp_cc) => .init =>
> >> +        * bpf_setsockopt(tcp_cc)" => .init => ....
> >> +        *
> >> +        * The second bpf_setsockopt(tcp_cc) is not allowed
> >> +        * in order to break the loop when both .init
> >> +        * are the same bpf prog.
> >> +        *
> >> +        * This applies even the second bpf_setsockopt(tcp_cc)
> >> +        * does not cause a loop.  This limits only the first
> >> +        * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
> >> +        * pick a fallback cc (eg. peer does not support ECN)
> >> +        * and the second '.init' cannot fallback to
> >> +        * another.
> >> +        */
> >> +       tp = tcp_sk(sk);
> >> +       if (tp->bpf_chg_cc_inprogress)
> >> +               return -EBUSY;
> >> +
> >
> > Is the socket locked (and owned by current thread) at this point ?
> > If not, changing bpf_chg_cc_inprogress would be racy.
>
> Yes, the socket is locked and owned.  There is a sock_owned_by_me check earlier
> in _bpf_setsockopt().

Good to know. Note a listener can be cloned without socket lock being held.

In order to avoid surprises, I would clear bpf_chg_cc_inprogress in
tcp_create_openreq_child()
Martin KaFai Lau Sept. 29, 2022, 6:17 a.m. UTC | #6
On 9/28/22 10:37 PM, Eric Dumazet wrote:
> On Wed, Sep 28, 2022 at 10:31 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 9/28/22 7:04 PM, Eric Dumazet wrote:
>>> On Fri, Sep 23, 2022 at 3:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
>>>>
>>>> From: Martin KaFai Lau <martin.lau@kernel.org>
>>>>
>>>> When a bad bpf prog '.init' calls
>>>> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
>>>>
>>>> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
>>>> ... => .init => bpf_setsockopt(tcp_cc).
>>>>
>>>> It was prevented by the prog->active counter before but the prog->active
>>>> detection cannot be used in struct_ops as explained in the earlier
>>>> patch of the set.
>>>>
>>>> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
>>>> in order to break the loop.  This is done by using a bit of
>>>> an existing 1 byte hole in tcp_sock to check if there is
>>>> on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock.
>>>>
>>>> Note that this essentially limits only the first '.init' can
>>>> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
>>>> does not support ECN) and the second '.init' cannot fallback to
>>>> another cc.  This applies even the second
>>>> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
>>>>
>>>> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
>>>> ---
>>>>    include/linux/tcp.h |  6 ++++++
>>>>    net/core/filter.c   | 28 +++++++++++++++++++++++++++-
>>>>    2 files changed, 33 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>>>> index a9fbe22732c3..3bdf687e2fb3 100644
>>>> --- a/include/linux/tcp.h
>>>> +++ b/include/linux/tcp.h
>>>> @@ -388,6 +388,12 @@ struct tcp_sock {
>>>>           u8      bpf_sock_ops_cb_flags;  /* Control calling BPF programs
>>>>                                            * values defined in uapi/linux/tcp.h
>>>>                                            */
>>>> +       u8      bpf_chg_cc_inprogress:1; /* In the middle of
>>>> +                                         * bpf_setsockopt(TCP_CONGESTION),
>>>> +                                         * it is to avoid the bpf_tcp_cc->init()
>>>> +                                         * to recur itself by calling
>>>> +                                         * bpf_setsockopt(TCP_CONGESTION, "itself").
>>>> +                                         */
>>>>    #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
>>>>    #else
>>>>    #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>> index 96f2f7a65e65..ac4c45c02da5 100644
>>>> --- a/net/core/filter.c
>>>> +++ b/net/core/filter.c
>>>> @@ -5105,6 +5105,9 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
>>>>    static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
>>>>                                         int *optlen, bool getopt)
>>>>    {
>>>> +       struct tcp_sock *tp;
>>>> +       int ret;
>>>> +
>>>>           if (*optlen < 2)
>>>>                   return -EINVAL;
>>>>
>>>> @@ -5125,8 +5128,31 @@ static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
>>>>           if (*optlen >= sizeof("cdg") - 1 && !strncmp("cdg", optval, *optlen))
>>>>                   return -ENOTSUPP;
>>>>
>>>> -       return do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
>>>> +       /* It stops this looping
>>>> +        *
>>>> +        * .init => bpf_setsockopt(tcp_cc) => .init =>
>>>> +        * bpf_setsockopt(tcp_cc)" => .init => ....
>>>> +        *
>>>> +        * The second bpf_setsockopt(tcp_cc) is not allowed
>>>> +        * in order to break the loop when both .init
>>>> +        * are the same bpf prog.
>>>> +        *
>>>> +        * This applies even the second bpf_setsockopt(tcp_cc)
>>>> +        * does not cause a loop.  This limits only the first
>>>> +        * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
>>>> +        * pick a fallback cc (eg. peer does not support ECN)
>>>> +        * and the second '.init' cannot fallback to
>>>> +        * another.
>>>> +        */
>>>> +       tp = tcp_sk(sk);
>>>> +       if (tp->bpf_chg_cc_inprogress)
>>>> +               return -EBUSY;
>>>> +
>>>
>>> Is the socket locked (and owned by current thread) at this point ?
>>> If not, changing bpf_chg_cc_inprogress would be racy.
>>
>> Yes, the socket is locked and owned.  There is a sock_owned_by_me check earlier
>> in _bpf_setsockopt().
> 
> Good to know. Note a listener can be cloned without socket lock being held.
> 
> In order to avoid surprises, I would clear bpf_chg_cc_inprogress in
> tcp_create_openreq_child()

Ah, make sense.  I will re-spin.
diff mbox series

Patch

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index a9fbe22732c3..3bdf687e2fb3 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -388,6 +388,12 @@  struct tcp_sock {
 	u8	bpf_sock_ops_cb_flags;  /* Control calling BPF programs
 					 * values defined in uapi/linux/tcp.h
 					 */
+	u8	bpf_chg_cc_inprogress:1; /* In the middle of
+					  * bpf_setsockopt(TCP_CONGESTION),
+					  * it is to avoid the bpf_tcp_cc->init()
+					  * to recur itself by calling
+					  * bpf_setsockopt(TCP_CONGESTION, "itself").
+					  */
 #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
 #else
 #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
diff --git a/net/core/filter.c b/net/core/filter.c
index 96f2f7a65e65..ac4c45c02da5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5105,6 +5105,9 @@  static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
 static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
 				      int *optlen, bool getopt)
 {
+	struct tcp_sock *tp;
+	int ret;
+
 	if (*optlen < 2)
 		return -EINVAL;
 
@@ -5125,8 +5128,31 @@  static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
 	if (*optlen >= sizeof("cdg") - 1 && !strncmp("cdg", optval, *optlen))
 		return -ENOTSUPP;
 
-	return do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
+	/* It stops this looping
+	 *
+	 * .init => bpf_setsockopt(tcp_cc) => .init =>
+	 * bpf_setsockopt(tcp_cc)" => .init => ....
+	 *
+	 * The second bpf_setsockopt(tcp_cc) is not allowed
+	 * in order to break the loop when both .init
+	 * are the same bpf prog.
+	 *
+	 * This applies even the second bpf_setsockopt(tcp_cc)
+	 * does not cause a loop.  This limits only the first
+	 * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
+	 * pick a fallback cc (eg. peer does not support ECN)
+	 * and the second '.init' cannot fallback to
+	 * another.
+	 */
+	tp = tcp_sk(sk);
+	if (tp->bpf_chg_cc_inprogress)
+		return -EBUSY;
+
+	tp->bpf_chg_cc_inprogress = 1;
+	ret = do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
 				KERNEL_SOCKPTR(optval), *optlen);
+	tp->bpf_chg_cc_inprogress = 0;
+	return ret;
 }
 
 static int sol_tcp_sockopt(struct sock *sk, int optname,