[RFC,v7,0/8] net_sched: Introduce eBPF based Qdisc

Message ID	cover.1705432850.git.amery.hung@bytedance.com (mailing list archive)
Headers	show Received: from mail-ot1-f52.google.com (mail-ot1-f52.google.com [209.85.210.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B723F2557B; Wed, 17 Jan 2024 21:56:26 +0000 (UTC) From: Amery Hung <ameryhung@gmail.com> To: netdev@vger.kernel.org Cc: bpf@vger.kernel.org, yangpeihao@sjtu.edu.cn, toke@redhat.com, jhs@mojatatu.com, jiri@resnulli.us, sdf@google.com, xiyou.wangcong@gmail.com, yepeilin.cs@gmail.com Subject: [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc Date: Wed, 17 Jan 2024 21:56:16 +0000 Message-Id: <cover.1705432850.git.amery.hung@bytedance.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	net_sched: Introduce eBPF based Qdisc \| expand [RFC,v7,0/8] net_sched: Introduce eBPF based Qdisc [RFC,v7,1/8] net_sched: Introduce eBPF based Qdisc [RFC,v7,2/8] net_sched: Add kfuncs for working with skb [RFC,v7,3/8] net_sched: Introduce kfunc bpf_skb_tc_classify() [RFC,v7,4/8] net_sched: Add reset program [RFC,v7,5/8] net_sched: Add init program [RFC,v7,6/8] tools/libbpf: Add support for BPF_PROG_TYPE_QDISC [RFC,v7,7/8] samples/bpf: Add an example of bpf fq qdisc [RFC,v7,8/8] samples/bpf: Add an example of bpf netem qdisc

Amery Hung Jan. 17, 2024, 9:56 p.m. UTC

Hi, 

I am continuing the work of ebpf-based Qdisc based on Cong’s previous
RFC. The followings are some use cases of eBPF Qdisc:

1. Allow customizing Qdiscs in an easier way. So that people don't
   have to write a complete Qdisc kernel module just to experiment
   some new queuing theory.

2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
   is before enqueue, it is impossible to adjust those "tokens" after
   packets get dropped in enqueue. With eBPF Qdisc, it is easy to
   be solved with a shared map between clsact and sch_bpf.

3. Replace qevents, as now the user gains much more control over the
   skb and queues.

4. Provide a new way to reuse TC filters. Currently TC relies on filter
   chain and block to reuse the TC filters, but they are too complicated
   to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
   TC filters on _any_ Qdisc (even on a different netdev) to do the
   classification.

5. Potentially pave a way for ingress to queue packets, although
   current implementation is still only for egress.

I’ve combed through previous comments and appreciated the feedbacks.
Some major changes in this RFC is the use of kptr to skb to maintain
the validility of skb during its lifetime in the Qdisc, dropping rbtree
maps, and the inclusion of two examples. 

Some questions for discussion:

1. We now pass a trusted kptr of sk_buff to the program instead of
   __sk_buff. This makes most helpers using __sk_buff incompatible
   with eBPF qdisc. An alternative is to still use __sk_buff in the
   context and use bpf_cast_to_kern_ctx() to acquire the kptr. However,
   this can only be applied to enqueue program, since in dequeue program
   skbs do not come from ctx but kptrs exchanged out of maps (i.e., there
   is no __sk_buff). Any suggestion for making skb kptr and helper
   functions compatible?

2. The current patchset uses netlink. Do we also want to use bpf_link
   for attachment?

3. People have suggested struct_ops. We chose not to use struct_ops since
   users might want to create multiple bpf qdiscs with different
   implementations. Current struct_ops attachment model does not seem
   to support replacing only functions of a specific instance of a module,
   but I might be wrong.

Todo:
  - Add selftest

  - Make bpf list/rbtree use list/rbnode in skb so that developers
    don't need to allocate bpf objects for storing skb kptrs.

Note:
  - This patchset requires bpf support of exchanging kptr into allocated
    objects (local kptr), which Dave Marchevsky is working on.

  - The user space programs in the sample are adapted from the example
    Peihao Yang written in RFC v5 thread.

---
v7: Reference skb using kptr to sk_buff instead of __sk_buff
    Use the new bpf rbtree/link to for skb queues
    Add reset and init programs
    Add a bpf fq qdisc sample
    Add a bpf netem qdisc sample

v6: switch to kptr based approach

v5: mv kernel/bpf/skb_map.c net/core/skb_map.c
    implement flow map as map-in-map
    rename bpf_skb_tc_classify() and move it to net/sched/cls_api.c
    clean up eBPF qdisc program context

v4: get rid of PIFO, use rbtree directly

v3: move priority queue from sch_bpf to skb map
    introduce skb map and its helpers
    introduce bpf_skb_classify()
    use netdevice notifier to reset skb's
    Rebase on latest bpf-next

v2: Rebase on latest net-next
    Make the code more complete (but still incomplete)

Amery Hung (5):
  net_sched: Add reset program
  net_sched: Add init program
  tools/libbpf: Add support for BPF_PROG_TYPE_QDISC
  samples/bpf: Add an example of bpf fq qdisc
  samples/bpf: Add an example of bpf netem qdisc

Cong Wang (3):
  net_sched: Introduce eBPF based Qdisc
  net_sched: Add kfuncs for working with skb
  net_sched: Introduce kfunc bpf_skb_tc_classify()

 include/linux/bpf_types.h       |   4 +
 include/uapi/linux/bpf.h        |  23 +
 include/uapi/linux/pkt_sched.h  |  24 ++
 kernel/bpf/btf.c                |   5 +
 kernel/bpf/helpers.c            |   1 +
 kernel/bpf/syscall.c            |  10 +
 net/core/filter.c               | 100 +++++
 net/sched/Kconfig               |  15 +
 net/sched/Makefile              |   1 +
 net/sched/sch_bpf.c             | 729 ++++++++++++++++++++++++++++++++
 samples/bpf/Makefile            |  14 +-
 samples/bpf/bpf_experimental.h  | 134 ++++++
 samples/bpf/tc_clsact_edt.bpf.c | 103 +++++
 samples/bpf/tc_sch_fq.bpf.c     | 666 +++++++++++++++++++++++++++++
 samples/bpf/tc_sch_fq.c         | 321 ++++++++++++++
 samples/bpf/tc_sch_netem.bpf.c  | 256 +++++++++++
 samples/bpf/tc_sch_netem.c      | 347 +++++++++++++++
 tools/include/uapi/linux/bpf.h  |  23 +
 tools/lib/bpf/libbpf.c          |   4 +
 19 files changed, 2779 insertions(+), 1 deletion(-)
 create mode 100644 net/sched/sch_bpf.c
 create mode 100644 samples/bpf/bpf_experimental.h
 create mode 100644 samples/bpf/tc_clsact_edt.bpf.c
 create mode 100644 samples/bpf/tc_sch_fq.bpf.c
 create mode 100644 samples/bpf/tc_sch_fq.c
 create mode 100644 samples/bpf/tc_sch_netem.bpf.c
 create mode 100644 samples/bpf/tc_sch_netem.c

Stanislav Fomichev Jan. 23, 2024, 9:13 p.m. UTC | #1

On 01/17, Amery Hung wrote:
> Hi, 
> 
> I am continuing the work of ebpf-based Qdisc based on Cong’s previous
> RFC. The followings are some use cases of eBPF Qdisc:
> 
> 1. Allow customizing Qdiscs in an easier way. So that people don't
>    have to write a complete Qdisc kernel module just to experiment
>    some new queuing theory.
> 
> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
>    is before enqueue, it is impossible to adjust those "tokens" after
>    packets get dropped in enqueue. With eBPF Qdisc, it is easy to
>    be solved with a shared map between clsact and sch_bpf.
> 
> 3. Replace qevents, as now the user gains much more control over the
>    skb and queues.
> 
> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
>    chain and block to reuse the TC filters, but they are too complicated
>    to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
>    TC filters on _any_ Qdisc (even on a different netdev) to do the
>    classification.
> 
> 5. Potentially pave a way for ingress to queue packets, although
>    current implementation is still only for egress.
> 
> I’ve combed through previous comments and appreciated the feedbacks.
> Some major changes in this RFC is the use of kptr to skb to maintain
> the validility of skb during its lifetime in the Qdisc, dropping rbtree
> maps, and the inclusion of two examples. 
> 
> Some questions for discussion:
> 
> 1. We now pass a trusted kptr of sk_buff to the program instead of
>    __sk_buff. This makes most helpers using __sk_buff incompatible
>    with eBPF qdisc. An alternative is to still use __sk_buff in the
>    context and use bpf_cast_to_kern_ctx() to acquire the kptr. However,
>    this can only be applied to enqueue program, since in dequeue program
>    skbs do not come from ctx but kptrs exchanged out of maps (i.e., there
>    is no __sk_buff). Any suggestion for making skb kptr and helper
>    functions compatible?
> 
> 2. The current patchset uses netlink. Do we also want to use bpf_link
>    for attachment?

[..]

> 3. People have suggested struct_ops. We chose not to use struct_ops since
>    users might want to create multiple bpf qdiscs with different
>    implementations. Current struct_ops attachment model does not seem
>    to support replacing only functions of a specific instance of a module,
>    but I might be wrong.

I still feel like it deserves at leasta try. Maybe we can find some potential
path where struct_ops can allow different implementations (Martin probably
has some ideas about that). I looked at the bpf qdisc itself and it doesn't
really have anything complicated (besides trying to play nicely with other
tc classes/actions, but I'm not sure how relevant that is).

With struct_ops you can also get your (2) addressed.

Daniel Borkmann Jan. 24, 2024, 10:10 a.m. UTC | #2

On 1/23/24 10:13 PM, Stanislav Fomichev wrote:
> On 01/17, Amery Hung wrote:
>> Hi,
>>
>> I am continuing the work of ebpf-based Qdisc based on Cong’s previous
>> RFC. The followings are some use cases of eBPF Qdisc:
>>
>> 1. Allow customizing Qdiscs in an easier way. So that people don't
>>     have to write a complete Qdisc kernel module just to experiment
>>     some new queuing theory.
>>
>> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
>>     is before enqueue, it is impossible to adjust those "tokens" after
>>     packets get dropped in enqueue. With eBPF Qdisc, it is easy to
>>     be solved with a shared map between clsact and sch_bpf.
>>
>> 3. Replace qevents, as now the user gains much more control over the
>>     skb and queues.
>>
>> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
>>     chain and block to reuse the TC filters, but they are too complicated
>>     to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
>>     TC filters on _any_ Qdisc (even on a different netdev) to do the
>>     classification.
>>
>> 5. Potentially pave a way for ingress to queue packets, although
>>     current implementation is still only for egress.
>>
>> I’ve combed through previous comments and appreciated the feedbacks.
>> Some major changes in this RFC is the use of kptr to skb to maintain
>> the validility of skb during its lifetime in the Qdisc, dropping rbtree
>> maps, and the inclusion of two examples.
>>
>> Some questions for discussion:
>>
>> 1. We now pass a trusted kptr of sk_buff to the program instead of
>>     __sk_buff. This makes most helpers using __sk_buff incompatible
>>     with eBPF qdisc. An alternative is to still use __sk_buff in the
>>     context and use bpf_cast_to_kern_ctx() to acquire the kptr. However,
>>     this can only be applied to enqueue program, since in dequeue program
>>     skbs do not come from ctx but kptrs exchanged out of maps (i.e., there
>>     is no __sk_buff). Any suggestion for making skb kptr and helper
>>     functions compatible?
>>
>> 2. The current patchset uses netlink. Do we also want to use bpf_link
>>     for attachment?
> 
> [..]
> 
>> 3. People have suggested struct_ops. We chose not to use struct_ops since
>>     users might want to create multiple bpf qdiscs with different
>>     implementations. Current struct_ops attachment model does not seem
>>     to support replacing only functions of a specific instance of a module,
>>     but I might be wrong.
> 
> I still feel like it deserves at leasta try. Maybe we can find some potential
> path where struct_ops can allow different implementations (Martin probably
> has some ideas about that). I looked at the bpf qdisc itself and it doesn't
> really have anything complicated (besides trying to play nicely with other
> tc classes/actions, but I'm not sure how relevant that is).

Plus it's also not used in the two sample implementations, given you can
implement this as part of the enqueue operation in bpf. It would make sense
to drop the kfunc from the set. One other note.. the BPF samples have been
bitrotting for quite a while, please convert this into a proper BPF selftest
so that BPF CI can run this.

> With struct_ops you can also get your (2) addressed.

+1

Thanks,
Daniel

Jamal Hadi Salim Jan. 24, 2024, 12:09 p.m. UTC | #3

On Tue, Jan 23, 2024 at 4:13 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 01/17, Amery Hung wrote:
> > Hi,
> >
> > I am continuing the work of ebpf-based Qdisc based on Cong’s previous
> > RFC. The followings are some use cases of eBPF Qdisc:
> >
> > 1. Allow customizing Qdiscs in an easier way. So that people don't
> >    have to write a complete Qdisc kernel module just to experiment
> >    some new queuing theory.
> >
> > 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
> >    is before enqueue, it is impossible to adjust those "tokens" after
> >    packets get dropped in enqueue. With eBPF Qdisc, it is easy to
> >    be solved with a shared map between clsact and sch_bpf.
> >
> > 3. Replace qevents, as now the user gains much more control over the
> >    skb and queues.
> >
> > 4. Provide a new way to reuse TC filters. Currently TC relies on filter
> >    chain and block to reuse the TC filters, but they are too complicated
> >    to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
> >    TC filters on _any_ Qdisc (even on a different netdev) to do the
> >    classification.
> >
> > 5. Potentially pave a way for ingress to queue packets, although
> >    current implementation is still only for egress.
> >
> > I’ve combed through previous comments and appreciated the feedbacks.
> > Some major changes in this RFC is the use of kptr to skb to maintain
> > the validility of skb during its lifetime in the Qdisc, dropping rbtree
> > maps, and the inclusion of two examples.
> >
> > Some questions for discussion:
> >
> > 1. We now pass a trusted kptr of sk_buff to the program instead of
> >    __sk_buff. This makes most helpers using __sk_buff incompatible
> >    with eBPF qdisc. An alternative is to still use __sk_buff in the
> >    context and use bpf_cast_to_kern_ctx() to acquire the kptr. However,
> >    this can only be applied to enqueue program, since in dequeue program
> >    skbs do not come from ctx but kptrs exchanged out of maps (i.e., there
> >    is no __sk_buff). Any suggestion for making skb kptr and helper
> >    functions compatible?
> >
> > 2. The current patchset uses netlink. Do we also want to use bpf_link
> >    for attachment?
>
> [..]
>
> > 3. People have suggested struct_ops. We chose not to use struct_ops since
> >    users might want to create multiple bpf qdiscs with different
> >    implementations. Current struct_ops attachment model does not seem
> >    to support replacing only functions of a specific instance of a module,
> >    but I might be wrong.
>
> I still feel like it deserves at leasta try. Maybe we can find some potential
> path where struct_ops can allow different implementations (Martin probably
> has some ideas about that). I looked at the bpf qdisc itself and it doesn't
> really have anything complicated (besides trying to play nicely with other
> tc classes/actions, but I'm not sure how relevant that is).

Are you suggesting that it is a nuisance to integrate with the
existing infra? I would consider it being a lot more than "trying to
play nicely". Besides, it's a kfunc and people will not be forced to
use it.

cheers,
jamal

> With struct_ops you can also get your (2) addressed.

Daniel Borkmann Jan. 24, 2024, 1:07 p.m. UTC | #4

On 1/24/24 1:09 PM, Jamal Hadi Salim wrote:
> On Tue, Jan 23, 2024 at 4:13 PM Stanislav Fomichev <sdf@google.com> wrote:
>> On 01/17, Amery Hung wrote:
>>> Hi,
>>>
>>> I am continuing the work of ebpf-based Qdisc based on Cong’s previous
>>> RFC. The followings are some use cases of eBPF Qdisc:
>>>
>>> 1. Allow customizing Qdiscs in an easier way. So that people don't
>>>     have to write a complete Qdisc kernel module just to experiment
>>>     some new queuing theory.
>>>
>>> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
>>>     is before enqueue, it is impossible to adjust those "tokens" after
>>>     packets get dropped in enqueue. With eBPF Qdisc, it is easy to
>>>     be solved with a shared map between clsact and sch_bpf.
>>>
>>> 3. Replace qevents, as now the user gains much more control over the
>>>     skb and queues.
>>>
>>> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
>>>     chain and block to reuse the TC filters, but they are too complicated
>>>     to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
>>>     TC filters on _any_ Qdisc (even on a different netdev) to do the
>>>     classification.
>>>
>>> 5. Potentially pave a way for ingress to queue packets, although
>>>     current implementation is still only for egress.
>>>
>>> I’ve combed through previous comments and appreciated the feedbacks.
>>> Some major changes in this RFC is the use of kptr to skb to maintain
>>> the validility of skb during its lifetime in the Qdisc, dropping rbtree
>>> maps, and the inclusion of two examples.
>>>
>>> Some questions for discussion:
>>>
>>> 1. We now pass a trusted kptr of sk_buff to the program instead of
>>>     __sk_buff. This makes most helpers using __sk_buff incompatible
>>>     with eBPF qdisc. An alternative is to still use __sk_buff in the
>>>     context and use bpf_cast_to_kern_ctx() to acquire the kptr. However,
>>>     this can only be applied to enqueue program, since in dequeue program
>>>     skbs do not come from ctx but kptrs exchanged out of maps (i.e., there
>>>     is no __sk_buff). Any suggestion for making skb kptr and helper
>>>     functions compatible?
>>>
>>> 2. The current patchset uses netlink. Do we also want to use bpf_link
>>>     for attachment?
>>
>> [..]
>>
>>> 3. People have suggested struct_ops. We chose not to use struct_ops since
>>>     users might want to create multiple bpf qdiscs with different
>>>     implementations. Current struct_ops attachment model does not seem
>>>     to support replacing only functions of a specific instance of a module,
>>>     but I might be wrong.
>>
>> I still feel like it deserves at leasta try. Maybe we can find some potential
>> path where struct_ops can allow different implementations (Martin probably
>> has some ideas about that). I looked at the bpf qdisc itself and it doesn't
>> really have anything complicated (besides trying to play nicely with other
>> tc classes/actions, but I'm not sure how relevant that is).
> 
> Are you suggesting that it is a nuisance to integrate with the
> existing infra? I would consider it being a lot more than "trying to
> play nicely". Besides, it's a kfunc and people will not be forced to
> use it.

What's the use case? If you already go that route to implement your own
qdisc, why is there a need to take the performane hit and go all the
way into old style cls/act infra when it can be done in a more straight
forward way natively? For the vast majority of cases this will be some
very lightweight classification anyway (if not outsourced to tc egress
given the lock). If there is a concrete production need, it could be
added, otherwise if there is no immediate use case which cannot be solved
otherwise I would not add unnecessary kfuncs.

Cheers,
Daniel

Jamal Hadi Salim Jan. 24, 2024, 2:11 p.m. UTC | #5

On Wed, Jan 24, 2024 at 8:08 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 1/24/24 1:09 PM, Jamal Hadi Salim wrote:
> > On Tue, Jan 23, 2024 at 4:13 PM Stanislav Fomichev <sdf@google.com> wrote:
> >> On 01/17, Amery Hung wrote:
> >>> Hi,
> >>>
> >>> I am continuing the work of ebpf-based Qdisc based on Cong’s previous
> >>> RFC. The followings are some use cases of eBPF Qdisc:
> >>>
> >>> 1. Allow customizing Qdiscs in an easier way. So that people don't
> >>>     have to write a complete Qdisc kernel module just to experiment
> >>>     some new queuing theory.
> >>>
> >>> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
> >>>     is before enqueue, it is impossible to adjust those "tokens" after
> >>>     packets get dropped in enqueue. With eBPF Qdisc, it is easy to
> >>>     be solved with a shared map between clsact and sch_bpf.
> >>>
> >>> 3. Replace qevents, as now the user gains much more control over the
> >>>     skb and queues.
> >>>
> >>> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
> >>>     chain and block to reuse the TC filters, but they are too complicated
> >>>     to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
> >>>     TC filters on _any_ Qdisc (even on a different netdev) to do the
> >>>     classification.
> >>>
> >>> 5. Potentially pave a way for ingress to queue packets, although
> >>>     current implementation is still only for egress.
> >>>
> >>> I’ve combed through previous comments and appreciated the feedbacks.
> >>> Some major changes in this RFC is the use of kptr to skb to maintain
> >>> the validility of skb during its lifetime in the Qdisc, dropping rbtree
> >>> maps, and the inclusion of two examples.
> >>>
> >>> Some questions for discussion:
> >>>
> >>> 1. We now pass a trusted kptr of sk_buff to the program instead of
> >>>     __sk_buff. This makes most helpers using __sk_buff incompatible
> >>>     with eBPF qdisc. An alternative is to still use __sk_buff in the
> >>>     context and use bpf_cast_to_kern_ctx() to acquire the kptr. However,
> >>>     this can only be applied to enqueue program, since in dequeue program
> >>>     skbs do not come from ctx but kptrs exchanged out of maps (i.e., there
> >>>     is no __sk_buff). Any suggestion for making skb kptr and helper
> >>>     functions compatible?
> >>>
> >>> 2. The current patchset uses netlink. Do we also want to use bpf_link
> >>>     for attachment?
> >>
> >> [..]
> >>
> >>> 3. People have suggested struct_ops. We chose not to use struct_ops since
> >>>     users might want to create multiple bpf qdiscs with different
> >>>     implementations. Current struct_ops attachment model does not seem
> >>>     to support replacing only functions of a specific instance of a module,
> >>>     but I might be wrong.
> >>
> >> I still feel like it deserves at leasta try. Maybe we can find some potential
> >> path where struct_ops can allow different implementations (Martin probably
> >> has some ideas about that). I looked at the bpf qdisc itself and it doesn't
> >> really have anything complicated (besides trying to play nicely with other
> >> tc classes/actions, but I'm not sure how relevant that is).
> >
> > Are you suggesting that it is a nuisance to integrate with the
> > existing infra? I would consider it being a lot more than "trying to
> > play nicely". Besides, it's a kfunc and people will not be forced to
> > use it.
>
> What's the use case?

What's the use case for enabling existing infra to work? Sure, let's
rewrite everything from scratch in ebpf. And then introduce new
tooling which well funded companies are capable of owning the right
resources to build and manage. Open source is about choices and
sharing and this is about choices and sharing.

> If you already go that route to implement your own
> qdisc, why is there a need to take the performane hit and go all the
> way into old style cls/act infra when it can be done in a more straight
> forward way natively?

Who is forcing you to use the kfunc? This is about choice.
What is ebpf these days anyways? Is it a) a programming environment or
b) is it the only way to do things? I see it as available infra i.e #a
not as the answer looking for a question.  IOW, as something we can
use to build the infra we need and use kfuncs when it makes sense. Not
everybody has infinite resources to keep hacking things into ebpf.

> For the vast majority of cases this will be some
> very lightweight classification anyway (if not outsourced to tc egress
> given the lock). If there is a concrete production need, it could be
> added, otherwise if there is no immediate use case which cannot be solved
> otherwise I would not add unnecessary kfuncs.

"Unnecessary" is really your view.

cheers,
jamal

> Cheers,
> Daniel

Daniel Borkmann Jan. 24, 2024, 3:26 p.m. UTC | #6

On 1/24/24 3:11 PM, Jamal Hadi Salim wrote:
> On Wed, Jan 24, 2024 at 8:08 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 1/24/24 1:09 PM, Jamal Hadi Salim wrote:
>>> On Tue, Jan 23, 2024 at 4:13 PM Stanislav Fomichev <sdf@google.com> wrote:
>>>> On 01/17, Amery Hung wrote:
>>>>> Hi,
>>>>>
>>>>> I am continuing the work of ebpf-based Qdisc based on Cong’s previous
>>>>> RFC. The followings are some use cases of eBPF Qdisc:
>>>>>
>>>>> 1. Allow customizing Qdiscs in an easier way. So that people don't
>>>>>      have to write a complete Qdisc kernel module just to experiment
>>>>>      some new queuing theory.
>>>>>
>>>>> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
>>>>>      is before enqueue, it is impossible to adjust those "tokens" after
>>>>>      packets get dropped in enqueue. With eBPF Qdisc, it is easy to
>>>>>      be solved with a shared map between clsact and sch_bpf.
>>>>>
>>>>> 3. Replace qevents, as now the user gains much more control over the
>>>>>      skb and queues.
>>>>>
>>>>> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
>>>>>      chain and block to reuse the TC filters, but they are too complicated
>>>>>      to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
>>>>>      TC filters on _any_ Qdisc (even on a different netdev) to do the
>>>>>      classification.
>>>>>
>>>>> 5. Potentially pave a way for ingress to queue packets, although
>>>>>      current implementation is still only for egress.
>>>>>
>>>>> I’ve combed through previous comments and appreciated the feedbacks.
>>>>> Some major changes in this RFC is the use of kptr to skb to maintain
>>>>> the validility of skb during its lifetime in the Qdisc, dropping rbtree
>>>>> maps, and the inclusion of two examples.
>>>>>
>>>>> Some questions for discussion:
>>>>>
>>>>> 1. We now pass a trusted kptr of sk_buff to the program instead of
>>>>>      __sk_buff. This makes most helpers using __sk_buff incompatible
>>>>>      with eBPF qdisc. An alternative is to still use __sk_buff in the
>>>>>      context and use bpf_cast_to_kern_ctx() to acquire the kptr. However,
>>>>>      this can only be applied to enqueue program, since in dequeue program
>>>>>      skbs do not come from ctx but kptrs exchanged out of maps (i.e., there
>>>>>      is no __sk_buff). Any suggestion for making skb kptr and helper
>>>>>      functions compatible?
>>>>>
>>>>> 2. The current patchset uses netlink. Do we also want to use bpf_link
>>>>>      for attachment?
>>>>
>>>> [..]
>>>>
>>>>> 3. People have suggested struct_ops. We chose not to use struct_ops since
>>>>>      users might want to create multiple bpf qdiscs with different
>>>>>      implementations. Current struct_ops attachment model does not seem
>>>>>      to support replacing only functions of a specific instance of a module,
>>>>>      but I might be wrong.
>>>>
>>>> I still feel like it deserves at leasta try. Maybe we can find some potential
>>>> path where struct_ops can allow different implementations (Martin probably
>>>> has some ideas about that). I looked at the bpf qdisc itself and it doesn't
>>>> really have anything complicated (besides trying to play nicely with other
>>>> tc classes/actions, but I'm not sure how relevant that is).
>>>
>>> Are you suggesting that it is a nuisance to integrate with the
>>> existing infra? I would consider it being a lot more than "trying to
>>> play nicely". Besides, it's a kfunc and people will not be forced to
>>> use it.
>>
>> What's the use case?
> 
> What's the use case for enabling existing infra to work? Sure, let's
> rewrite everything from scratch in ebpf. And then introduce new
> tooling which well funded companies are capable of owning the right
> resources to build and manage. Open source is about choices and
> sharing and this is about choices and sharing.
> 
>> If you already go that route to implement your own
>> qdisc, why is there a need to take the performane hit and go all the
>> way into old style cls/act infra when it can be done in a more straight
>> forward way natively?
> 
> Who is forcing you to use the kfunc? This is about choice.
> What is ebpf these days anyways? Is it a) a programming environment or
> b) is it the only way to do things? I see it as available infra i.e #a
> not as the answer looking for a question.  IOW, as something we can
> use to build the infra we need and use kfuncs when it makes sense. Not
> everybody has infinite resources to keep hacking things into ebpf.
> 
>> For the vast majority of cases this will be some
>> very lightweight classification anyway (if not outsourced to tc egress
>> given the lock). If there is a concrete production need, it could be
>> added, otherwise if there is no immediate use case which cannot be solved
>> otherwise I would not add unnecessary kfuncs.
> 
> "Unnecessary" is really your view.

Looks like we're talking past each other? If there is no plan to use it
in production (I assume Amery would be able to answer?), why add it right
now to the initial series, only to figure out later on (worst case in
few years) when the time comes that the kfunc does not fit the actual
need? You've probably seen the life cycle doc (Documentation/bpf/kfuncs.rst)
and while changes can be made, they should still be mindful about potential
breakages the longer it's out in the wild, hence my question if it's
planning to be used given it wasn't in the samples.

Thanks,
Daniel

Amery Hung Jan. 24, 2024, 9:26 p.m. UTC | #7

On Wed, Jan 24, 2024 at 7:27 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 1/24/24 3:11 PM, Jamal Hadi Salim wrote:
> > On Wed, Jan 24, 2024 at 8:08 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> On 1/24/24 1:09 PM, Jamal Hadi Salim wrote:
> >>> On Tue, Jan 23, 2024 at 4:13 PM Stanislav Fomichev <sdf@google.com> wrote:
> >>>> On 01/17, Amery Hung wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I am continuing the work of ebpf-based Qdisc based on Cong’s previous
> >>>>> RFC. The followings are some use cases of eBPF Qdisc:
> >>>>>
> >>>>> 1. Allow customizing Qdiscs in an easier way. So that people don't
> >>>>>      have to write a complete Qdisc kernel module just to experiment
> >>>>>      some new queuing theory.
> >>>>>
> >>>>> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
> >>>>>      is before enqueue, it is impossible to adjust those "tokens" after
> >>>>>      packets get dropped in enqueue. With eBPF Qdisc, it is easy to
> >>>>>      be solved with a shared map between clsact and sch_bpf.
> >>>>>
> >>>>> 3. Replace qevents, as now the user gains much more control over the
> >>>>>      skb and queues.
> >>>>>
> >>>>> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
> >>>>>      chain and block to reuse the TC filters, but they are too complicated
> >>>>>      to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
> >>>>>      TC filters on _any_ Qdisc (even on a different netdev) to do the
> >>>>>      classification.
> >>>>>
> >>>>> 5. Potentially pave a way for ingress to queue packets, although
> >>>>>      current implementation is still only for egress.
> >>>>>
> >>>>> I’ve combed through previous comments and appreciated the feedbacks.
> >>>>> Some major changes in this RFC is the use of kptr to skb to maintain
> >>>>> the validility of skb during its lifetime in the Qdisc, dropping rbtree
> >>>>> maps, and the inclusion of two examples.
> >>>>>
> >>>>> Some questions for discussion:
> >>>>>
> >>>>> 1. We now pass a trusted kptr of sk_buff to the program instead of
> >>>>>      __sk_buff. This makes most helpers using __sk_buff incompatible
> >>>>>      with eBPF qdisc. An alternative is to still use __sk_buff in the
> >>>>>      context and use bpf_cast_to_kern_ctx() to acquire the kptr. However,
> >>>>>      this can only be applied to enqueue program, since in dequeue program
> >>>>>      skbs do not come from ctx but kptrs exchanged out of maps (i.e., there
> >>>>>      is no __sk_buff). Any suggestion for making skb kptr and helper
> >>>>>      functions compatible?
> >>>>>
> >>>>> 2. The current patchset uses netlink. Do we also want to use bpf_link
> >>>>>      for attachment?
> >>>>
> >>>> [..]
> >>>>
> >>>>> 3. People have suggested struct_ops. We chose not to use struct_ops since
> >>>>>      users might want to create multiple bpf qdiscs with different
> >>>>>      implementations. Current struct_ops attachment model does not seem
> >>>>>      to support replacing only functions of a specific instance of a module,
> >>>>>      but I might be wrong.
> >>>>
> >>>> I still feel like it deserves at leasta try. Maybe we can find some potential
> >>>> path where struct_ops can allow different implementations (Martin probably
> >>>> has some ideas about that). I looked at the bpf qdisc itself and it doesn't
> >>>> really have anything complicated (besides trying to play nicely with other
> >>>> tc classes/actions, but I'm not sure how relevant that is).
> >>>
> >>> Are you suggesting that it is a nuisance to integrate with the
> >>> existing infra? I would consider it being a lot more than "trying to
> >>> play nicely". Besides, it's a kfunc and people will not be forced to
> >>> use it.
> >>
> >> What's the use case?
> >
> > What's the use case for enabling existing infra to work? Sure, let's
> > rewrite everything from scratch in ebpf. And then introduce new
> > tooling which well funded companies are capable of owning the right
> > resources to build and manage. Open source is about choices and
> > sharing and this is about choices and sharing.
> >
> >> If you already go that route to implement your own
> >> qdisc, why is there a need to take the performane hit and go all the
> >> way into old style cls/act infra when it can be done in a more straight
> >> forward way natively?
> >
> > Who is forcing you to use the kfunc? This is about choice.
> > What is ebpf these days anyways? Is it a) a programming environment or
> > b) is it the only way to do things? I see it as available infra i.e #a
> > not as the answer looking for a question.  IOW, as something we can
> > use to build the infra we need and use kfuncs when it makes sense. Not
> > everybody has infinite resources to keep hacking things into ebpf.
> >
> >> For the vast majority of cases this will be some
> >> very lightweight classification anyway (if not outsourced to tc egress
> >> given the lock). If there is a concrete production need, it could be
> >> added, otherwise if there is no immediate use case which cannot be solved
> >> otherwise I would not add unnecessary kfuncs.
> >
> > "Unnecessary" is really your view.
>
> Looks like we're talking past each other? If there is no plan to use it
> in production (I assume Amery would be able to answer?), why add it right
> now to the initial series, only to figure out later on (worst case in
> few years) when the time comes that the kfunc does not fit the actual
> need? You've probably seen the life cycle doc (Documentation/bpf/kfuncs.rst)
> and while changes can be made, they should still be mindful about potential
> breakages the longer it's out in the wild, hence my question if it's
> planning to be used given it wasn't in the samples.
>

We would like to reuse existing TC filters. Like Jamal says, changing
filter rules in production can be done easily with existing tooling.
Besides, when the user is only interested in exploring scheduling
algorithms but not classifying traffics, they don't need to replicate
the filter again in bpf. I can add bpf_skb_tc_classify() test cases in
the next series if that helps.

Thanks,
Amery

> Thanks,
> Daniel

Daniel Borkmann Jan. 25, 2024, 11:57 a.m. UTC | #8

On 1/24/24 10:26 PM, Amery Hung wrote:
> On Wed, Jan 24, 2024 at 7:27 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 1/24/24 3:11 PM, Jamal Hadi Salim wrote:
>>> On Wed, Jan 24, 2024 at 8:08 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>> On 1/24/24 1:09 PM, Jamal Hadi Salim wrote:
>>>>> On Tue, Jan 23, 2024 at 4:13 PM Stanislav Fomichev <sdf@google.com> wrote:
>>>>>> On 01/17, Amery Hung wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am continuing the work of ebpf-based Qdisc based on Cong’s previous
>>>>>>> RFC. The followings are some use cases of eBPF Qdisc:
>>>>>>>
>>>>>>> 1. Allow customizing Qdiscs in an easier way. So that people don't
>>>>>>>       have to write a complete Qdisc kernel module just to experiment
>>>>>>>       some new queuing theory.
>>>>>>>
>>>>>>> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
>>>>>>>       is before enqueue, it is impossible to adjust those "tokens" after
>>>>>>>       packets get dropped in enqueue. With eBPF Qdisc, it is easy to
>>>>>>>       be solved with a shared map between clsact and sch_bpf.
>>>>>>>
>>>>>>> 3. Replace qevents, as now the user gains much more control over the
>>>>>>>       skb and queues.
>>>>>>>
>>>>>>> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
>>>>>>>       chain and block to reuse the TC filters, but they are too complicated
>>>>>>>       to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
>>>>>>>       TC filters on _any_ Qdisc (even on a different netdev) to do the
>>>>>>>       classification.
>>>>>>>
>>>>>>> 5. Potentially pave a way for ingress to queue packets, although
>>>>>>>       current implementation is still only for egress.
>>>>>>>
>>>>>>> I’ve combed through previous comments and appreciated the feedbacks.
>>>>>>> Some major changes in this RFC is the use of kptr to skb to maintain
>>>>>>> the validility of skb during its lifetime in the Qdisc, dropping rbtree
>>>>>>> maps, and the inclusion of two examples.
>>>>>>>
>>>>>>> Some questions for discussion:
>>>>>>>
>>>>>>> 1. We now pass a trusted kptr of sk_buff to the program instead of
>>>>>>>       __sk_buff. This makes most helpers using __sk_buff incompatible
>>>>>>>       with eBPF qdisc. An alternative is to still use __sk_buff in the
>>>>>>>       context and use bpf_cast_to_kern_ctx() to acquire the kptr. However,
>>>>>>>       this can only be applied to enqueue program, since in dequeue program
>>>>>>>       skbs do not come from ctx but kptrs exchanged out of maps (i.e., there
>>>>>>>       is no __sk_buff). Any suggestion for making skb kptr and helper
>>>>>>>       functions compatible?
>>>>>>>
>>>>>>> 2. The current patchset uses netlink. Do we also want to use bpf_link
>>>>>>>       for attachment?
>>>>>>
>>>>>> [..]
>>>>>>
>>>>>>> 3. People have suggested struct_ops. We chose not to use struct_ops since
>>>>>>>       users might want to create multiple bpf qdiscs with different
>>>>>>>       implementations. Current struct_ops attachment model does not seem
>>>>>>>       to support replacing only functions of a specific instance of a module,
>>>>>>>       but I might be wrong.
>>>>>>
>>>>>> I still feel like it deserves at leasta try. Maybe we can find some potential
>>>>>> path where struct_ops can allow different implementations (Martin probably
>>>>>> has some ideas about that). I looked at the bpf qdisc itself and it doesn't
>>>>>> really have anything complicated (besides trying to play nicely with other
>>>>>> tc classes/actions, but I'm not sure how relevant that is).
>>>>>
>>>>> Are you suggesting that it is a nuisance to integrate with the
>>>>> existing infra? I would consider it being a lot more than "trying to
>>>>> play nicely". Besides, it's a kfunc and people will not be forced to
>>>>> use it.
>>>>
>>>> What's the use case?
>>>
>>> What's the use case for enabling existing infra to work? Sure, let's
>>> rewrite everything from scratch in ebpf. And then introduce new
>>> tooling which well funded companies are capable of owning the right
>>> resources to build and manage. Open source is about choices and
>>> sharing and this is about choices and sharing.
>>>
>>>> If you already go that route to implement your own
>>>> qdisc, why is there a need to take the performane hit and go all the
>>>> way into old style cls/act infra when it can be done in a more straight
>>>> forward way natively?
>>>
>>> Who is forcing you to use the kfunc? This is about choice.
>>> What is ebpf these days anyways? Is it a) a programming environment or
>>> b) is it the only way to do things? I see it as available infra i.e #a
>>> not as the answer looking for a question.  IOW, as something we can
>>> use to build the infra we need and use kfuncs when it makes sense. Not
>>> everybody has infinite resources to keep hacking things into ebpf.
>>>
>>>> For the vast majority of cases this will be some
>>>> very lightweight classification anyway (if not outsourced to tc egress
>>>> given the lock). If there is a concrete production need, it could be
>>>> added, otherwise if there is no immediate use case which cannot be solved
>>>> otherwise I would not add unnecessary kfuncs.
>>>
>>> "Unnecessary" is really your view.
>>
>> Looks like we're talking past each other? If there is no plan to use it
>> in production (I assume Amery would be able to answer?), why add it right
>> now to the initial series, only to figure out later on (worst case in
>> few years) when the time comes that the kfunc does not fit the actual
>> need? You've probably seen the life cycle doc (Documentation/bpf/kfuncs.rst)
>> and while changes can be made, they should still be mindful about potential
>> breakages the longer it's out in the wild, hence my question if it's
>> planning to be used given it wasn't in the samples.
> 
> We would like to reuse existing TC filters. Like Jamal says, changing
> filter rules in production can be done easily with existing tooling.
> Besides, when the user is only interested in exploring scheduling
> algorithms but not classifying traffics, they don't need to replicate
> the filter again in bpf. I can add bpf_skb_tc_classify() test cases in
> the next series if that helps.

In that case, please add a BPF selftest for exercising the kfunc, and also
expand the commit description with the above rationale.

Thanks,
Daniel

[RFC,v7,0/8] net_sched: Introduce eBPF based Qdisc

Message

Comments