Message ID | 20231201182904.532825-16-jhs@mojatatu.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | Introducing P4TC | expand |
Jamal Hadi Salim wrote: > Introduce P4 tc classifier. A tc filter instantiated on this classifier > is used to bind a P4 pipeline to one or more netdev ports. To use P4 > classifier you must specify a pipeline name that will be associated to > this filter, a s/w parser and datapath ebpf program. The pipeline must have > already been created via a template. > For example, if we were to add a filter to ingress of network interface > device $P0 and associate it to P4 pipeline simple_l3 we'd issue the > following command: In addition to my comments from last iteration. > > tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ > action bpf obj $PARSER.o section prog/tc-parser \ > action bpf obj $PROGNAME.o section prog/tc-ingress Having multiple object files is a mistake IMO and will cost performance. Have a single object file avoid stitching together metadata and run to completion. And then run entirely from XDP this is how we have been getting good performance numbers. > > $PROGNAME.o and $PARSER.o is a compilation of the eBPF programs generated > by the P4 compiler and will be the representation of the P4 program. > Note that filter understands that $PARSER.o is a parser to be loaded > at the tc level. The datapath program is merely an eBPF action. > > Note we do support a distinct way of loading the parser as opposed to > making it be an action, the above example would be: > > tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ > prog type tc obj $PARSER.o ... \ > action bpf obj $PROGNAME.o section prog/tc-ingress > > We support two types of loadings of these initial programs in the pipeline > and differentiate between what gets loaded at tc vs xdp by using syntax of > > either "prog type tc obj" or "prog type xdp obj" > > For XDP: > > tc filter add dev $P0 ingress protocol all prio 1 p4 pname simple_l3 \ > prog type xdp obj $PARSER.o section parser/xdp \ > pinned_link /sys/fs/bpf/mylink \ > action bpf obj $PROGNAME.o section prog/tc-ingress I don't think tc should be loading xdp programs. XDP is not 'tc'. > > The theory of operations is as follows: > > ================================1. PARSING================================ > > The packet first encounters the parser. > The parser is implemented in ebpf residing either at the TC or XDP > level. The parsed header values are stored in a shared eBPF map. > When the parser runs at XDP level, we load it into XDP using tc filter > command and pin it to a file. > > =============================2. ACTIONS============================= > > In the above example, the P4 program (minus the parser) is encoded in an > action($PROGNAME.o). It should be noted that classical tc actions > continue to work: > IOW, someone could decide to add a mirred action to mirror all packets > after or before the ebpf action. > > tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ > prog type tc obj $PARSER.o section parser/tc-ingress \ > action bpf obj $PROGNAME.o section prog/tc-ingress \ > action mirred egress mirror index 1 dev $P1 \ > action bpf obj $ANOTHERPROG.o section mysect/section-1 > > It should also be noted that it is feasible to split some of the ingress > datapath into XDP first and more into TC later (as was shown above for > example where the parser runs at XDP level). YMMV. Is there any performance value in partial XDP and partial TC? The main wins we see in XDP are when we can drop, redirect, etc the packet entirely in XDP and avoid skb altogether. > > Co-developed-by: Victor Nogueira <victor@mojatatu.com> > Signed-off-by: Victor Nogueira <victor@mojatatu.com> > Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> > Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> > Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> > ---
On 12/5/23 1:32 AM, John Fastabend wrote: > Jamal Hadi Salim wrote: >> Introduce P4 tc classifier. A tc filter instantiated on this classifier >> is used to bind a P4 pipeline to one or more netdev ports. To use P4 >> classifier you must specify a pipeline name that will be associated to >> this filter, a s/w parser and datapath ebpf program. The pipeline must have >> already been created via a template. >> For example, if we were to add a filter to ingress of network interface >> device $P0 and associate it to P4 pipeline simple_l3 we'd issue the >> following command: > > In addition to my comments from last iteration. > >> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ >> action bpf obj $PARSER.o section prog/tc-parser \ >> action bpf obj $PROGNAME.o section prog/tc-ingress > > Having multiple object files is a mistake IMO and will cost > performance. Have a single object file avoid stitching together > metadata and run to completion. And then run entirely from XDP > this is how we have been getting good performance numbers. +1, fully agree. >> $PROGNAME.o and $PARSER.o is a compilation of the eBPF programs generated >> by the P4 compiler and will be the representation of the P4 program. >> Note that filter understands that $PARSER.o is a parser to be loaded >> at the tc level. The datapath program is merely an eBPF action. >> >> Note we do support a distinct way of loading the parser as opposed to >> making it be an action, the above example would be: >> >> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ >> prog type tc obj $PARSER.o ... \ >> action bpf obj $PROGNAME.o section prog/tc-ingress >> >> We support two types of loadings of these initial programs in the pipeline >> and differentiate between what gets loaded at tc vs xdp by using syntax of >> >> either "prog type tc obj" or "prog type xdp obj" >> >> For XDP: >> >> tc filter add dev $P0 ingress protocol all prio 1 p4 pname simple_l3 \ >> prog type xdp obj $PARSER.o section parser/xdp \ >> pinned_link /sys/fs/bpf/mylink \ >> action bpf obj $PROGNAME.o section prog/tc-ingress > > I don't think tc should be loading xdp programs. XDP is not 'tc'. For XDP, we do have a separate attach API, for BPF links we have bpf_xdp_link_attach() via bpf(2) and regular progs we have the classic way via dev_change_xdp_fd() with IFLA_XDP_* attributes. Mid-term we'll also add bpf_mprog support for XDP to allow multi-user attachment. tc kernel code should not add yet another way of attaching XDP, this should just reuse existing uapi infra instead from userspace control plane side. >> The theory of operations is as follows: >> >> ================================1. PARSING================================ >> >> The packet first encounters the parser. >> The parser is implemented in ebpf residing either at the TC or XDP >> level. The parsed header values are stored in a shared eBPF map. >> When the parser runs at XDP level, we load it into XDP using tc filter >> command and pin it to a file. >> >> =============================2. ACTIONS============================= >> >> In the above example, the P4 program (minus the parser) is encoded in an >> action($PROGNAME.o). It should be noted that classical tc actions >> continue to work: >> IOW, someone could decide to add a mirred action to mirror all packets >> after or before the ebpf action. >> >> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ >> prog type tc obj $PARSER.o section parser/tc-ingress \ >> action bpf obj $PROGNAME.o section prog/tc-ingress \ >> action mirred egress mirror index 1 dev $P1 \ >> action bpf obj $ANOTHERPROG.o section mysect/section-1 >> >> It should also be noted that it is feasible to split some of the ingress >> datapath into XDP first and more into TC later (as was shown above for >> example where the parser runs at XDP level). YMMV. > > Is there any performance value in partial XDP and partial TC? The main > wins we see in XDP are when we can drop, redirect, etc the packet > entirely in XDP and avoid skb altogether. > >> >> Co-developed-by: Victor Nogueira <victor@mojatatu.com> >> Signed-off-by: Victor Nogueira <victor@mojatatu.com> >> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> >> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> >> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> The cls_p4 is roughly a copy of {cls,act}_bpf, and from a BPF community side we moved away from this some time ago for the benefit of a better management API for tc BPF programs via bpf(2) through bpf_mprog (see libbpf and BPF selftests around this), as mentioned earlier. Please use this instead for your userspace control plane, otherwise we are repeating the same mistakes from the past again that were already fixed. Therefore, from BPF side: Nacked-by: Daniel Borkmann <daniel@iogearbox.net> Cheers, Daniel
On Tue, Dec 5, 2023 at 8:43 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > > On 12/5/23 1:32 AM, John Fastabend wrote: > > Jamal Hadi Salim wrote: > >> Introduce P4 tc classifier. A tc filter instantiated on this classifier > >> is used to bind a P4 pipeline to one or more netdev ports. To use P4 > >> classifier you must specify a pipeline name that will be associated to > >> this filter, a s/w parser and datapath ebpf program. The pipeline must have > >> already been created via a template. > >> For example, if we were to add a filter to ingress of network interface > >> device $P0 and associate it to P4 pipeline simple_l3 we'd issue the > >> following command: > > > > In addition to my comments from last iteration. > > > >> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ > >> action bpf obj $PARSER.o section prog/tc-parser \ > >> action bpf obj $PROGNAME.o section prog/tc-ingress > > > > Having multiple object files is a mistake IMO and will cost > > performance. Have a single object file avoid stitching together > > metadata and run to completion. And then run entirely from XDP > > this is how we have been getting good performance numbers. > > +1, fully agree. As I stated earlier: while performance is important it is not the highest priority for what we are doing, rather correctness is. We dont want to be wrestling with the verifier or some other limitation like tail call limits to gain some increase in a few kkps. We are taking a gamble with the parser which is not using any kfuncs at the moment. Putting them all in one program will increase the risk. As i responded to you earlier, we just dont want to lose functionality, some sample space: - we could have multiple pipelines with different priorities - and each pipeline may have its own logic with many tables etc (and the choice to iterate the next one is essentially encoded in the tc action codes) - we want to be able to split the pipeline into parts that can run _in unison_ in h/w, xdp, and tc - we use tc block to map groups of ports heavily - we use netlink as our control API > >> $PROGNAME.o and $PARSER.o is a compilation of the eBPF programs generated > >> by the P4 compiler and will be the representation of the P4 program. > >> Note that filter understands that $PARSER.o is a parser to be loaded > >> at the tc level. The datapath program is merely an eBPF action. > >> > >> Note we do support a distinct way of loading the parser as opposed to > >> making it be an action, the above example would be: > >> > >> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ > >> prog type tc obj $PARSER.o ... \ > >> action bpf obj $PROGNAME.o section prog/tc-ingress > >> > >> We support two types of loadings of these initial programs in the pipeline > >> and differentiate between what gets loaded at tc vs xdp by using syntax of > >> > >> either "prog type tc obj" or "prog type xdp obj" > >> > >> For XDP: > >> > >> tc filter add dev $P0 ingress protocol all prio 1 p4 pname simple_l3 \ > >> prog type xdp obj $PARSER.o section parser/xdp \ > >> pinned_link /sys/fs/bpf/mylink \ > >> action bpf obj $PROGNAME.o section prog/tc-ingress > > > > I don't think tc should be loading xdp programs. XDP is not 'tc'. > > For XDP, we do have a separate attach API, for BPF links we have bpf_xdp_link_attach() > via bpf(2) and regular progs we have the classic way via dev_change_xdp_fd() with > IFLA_XDP_* attributes. Mid-term we'll also add bpf_mprog support for XDP to allow > multi-user attachment. tc kernel code should not add yet another way of attaching XDP, > this should just reuse existing uapi infra instead from userspace control plane side. I am probably missing something. We are not loading the XDP program - it is preloaded, the only thing the filter does above is grabbing a reference to it. The P4 pipeline in this case is split into a piece (the parser) that runs on XDP and some that runs on tc. And as i mentioned earlier we could go further another piece which is part of the pipeline may run in hw. And infact in the future a compiler will be able to generate code that is split across machines. For our s/w datapath on the same node the only split is between tc and XDP. > >> The theory of operations is as follows: > >> > >> ================================1. PARSING================================ > >> > >> The packet first encounters the parser. > >> The parser is implemented in ebpf residing either at the TC or XDP > >> level. The parsed header values are stored in a shared eBPF map. > >> When the parser runs at XDP level, we load it into XDP using tc filter > >> command and pin it to a file. > >> > >> =============================2. ACTIONS============================= > >> > >> In the above example, the P4 program (minus the parser) is encoded in an > >> action($PROGNAME.o). It should be noted that classical tc actions > >> continue to work: > >> IOW, someone could decide to add a mirred action to mirror all packets > >> after or before the ebpf action. > >> > >> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ > >> prog type tc obj $PARSER.o section parser/tc-ingress \ > >> action bpf obj $PROGNAME.o section prog/tc-ingress \ > >> action mirred egress mirror index 1 dev $P1 \ > >> action bpf obj $ANOTHERPROG.o section mysect/section-1 > >> > >> It should also be noted that it is feasible to split some of the ingress > >> datapath into XDP first and more into TC later (as was shown above for > >> example where the parser runs at XDP level). YMMV. > > > > Is there any performance value in partial XDP and partial TC? The main > > wins we see in XDP are when we can drop, redirect, etc the packet > > entirely in XDP and avoid skb altogether. > > > >> > >> Co-developed-by: Victor Nogueira <victor@mojatatu.com> > >> Signed-off-by: Victor Nogueira <victor@mojatatu.com> > >> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> > >> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> > >> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> > > The cls_p4 is roughly a copy of {cls,act}_bpf, and from a BPF community side > we moved away from this some time ago for the benefit of a better management > API for tc BPF programs via bpf(2) through bpf_mprog (see libbpf and BPF selftests > around this), as mentioned earlier. Please use this instead for your userspace > control plane, otherwise we are repeating the same mistakes from the past again > that were already fixed. Sorry, that is your use case for kubernetes and not ours. We want to use the tc infra. We want to use netlink. I could be misreading what you are saying but it seems that you are suggesting that tc infra is now obsolete as far as ebpf is concerned? Overall: It is a bit selfish to say your use case dictates how other people use ebpf. ebpf is just a means to an end for us and _is not the end goal_ - just an infra toolset. We spent a long time compromising to meet you somewhere when you asked us to use ebpf but you are pushing it now . If you feel we should unify the P4 classifier with the tc ebpf classifier etc then we are going to need some changes that are not going to be useful for other people. And i dont see the point in that. cheers, jamal > Therefore, from BPF side: > > Nacked-by: Daniel Borkmann <daniel@iogearbox.net> > > Cheers, > Daniel
On 12/5/23 5:23 PM, Jamal Hadi Salim wrote: > On Tue, Dec 5, 2023 at 8:43 AM Daniel Borkmann <daniel@iogearbox.net> wrote: >> On 12/5/23 1:32 AM, John Fastabend wrote: >>> Jamal Hadi Salim wrote: >>>> Introduce P4 tc classifier. A tc filter instantiated on this classifier >>>> is used to bind a P4 pipeline to one or more netdev ports. To use P4 >>>> classifier you must specify a pipeline name that will be associated to >>>> this filter, a s/w parser and datapath ebpf program. The pipeline must have >>>> already been created via a template. >>>> For example, if we were to add a filter to ingress of network interface >>>> device $P0 and associate it to P4 pipeline simple_l3 we'd issue the >>>> following command: >>> >>> In addition to my comments from last iteration. >>> >>>> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ >>>> action bpf obj $PARSER.o section prog/tc-parser \ >>>> action bpf obj $PROGNAME.o section prog/tc-ingress >>> >>> Having multiple object files is a mistake IMO and will cost >>> performance. Have a single object file avoid stitching together >>> metadata and run to completion. And then run entirely from XDP >>> this is how we have been getting good performance numbers. >> >> +1, fully agree. > > As I stated earlier: while performance is important it is not the > highest priority for what we are doing, rather correctness is. We dont > want to be wrestling with the verifier or some other limitation like > tail call limits to gain some increase in a few kkps. We are taking a > gamble with the parser which is not using any kfuncs at the moment. > Putting them all in one program will increase the risk. I don't think this is a good reason, this corners you into UAPI which later on cannot be changed anymore. If you encounter such issues, then why not bringing up actual concrete examples / limitations you run into to the BPF community and help one way or another to get the verifier improved instead? (Again, see sched_ext as one example improving verifier, but also concrete example bug reports, etc could help.) > As i responded to you earlier, we just dont want to lose > functionality, some sample space: > - we could have multiple pipelines with different priorities - and > each pipeline may have its own logic with many tables etc (and the > choice to iterate the next one is essentially encoded in the tc action > codes) > - we want to be able to split the pipeline into parts that can run _in > unison_ in h/w, xdp, and tc So parser at XDP, but then you push it up the stack (instead of staying only at XDP layer) just to reach into tc layer to perform a corresponding action.. and this just to work around verifier as you say? > - we use tc block to map groups of ports heavily > - we use netlink as our control API > >>>> $PROGNAME.o and $PARSER.o is a compilation of the eBPF programs generated >>>> by the P4 compiler and will be the representation of the P4 program. >>>> Note that filter understands that $PARSER.o is a parser to be loaded >>>> at the tc level. The datapath program is merely an eBPF action. >>>> >>>> Note we do support a distinct way of loading the parser as opposed to >>>> making it be an action, the above example would be: >>>> >>>> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ >>>> prog type tc obj $PARSER.o ... \ >>>> action bpf obj $PROGNAME.o section prog/tc-ingress >>>> >>>> We support two types of loadings of these initial programs in the pipeline >>>> and differentiate between what gets loaded at tc vs xdp by using syntax of >>>> >>>> either "prog type tc obj" or "prog type xdp obj" >>>> >>>> For XDP: >>>> >>>> tc filter add dev $P0 ingress protocol all prio 1 p4 pname simple_l3 \ >>>> prog type xdp obj $PARSER.o section parser/xdp \ >>>> pinned_link /sys/fs/bpf/mylink \ >>>> action bpf obj $PROGNAME.o section prog/tc-ingress >>> >>> I don't think tc should be loading xdp programs. XDP is not 'tc'. >> >> For XDP, we do have a separate attach API, for BPF links we have bpf_xdp_link_attach() >> via bpf(2) and regular progs we have the classic way via dev_change_xdp_fd() with >> IFLA_XDP_* attributes. Mid-term we'll also add bpf_mprog support for XDP to allow >> multi-user attachment. tc kernel code should not add yet another way of attaching XDP, >> this should just reuse existing uapi infra instead from userspace control plane side. > > I am probably missing something. We are not loading the XDP program - > it is preloaded, the only thing the filter does above is grabbing a > reference to it. The P4 pipeline in this case is split into a piece > (the parser) that runs on XDP and some that runs on tc. And as i > mentioned earlier we could go further another piece which is part of > the pipeline may run in hw. And infact in the future a compiler will > be able to generate code that is split across machines. For our s/w > datapath on the same node the only split is between tc and XDP. So it is even worse from a design PoV. The kernel side allows XDP program to be passed to cls_p4, but then it's not doing anything but holding a reference to that BPF program. Iow, you need anyway to go the regular way of bpf_xdp_link_attach() or dev_change_xdp_fd() to install XDP. Why is the reference even needed here, why it cannot be done in user space from your control plane? This again, feels like a shim layer which should live in user space instead. >>>> The theory of operations is as follows: >>>> >>>> ================================1. PARSING================================ >>>> >>>> The packet first encounters the parser. >>>> The parser is implemented in ebpf residing either at the TC or XDP >>>> level. The parsed header values are stored in a shared eBPF map. >>>> When the parser runs at XDP level, we load it into XDP using tc filter >>>> command and pin it to a file. >>>> >>>> =============================2. ACTIONS============================= >>>> >>>> In the above example, the P4 program (minus the parser) is encoded in an >>>> action($PROGNAME.o). It should be noted that classical tc actions >>>> continue to work: >>>> IOW, someone could decide to add a mirred action to mirror all packets >>>> after or before the ebpf action. >>>> >>>> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ >>>> prog type tc obj $PARSER.o section parser/tc-ingress \ >>>> action bpf obj $PROGNAME.o section prog/tc-ingress \ >>>> action mirred egress mirror index 1 dev $P1 \ >>>> action bpf obj $ANOTHERPROG.o section mysect/section-1 >>>> >>>> It should also be noted that it is feasible to split some of the ingress >>>> datapath into XDP first and more into TC later (as was shown above for >>>> example where the parser runs at XDP level). YMMV. >>> >>> Is there any performance value in partial XDP and partial TC? The main >>> wins we see in XDP are when we can drop, redirect, etc the packet >>> entirely in XDP and avoid skb altogether. >>> >>>> Co-developed-by: Victor Nogueira <victor@mojatatu.com> >>>> Signed-off-by: Victor Nogueira <victor@mojatatu.com> >>>> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> >>>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> >>>> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> >> >> The cls_p4 is roughly a copy of {cls,act}_bpf, and from a BPF community side >> we moved away from this some time ago for the benefit of a better management >> API for tc BPF programs via bpf(2) through bpf_mprog (see libbpf and BPF selftests >> around this), as mentioned earlier. Please use this instead for your userspace >> control plane, otherwise we are repeating the same mistakes from the past again >> that were already fixed. > > Sorry, that is your use case for kubernetes and not ours. We want to There is nothing specific to k8s, it's generic infrastructure for tc BPF and also used outside of k8s scope; please double-check the selftests to get a picture of the API and libbpf integration. > use the tc infra. We want to use netlink. I could be misreading what > you are saying but it seems that you are suggesting that tc infra is > now obsolete as far as ebpf is concerned? Overall: It is a bit selfish > to say your use case dictates how other people use ebpf. ebpf is just > a means to an end for us and _is not the end goal_ - just an infra > toolset. Not really, the infrastructure is already there and ready to be used and it supports basic building blocks such as BPF links, relative prog/link dependency resolution, etc, where none of it can be found here. The problem is "we want to use netlink" which is even why you need to push down things like XDP prog, but it's broken by design, really. You are trying to push down a control plane into netlink which should have been a framework in user space. > If you feel we should unify the P4 classifier with the tc ebpf > classifier etc then we are going to need some changes that are not > going to be useful for other people. And i dont see the point in that. > > cheers, > jamal > >> Therefore, from BPF side: >> >> Nacked-by: Daniel Borkmann <daniel@iogearbox.net> >> >> Cheers, >> Daniel
On Tue, Dec 5, 2023 at 5:32 PM Daniel Borkmann <daniel@iogearbox.net> wrote: > > On 12/5/23 5:23 PM, Jamal Hadi Salim wrote: > > On Tue, Dec 5, 2023 at 8:43 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > >> On 12/5/23 1:32 AM, John Fastabend wrote: > >>> Jamal Hadi Salim wrote: > >>>> Introduce P4 tc classifier. A tc filter instantiated on this classifier > >>>> is used to bind a P4 pipeline to one or more netdev ports. To use P4 > >>>> classifier you must specify a pipeline name that will be associated to > >>>> this filter, a s/w parser and datapath ebpf program. The pipeline must have > >>>> already been created via a template. > >>>> For example, if we were to add a filter to ingress of network interface > >>>> device $P0 and associate it to P4 pipeline simple_l3 we'd issue the > >>>> following command: > >>> > >>> In addition to my comments from last iteration. > >>> > >>>> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ > >>>> action bpf obj $PARSER.o section prog/tc-parser \ > >>>> action bpf obj $PROGNAME.o section prog/tc-ingress > >>> > >>> Having multiple object files is a mistake IMO and will cost > >>> performance. Have a single object file avoid stitching together > >>> metadata and run to completion. And then run entirely from XDP > >>> this is how we have been getting good performance numbers. > >> > >> +1, fully agree. > > > > As I stated earlier: while performance is important it is not the > > highest priority for what we are doing, rather correctness is. We dont > > want to be wrestling with the verifier or some other limitation like > > tail call limits to gain some increase in a few kkps. We are taking a > > gamble with the parser which is not using any kfuncs at the moment. > > Putting them all in one program will increase the risk. > > I don't think this is a good reason, this corners you into UAPI which > later on cannot be changed anymore. If you encounter such issues, then > why not bringing up actual concrete examples / limitations you run into > to the BPF community and help one way or another to get the verifier > improved instead? (Again, see sched_ext as one example improving verifier, > but also concrete example bug reports, etc could help.) > Which uapi are you talking about? The eBPF code gets generated by the compiler. Whether we generate one or 10 programs or where we place them is up to the compiler. We choose today to generate the parser separately - but we can change it in a heartbeat with zero kernel changes. > > As i responded to you earlier, we just dont want to lose > > functionality, some sample space: > > - we could have multiple pipelines with different priorities - and > > each pipeline may have its own logic with many tables etc (and the > > choice to iterate the next one is essentially encoded in the tc action > > codes) > > - we want to be able to split the pipeline into parts that can run _in > > unison_ in h/w, xdp, and tc > > So parser at XDP, but then you push it up the stack (instead of staying > only at XDP layer) just to reach into tc layer to perform a corresponding > action.. and this just to work around verifier as you say? > You are mixing things. The idea of being able to split a pipeline into hw:xdp:tc is a requirement. You can run the pipeline fully in XDP or fully in tc or split it when it makes sense. The idea of splitting the parser from the main p4 control block is for two reasons 1) someone else can generate or handcode the parser if they need to - we feel this is an area that may need to take advantage of features like dynptr etc in the future 2) as a precaution to ensure all P4 programs load. We have no problem putting both in one ebpf prog when we gain confidence that it will _always_ work - it is a mere change to what the compiler generates. > > - we use tc block to map groups of ports heavily > > - we use netlink as our control API > > > >>>> $PROGNAME.o and $PARSER.o is a compilation of the eBPF programs generated > >>>> by the P4 compiler and will be the representation of the P4 program. > >>>> Note that filter understands that $PARSER.o is a parser to be loaded > >>>> at the tc level. The datapath program is merely an eBPF action. > >>>> > >>>> Note we do support a distinct way of loading the parser as opposed to > >>>> making it be an action, the above example would be: > >>>> > >>>> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ > >>>> prog type tc obj $PARSER.o ... \ > >>>> action bpf obj $PROGNAME.o section prog/tc-ingress > >>>> > >>>> We support two types of loadings of these initial programs in the pipeline > >>>> and differentiate between what gets loaded at tc vs xdp by using syntax of > >>>> > >>>> either "prog type tc obj" or "prog type xdp obj" > >>>> > >>>> For XDP: > >>>> > >>>> tc filter add dev $P0 ingress protocol all prio 1 p4 pname simple_l3 \ > >>>> prog type xdp obj $PARSER.o section parser/xdp \ > >>>> pinned_link /sys/fs/bpf/mylink \ > >>>> action bpf obj $PROGNAME.o section prog/tc-ingress > >>> > >>> I don't think tc should be loading xdp programs. XDP is not 'tc'. > >> > >> For XDP, we do have a separate attach API, for BPF links we have bpf_xdp_link_attach() > >> via bpf(2) and regular progs we have the classic way via dev_change_xdp_fd() with > >> IFLA_XDP_* attributes. Mid-term we'll also add bpf_mprog support for XDP to allow > >> multi-user attachment. tc kernel code should not add yet another way of attaching XDP, > >> this should just reuse existing uapi infra instead from userspace control plane side. > > > > I am probably missing something. We are not loading the XDP program - > > it is preloaded, the only thing the filter does above is grabbing a > > reference to it. The P4 pipeline in this case is split into a piece > > (the parser) that runs on XDP and some that runs on tc. And as i > > mentioned earlier we could go further another piece which is part of > > the pipeline may run in hw. And infact in the future a compiler will > > be able to generate code that is split across machines. For our s/w > > datapath on the same node the only split is between tc and XDP. > > So it is even worse from a design PoV. So from a wild accusation that we are loading the program to now a condescending remark we have a bad design. > The kernel side allows XDP program > to be passed to cls_p4, but then it's not doing anything but holding a > reference to that BPF program. Iow, you need anyway to go the regular way > of bpf_xdp_link_attach() or dev_change_xdp_fd() to install XDP. Why is the > reference even needed here, why it cannot be done in user space from your > control plane? This again, feels like a shim layer which should live in > user space instead. > Our control path goes through tc - where we instantiate the pipeline on typically a tc block. Note: there could be many pipeline instances of the same set of ebpf programs. We need to know which ebpf programs are bound to which pipelines. When a pipeline is instantiated or destroyed it sends (netlink) events to user space. It is only natural to reference the programs which are part of the pipeline at that point i.e loading for tc progs and referencing for xdp. The control is already in user space to create bpf links etc. Our concern was (if you looked at the RFC discussions earlier on) a) we dont want anyone removing or replacing the XDP program that is part of a P4 pipeline b) we wanted to ensure in the case of a split pipeline that the XDP code that ran before tc part of the pipeline was infact the one that we wanted to run. The original code (before Toke made a suggestion to use bpf links) was passing a cookie from XDP to tc which we would use to solve these concerns. By creating the link in user space we can pass the fd - which is what you are seeing here. That solves both #a and #b. Granted we may be a little paranoid but operationally an important detail is: if one dumps the tc filter with this approach they know what progs compose the pipeline. > >>>> The theory of operations is as follows: > >>>> > >>>> ================================1. PARSING================================ > >>>> > >>>> The packet first encounters the parser. > >>>> The parser is implemented in ebpf residing either at the TC or XDP > >>>> level. The parsed header values are stored in a shared eBPF map. > >>>> When the parser runs at XDP level, we load it into XDP using tc filter > >>>> command and pin it to a file. > >>>> > >>>> =============================2. ACTIONS============================= > >>>> > >>>> In the above example, the P4 program (minus the parser) is encoded in an > >>>> action($PROGNAME.o). It should be noted that classical tc actions > >>>> continue to work: > >>>> IOW, someone could decide to add a mirred action to mirror all packets > >>>> after or before the ebpf action. > >>>> > >>>> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ > >>>> prog type tc obj $PARSER.o section parser/tc-ingress \ > >>>> action bpf obj $PROGNAME.o section prog/tc-ingress \ > >>>> action mirred egress mirror index 1 dev $P1 \ > >>>> action bpf obj $ANOTHERPROG.o section mysect/section-1 > >>>> > >>>> It should also be noted that it is feasible to split some of the ingress > >>>> datapath into XDP first and more into TC later (as was shown above for > >>>> example where the parser runs at XDP level). YMMV. > >>> > >>> Is there any performance value in partial XDP and partial TC? The main > >>> wins we see in XDP are when we can drop, redirect, etc the packet > >>> entirely in XDP and avoid skb altogether. > >>> > >>>> Co-developed-by: Victor Nogueira <victor@mojatatu.com> > >>>> Signed-off-by: Victor Nogueira <victor@mojatatu.com> > >>>> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> > >>>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> > >>>> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> > >> > >> The cls_p4 is roughly a copy of {cls,act}_bpf, and from a BPF community side > >> we moved away from this some time ago for the benefit of a better management > >> API for tc BPF programs via bpf(2) through bpf_mprog (see libbpf and BPF selftests > >> around this), as mentioned earlier. Please use this instead for your userspace > >> control plane, otherwise we are repeating the same mistakes from the past again > >> that were already fixed. > > > > Sorry, that is your use case for kubernetes and not ours. We want to > > There is nothing specific to k8s, it's generic infrastructure for tc BPF > and also used outside of k8s scope; please double-check the selftests to > get a picture of the API and libbpf integration. > I did and i couldnt see how we can do any of the tcx/mprog using tc to meet our requirements. I may be missing something very obvious but it was why i said it was for your use case not ours. I would be willing to look again if you say it works with tc but do note that I am fine with tc infra where i can add actions, all composed of different programs if i wanted to; and add addendums to use other tc existing (non-ebpf) actions if i needed to. We have what we need working fine, so there has to be a compelling reason to change. I asked you a question earlier whether in your view tc use of ebpf is deprecated. I have seen you make a claim in the past that sched_act was useless and that everyone needs to use sched_cls and you went on to say nobody needs priorities. TBH, that is _your view for your use case_. > > use the tc infra. We want to use netlink. I could be misreading what > > you are saying but it seems that you are suggesting that tc infra is > > now obsolete as far as ebpf is concerned? Overall: It is a bit selfish > > to say your use case dictates how other people use ebpf. ebpf is just > > a means to an end for us and _is not the end goal_ - just an infra > > toolset. > > Not really, the infrastructure is already there and ready to be used and > it supports basic building blocks such as BPF links, relative prog/link > dependency resolution, etc, where none of it can be found here. The > problem is "we want to use netlink" which is even why you need to push > down things like XDP prog, but it's broken by design, really. You are > trying to push down a control plane into netlink which should have been > a framework in user space. > The netlink part is not negotiable - the cover letter says why and i have explained it 10K times in these threads. You are listing all these tcx features like relativeness for which i have no use for. OTOH, like i said if it works with tc then i would be willing to look at it but there need to be compelling reasons to move to that shiny new infra. cheers, jamal
On 12/6/23 3:59 PM, Jamal Hadi Salim wrote: > On Tue, Dec 5, 2023 at 5:32 PM Daniel Borkmann <daniel@iogearbox.net> wrote: >> On 12/5/23 5:23 PM, Jamal Hadi Salim wrote: >>> On Tue, Dec 5, 2023 at 8:43 AM Daniel Borkmann <daniel@iogearbox.net> wrote: >>>> On 12/5/23 1:32 AM, John Fastabend wrote: >>>>> Jamal Hadi Salim wrote: >>>>>> Introduce P4 tc classifier. A tc filter instantiated on this classifier >>>>>> is used to bind a P4 pipeline to one or more netdev ports. To use P4 >>>>>> classifier you must specify a pipeline name that will be associated to >>>>>> this filter, a s/w parser and datapath ebpf program. The pipeline must have >>>>>> already been created via a template. >>>>>> For example, if we were to add a filter to ingress of network interface >>>>>> device $P0 and associate it to P4 pipeline simple_l3 we'd issue the >>>>>> following command: >>>>> >>>>> In addition to my comments from last iteration. >>>>> >>>>>> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ >>>>>> action bpf obj $PARSER.o section prog/tc-parser \ >>>>>> action bpf obj $PROGNAME.o section prog/tc-ingress >>>>> >>>>> Having multiple object files is a mistake IMO and will cost >>>>> performance. Have a single object file avoid stitching together >>>>> metadata and run to completion. And then run entirely from XDP >>>>> this is how we have been getting good performance numbers. >>>> >>>> +1, fully agree. >>> >>> As I stated earlier: while performance is important it is not the >>> highest priority for what we are doing, rather correctness is. We dont >>> want to be wrestling with the verifier or some other limitation like >>> tail call limits to gain some increase in a few kkps. We are taking a >>> gamble with the parser which is not using any kfuncs at the moment. >>> Putting them all in one program will increase the risk. >> >> I don't think this is a good reason, this corners you into UAPI which >> later on cannot be changed anymore. If you encounter such issues, then >> why not bringing up actual concrete examples / limitations you run into >> to the BPF community and help one way or another to get the verifier >> improved instead? (Again, see sched_ext as one example improving verifier, >> but also concrete example bug reports, etc could help.) > > Which uapi are you talking about? The eBPF code gets generated by the > compiler. Whether we generate one or 10 programs or where we place > them is up to the compiler. > We choose today to generate the parser separately - but we can change > it in a heartbeat with zero kernel changes. With UAPI I mean to even have this parser separation. Ideally, this should just naturally be a single program as in XDP layer itself. You mentioned below you could run the pipeline just in XDP.. >>> As i responded to you earlier, we just dont want to lose >>> functionality, some sample space: >>> - we could have multiple pipelines with different priorities - and >>> each pipeline may have its own logic with many tables etc (and the >>> choice to iterate the next one is essentially encoded in the tc action >>> codes) >>> - we want to be able to split the pipeline into parts that can run _in >>> unison_ in h/w, xdp, and tc >> >> So parser at XDP, but then you push it up the stack (instead of staying >> only at XDP layer) just to reach into tc layer to perform a corresponding >> action.. and this just to work around verifier as you say? > > You are mixing things. The idea of being able to split a pipeline into > hw:xdp:tc is a requirement. You can run the pipeline fully in XDP or > fully in tc or split it when it makes sense. > The idea of splitting the parser from the main p4 control block is for > two reasons 1) someone else can generate or handcode the parser if > they need to - we feel this is an area that may need to take advantage > of features like dynptr etc in the future 2) as a precaution to ensure > all P4 programs load. We have no problem putting both in one ebpf prog > when we gain confidence that it will _always_ work - it is a mere > change to what the compiler generates. The cooperation between BPF progs at different layers (e.g. nfp allowed that nicely from a BPF offload PoV) makes sense, just less to split the actions within a given layer into multiple units where state needs to be transferred, packets reparsed, etc. When you say that "we have no problem putting both in one ebpf prog when we gain confidence that it will _always_ work", then should this not be the goal to start with? How do you quantify "gain confidence"? Test/conformance suite? It would be better to start out with this in the first place and fix or collaborate with whatever limits get encountered along the way. This would be the case for XDP anyway given you mention you want to support this layer. >>> - we use tc block to map groups of ports heavily >>> - we use netlink as our control API >>> >>>>>> $PROGNAME.o and $PARSER.o is a compilation of the eBPF programs generated >>>>>> by the P4 compiler and will be the representation of the P4 program. >>>>>> Note that filter understands that $PARSER.o is a parser to be loaded >>>>>> at the tc level. The datapath program is merely an eBPF action. >>>>>> >>>>>> Note we do support a distinct way of loading the parser as opposed to >>>>>> making it be an action, the above example would be: >>>>>> >>>>>> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ >>>>>> prog type tc obj $PARSER.o ... \ >>>>>> action bpf obj $PROGNAME.o section prog/tc-ingress >>>>>> >>>>>> We support two types of loadings of these initial programs in the pipeline >>>>>> and differentiate between what gets loaded at tc vs xdp by using syntax of >>>>>> >>>>>> either "prog type tc obj" or "prog type xdp obj" >>>>>> >>>>>> For XDP: >>>>>> >>>>>> tc filter add dev $P0 ingress protocol all prio 1 p4 pname simple_l3 \ >>>>>> prog type xdp obj $PARSER.o section parser/xdp \ >>>>>> pinned_link /sys/fs/bpf/mylink \ >>>>>> action bpf obj $PROGNAME.o section prog/tc-ingress >>>>> >>>>> I don't think tc should be loading xdp programs. XDP is not 'tc'. >>>> >>>> For XDP, we do have a separate attach API, for BPF links we have bpf_xdp_link_attach() >>>> via bpf(2) and regular progs we have the classic way via dev_change_xdp_fd() with >>>> IFLA_XDP_* attributes. Mid-term we'll also add bpf_mprog support for XDP to allow >>>> multi-user attachment. tc kernel code should not add yet another way of attaching XDP, >>>> this should just reuse existing uapi infra instead from userspace control plane side. >>> >>> I am probably missing something. We are not loading the XDP program - >>> it is preloaded, the only thing the filter does above is grabbing a >>> reference to it. The P4 pipeline in this case is split into a piece >>> (the parser) that runs on XDP and some that runs on tc. And as i >>> mentioned earlier we could go further another piece which is part of >>> the pipeline may run in hw. And infact in the future a compiler will >>> be able to generate code that is split across machines. For our s/w >>> datapath on the same node the only split is between tc and XDP. >> >> So it is even worse from a design PoV. > > So from a wild accusation that we are loading the program to now a > condescending remark we have a bad design. It's my opinion, yes, because all the pieces don't really fit naturally together. It's all centered around the netlink layer which you call out as 'non-negotiable', whereas this would have a better fit for a s/w-based solution where you provide a framework for developers from user space. Why do you even need an XDP reference in tc layer? Even though the XDP loading happens through the regular path anyway.. just to knit the different pieces artificially together despite the different existing layers & APIs. What should actually sit in user space and orchestrate the different generic pieces of the kernel toolbox together, you now try to artificially move one one layer down in a /non-generic/ way. Given this is trying to target a s/w datapath, I just don't follow why the building blocks of this work cannot be done in a /generic/ way. Meaning, generic extensions to the kernel infra in a p4-agnostic way, so they are also useful and consumable outside of it for tc BPF or XDP users, and then in user space the control plane picks all the necessary pieces it needs. (Think of an analogy to containers today.. there is no such notion in the kernel and the user space infra picks all the necessary pieces such as netns, cgroups, etc to flexibly assemble this higher level concept.) >> The kernel side allows XDP program >> to be passed to cls_p4, but then it's not doing anything but holding a >> reference to that BPF program. Iow, you need anyway to go the regular way >> of bpf_xdp_link_attach() or dev_change_xdp_fd() to install XDP. Why is the >> reference even needed here, why it cannot be done in user space from your >> control plane? This again, feels like a shim layer which should live in >> user space instead. > > Our control path goes through tc - where we instantiate the pipeline > on typically a tc block. Note: there could be many pipeline instances > of the same set of ebpf programs. We need to know which ebpf programs > are bound to which pipelines. When a pipeline is instantiated or > destroyed it sends (netlink) events to user space. It is only natural > to reference the programs which are part of the pipeline at that point > i.e loading for tc progs and referencing for xdp. The control is > already in user space to create bpf links etc. > > Our concern was (if you looked at the RFC discussions earlier on) a) > we dont want anyone removing or replacing the XDP program that is part > of a P4 pipeline b) we wanted to ensure in the case of a split > pipeline that the XDP code that ran before tc part of the pipeline was > infact the one that we wanted to run. The original code (before Toke > made a suggestion to use bpf links) was passing a cookie from XDP to > tc which we would use to solve these concerns. By creating the link in > user space we can pass the fd - which is what you are seeing here. > That solves both #a and #b. > Granted we may be a little paranoid but operationally an important > detail is: if one dumps the tc filter with this approach they know > what progs compose the pipeline. But just holding the reference in the tc cls_p4 code on the XDP program doesn't automatically mean that this blocks anything else from happening. You still need a user space control plane which creates the link, maybe pins it somewhere, and when you need to update the program at the XDP layer, then that user space control plane updates the prog @ XDP link. At that point the dump in tc has a window of inconsistency given this is non-atomic, and given this two-step approach.. what happens when the control plane crashesin the middle in the worst case, then would you take the XDP link info as source of truth or the cls_p4 dump? Just operating on the XDP link without this two-step detour is a much more robust approach given you avoid this race altogether. >>>>>> The theory of operations is as follows: >>>>>> >>>>>> ================================1. PARSING================================ >>>>>> >>>>>> The packet first encounters the parser. >>>>>> The parser is implemented in ebpf residing either at the TC or XDP >>>>>> level. The parsed header values are stored in a shared eBPF map. >>>>>> When the parser runs at XDP level, we load it into XDP using tc filter >>>>>> command and pin it to a file. >>>>>> >>>>>> =============================2. ACTIONS============================= >>>>>> >>>>>> In the above example, the P4 program (minus the parser) is encoded in an >>>>>> action($PROGNAME.o). It should be noted that classical tc actions >>>>>> continue to work: >>>>>> IOW, someone could decide to add a mirred action to mirror all packets >>>>>> after or before the ebpf action. >>>>>> >>>>>> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ >>>>>> prog type tc obj $PARSER.o section parser/tc-ingress \ >>>>>> action bpf obj $PROGNAME.o section prog/tc-ingress \ >>>>>> action mirred egress mirror index 1 dev $P1 \ >>>>>> action bpf obj $ANOTHERPROG.o section mysect/section-1 >>>>>> >>>>>> It should also be noted that it is feasible to split some of the ingress >>>>>> datapath into XDP first and more into TC later (as was shown above for >>>>>> example where the parser runs at XDP level). YMMV. >>>>> >>>>> Is there any performance value in partial XDP and partial TC? The main >>>>> wins we see in XDP are when we can drop, redirect, etc the packet >>>>> entirely in XDP and avoid skb altogether. >>>>> >>>>>> Co-developed-by: Victor Nogueira <victor@mojatatu.com> >>>>>> Signed-off-by: Victor Nogueira <victor@mojatatu.com> >>>>>> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> >>>>>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> >>>>>> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> >>>> >>>> The cls_p4 is roughly a copy of {cls,act}_bpf, and from a BPF community side >>>> we moved away from this some time ago for the benefit of a better management >>>> API for tc BPF programs via bpf(2) through bpf_mprog (see libbpf and BPF selftests >>>> around this), as mentioned earlier. Please use this instead for your userspace >>>> control plane, otherwise we are repeating the same mistakes from the past again >>>> that were already fixed. >>> >>> Sorry, that is your use case for kubernetes and not ours. We want to >> >> There is nothing specific to k8s, it's generic infrastructure for tc BPF >> and also used outside of k8s scope; please double-check the selftests to >> get a picture of the API and libbpf integration. > > I did and i couldnt see how we can do any of the tcx/mprog using tc to > meet our requirements. I may be missing something very obvious but it > was why i said it was for your use case not ours. I would be willing > to look again if you say it works with tc but do note that I am fine > with tc infra where i can add actions, all composed of different > programs if i wanted to; and add addendums to use other tc existing > (non-ebpf) actions if i needed to. We have what we need working fine, > so there has to be a compelling reason to change. > I asked you a question earlier whether in your view tc use of ebpf is > deprecated. I have seen you make a claim in the past that sched_act > was useless and that everyone needs to use sched_cls and you went on > to say nobody needs priorities. TBH, that is _your view for your use > case_. I do see act_bpf as redundant given the cls_bpf with the direct action mode can do everything that is needed with BPF, and whenever something was needed, extensions to verifier/helpers/kfuncs/etc were sufficient. We've been using this for years this way in production with complex programs and never saw a need to utilize any of the remaining actions outside of BPF or to have a split of parser/action as mentioned above. The additional machinery would also add overhead in s/w fast path which can be avoided (if it were e.g. cls_matchall + act_bpf). That said, people use cls_bpf in multi-user mode where different progs get attached. The priorities was collective BPF community feedback that these are hard to use due to the seen collisions in practice which led to various hard to debug incidents. While this was not my view initially, I agree that the new design with before/after and relative prog/link reference is a better ux. >>> use the tc infra. We want to use netlink. I could be misreading what >>> you are saying but it seems that you are suggesting that tc infra is >>> now obsolete as far as ebpf is concerned? Overall: It is a bit selfish >>> to say your use case dictates how other people use ebpf. ebpf is just >>> a means to an end for us and _is not the end goal_ - just an infra >>> toolset. >> >> Not really, the infrastructure is already there and ready to be used and >> it supports basic building blocks such as BPF links, relative prog/link >> dependency resolution, etc, where none of it can be found here. The >> problem is "we want to use netlink" which is even why you need to push >> down things like XDP prog, but it's broken by design, really. You are >> trying to push down a control plane into netlink which should have been >> a framework in user space. > > The netlink part is not negotiable - the cover letter says why and i > have explained it 10K times in these threads. You are listing all > these tcx features like relativeness for which i have no use for. > OTOH, like i said if it works with tc then i would be willing to look > at it but there need to be compelling reasons to move to that shiny > new infra. If you don't have a particular case for multi-prog, that is totally fine. You mentioned earlier on "we dont want anyone removing or replacing the XDP program that is part of a P4 pipeline", and that you are using BPF links to solve it, so I presume it would be equally important case for the tc BPF program of your P4 pipeline. I presume you use libbpf, so here the controller would do exact similar steps on tcx that you do for XDP to set up BPF links. But again, my overall comment comes down to why it cannot be broken into generic extensions as mentioned above given XDP/tc infra is in place. Thanks, Daniel
On Fri, Dec 8, 2023 at 5:06 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > > On 12/6/23 3:59 PM, Jamal Hadi Salim wrote: > > On Tue, Dec 5, 2023 at 5:32 PM Daniel Borkmann <daniel@iogearbox.net> wrote: > >> On 12/5/23 5:23 PM, Jamal Hadi Salim wrote: > >>> On Tue, Dec 5, 2023 at 8:43 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > >>>> On 12/5/23 1:32 AM, John Fastabend wrote: > >>>>> Jamal Hadi Salim wrote: > >>>>>> Introduce P4 tc classifier. A tc filter instantiated on this classifier > >>>>>> is used to bind a P4 pipeline to one or more netdev ports. To use P4 > >>>>>> classifier you must specify a pipeline name that will be associated to > >>>>>> this filter, a s/w parser and datapath ebpf program. The pipeline must have > >>>>>> already been created via a template. > >>>>>> For example, if we were to add a filter to ingress of network interface > >>>>>> device $P0 and associate it to P4 pipeline simple_l3 we'd issue the > >>>>>> following command: > >>>>> > >>>>> In addition to my comments from last iteration. > >>>>> > >>>>>> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ > >>>>>> action bpf obj $PARSER.o section prog/tc-parser \ > >>>>>> action bpf obj $PROGNAME.o section prog/tc-ingress > >>>>> > >>>>> Having multiple object files is a mistake IMO and will cost > >>>>> performance. Have a single object file avoid stitching together > >>>>> metadata and run to completion. And then run entirely from XDP > >>>>> this is how we have been getting good performance numbers. > >>>> > >>>> +1, fully agree. > >>> > >>> As I stated earlier: while performance is important it is not the > >>> highest priority for what we are doing, rather correctness is. We dont > >>> want to be wrestling with the verifier or some other limitation like > >>> tail call limits to gain some increase in a few kkps. We are taking a > >>> gamble with the parser which is not using any kfuncs at the moment. > >>> Putting them all in one program will increase the risk. > >> > >> I don't think this is a good reason, this corners you into UAPI which > >> later on cannot be changed anymore. If you encounter such issues, then > >> why not bringing up actual concrete examples / limitations you run into > >> to the BPF community and help one way or another to get the verifier > >> improved instead? (Again, see sched_ext as one example improving verifier, > >> but also concrete example bug reports, etc could help.) > > > > Which uapi are you talking about? The eBPF code gets generated by the > > compiler. Whether we generate one or 10 programs or where we place > > them is up to the compiler. > > We choose today to generate the parser separately - but we can change > > it in a heartbeat with zero kernel changes. > > With UAPI I mean to even have this parser separation. Ideally, this should > just naturally be a single program as in XDP layer itself. You mentioned > below you could run the pipeline just in XDP.. > Yes we can - but it doesnt break uapi. The simplest thing to do is to place the pipeline either at TC or XDP fully. Caveat being not everything can run on XDP. Probably showing XDP running the parser was not a good example - and i think we should remove it from the commit messages to avoid confusion. The intent there was to show that XDP, given its speed advantages, can do the parsing faster and infact reject anything early if thats what the P4 programm deemed it do; if it likes something then the tc layer handles the rest of the pipeline processing. It is really p4 program dependent. Consider another example which is more sensible: XDP has a fast path (pipeline branching based on runtime conditions works well in P4) and if there were exceptions to the fast path (maybe a cache miss) then processing in the tc layer, etc. I think we'll just remove such examples in the commit. For the multi-prog-per level: If for a given P4 program the compiler (v1) generates two separate ebpf programs(as we do in this case) and then the next version of the compiler(v2) puts all the logic in one ebpf program at XDP only - nothing breaks. i.e both V1 and V2 output continue to work; maybe the V2 output could end up being more efficient, etc. > >>> As i responded to you earlier, we just dont want to lose > >>> functionality, some sample space: > >>> - we could have multiple pipelines with different priorities - and > >>> each pipeline may have its own logic with many tables etc (and the > >>> choice to iterate the next one is essentially encoded in the tc action > >>> codes) > >>> - we want to be able to split the pipeline into parts that can run _in > >>> unison_ in h/w, xdp, and tc > >> > >> So parser at XDP, but then you push it up the stack (instead of staying > >> only at XDP layer) just to reach into tc layer to perform a corresponding > >> action.. and this just to work around verifier as you say? > > > > You are mixing things. The idea of being able to split a pipeline into > > hw:xdp:tc is a requirement. You can run the pipeline fully in XDP or > > fully in tc or split it when it makes sense. > > The idea of splitting the parser from the main p4 control block is for > > two reasons 1) someone else can generate or handcode the parser if > > they need to - we feel this is an area that may need to take advantage > > of features like dynptr etc in the future 2) as a precaution to ensure > > all P4 programs load. We have no problem putting both in one ebpf prog > > when we gain confidence that it will _always_ work - it is a mere > > change to what the compiler generates. > > The cooperation between BPF progs at different layers (e.g. nfp allowed that > nicely from a BPF offload PoV) makes sense, just less to split the actions > within a given layer into multiple units where state needs to be transferred, > For the parser split, one motivation was: there are other tools that are very specialized on parsers (see Tom) and as long as that tool can read P4 and conform to our expectations, we should be able to use that parser as a replacement. Maybe that example is too specific and doesnt apply in the larger picture but like i said we can change it with a compiler mod. No rush - we'll see where this goes. > packets reparsed, etc. When you say that "we have no problem putting both in > one ebpf prog when we gain confidence that it will _always_ work", then should > this not be the goal to start with? How do you quantify "gain confidence"? > Test/conformance suite? It would be better to start out with this in the first > place and fix or collaborate with whatever limits get encountered along the > way. This would be the case for XDP anyway given you mention you want to > support this layer. It's just bad experience with the eBPF tooling that drove us in this path (path explosions, pointer trickery, tail call limits, etc). Our goal is to not require eBPF expertise for tc people (who are the main consumers of this); things have to _just work_ after the compiler emits them. We dont want to maintain a bag of tricks which may work some of the time. For our audience a goal is to lower the barrier for them and reduce dependence on "you must now be a guru at eBPF". For now we are in a grace period with the compiler (even for the parser separation, which we could end up removing) and over time feedback for usability and optimization will keep improving generated code and hopefully using new eBPF features more effectively. So i am not very concerned about this. > >>> - we use tc block to map groups of ports heavily > >>> - we use netlink as our control API > >>> > >>>>>> $PROGNAME.o and $PARSER.o is a compilation of the eBPF programs generated > >>>>>> by the P4 compiler and will be the representation of the P4 program. > >>>>>> Note that filter understands that $PARSER.o is a parser to be loaded > >>>>>> at the tc level. The datapath program is merely an eBPF action. > >>>>>> > >>>>>> Note we do support a distinct way of loading the parser as opposed to > >>>>>> making it be an action, the above example would be: > >>>>>> > >>>>>> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ > >>>>>> prog type tc obj $PARSER.o ... \ > >>>>>> action bpf obj $PROGNAME.o section prog/tc-ingress > >>>>>> > >>>>>> We support two types of loadings of these initial programs in the pipeline > >>>>>> and differentiate between what gets loaded at tc vs xdp by using syntax of > >>>>>> > >>>>>> either "prog type tc obj" or "prog type xdp obj" > >>>>>> > >>>>>> For XDP: > >>>>>> > >>>>>> tc filter add dev $P0 ingress protocol all prio 1 p4 pname simple_l3 \ > >>>>>> prog type xdp obj $PARSER.o section parser/xdp \ > >>>>>> pinned_link /sys/fs/bpf/mylink \ > >>>>>> action bpf obj $PROGNAME.o section prog/tc-ingress > >>>>> > >>>>> I don't think tc should be loading xdp programs. XDP is not 'tc'. > >>>> > >>>> For XDP, we do have a separate attach API, for BPF links we have bpf_xdp_link_attach() > >>>> via bpf(2) and regular progs we have the classic way via dev_change_xdp_fd() with > >>>> IFLA_XDP_* attributes. Mid-term we'll also add bpf_mprog support for XDP to allow > >>>> multi-user attachment. tc kernel code should not add yet another way of attaching XDP, > >>>> this should just reuse existing uapi infra instead from userspace control plane side. > >>> > >>> I am probably missing something. We are not loading the XDP program - > >>> it is preloaded, the only thing the filter does above is grabbing a > >>> reference to it. The P4 pipeline in this case is split into a piece > >>> (the parser) that runs on XDP and some that runs on tc. And as i > >>> mentioned earlier we could go further another piece which is part of > >>> the pipeline may run in hw. And infact in the future a compiler will > >>> be able to generate code that is split across machines. For our s/w > >>> datapath on the same node the only split is between tc and XDP. > >> > >> So it is even worse from a design PoV. > > > > So from a wild accusation that we are loading the program to now a > > condescending remark we have a bad design. > > It's my opinion, yes, because all the pieces don't really fit naturally > together. It's all centered around the netlink layer which you call out > as 'non-negotiable', whereas this would have a better fit for a s/w-based > solution where you provide a framework for developers from user space. > Why do you even need an XDP reference in tc layer? Even though the XDP > loading happens through the regular path anyway.. just to knit the > different pieces artificially together despite the different existing > layers & APIs. The P4 abstraction is a pipeline - to give analogy with current tc offloads (such as flower): the pipeline placement could be split between h/w and s/w (or be entirely in one or other), So the placement idea is already cooked in the TC psyche. And the control to all that happens at the tc layer. I can send a netlink message and it will tell me which part is in h/w and s/w. We are just following a similar thought process here: The pipeline is owned by TC and therefore its management and control (and source of truth) sits at TC. True, XDP is a different layer - but from an engineering perspective, I dont see this as this layer violation rather it is something pragmatic to do. > What should actually sit in user space and orchestrate > the different generic pieces of the kernel toolbox together, you now try > to artificially move one one layer down in a /non-generic/ way. Given this > is trying to target a s/w datapath, I just don't follow why the building > blocks of this work cannot be done in a /generic/ way. Meaning, generic > extensions to the kernel infra in a p4-agnostic way, so they are also > useful and consumable outside of it for tc BPF or XDP users, and then in > user space the control plane picks all the necessary pieces it needs. (Think > of an analogy to containers today.. there is no such notion in the kernel > and the user space infra picks all the necessary pieces such as netns, > cgroups, etc to flexibly assemble this higher level concept.) > If there are alternative ways to load the programs and define their dependency that would work with tc, then it should be sufficient to just feed the fds to the p4 classifier when we instantiate a p4 pipeline (or may not even be needed depending on what that stiching infra is). I did look at tcx hard after my last response and i am afraid, Daniel, that you divorced us and our status right now is we are your "ex" ;-> (is that what x means in tcx?). But if such a scheme exists, the eBPF progs can still call the exposed kfuncs meaning it will continue to work in the TC realm... > >> The kernel side allows XDP program > >> to be passed to cls_p4, but then it's not doing anything but holding a > >> reference to that BPF program. Iow, you need anyway to go the regular way > >> of bpf_xdp_link_attach() or dev_change_xdp_fd() to install XDP. Why is the > >> reference even needed here, why it cannot be done in user space from your > >> control plane? This again, feels like a shim layer which should live in > >> user space instead. > > > > Our control path goes through tc - where we instantiate the pipeline > > on typically a tc block. Note: there could be many pipeline instances > > of the same set of ebpf programs. We need to know which ebpf programs > > are bound to which pipelines. When a pipeline is instantiated or > > destroyed it sends (netlink) events to user space. It is only natural > > to reference the programs which are part of the pipeline at that point > > i.e loading for tc progs and referencing for xdp. The control is > > already in user space to create bpf links etc. > > > > Our concern was (if you looked at the RFC discussions earlier on) a) > > we dont want anyone removing or replacing the XDP program that is part > > of a P4 pipeline b) we wanted to ensure in the case of a split > > pipeline that the XDP code that ran before tc part of the pipeline was > > infact the one that we wanted to run. The original code (before Toke > > made a suggestion to use bpf links) was passing a cookie from XDP to > > tc which we would use to solve these concerns. By creating the link in > > user space we can pass the fd - which is what you are seeing here. > > That solves both #a and #b. > > Granted we may be a little paranoid but operationally an important > > detail is: if one dumps the tc filter with this approach they know > > what progs compose the pipeline. > > But just holding the reference in the tc cls_p4 code on the XDP program > doesn't automatically mean that this blocks anything else from happening. > You still need a user space control plane which creates the link, maybe > pins it somewhere, and when you need to update the program at the XDP > layer, then that user space control plane updates the prog @ XDP link. At > that point the dump in tc has a window of inconsistency given this is > non-atomic, and given this two-step approach.. what happens when the > control plane crashesin the middle in the worst case, then would you > take the XDP link info as source of truth or the cls_p4 dump? Just > operating on the XDP link without this two-step detour is a much more > robust approach given you avoid this race altogether. See my comment above on tcx on splitting the loading from tc runtime. My experience in SDN is that you want the kernel to be the source of truth. i.e. if i want to know which progs are running for a given p4 pipeline, at what level, putting this info on some user space daemon which - as you point out may crush - is not the most robust. I should be able to just use a cli to find out the truth. I didnt quiet follow your comment above on the XDP prog being replaced which a dump is going on... Am i mistaken in thinking that as long as i hold the refcount, you cant just swap things out from underneath me? > >>>>>> The theory of operations is as follows: > >>>>>> > >>>>>> ================================1. PARSING================================ > >>>>>> > >>>>>> The packet first encounters the parser. > >>>>>> The parser is implemented in ebpf residing either at the TC or XDP > >>>>>> level. The parsed header values are stored in a shared eBPF map. > >>>>>> When the parser runs at XDP level, we load it into XDP using tc filter > >>>>>> command and pin it to a file. > >>>>>> > >>>>>> =============================2. ACTIONS============================= > >>>>>> > >>>>>> In the above example, the P4 program (minus the parser) is encoded in an > >>>>>> action($PROGNAME.o). It should be noted that classical tc actions > >>>>>> continue to work: > >>>>>> IOW, someone could decide to add a mirred action to mirror all packets > >>>>>> after or before the ebpf action. > >>>>>> > >>>>>> tc filter add dev $P0 parent ffff: protocol all prio 6 p4 pname simple_l3 \ > >>>>>> prog type tc obj $PARSER.o section parser/tc-ingress \ > >>>>>> action bpf obj $PROGNAME.o section prog/tc-ingress \ > >>>>>> action mirred egress mirror index 1 dev $P1 \ > >>>>>> action bpf obj $ANOTHERPROG.o section mysect/section-1 > >>>>>> > >>>>>> It should also be noted that it is feasible to split some of the ingress > >>>>>> datapath into XDP first and more into TC later (as was shown above for > >>>>>> example where the parser runs at XDP level). YMMV. > >>>>> > >>>>> Is there any performance value in partial XDP and partial TC? The main > >>>>> wins we see in XDP are when we can drop, redirect, etc the packet > >>>>> entirely in XDP and avoid skb altogether. > >>>>> > >>>>>> Co-developed-by: Victor Nogueira <victor@mojatatu.com> > >>>>>> Signed-off-by: Victor Nogueira <victor@mojatatu.com> > >>>>>> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> > >>>>>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> > >>>>>> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> > >>>> > >>>> The cls_p4 is roughly a copy of {cls,act}_bpf, and from a BPF community side > >>>> we moved away from this some time ago for the benefit of a better management > >>>> API for tc BPF programs via bpf(2) through bpf_mprog (see libbpf and BPF selftests > >>>> around this), as mentioned earlier. Please use this instead for your userspace > >>>> control plane, otherwise we are repeating the same mistakes from the past again > >>>> that were already fixed. > >>> > >>> Sorry, that is your use case for kubernetes and not ours. We want to > >> > >> There is nothing specific to k8s, it's generic infrastructure for tc BPF > >> and also used outside of k8s scope; please double-check the selftests to > >> get a picture of the API and libbpf integration. > > > > I did and i couldnt see how we can do any of the tcx/mprog using tc to > > meet our requirements. I may be missing something very obvious but it > > was why i said it was for your use case not ours. I would be willing > > to look again if you say it works with tc but do note that I am fine > > with tc infra where i can add actions, all composed of different > > programs if i wanted to; and add addendums to use other tc existing > > (non-ebpf) actions if i needed to. We have what we need working fine, > > so there has to be a compelling reason to change. > > I asked you a question earlier whether in your view tc use of ebpf is > > deprecated. I have seen you make a claim in the past that sched_act > > was useless and that everyone needs to use sched_cls and you went on > > to say nobody needs priorities. TBH, that is _your view for your use > > case_. > > I do see act_bpf as redundant given the cls_bpf with the direct > action mode can do everything that is needed with BPF, and whenever > something was needed, extensions to verifier/helpers/kfuncs/etc were > sufficient. We've been using this for years this way in production > with complex programs and never saw a need to utilize any of the > remaining actions outside of BPF or to have a split of parser/action > as mentioned above. When you have a very specific use case you can fix things as needed as you described a lot easier; we have a large permutation of potential progs and pipeline flows to be dictated by P4 progs. We want to make sure things work all the time without someone calling us to say "how come this doesnt load?". For that we are willing to sacrifice some performance and i am sure we'll get better over time. So if it is multi-action so be it, at least for now. Definitely, we would not have wanted to go the eBPF path without kfuncs (and XDP plays a nice role) - so i feel we are in a good place. My thinking process has been converted from prioritizing "let me squeeze those cycles by skipping a memset" to "lets make this thing usable by other people" and if i loose a few kpps because i have two actions instead of one, no big deal - we'll get better over time. > The additional machinery would also add overhead > in s/w fast path which can be avoided (if it were e.g. cls_matchall + > act_bpf). That said, people use cls_bpf in multi-user mode where > different progs get attached. The priorities was collective BPF > community feedback that these are hard to use due to the seen > collisions in practice which led to various hard to debug incidents. > While this was not my view initially, I agree that the new design > with before/after and relative prog/link reference is a better ux. > I empathize with the situation you faced (i note that motivation was a multi user food fight). We dont have that "collision" problem in our use cases. TBH, TC priorities and chains (which i can jump to) are sufficient for what we do. Note: I am also not objecting to getting better performance (which i am sure we'll get better over time) or finding a common ground for how to specify the collection of programs (as long as it serves our needs as well i.e tc, netlink). cheers, jamal > >>> use the tc infra. We want to use netlink. I could be misreading what > >>> you are saying but it seems that you are suggesting that tc infra is > >>> now obsolete as far as ebpf is concerned? Overall: It is a bit selfish > >>> to say your use case dictates how other people use ebpf. ebpf is just > >>> a means to an end for us and _is not the end goal_ - just an infra > >>> toolset. > >> > >> Not really, the infrastructure is already there and ready to be used and > >> it supports basic building blocks such as BPF links, relative prog/link > >> dependency resolution, etc, where none of it can be found here. The > >> problem is "we want to use netlink" which is even why you need to push > >> down things like XDP prog, but it's broken by design, really. You are > >> trying to push down a control plane into netlink which should have been > >> a framework in user space. > > > > The netlink part is not negotiable - the cover letter says why and i > > have explained it 10K times in these threads. You are listing all > > these tcx features like relativeness for which i have no use for. > > OTOH, like i said if it works with tc then i would be willing to look > > at it but there need to be compelling reasons to move to that shiny > > new infra. > > If you don't have a particular case for multi-prog, that is totally > fine. You mentioned earlier on "we dont want anyone removing or replacing > the XDP program that is part of a P4 pipeline", and that you are using > BPF links to solve it, so I presume it would be equally important case > for the tc BPF program of your P4 pipeline. I presume you use libbpf, so > here the controller would do exact similar steps on tcx that you do for > XDP to set up BPF links. But again, my overall comment comes down to > why it cannot be broken into generic extensions as mentioned above given > XDP/tc infra is in place. > Thanks, > Daniel
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h index 75bf73742..b70ba4647 100644 --- a/include/uapi/linux/pkt_cls.h +++ b/include/uapi/linux/pkt_cls.h @@ -739,6 +739,24 @@ enum { #define TCA_MATCHALL_MAX (__TCA_MATCHALL_MAX - 1) +/* P4 classifier */ + +enum { + TCA_P4_UNSPEC, + TCA_P4_CLASSID, + TCA_P4_ACT, + TCA_P4_PNAME, + TCA_P4_PIPEID, + TCA_P4_PROG_FD, + TCA_P4_PROG_NAME, + TCA_P4_PROG_TYPE, + TCA_P4_PROG_ID, + TCA_P4_PAD, + __TCA_P4_MAX, +}; + +#define TCA_P4_MAX (__TCA_P4_MAX - 1) + /* Extended Matches */ struct tcf_ematch_tree_hdr { diff --git a/net/sched/Kconfig b/net/sched/Kconfig index df6d5e15f..dbfe5ceef 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -565,6 +565,18 @@ config NET_CLS_MATCHALL To compile this code as a module, choose M here: the module will be called cls_matchall. +config NET_CLS_P4 + tristate "P4 classifier" + select NET_CLS + select NET_P4_TC + help + If you say Y here, you will be able to bind a P4 pipeline + program. You will need to install a P4 template representing the + program successfully to use this feature. + + To compile this code as a module, choose M here: the module will + be called cls_p4. + config NET_EMATCH bool "Extended Matches" select NET_CLS diff --git a/net/sched/Makefile b/net/sched/Makefile index 937b8f8a9..15bd59ae3 100644 --- a/net/sched/Makefile +++ b/net/sched/Makefile @@ -73,6 +73,7 @@ obj-$(CONFIG_NET_CLS_CGROUP) += cls_cgroup.o obj-$(CONFIG_NET_CLS_BPF) += cls_bpf.o obj-$(CONFIG_NET_CLS_FLOWER) += cls_flower.o obj-$(CONFIG_NET_CLS_MATCHALL) += cls_matchall.o +obj-$(CONFIG_NET_CLS_P4) += cls_p4.o obj-$(CONFIG_NET_EMATCH) += ematch.o obj-$(CONFIG_NET_EMATCH_CMP) += em_cmp.o obj-$(CONFIG_NET_EMATCH_NBYTE) += em_nbyte.o diff --git a/net/sched/cls_p4.c b/net/sched/cls_p4.c new file mode 100644 index 000000000..baa0bfe84 --- /dev/null +++ b/net/sched/cls_p4.c @@ -0,0 +1,447 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * net/sched/cls_p4.c - P4 Classifier + * Copyright (c) 2022-2023, Mojatatu Networks + * Copyright (c) 2022-2023, Intel Corporation. + * Authors: Jamal Hadi Salim <jhs@mojatatu.com> + * Victor Nogueira <victor@mojatatu.com> + * Pedro Tammela <pctammela@mojatatu.com> + */ + +#include <linux/kernel.h> +#include <linux/init.h> +#include <linux/module.h> +#include <linux/percpu.h> +#include <linux/bpf.h> +#include <linux/filter.h> + +#include <net/sch_generic.h> +#include <net/pkt_cls.h> + +#include <net/p4tc.h> + +#include "p4tc/trace.h" + +#define CLS_P4_PROG_NAME_LEN 256 + +struct p4tc_bpf_prog { + struct bpf_prog *p4_prog; + const char *p4_prog_name; +}; + +struct cls_p4_head { + struct tcf_exts exts; + struct tcf_result res; + struct rcu_work rwork; + struct p4tc_pipeline *pipeline; + struct p4tc_bpf_prog *prog; + u32 handle; +}; + +static int p4_classify(struct sk_buff *skb, const struct tcf_proto *tp, + struct tcf_result *res) +{ + struct cls_p4_head *head = rcu_dereference_bh(tp->root); + bool at_ingress = skb_at_tc_ingress(skb); + + if (unlikely(!head)) { + pr_err("P4 classifier not found\n"); + return -1; + } + + /* head->prog represents the eBPF program that will be first executed by + * the data plane. It may or may not exist. In addition to head->prog, + * we'll have another eBPF program that will execute after this one in + * the form of a filter action (head->exts). + * head->prog->p4_prog_type == BPf_PROG_TYPE_SCHED_ACT means this + * program executes in TC P4 filter. + * head->prog->p4_prog_type == BPf_PROG_TYPE_SCHED_XDP means this + * program was loaded in XDP. + */ + if (head->prog) { + int rc = TC_ACT_PIPE; + + /* If eBPF program is loaded into TC */ + if (head->prog->p4_prog->type == BPF_PROG_TYPE_SCHED_ACT) { + if (at_ingress) { + /* It is safe to push/pull even if skb_shared() */ + __skb_push(skb, skb->mac_len); + bpf_compute_data_pointers(skb); + rc = bpf_prog_run(head->prog->p4_prog, + skb); + __skb_pull(skb, skb->mac_len); + } else { + bpf_compute_data_pointers(skb); + rc = bpf_prog_run(head->prog->p4_prog, + skb); + } + } + + if (rc != TC_ACT_PIPE) + return rc; + } + + trace_p4_classify(skb, head->pipeline); + + *res = head->res; + + return tcf_exts_exec(skb, &head->exts, res); +} + +static int p4_init(struct tcf_proto *tp) +{ + return 0; +} + +static void p4_bpf_prog_destroy(struct p4tc_bpf_prog *prog) +{ + bpf_prog_put(prog->p4_prog); + kfree(prog->p4_prog_name); + kfree(prog); +} + +static void __p4_destroy(struct cls_p4_head *head) +{ + tcf_exts_destroy(&head->exts); + tcf_exts_put_net(&head->exts); + if (head->prog) + p4_bpf_prog_destroy(head->prog); + p4tc_pipeline_put(head->pipeline); + kfree(head); +} + +static void p4_destroy_work(struct work_struct *work) +{ + struct cls_p4_head *head = + container_of(to_rcu_work(work), struct cls_p4_head, rwork); + + rtnl_lock(); + __p4_destroy(head); + rtnl_unlock(); +} + +static void p4_destroy(struct tcf_proto *tp, bool rtnl_held, + struct netlink_ext_ack *extack) +{ + struct cls_p4_head *head = rtnl_dereference(tp->root); + + if (!head) + return; + + tcf_unbind_filter(tp, &head->res); + + if (tcf_exts_get_net(&head->exts)) + tcf_queue_work(&head->rwork, p4_destroy_work); + else + __p4_destroy(head); +} + +static void *p4_get(struct tcf_proto *tp, u32 handle) +{ + struct cls_p4_head *head = rtnl_dereference(tp->root); + + if (head && head->handle == handle) + return head; + + return NULL; +} + +static const struct nla_policy p4_policy[TCA_P4_MAX + 1] = { + [TCA_P4_UNSPEC] = { .type = NLA_UNSPEC }, + [TCA_P4_CLASSID] = { .type = NLA_U32 }, + [TCA_P4_ACT] = { .type = NLA_NESTED }, + [TCA_P4_PNAME] = { .type = NLA_STRING, .len = P4TC_PIPELINE_NAMSIZ }, + [TCA_P4_PIPEID] = { .type = NLA_U32 }, + [TCA_P4_PROG_FD] = { .type = NLA_U32 }, + [TCA_P4_PROG_NAME] = { .type = NLA_STRING, + .len = CLS_P4_PROG_NAME_LEN }, + [TCA_P4_PROG_TYPE] = { .type = NLA_U32 }, +}; + +static int cls_p4_prog_from_efd(struct nlattr **tb, + struct p4tc_bpf_prog *prog, u32 flags, + struct netlink_ext_ack *extack) +{ + struct bpf_prog *fp; + u32 prog_type; + char *name; + u32 bpf_fd; + + bpf_fd = nla_get_u32(tb[TCA_P4_PROG_FD]); + prog_type = nla_get_u32(tb[TCA_P4_PROG_TYPE]); + + if (prog_type != BPF_PROG_TYPE_XDP && + prog_type != BPF_PROG_TYPE_SCHED_ACT) { + NL_SET_ERR_MSG(extack, + "BPF prog type must be BPF_PROG_TYPE_SCHED_ACT or BPF_PROG_TYPE_XDP"); + return -EINVAL; + } + + fp = bpf_prog_get_type_dev(bpf_fd, prog_type, false); + if (IS_ERR(fp)) + return PTR_ERR(fp); + + name = nla_memdup(tb[TCA_P4_PROG_NAME], GFP_KERNEL); + if (!name) { + bpf_prog_put(fp); + return -ENOMEM; + } + + prog->p4_prog_name = name; + prog->p4_prog = fp; + + return 0; +} + +static int p4_set_parms(struct net *net, struct tcf_proto *tp, + struct cls_p4_head *head, unsigned long base, + struct nlattr **tb, struct nlattr *est, u32 flags, + struct netlink_ext_ack *extack) +{ + bool load_bpf_prog = tb[TCA_P4_PROG_NAME] && tb[TCA_P4_PROG_FD] && + tb[TCA_P4_PROG_TYPE]; + struct p4tc_bpf_prog *prog = NULL; + int err; + + err = tcf_exts_validate_ex(net, tp, tb, est, &head->exts, flags, 0, + extack); + if (err < 0) + return err; + + if (load_bpf_prog) { + prog = kzalloc(sizeof(*prog), GFP_KERNEL); + if (!prog) { + err = -ENOMEM; + goto exts_destroy; + } + + err = cls_p4_prog_from_efd(tb, prog, flags, extack); + if (err < 0) { + kfree(prog); + goto exts_destroy; + } + } + + if (tb[TCA_P4_CLASSID]) { + head->res.classid = nla_get_u32(tb[TCA_P4_CLASSID]); + tcf_bind_filter(tp, &head->res, base); + } + + if (load_bpf_prog) { + if (head->prog) { + pr_notice("cls_p4: Substituting old BPF program with id %u with new one with id %u\n", + head->prog->p4_prog->aux->id, prog->p4_prog->aux->id); + p4_bpf_prog_destroy(head->prog); + } + head->prog = prog; + } + + return 0; + +exts_destroy: + tcf_exts_destroy(&head->exts); + return err; +} + +static int p4_change(struct net *net, struct sk_buff *in_skb, + struct tcf_proto *tp, unsigned long base, u32 handle, + struct nlattr **tca, void **arg, u32 flags, + struct netlink_ext_ack *extack) +{ + struct cls_p4_head *head = rtnl_dereference(tp->root); + struct p4tc_pipeline *pipeline = NULL; + struct nlattr *tb[TCA_P4_MAX + 1]; + struct cls_p4_head *new_cls; + char *pname = NULL; + u32 pipeid = 0; + int err; + + if (!tca[TCA_OPTIONS]) { + NL_SET_ERR_MSG(extack, "Must provide pipeline options"); + return -EINVAL; + } + + if (head) + return -EEXIST; + + err = nla_parse_nested(tb, TCA_P4_MAX, tca[TCA_OPTIONS], p4_policy, + extack); + if (err < 0) + return err; + + if (tb[TCA_P4_PNAME]) + pname = nla_data(tb[TCA_P4_PNAME]); + + if (tb[TCA_P4_PIPEID]) + pipeid = nla_get_u32(tb[TCA_P4_PIPEID]); + + pipeline = p4tc_pipeline_find_get(net, pname, pipeid, extack); + if (IS_ERR(pipeline)) + return PTR_ERR(pipeline); + + if (!p4tc_pipeline_sealed(pipeline)) { + err = -EINVAL; + NL_SET_ERR_MSG(extack, "Pipeline must be sealed before use"); + goto pipeline_put; + } + + new_cls = kzalloc(sizeof(*new_cls), GFP_KERNEL); + if (!new_cls) { + err = -ENOMEM; + goto pipeline_put; + } + + err = tcf_exts_init(&new_cls->exts, net, TCA_P4_ACT, 0); + if (err) + goto err_exts_init; + + if (!handle) + handle = 1; + + new_cls->handle = handle; + + err = p4_set_parms(net, tp, new_cls, base, tb, tca[TCA_RATE], flags, + extack); + if (err) + goto err_set_parms; + + new_cls->pipeline = pipeline; + *arg = head; + rcu_assign_pointer(tp->root, new_cls); + return 0; + +err_set_parms: + tcf_exts_destroy(&new_cls->exts); +err_exts_init: + kfree(new_cls); +pipeline_put: + p4tc_pipeline_put(pipeline); + return err; +} + +static int p4_delete(struct tcf_proto *tp, void *arg, bool *last, + bool rtnl_held, struct netlink_ext_ack *extack) +{ + *last = true; + return 0; +} + +static void p4_walk(struct tcf_proto *tp, struct tcf_walker *arg, + bool rtnl_held) +{ + struct cls_p4_head *head = rtnl_dereference(tp->root); + + if (arg->count < arg->skip) + goto skip; + + if (!head) + return; + if (arg->fn(tp, head, arg) < 0) + arg->stop = 1; +skip: + arg->count++; +} + +static int p4_prog_dump(struct sk_buff *skb, struct p4tc_bpf_prog *prog) +{ + unsigned char *b = nlmsg_get_pos(skb); + + if (nla_put_u32(skb, TCA_P4_PROG_ID, prog->p4_prog->aux->id)) + goto nla_put_failure; + + if (nla_put_string(skb, TCA_P4_PROG_NAME, prog->p4_prog_name)) + goto nla_put_failure; + + if (nla_put_u32(skb, TCA_P4_PROG_TYPE, prog->p4_prog->type)) + goto nla_put_failure; + + return 0; + +nla_put_failure: + nlmsg_trim(skb, b); + return -1; +} + +static int p4_dump(struct net *net, struct tcf_proto *tp, void *fh, + struct sk_buff *skb, struct tcmsg *t, bool rtnl_held) +{ + struct cls_p4_head *head = fh; + struct nlattr *nest; + + if (!head) + return skb->len; + + t->tcm_handle = head->handle; + + nest = nla_nest_start(skb, TCA_OPTIONS); + if (!nest) + goto nla_put_failure; + + if (nla_put_string(skb, TCA_P4_PNAME, head->pipeline->common.name)) + goto nla_put_failure; + + if (head->res.classid && + nla_put_u32(skb, TCA_P4_CLASSID, head->res.classid)) + goto nla_put_failure; + + if (head->prog && p4_prog_dump(skb, head->prog)) + goto nla_put_failure; + + if (tcf_exts_dump(skb, &head->exts)) + goto nla_put_failure; + + nla_nest_end(skb, nest); + + if (tcf_exts_dump_stats(skb, &head->exts) < 0) + goto nla_put_failure; + + return skb->len; + +nla_put_failure: + nla_nest_cancel(skb, nest); + return -1; +} + +static void p4_bind_class(void *fh, u32 classid, unsigned long cl, void *q, + unsigned long base) +{ + struct cls_p4_head *head = fh; + + if (head && head->res.classid == classid) { + if (cl) + __tcf_bind_filter(q, &head->res, base); + else + __tcf_unbind_filter(q, &head->res); + } +} + +static struct tcf_proto_ops cls_p4_ops __read_mostly = { + .kind = "p4", + .classify = p4_classify, + .init = p4_init, + .destroy = p4_destroy, + .get = p4_get, + .change = p4_change, + .delete = p4_delete, + .walk = p4_walk, + .dump = p4_dump, + .bind_class = p4_bind_class, + .owner = THIS_MODULE, +}; + +static int __init cls_p4_init(void) +{ + return register_tcf_proto_ops(&cls_p4_ops); +} + +static void __exit cls_p4_exit(void) +{ + unregister_tcf_proto_ops(&cls_p4_ops); +} + +module_init(cls_p4_init); +module_exit(cls_p4_exit); + +MODULE_AUTHOR("Mojatatu Networks"); +MODULE_DESCRIPTION("P4 Classifier"); +MODULE_LICENSE("GPL"); diff --git a/net/sched/p4tc/Makefile b/net/sched/p4tc/Makefile index 3fed9a853..726902f10 100644 --- a/net/sched/p4tc/Makefile +++ b/net/sched/p4tc/Makefile @@ -1,6 +1,8 @@ # SPDX-License-Identifier: GPL-2.0 +CFLAGS_trace.o := -I$(src) + obj-y := p4tc_types.o p4tc_tmpl_api.o p4tc_pipeline.o \ p4tc_action.o p4tc_table.o p4tc_tbl_entry.o \ - p4tc_runtime_api.o + p4tc_runtime_api.o trace.o obj-$(CONFIG_DEBUG_INFO_BTF) += p4tc_bpf.o diff --git a/net/sched/p4tc/trace.c b/net/sched/p4tc/trace.c new file mode 100644 index 000000000..683313407 --- /dev/null +++ b/net/sched/p4tc/trace.c @@ -0,0 +1,10 @@ +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) + +#include <net/p4tc.h> + +#ifndef __CHECKER__ + +#define CREATE_TRACE_POINTS +#include "trace.h" +EXPORT_TRACEPOINT_SYMBOL_GPL(p4_classify); +#endif diff --git a/net/sched/p4tc/trace.h b/net/sched/p4tc/trace.h new file mode 100644 index 000000000..80abec13b --- /dev/null +++ b/net/sched/p4tc/trace.h @@ -0,0 +1,44 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM p4tc + +#if !defined(__P4TC_TRACE_H_) || defined(TRACE_HEADER_MULTI_READ) +#define __P4TC_TRACE_H + +#include <linux/tracepoint.h> + +struct p4tc_pipeline; + +TRACE_EVENT(p4_classify, + TP_PROTO(struct sk_buff *skb, struct p4tc_pipeline *pipeline), + + TP_ARGS(skb, pipeline), + + TP_STRUCT__entry(__string(pname, pipeline->common.name) + __field(u32, p_id) + __field(u32, ifindex) + __field(u32, ingress) + ), + + TP_fast_assign(__assign_str(pname, pipeline->common.name); + __entry->p_id = pipeline->common.p_id; + __entry->ifindex = skb->dev->ifindex; + __entry->ingress = skb_at_tc_ingress(skb); + ), + + TP_printk("dev=%u dir=%s pipeline=%s p_id=%u", + __entry->ifindex, + __entry->ingress ? "ingress" : "egress", + __get_str(pname), + __entry->p_id + ) +); + +#endif + +#undef TRACE_INCLUDE_PATH +#define TRACE_INCLUDE_PATH . +#undef TRACE_INCLUDE_FILE +#define TRACE_INCLUDE_FILE trace + +#include <trace/define_trace.h>