mbox series

[net-next,v14,00/15] Introducing P4TC (series 1)

Message ID 20240404122338.372945-1-jhs@mojatatu.com (mailing list archive)
Headers show
Series Introducing P4TC (series 1) | expand

Message

Jamal Hadi Salim April 4, 2024, 12:23 p.m. UTC
This is the first patchset of two. In this patch we are submitting 15 which
cover the minimal viable P4 PNA architecture.
Please, if you want to discuss a slightly tangential subject like offload
or even your politics then start another thread with a different subject
line.  The way you do it is to change the subject line to for example
"<Your New Subject here> (WAS: <original subject line here>)".

In this cover letter i am restoring text i took out in V10 which stated
"our requirements".

Martin, please look at patch 14 again. The bpf selftests for kfuncs is
sloted for series 2. Paolo, please take a look at 1, 3, 6 for the changes
you suggested. Marcelo, because we made changes to patch 14, I have
removed your reviewed-by. Can you please take another look at that patch?

__Description of these Patches__

These Patches are constrained entirely within the TC domain with very tiny
changes made in patch 1-5. eBPF is used as an infrastructure component for
the software datapath and no changes are made to any eBPF code, only kfuncs
are introduced in patch 14.

Patch #1 adds infrastructure for per-netns P4 actions that can be created on
as need basis for the P4 program requirement. This patch makes a small
incision into act_api. Patches 2-4 are minimalist enablers for P4TC and have
no effect on the classical tc action (example patch#2 just increases the size
of the action names from 16->64B).
Patch 5 adds infrastructure support for preallocation of dynamic actions
needed for P4.

The core P4TC code implements several P4 objects.
1) Patch #6 introduces P4 data types which are consumed by the rest of the
   code
2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD
   commands for P4 pipelines.
4) Patch #9 introduces the action templates and associated CRUD commands.
5) Patch #10 introduce the action runtime infrastructure.
6) Patch #11 introduces the concept of P4 table templates and associated
   CRUD commands for tables.
7) Patch #12 introduces runtime table entry infra and associated CU
   commands.
8) Patch #13 introduces runtime table entry infra and associated RD
   commands.
9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
10) Patch #15 introduces the TC classifier P4 used at runtime.

There are a few more patches not in this patchset that deal with externs,
test cases, etc.

What is P4?
-----------

The Programming Protocol-independent Packet Processors (P4) is an open
source, domain-specific programming language for specifying data plane
behavior.

The current P4 landscape includes an extensive range of deployments,
products, projects and services, etc[9][12]. Two major NIC vendors,
Intel[10] and AMD[11] currently offer P4-native NICs. P4 is currently
curated by the Linux Foundation[9].

A lot more on why P4 - see small treatise here:[4].

What is P4TC?
-------------

P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4
program and its associated objects and state are attachend to a kernel
_netns_ structure.
IOW, if we had two programs across netns' or within a netns they have no
visibility to each others objects (unlike for example TC actions whose
kinds are "global" in nature or eBPF maps visavis bpftool).

P4TC builds on top of many years of Linux TC experiences of a netlink
control path interface coupled with a software datapath with an equivalent
offloadable hardware datapath. In this patch series we are focussing only
on the s/w datapath. The s/w and h/w path equivalence that TC provides is
relevant for a primary use case of P4 where some (currently) large consumers
of NICs provide vendors their datapath specs in P4. In such a case one could
generate specified datapaths in s/w and test/validate the requirements
before hardware acquisition(example [12]).

Unlike other approaches such as TC Flower which require kernel and user
space changes when new datapath objects like packet headers are introduced
P4TC requires zero kernel or user space changes. We refer to this as:
_kernel and user space code change independence_.
Meaning:
A P4 program describes headers, how to parse, etc alongside prescribing
the datapath processing logic; the compiler uses the P4 program as input
and generates several artifacts which are then loaded into the kernel to
manifest the intended datapath. In addition to the generated datapath,
control path constructs are generated. The process is described further
below in "P4TC Workflow".

Some History
------------

There have been many discussions and meetings within the community since
about 2015 in regards to P4 over TC[2] and we are finally proving to the
naysayers that we do get stuff done!

A lot more of the P4TC motivation is captured at:
https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md

__P4TC Architecture__

The current architecture was described at netdevconf 0x17[14] and if you
prefer academic conference papers, a short paper is available here[15].

There are 4 parts:

1) A Template CRUD provisioning API for manifesting a P4 program and its
associated objects in the kernel. The template provisioning API uses
netlink.  See patch in part 2.

2) A Runtime CRUD+ API code which is used for controlling the different
runtime behavior of the P4 objects. The runtime API uses netlink. See notes
further down. See patch descriptions...

3) P4 objects and their control interfaces: tables, actions, externs, etc.
Any object that requires control plane interaction resides in the TC domain
and is subject to the CRUD runtime API.  The intended goal is to make use
of the tc semantics of skip_sw/hw to target P4 program objects either in s/w
or h/w.

4) S/W Datapath code hooks. The s/w datapath is eBPF based and is generated
by a compiler based on the P4 spec. When accessing any P4 object that
requires control plane interfaces, the eBPF code accesses the P4TC side
from #3 above using kfuncs.

The generated eBPF code is derived from [13] with enhancements and fixes to
meet our requirements.

__P4TC Workflow__

The Development and instantiation workflow for P4TC is as follows:

  A) A developer writes a P4 program, "myprog"

  B) Compiles it using the P4C compiler[8]. The compiler generates 3
     outputs:

     a) A shell script which form template definitions for the different P4
        objects "myprog" utilizes (tables, externs, actions etc). See #1
        above

     b) The parser and the rest of the datapath are generated as eBPF and
        need to be compiled into binaries. At the moment the parser and the
        main control block are generated as separate eBPF program but this
        could change in the future (without affecting any kernel code).
        See #4 above.

     c) A json introspection file used for the control plane
        (by iproute2/tc).

  C) At this point the artifacts from #1,#4 could be handed to an operator
     (the operator could be the same person as the developer from #A, #B).

     i) For the eBPF part, either the operator is handed an ebpf binary or
     source which they compile at this point into a binary.
     The operator executes the shell script(s) to manifest the functional
     "myprog" into the kernel.

     ii) The operator instantiates "myprog" pipeline via the tc P4 filter
     to ingress/egress (depending on P4 arch) of one or more netdevs/ports
     (illustrated below as "block 22").

     Example instantion where the parser is a separate action:
       "tc filter add block 22 ingress protocol all prio 10 \
        p4 pname myprog \
        action bpf obj $PARSER.o section p4tc/parse \
        action bpf obj $PROGNAME.o section p4tc/main"

See individual patches in partc for more examples tc vs xdp etc. Also see
section on "challenges" (further below on this cover letter).

Once "myprog" P4 program is instantiated one can start performing operations
on table entries and/or actions at runtime as described below.

__P4TC Runtime Control Path__

The control interface builds on past tc experience and tries to get things
right from the beginning (example filtering is separated from depending
on existing object TLVs and made generic); also the code is written in
such a way it is mostly lockless.

The P4TC control interface, using netlink, provides what we call a CRUDPS
abstraction which stands for: Create, Read(get), Update, Delete, Subscribe,
Publish.  From a high level PoV the following describes a conformant high
level API (both on netlink data model and code level):

	Create(</path/to/object, DATA>+)
	Read(</path/to/object>, [optional filter])
	Update(</path/to/object>, DATA>+)
	Delete(</path/to/object>, [optional filter])
	Subscribe(</path/to/object>, [optional filter])

Note, we _dont_ treat "dump" or "flush" as speacial. If "path/to/object"
points to a table then a "Delete" implies "flush" and a "Read" implies dump
but if it points to an entry (by specifying a key) then "Delete" implies
deleting and entry and "Read" implies reading that single entry. It should
be noted that both "Delete" and "Read" take an optional filter parameter.
The filter can define further refinements to what the control plane wants
read or deleted.
"Subscribe" uses built in netlink event management. It, as well, takes a
filter which can further refine what events get generated to the control
plane (taken out of this patchset, to be re-added with consideration of
[16]).

Lets show some runtime samples:

..create an entry, if we match ip address 10.0.1.2 send packet out eno1
  tc p4ctrl create myprog/table/mytable \
   dstAddr 10.0.1.2/32 action send_to_port param port eno1

..Batch create entries
  tc p4ctrl create myprog/table/mytable \
  entry dstAddr 10.1.1.2/32  action send_to_port param port eno1 \
  entry dstAddr 10.1.10.2/32  action send_to_port param port eno10 \
  entry dstAddr 10.0.2.2/32  action send_to_port param port eno2

..Get an entry (note "read" is interchangeably used as "get" which is a
common semantic in tc):
  tc p4ctrl read myprog/table/mytable \
   dstAddr 10.0.2.2/32

..dump mytable
  tc p4ctrl read myprog/table/mytable

..dump mytable for all entries whose key fits within 10.1.0.0/16
  tc p4ctrl read myprog/table/mytable \
  filter key/myprog/mytable/dstAddr = 10.1.0.0/16

..dump all mytable entries which have an action send_to_port with param "eno1"
  tc p4ctrl get myprog/table/mytable \
  filter param/act/myprog/send_to_port/port = "eno1"

The filter expression is powerful, f.e you could say:

  tc p4ctrl get myprog/table/mytable \
  filter param/act/myprog/send_to_port/port = "eno1" && \
         key/myprog/mytable/dstAddr = 10.1.0.0/16

It also works on built in metadata, example in the following case dumping
entries from mytable that have seen activity in the last 10 secs:
  tc p4ctrl get myprog/table/mytable \
  filter msecs_since < 10000

Delete follows the same syntax as get/read, so for sake of brevity we won't
show more example than how to flush mytable:

  tc p4ctrl delete myprog/table/mytable

Mystery question: How do we achieve iproute2-kernel independence and
how does "tc p4ctrl" as a cli know how to program the kernel given an
arbitrary command line as shown above? Answer(s): It queries the
compiler generated json file in "P4TC Workflow" #B.c above. The json file
has enough details to figure out that we have a program called "myprog"
which has a table "mytable" that has a key name "dstAddr" which happens to
be type ipv4 address prefix. The json file also provides details to show
that the table "mytable" supports an action called "send_to_port" which
accepts a parameter "port" of type netdev (see the types patch for all
supported P4 data types).
All P4 components have names, IDs, and types - so this makes it very easy
to map into netlink.
Once user space tc/p4ctrl validates the human command input, it creates
standard binary netlink structures (TLVs etc) which are sent to the kernel.
See the runtime table entry patch for more details.

__P4TC Datapath__

The P4TC s/w datapath execution is generated as eBPF. Any objects that
require control interfacing reside in the "P4TC domain" and are controlled
via netlink as described above. Per packet execution and state and even
objects that do not require control interfacing (like the P4 parser) are
generated as eBPF.

A packet arriving on s/w ingress of any of the ports on block 22
(illustrated in section "P4TC Workflow" above will first be exercised via
the (generated eBPF) parser component to extract the headers (the ip
destination address labeled "dstAddr" above in section "P4TC Runtime
Control Path"). The datapath then proceeds to use "dstAddr", table ID
and pipeline ID as a key to do a lookup in myprog's "mytable" which returns
the action params which are then used to execute the action in the eBPF
datapath (eventually sending out packets to eno1).
On a table miss, mytable's default miss action (not described) is executed.

__Testing__

Speaking of testing - we have 2-300 tdc test cases (which will be in the
second patchset).
These tests are run on our CICD system on pull requests and after commits
are approved. The CICD does a lot of other tests (more since v2, thanks to
Simon's input)including:
checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on
both X86, ARM 64 and emulated BE via qemu s390. We trigger performance
testing in the CICD to catch performance regressions (currently only on
the control path, but in the future for the datapath).
Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on
memory sanitizer but recently added support for concurrency sanitizer.
Before main releases we ensure each patch will compile on its own to help
in git bisect and run the xmas tree tool. We eventually put the code via
coverity.

In addition we are working on enabling a tool that will take a P4 program,
run it through the compiler, and generate permutations of traffic patterns
via symbolic execution that will test both positive and negative datapath
code paths. The test generator tool integration is still work in progress.
Also: We have other code that test parallelization etc which we are trying
to find a fit for in the kernel tree's testing infra.

__Restating Our Requirements__

Given this code is not intrusive at all because it only touches TC.
We would like to emphasize that we see eBPF as _infrastructure tooling
available to us and not the end goal_. Please help us with technical input
on for example how we can do better kfuncs, etc. If you want to critique,
then our requirements should be your guide and please be considerate that
this is about P4, not eBPF. IOW:
We would appreciate technical commentary instead of bikeshedding on how
_you_ would have implemented this probably with more eBPF or some other
clever tricks. It is sad to see there was zero input from anyone in the eBPF
world for 7 RFC postings (in a period of 9 months).
If i am ranting here is because we have spent over a year now on this
topic - we have taken the initial input and have given you eBPF. So lets
make progress please.

The initial release was presented in October 2022[20] and RFC in January
2023 had a "scriptable" datapath (the idea built on the u32 classifier[17]
and pedit action[18] approach. Post RFC V1, we made changes to fit the
feedback to integrate eBPF to replace the "scriptable" software datapath.
On our part, the goal for the change was to meet folks in the middle as a
compromise.
No regrets on the journey since after all the effort because we ended
getting XDP which was not in the original picture. Some of our efforts are
captured at [1][3] and in the patch history.

In this section we review the original scriptable version against the
current implementation which uses eBPF and in the process re-enumerate our
requirements.

To be very clear: Our intention for P4TC is to target _the TC crowd_.
Essentially developers and ops people already familiar and deploying TC
based infra.
More importantly the original intent for P4TC was to enable _ops folks_
more than devs (given code is being generated and doesn't need humans to
write it).

With TC, we gain the whole "familiar" package of match-action pipeline
abstraction++, meaning from the control plane(see discussion above) all
the way to the tooling infra, i.e iproute2/tc cli, netlink infra interface
(request/response, event subscribe/multicast-publish, congestion control
etc), s/w and h/w symbiosis, the autonomous kernel control, etc.
The main advantage over vendor specific implementations(which is the current
alternative) is: with P4TC we have a singular vendor-neutral interface via
the kernel using well understood mechanisms that have gained learnings from
deployment experience.

So lets list some of these requirements and compare whether moving to eBPF
affected us or gave us an advantage.

0) Understood Control Plane semantics

This requirement is unaffected.
The control plane remains as netlink and therefore we get the classical
multi-user CRUD+Publish/subscribe APIs built in.

1) Must support SW/HW equivalence

This requirement is unaffected. The control plane is netlink. Any semantics
to select between sw and hw via skip_sw/hw semantics is maintained.

2) Supporting expressibility of the universe set of P4 progs

It is a must to support 100% of all possible P4 programs. In the past the
eBPF verifier, for example in [13], had to be worked around and even then
there are cases where we couldnt avoid path explosion when branching isi
involved and failed to run. So we were skeptical about using eBPF to begin
with.
Kfuncs changed our minds. Note, there are still challenges running all
potential P4 programs at the XDP level - but the pipeline could be split
between XDP and TC in such cases. The compiler can be told to generate
pieces that run on XDP and other on TC (see examples).
Summary: This requirement is unaffected.

3) Operational usability

By maintaining the TC control plane (even in presence of eBPF datapath)
runtime aspects remain unchanged. So for our target audience of folks
who have deployed tc, including offloads, the comfort zone is unchanged.

There is some loss in operational usability because we now have more knobs:
the extra compilation, loading and syncing of ebpf binaries, etc.
IOW, I can no longer just ship someone a shell script(ascii) in an email to
someone and say "go run this and "myprog" will just work".

4) Operational and development Debuggability

If something goes wrong, the tc craftsperson is now required to have
additional knowledge of eBPF code and process.
Our intent is to compensate this challenge with debug tools that ease the
craftperson's debugging.

5) Opportunity for rapid prototyping of new ideas

This is not exactly a requirement but something that became a useful
feature during the P4TC development phase. When the compiler was lagging
behind in features was to often handcode the template scripts.
Then you would dump back the template from the kernel and do a diff to
ensure the kernel didn't get something wrong. Essentially, this was a nice
debug feature. During development, we wrote scripts that covered a range of
P4 architectures(PSA, V1, etc) which required no kernel code changes.

Over time the debug feature morphed into: a) start by handcoding scripts
then b) read it back and then c) generate the P4 code.
It means one could start with the template scripts outside of the
constraints of a P4 architecture spec(PNA/PSA) or even within a P4
architecture then test some ideas and eventually feed back the concepts to
the compiler authors or modify or create a new P4 architecture and share
with the P4 standards folks.

To summarize in presence of eBPF: The debugging idea is probably still
alive.  One could dump, with proper tooling(bpftool for example), the
loaded eBPF code and be able to check for differences. But this is not the
interesting part.
The concept of going back from whats in the kernel to P4 is a lot more
difficult to implement mostly due to scoping of DSL vs general purpose. It
may be lost.  We have been discussing ways to use BTF and embedding
annotations in the eBPF code and binary but more thought is required and we
welcome suggestions.

6) Supporting per namespace program

In P4TC every program and its associated objects have unique IDs which are
generated by the compiler. Multiple or the same P4 program(s) can run
independently in different namespaces alongside their appropriate state and
object instance parameterization (despite name or ID collission).
This requirement is still met (by virtue of keeping P4 program control
objects within the TC domain and attaching to a netns).

__References__

[1]https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
[2]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#historical-perspective-for-p4tc
[3]https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation
[4]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#so-why-p4-and-how-does-p4-help-here
[5]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#mf59be7abc5df3473cff3879c8cc3e2369c0640a6
[6]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#m783cfd79e9d755cf0e7afc1a7d5404635a5b1919
[7]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#ma8c84df0f7043d17b98f3d67aab0f4904c600469
[8]https://github.com/p4lang/p4c/tree/main/backends/tc
[9]https://p4.org/
[10]https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html
[11]https://www.amd.com/en/accelerators/pensando
[12]https://github.com/sonic-net/DASH/tree/main
[13]https://github.com/p4lang/p4c/tree/main/backends/ebpf
[14]https://netdevconf.info/0x17/sessions/talk/integrating-ebpf-into-the-p4tc-datapath.html
[15]https://dl.acm.org/doi/10.1145/3630047.3630193
[16]https://lore.kernel.org/netdev/20231216123001.1293639-1-jiri@resnulli.us/
[17.a]https://netdevconf.info/0x13/session.html?talk-tc-u-classifier
[17.b]man tc-u32
[18]man tc-pedit
[19] https://lore.kernel.org/netdev/20231219181623.3845083-6-victor@mojatatu.com/T/#m86e71743d1d83b728bb29d5b877797cb4942e835
[20.a] https://netdevconf.info/0x16/sessions/talk/your-network-datapath-will-be-p4-scripted.html
[20.b] https://netdevconf.info/0x16/sessions/workshop/p4tc-workshop.html

--------
HISTORY
--------

Changes in Version 14
----------------------
1) #UNDEF HWRITE/HREAD and remove unnecessary checks (Paolo)
2) Remove const cast added in v13 as a result of changes suggested
   suggested by Paolo (Marcelo)
3) Introduce type validate for s8 caught as a result of audit from #1
4) S/GFP_KERNEL/GFP_KERNEL_ACCOUNT for types and runtime objects (Paolo)
5) Syzkaller caught an invalid netlink attribute bug that has existed
   since v5! As noted in patch0 we've been running syzkaller for months.
6) Add Marcelo's reviewed-by for patch 14 and Toke's ACK to the series.

Changes in Version 13
----------------------

1) Remove ops->print() from p4 types (Paolo).

2) Use mutex instead of rwlock for dynamic actions since rwlock is
   discouraged these days(Paolo).

3) Constify action init_ops() ops parameter (Paolo).

4) Use struct sk_buff in kfunc instead of struct __sk_buff (Martin)
   Use struct xdp_buff in kfunc instead of struct xdp_md (Martin)

5) Replace BTF_SET8_START with BTF_KFUNCS_START and replace
   BTF_SET8_END with BTF_KFUNCS_END (Martin)

6) Add params__sz argument to all kfuncs to guard against future change
   to parameter structures being passed between bpf and tc. For kfunc
   xdp/bpf_p4tc_entry_create() we already had the max(5) allowed number of
   of parameters. To work around this we had to merge two structs together
   in order to maintain the number of params to 5 (Martin).

7) Add more info on commit log to explain the relation between the kfuncs
   and TC for patch #14 (Martin).

Changes in Version 12
----------------------

0) Introduce back 15 patches (v11 had 5)

1) From discussions with Daniel:
   i) Remove the XDP programs association alltogether. No refcounting. nothing.
   ii) Remove prog type tc - everything is now an ebpf tc action.

2) s/PAD0/__pad0/g. Thanks to Marcelo.

3) Add extack to specify how many entries (N of M) specified in a batch for
   any of requested Create/Update/Delete succeeded. Prior to this it would
   only tell us the batch failed to complete without giving us details of
   which of M failed. Added as a debug aid.

Changes in Version 11
----------------------
1) Split the series into two. Original patches 1-5 in this patchset. The rest
   will go out after this is merged.

2) Change any references of IFNAMSIZ in the action code when referencing the
   action name size to ACTNAMSIZ. Thanks to Marcelo.

Changes in Version 10
----------------------
1) A couple of patches from the earlier version were clean enough to submit,
   so we did. This gave us room to split the two largest patches each into
   two. Even though the split is not git-bisactable and really some of it didn't
   make much sense (eg spliting a create, and update in one patch and delete and
   get into another) we made sure each of the split patches compiled
   independently. The idea is to reduce the number of lines of code to review
   and when we get sufficient reviews we will put the splits together again.
   See patch #12 and #13 as well as patches #7 and #8).

2) Add more context in patch 0. Please READ!

3) Added dump/delete filters back to the code - we had taken them out in the
   earlier patches to reduce the amount of code for review - but in retrospect
   we feel they are important enough to push earlier rather than later.


Changes In version 9
---------------------

1) Remove the largest patch (externs) to ease review.

2) Break up action patches into two to ease review bringing down the patches
   that need more scrutiny to 8 (the first 7 are almost trivial).

3) Fixup prefix naming convention to p4tc_xxx for uapi and p4a_xxx for actions
   to provide consistency(Jiri).

4) Silence sparse warning "was not declared. Should it be static?" for kfuncs
   by making them static. TBH, not sure if this is the right solution
   but it makes sparse happy and hopefully someone will comment.

Changes In Version 8
---------------------

1) Fix all the patchwork warnings and improve our ci to catch them in the future

2) Reduce the number of patches to basic max(15)  to ease review.

Changes In Version 7
-------------------------

0) First time removing the RFC tag!

1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that
using bpf links was sufficient to protect us from someone replacing or deleting
a eBPF program after it has been bound to a netdev.

2) Add some reviewed-bys from Vlad.

3) Small bug fixes from v6 based on testing for ebpf.

4) Added the counter extern as a sample extern. Illustrating this example because
   it is slightly complex since it is possible to invoke it directly from
   the P4TC domain (in case of direct counters) or from eBPF (indirect counters).
   It is not exactly the most efficient implementation (a reasonable counter impl
   should be per-cpu).

Changes In RFC Version 6
-------------------------

1) Completed integration from scriptable view to eBPF. Completed integration
   of externs integration.

2) Small bug fixes from v5 based on testing.

Changes In RFC Version 5
-------------------------

1) More integration from scriptable view to eBPF. Small bug fixes from last
   integration.

2) More streamlining support of externs via kfunc (create-on-miss, etc)

3) eBPF linking for XDP.

There is more eBPF integration/streamlining coming (we are getting close to
conversion from scriptable domain).

Changes In RFC Version 4
-------------------------

1) More integration from scriptable to eBPF. Small bug fixes.

2) More streamlining support of externs via kfunc (one additional kfunc).

3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata.

There is more eBPF integration coming. One thing we looked at but is not in this
patchset but should be in the next is use of eBPF link in our loading (see
"challenge #1" further below).

Changes In RFC Version 3
-------------------------

These patches are still in a little bit of flux as we adjust to integrating
eBPF. So there are small constructs that are used in V1 and 2 but no longer
used in this version. We will make a V4 which will remove those.
The changes from V2 are as follows:

1) Feedback we got in V2 is to try stick to one of the two modes. In this version
we are taking one more step and going the path of mode2 vs v2 where we had 2 modes.

2) The P4 Register extern is no longer standalone. Instead, as part of integrating
into eBPF we introduce another kfunc which encapsulates Register as part of the
extern interface.

3) We have improved our CICD to include tools pointed to us by Simon. See
   "Testing" further below. Thanks to Simon for that and other issues he caught.
   Simon, we discussed on issue [7] but decided to keep that log since we think
   it is useful.

4) A lot of small cleanups. Thanks Marcelo. There are two things we need to
   re-discuss though; see: [5], [6].

5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub.

6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are
   guaranteed that either A or B must exist; however, lets make smatch happy.
   Thanks to Simon and Dan Carpenter.

Changes In RFC Version 2
-------------------------

Version 2 is the initial integration of the eBPF datapath.
We took into consideration suggestions provided to use eBPF and put effort into
analyzing eBPF as datapath which involved extensive testing.
We implemented 6 approaches with eBPF and ran performance analysis and presented
our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6
vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if
you account for XDP or TC separately).

Conclusions from the exercise: We lose the simple operational model we had
prior to integrating eBPF. We do gain performance in most cases when the
datapath is less compute-bound.
For more discussion on our requirements vs journeying the eBPF path please
scroll down to "Restating Our Requirements" and "Challenges".

This patch set presented two modes.
mode1: the parser is entirely based on eBPF - whereas the rest of the
SW datapath stays as _scriptable_ as in Version 1.
mode2: All of the kernel s/w datapath (including parser) is in eBPF.

The key ingredient for eBPF, that we did not have access to in the past, is
kfunc (it made a big difference for us to reconsider eBPF).

In V2 the two modes are mutually exclusive (IOW, you get to choose one
or the other via Kconfig).

Jamal Hadi Salim (15):
  net: sched: act_api: Introduce P4 actions list
  net/sched: act_api: increase action kind string length
  net/sched: act_api: Update tc_action_ops to account for P4 actions
  net/sched: act_api: add struct p4tc_action_ops as a parameter to
    lookup callback
  net: sched: act_api: Add support for preallocated P4 action instances
  p4tc: add P4 data types
  p4tc: add template API
  p4tc: add template pipeline create, get, update, delete
  p4tc: add template action create, update, delete, get, flush and dump
  p4tc: add runtime action support
  p4tc: add template table create, update, delete, get, flush and dump
  p4tc: add runtime table entry create and update
  p4tc: add runtime table entry get, delete, flush and dump
  p4tc: add set of P4TC table kfuncs
  p4tc: add P4 classifier

 include/linux/bitops.h            |    1 +
 include/net/act_api.h             |   23 +-
 include/net/p4tc.h                |  714 +++++++
 include/net/p4tc_types.h          |   89 +
 include/net/tc_act/p4tc.h         |   79 +
 include/uapi/linux/p4tc.h         |  465 +++++
 include/uapi/linux/pkt_cls.h      |   15 +
 include/uapi/linux/rtnetlink.h    |   18 +
 include/uapi/linux/tc_act/tc_p4.h |   11 +
 net/sched/Kconfig                 |   23 +
 net/sched/Makefile                |    3 +
 net/sched/act_api.c               |  192 +-
 net/sched/cls_api.c               |    2 +-
 net/sched/cls_p4.c                |  305 +++
 net/sched/p4tc/Makefile           |    8 +
 net/sched/p4tc/p4tc_action.c      | 2419 +++++++++++++++++++++++
 net/sched/p4tc/p4tc_bpf.c         |  360 ++++
 net/sched/p4tc/p4tc_filter.c      | 1012 ++++++++++
 net/sched/p4tc/p4tc_pipeline.c    |  700 +++++++
 net/sched/p4tc/p4tc_runtime_api.c |  145 ++
 net/sched/p4tc/p4tc_table.c       | 1820 +++++++++++++++++
 net/sched/p4tc/p4tc_tbl_entry.c   | 3071 +++++++++++++++++++++++++++++
 net/sched/p4tc/p4tc_tmpl_api.c    |  440 +++++
 net/sched/p4tc/p4tc_types.c       | 1213 ++++++++++++
 net/sched/p4tc/trace.c            |   10 +
 net/sched/p4tc/trace.h            |   44 +
 security/selinux/nlmsgtab.c       |   10 +-
 27 files changed, 13156 insertions(+), 36 deletions(-)
 create mode 100644 include/net/p4tc.h
 create mode 100644 include/net/p4tc_types.h
 create mode 100644 include/net/tc_act/p4tc.h
 create mode 100644 include/uapi/linux/p4tc.h
 create mode 100644 include/uapi/linux/tc_act/tc_p4.h
 create mode 100644 net/sched/cls_p4.c
 create mode 100644 net/sched/p4tc/Makefile
 create mode 100644 net/sched/p4tc/p4tc_action.c
 create mode 100644 net/sched/p4tc/p4tc_bpf.c
 create mode 100644 net/sched/p4tc/p4tc_filter.c
 create mode 100644 net/sched/p4tc/p4tc_pipeline.c
 create mode 100644 net/sched/p4tc/p4tc_runtime_api.c
 create mode 100644 net/sched/p4tc/p4tc_table.c
 create mode 100644 net/sched/p4tc/p4tc_tbl_entry.c
 create mode 100644 net/sched/p4tc/p4tc_tmpl_api.c
 create mode 100644 net/sched/p4tc/p4tc_types.c
 create mode 100644 net/sched/p4tc/trace.c
 create mode 100644 net/sched/p4tc/trace.h

Comments

Jamal Hadi Salim April 4, 2024, 12:44 p.m. UTC | #1
On Thu, Apr 4, 2024 at 8:23 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
>
> This is the first patchset of two. In this patch we are submitting 15 which
> cover the minimal viable P4 PNA architecture.
> Please, if you want to discuss a slightly tangential subject like offload
> or even your politics then start another thread with a different subject
> line.  The way you do it is to change the subject line to for example
> "<Your New Subject here> (WAS: <original subject line here>)".
>
> In this cover letter i am restoring text i took out in V10 which stated
> "our requirements".
>
> Martin, please look at patch 14 again. The bpf selftests for kfuncs is
> sloted for series 2. Paolo, please take a look at 1, 3, 6 for the changes
> you suggested. Marcelo, because we made changes to patch 14, I have
> removed your reviewed-by. Can you please take another look at that patch?

Sorry, Marcelo - you already reviewed and we restored your reviewed-by.

cheers,
jamal

>
> __Description of these Patches__
>
> These Patches are constrained entirely within the TC domain with very tiny
> changes made in patch 1-5. eBPF is used as an infrastructure component for
> the software datapath and no changes are made to any eBPF code, only kfuncs
> are introduced in patch 14.
>
> Patch #1 adds infrastructure for per-netns P4 actions that can be created on
> as need basis for the P4 program requirement. This patch makes a small
> incision into act_api. Patches 2-4 are minimalist enablers for P4TC and have
> no effect on the classical tc action (example patch#2 just increases the size
> of the action names from 16->64B).
> Patch 5 adds infrastructure support for preallocation of dynamic actions
> needed for P4.
>
> The core P4TC code implements several P4 objects.
> 1) Patch #6 introduces P4 data types which are consumed by the rest of the
>    code
> 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
> 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD
>    commands for P4 pipelines.
> 4) Patch #9 introduces the action templates and associated CRUD commands.
> 5) Patch #10 introduce the action runtime infrastructure.
> 6) Patch #11 introduces the concept of P4 table templates and associated
>    CRUD commands for tables.
> 7) Patch #12 introduces runtime table entry infra and associated CU
>    commands.
> 8) Patch #13 introduces runtime table entry infra and associated RD
>    commands.
> 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
> 10) Patch #15 introduces the TC classifier P4 used at runtime.
>
> There are a few more patches not in this patchset that deal with externs,
> test cases, etc.
>
> What is P4?
> -----------
>
> The Programming Protocol-independent Packet Processors (P4) is an open
> source, domain-specific programming language for specifying data plane
> behavior.
>
> The current P4 landscape includes an extensive range of deployments,
> products, projects and services, etc[9][12]. Two major NIC vendors,
> Intel[10] and AMD[11] currently offer P4-native NICs. P4 is currently
> curated by the Linux Foundation[9].
>
> A lot more on why P4 - see small treatise here:[4].
>
> What is P4TC?
> -------------
>
> P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4
> program and its associated objects and state are attachend to a kernel
> _netns_ structure.
> IOW, if we had two programs across netns' or within a netns they have no
> visibility to each others objects (unlike for example TC actions whose
> kinds are "global" in nature or eBPF maps visavis bpftool).
>
> P4TC builds on top of many years of Linux TC experiences of a netlink
> control path interface coupled with a software datapath with an equivalent
> offloadable hardware datapath. In this patch series we are focussing only
> on the s/w datapath. The s/w and h/w path equivalence that TC provides is
> relevant for a primary use case of P4 where some (currently) large consumers
> of NICs provide vendors their datapath specs in P4. In such a case one could
> generate specified datapaths in s/w and test/validate the requirements
> before hardware acquisition(example [12]).
>
> Unlike other approaches such as TC Flower which require kernel and user
> space changes when new datapath objects like packet headers are introduced
> P4TC requires zero kernel or user space changes. We refer to this as:
> _kernel and user space code change independence_.
> Meaning:
> A P4 program describes headers, how to parse, etc alongside prescribing
> the datapath processing logic; the compiler uses the P4 program as input
> and generates several artifacts which are then loaded into the kernel to
> manifest the intended datapath. In addition to the generated datapath,
> control path constructs are generated. The process is described further
> below in "P4TC Workflow".
>
> Some History
> ------------
>
> There have been many discussions and meetings within the community since
> about 2015 in regards to P4 over TC[2] and we are finally proving to the
> naysayers that we do get stuff done!
>
> A lot more of the P4TC motivation is captured at:
> https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md
>
> __P4TC Architecture__
>
> The current architecture was described at netdevconf 0x17[14] and if you
> prefer academic conference papers, a short paper is available here[15].
>
> There are 4 parts:
>
> 1) A Template CRUD provisioning API for manifesting a P4 program and its
> associated objects in the kernel. The template provisioning API uses
> netlink.  See patch in part 2.
>
> 2) A Runtime CRUD+ API code which is used for controlling the different
> runtime behavior of the P4 objects. The runtime API uses netlink. See notes
> further down. See patch descriptions...
>
> 3) P4 objects and their control interfaces: tables, actions, externs, etc.
> Any object that requires control plane interaction resides in the TC domain
> and is subject to the CRUD runtime API.  The intended goal is to make use
> of the tc semantics of skip_sw/hw to target P4 program objects either in s/w
> or h/w.
>
> 4) S/W Datapath code hooks. The s/w datapath is eBPF based and is generated
> by a compiler based on the P4 spec. When accessing any P4 object that
> requires control plane interfaces, the eBPF code accesses the P4TC side
> from #3 above using kfuncs.
>
> The generated eBPF code is derived from [13] with enhancements and fixes to
> meet our requirements.
>
> __P4TC Workflow__
>
> The Development and instantiation workflow for P4TC is as follows:
>
>   A) A developer writes a P4 program, "myprog"
>
>   B) Compiles it using the P4C compiler[8]. The compiler generates 3
>      outputs:
>
>      a) A shell script which form template definitions for the different P4
>         objects "myprog" utilizes (tables, externs, actions etc). See #1
>         above
>
>      b) The parser and the rest of the datapath are generated as eBPF and
>         need to be compiled into binaries. At the moment the parser and the
>         main control block are generated as separate eBPF program but this
>         could change in the future (without affecting any kernel code).
>         See #4 above.
>
>      c) A json introspection file used for the control plane
>         (by iproute2/tc).
>
>   C) At this point the artifacts from #1,#4 could be handed to an operator
>      (the operator could be the same person as the developer from #A, #B).
>
>      i) For the eBPF part, either the operator is handed an ebpf binary or
>      source which they compile at this point into a binary.
>      The operator executes the shell script(s) to manifest the functional
>      "myprog" into the kernel.
>
>      ii) The operator instantiates "myprog" pipeline via the tc P4 filter
>      to ingress/egress (depending on P4 arch) of one or more netdevs/ports
>      (illustrated below as "block 22").
>
>      Example instantion where the parser is a separate action:
>        "tc filter add block 22 ingress protocol all prio 10 \
>         p4 pname myprog \
>         action bpf obj $PARSER.o section p4tc/parse \
>         action bpf obj $PROGNAME.o section p4tc/main"
>
> See individual patches in partc for more examples tc vs xdp etc. Also see
> section on "challenges" (further below on this cover letter).
>
> Once "myprog" P4 program is instantiated one can start performing operations
> on table entries and/or actions at runtime as described below.
>
> __P4TC Runtime Control Path__
>
> The control interface builds on past tc experience and tries to get things
> right from the beginning (example filtering is separated from depending
> on existing object TLVs and made generic); also the code is written in
> such a way it is mostly lockless.
>
> The P4TC control interface, using netlink, provides what we call a CRUDPS
> abstraction which stands for: Create, Read(get), Update, Delete, Subscribe,
> Publish.  From a high level PoV the following describes a conformant high
> level API (both on netlink data model and code level):
>
>         Create(</path/to/object, DATA>+)
>         Read(</path/to/object>, [optional filter])
>         Update(</path/to/object>, DATA>+)
>         Delete(</path/to/object>, [optional filter])
>         Subscribe(</path/to/object>, [optional filter])
>
> Note, we _dont_ treat "dump" or "flush" as speacial. If "path/to/object"
> points to a table then a "Delete" implies "flush" and a "Read" implies dump
> but if it points to an entry (by specifying a key) then "Delete" implies
> deleting and entry and "Read" implies reading that single entry. It should
> be noted that both "Delete" and "Read" take an optional filter parameter.
> The filter can define further refinements to what the control plane wants
> read or deleted.
> "Subscribe" uses built in netlink event management. It, as well, takes a
> filter which can further refine what events get generated to the control
> plane (taken out of this patchset, to be re-added with consideration of
> [16]).
>
> Lets show some runtime samples:
>
> ..create an entry, if we match ip address 10.0.1.2 send packet out eno1
>   tc p4ctrl create myprog/table/mytable \
>    dstAddr 10.0.1.2/32 action send_to_port param port eno1
>
> ..Batch create entries
>   tc p4ctrl create myprog/table/mytable \
>   entry dstAddr 10.1.1.2/32  action send_to_port param port eno1 \
>   entry dstAddr 10.1.10.2/32  action send_to_port param port eno10 \
>   entry dstAddr 10.0.2.2/32  action send_to_port param port eno2
>
> ..Get an entry (note "read" is interchangeably used as "get" which is a
> common semantic in tc):
>   tc p4ctrl read myprog/table/mytable \
>    dstAddr 10.0.2.2/32
>
> ..dump mytable
>   tc p4ctrl read myprog/table/mytable
>
> ..dump mytable for all entries whose key fits within 10.1.0.0/16
>   tc p4ctrl read myprog/table/mytable \
>   filter key/myprog/mytable/dstAddr = 10.1.0.0/16
>
> ..dump all mytable entries which have an action send_to_port with param "eno1"
>   tc p4ctrl get myprog/table/mytable \
>   filter param/act/myprog/send_to_port/port = "eno1"
>
> The filter expression is powerful, f.e you could say:
>
>   tc p4ctrl get myprog/table/mytable \
>   filter param/act/myprog/send_to_port/port = "eno1" && \
>          key/myprog/mytable/dstAddr = 10.1.0.0/16
>
> It also works on built in metadata, example in the following case dumping
> entries from mytable that have seen activity in the last 10 secs:
>   tc p4ctrl get myprog/table/mytable \
>   filter msecs_since < 10000
>
> Delete follows the same syntax as get/read, so for sake of brevity we won't
> show more example than how to flush mytable:
>
>   tc p4ctrl delete myprog/table/mytable
>
> Mystery question: How do we achieve iproute2-kernel independence and
> how does "tc p4ctrl" as a cli know how to program the kernel given an
> arbitrary command line as shown above? Answer(s): It queries the
> compiler generated json file in "P4TC Workflow" #B.c above. The json file
> has enough details to figure out that we have a program called "myprog"
> which has a table "mytable" that has a key name "dstAddr" which happens to
> be type ipv4 address prefix. The json file also provides details to show
> that the table "mytable" supports an action called "send_to_port" which
> accepts a parameter "port" of type netdev (see the types patch for all
> supported P4 data types).
> All P4 components have names, IDs, and types - so this makes it very easy
> to map into netlink.
> Once user space tc/p4ctrl validates the human command input, it creates
> standard binary netlink structures (TLVs etc) which are sent to the kernel.
> See the runtime table entry patch for more details.
>
> __P4TC Datapath__
>
> The P4TC s/w datapath execution is generated as eBPF. Any objects that
> require control interfacing reside in the "P4TC domain" and are controlled
> via netlink as described above. Per packet execution and state and even
> objects that do not require control interfacing (like the P4 parser) are
> generated as eBPF.
>
> A packet arriving on s/w ingress of any of the ports on block 22
> (illustrated in section "P4TC Workflow" above will first be exercised via
> the (generated eBPF) parser component to extract the headers (the ip
> destination address labeled "dstAddr" above in section "P4TC Runtime
> Control Path"). The datapath then proceeds to use "dstAddr", table ID
> and pipeline ID as a key to do a lookup in myprog's "mytable" which returns
> the action params which are then used to execute the action in the eBPF
> datapath (eventually sending out packets to eno1).
> On a table miss, mytable's default miss action (not described) is executed.
>
> __Testing__
>
> Speaking of testing - we have 2-300 tdc test cases (which will be in the
> second patchset).
> These tests are run on our CICD system on pull requests and after commits
> are approved. The CICD does a lot of other tests (more since v2, thanks to
> Simon's input)including:
> checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on
> both X86, ARM 64 and emulated BE via qemu s390. We trigger performance
> testing in the CICD to catch performance regressions (currently only on
> the control path, but in the future for the datapath).
> Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on
> memory sanitizer but recently added support for concurrency sanitizer.
> Before main releases we ensure each patch will compile on its own to help
> in git bisect and run the xmas tree tool. We eventually put the code via
> coverity.
>
> In addition we are working on enabling a tool that will take a P4 program,
> run it through the compiler, and generate permutations of traffic patterns
> via symbolic execution that will test both positive and negative datapath
> code paths. The test generator tool integration is still work in progress.
> Also: We have other code that test parallelization etc which we are trying
> to find a fit for in the kernel tree's testing infra.
>
> __Restating Our Requirements__
>
> Given this code is not intrusive at all because it only touches TC.
> We would like to emphasize that we see eBPF as _infrastructure tooling
> available to us and not the end goal_. Please help us with technical input
> on for example how we can do better kfuncs, etc. If you want to critique,
> then our requirements should be your guide and please be considerate that
> this is about P4, not eBPF. IOW:
> We would appreciate technical commentary instead of bikeshedding on how
> _you_ would have implemented this probably with more eBPF or some other
> clever tricks. It is sad to see there was zero input from anyone in the eBPF
> world for 7 RFC postings (in a period of 9 months).
> If i am ranting here is because we have spent over a year now on this
> topic - we have taken the initial input and have given you eBPF. So lets
> make progress please.
>
> The initial release was presented in October 2022[20] and RFC in January
> 2023 had a "scriptable" datapath (the idea built on the u32 classifier[17]
> and pedit action[18] approach. Post RFC V1, we made changes to fit the
> feedback to integrate eBPF to replace the "scriptable" software datapath.
> On our part, the goal for the change was to meet folks in the middle as a
> compromise.
> No regrets on the journey since after all the effort because we ended
> getting XDP which was not in the original picture. Some of our efforts are
> captured at [1][3] and in the patch history.
>
> In this section we review the original scriptable version against the
> current implementation which uses eBPF and in the process re-enumerate our
> requirements.
>
> To be very clear: Our intention for P4TC is to target _the TC crowd_.
> Essentially developers and ops people already familiar and deploying TC
> based infra.
> More importantly the original intent for P4TC was to enable _ops folks_
> more than devs (given code is being generated and doesn't need humans to
> write it).
>
> With TC, we gain the whole "familiar" package of match-action pipeline
> abstraction++, meaning from the control plane(see discussion above) all
> the way to the tooling infra, i.e iproute2/tc cli, netlink infra interface
> (request/response, event subscribe/multicast-publish, congestion control
> etc), s/w and h/w symbiosis, the autonomous kernel control, etc.
> The main advantage over vendor specific implementations(which is the current
> alternative) is: with P4TC we have a singular vendor-neutral interface via
> the kernel using well understood mechanisms that have gained learnings from
> deployment experience.
>
> So lets list some of these requirements and compare whether moving to eBPF
> affected us or gave us an advantage.
>
> 0) Understood Control Plane semantics
>
> This requirement is unaffected.
> The control plane remains as netlink and therefore we get the classical
> multi-user CRUD+Publish/subscribe APIs built in.
>
> 1) Must support SW/HW equivalence
>
> This requirement is unaffected. The control plane is netlink. Any semantics
> to select between sw and hw via skip_sw/hw semantics is maintained.
>
> 2) Supporting expressibility of the universe set of P4 progs
>
> It is a must to support 100% of all possible P4 programs. In the past the
> eBPF verifier, for example in [13], had to be worked around and even then
> there are cases where we couldnt avoid path explosion when branching isi
> involved and failed to run. So we were skeptical about using eBPF to begin
> with.
> Kfuncs changed our minds. Note, there are still challenges running all
> potential P4 programs at the XDP level - but the pipeline could be split
> between XDP and TC in such cases. The compiler can be told to generate
> pieces that run on XDP and other on TC (see examples).
> Summary: This requirement is unaffected.
>
> 3) Operational usability
>
> By maintaining the TC control plane (even in presence of eBPF datapath)
> runtime aspects remain unchanged. So for our target audience of folks
> who have deployed tc, including offloads, the comfort zone is unchanged.
>
> There is some loss in operational usability because we now have more knobs:
> the extra compilation, loading and syncing of ebpf binaries, etc.
> IOW, I can no longer just ship someone a shell script(ascii) in an email to
> someone and say "go run this and "myprog" will just work".
>
> 4) Operational and development Debuggability
>
> If something goes wrong, the tc craftsperson is now required to have
> additional knowledge of eBPF code and process.
> Our intent is to compensate this challenge with debug tools that ease the
> craftperson's debugging.
>
> 5) Opportunity for rapid prototyping of new ideas
>
> This is not exactly a requirement but something that became a useful
> feature during the P4TC development phase. When the compiler was lagging
> behind in features was to often handcode the template scripts.
> Then you would dump back the template from the kernel and do a diff to
> ensure the kernel didn't get something wrong. Essentially, this was a nice
> debug feature. During development, we wrote scripts that covered a range of
> P4 architectures(PSA, V1, etc) which required no kernel code changes.
>
> Over time the debug feature morphed into: a) start by handcoding scripts
> then b) read it back and then c) generate the P4 code.
> It means one could start with the template scripts outside of the
> constraints of a P4 architecture spec(PNA/PSA) or even within a P4
> architecture then test some ideas and eventually feed back the concepts to
> the compiler authors or modify or create a new P4 architecture and share
> with the P4 standards folks.
>
> To summarize in presence of eBPF: The debugging idea is probably still
> alive.  One could dump, with proper tooling(bpftool for example), the
> loaded eBPF code and be able to check for differences. But this is not the
> interesting part.
> The concept of going back from whats in the kernel to P4 is a lot more
> difficult to implement mostly due to scoping of DSL vs general purpose. It
> may be lost.  We have been discussing ways to use BTF and embedding
> annotations in the eBPF code and binary but more thought is required and we
> welcome suggestions.
>
> 6) Supporting per namespace program
>
> In P4TC every program and its associated objects have unique IDs which are
> generated by the compiler. Multiple or the same P4 program(s) can run
> independently in different namespaces alongside their appropriate state and
> object instance parameterization (despite name or ID collission).
> This requirement is still met (by virtue of keeping P4 program control
> objects within the TC domain and attaching to a netns).
>
> __References__
>
> [1]https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
> [2]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#historical-perspective-for-p4tc
> [3]https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation
> [4]https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md#so-why-p4-and-how-does-p4-help-here
> [5]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#mf59be7abc5df3473cff3879c8cc3e2369c0640a6
> [6]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#m783cfd79e9d755cf0e7afc1a7d5404635a5b1919
> [7]https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@mojatatu.com/T/#ma8c84df0f7043d17b98f3d67aab0f4904c600469
> [8]https://github.com/p4lang/p4c/tree/main/backends/tc
> [9]https://p4.org/
> [10]https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html
> [11]https://www.amd.com/en/accelerators/pensando
> [12]https://github.com/sonic-net/DASH/tree/main
> [13]https://github.com/p4lang/p4c/tree/main/backends/ebpf
> [14]https://netdevconf.info/0x17/sessions/talk/integrating-ebpf-into-the-p4tc-datapath.html
> [15]https://dl.acm.org/doi/10.1145/3630047.3630193
> [16]https://lore.kernel.org/netdev/20231216123001.1293639-1-jiri@resnulli.us/
> [17.a]https://netdevconf.info/0x13/session.html?talk-tc-u-classifier
> [17.b]man tc-u32
> [18]man tc-pedit
> [19] https://lore.kernel.org/netdev/20231219181623.3845083-6-victor@mojatatu.com/T/#m86e71743d1d83b728bb29d5b877797cb4942e835
> [20.a] https://netdevconf.info/0x16/sessions/talk/your-network-datapath-will-be-p4-scripted.html
> [20.b] https://netdevconf.info/0x16/sessions/workshop/p4tc-workshop.html
>
> --------
> HISTORY
> --------
>
> Changes in Version 14
> ----------------------
> 1) #UNDEF HWRITE/HREAD and remove unnecessary checks (Paolo)
> 2) Remove const cast added in v13 as a result of changes suggested
>    suggested by Paolo (Marcelo)
> 3) Introduce type validate for s8 caught as a result of audit from #1
> 4) S/GFP_KERNEL/GFP_KERNEL_ACCOUNT for types and runtime objects (Paolo)
> 5) Syzkaller caught an invalid netlink attribute bug that has existed
>    since v5! As noted in patch0 we've been running syzkaller for months.
> 6) Add Marcelo's reviewed-by for patch 14 and Toke's ACK to the series.
>
> Changes in Version 13
> ----------------------
>
> 1) Remove ops->print() from p4 types (Paolo).
>
> 2) Use mutex instead of rwlock for dynamic actions since rwlock is
>    discouraged these days(Paolo).
>
> 3) Constify action init_ops() ops parameter (Paolo).
>
> 4) Use struct sk_buff in kfunc instead of struct __sk_buff (Martin)
>    Use struct xdp_buff in kfunc instead of struct xdp_md (Martin)
>
> 5) Replace BTF_SET8_START with BTF_KFUNCS_START and replace
>    BTF_SET8_END with BTF_KFUNCS_END (Martin)
>
> 6) Add params__sz argument to all kfuncs to guard against future change
>    to parameter structures being passed between bpf and tc. For kfunc
>    xdp/bpf_p4tc_entry_create() we already had the max(5) allowed number of
>    of parameters. To work around this we had to merge two structs together
>    in order to maintain the number of params to 5 (Martin).
>
> 7) Add more info on commit log to explain the relation between the kfuncs
>    and TC for patch #14 (Martin).
>
> Changes in Version 12
> ----------------------
>
> 0) Introduce back 15 patches (v11 had 5)
>
> 1) From discussions with Daniel:
>    i) Remove the XDP programs association alltogether. No refcounting. nothing.
>    ii) Remove prog type tc - everything is now an ebpf tc action.
>
> 2) s/PAD0/__pad0/g. Thanks to Marcelo.
>
> 3) Add extack to specify how many entries (N of M) specified in a batch for
>    any of requested Create/Update/Delete succeeded. Prior to this it would
>    only tell us the batch failed to complete without giving us details of
>    which of M failed. Added as a debug aid.
>
> Changes in Version 11
> ----------------------
> 1) Split the series into two. Original patches 1-5 in this patchset. The rest
>    will go out after this is merged.
>
> 2) Change any references of IFNAMSIZ in the action code when referencing the
>    action name size to ACTNAMSIZ. Thanks to Marcelo.
>
> Changes in Version 10
> ----------------------
> 1) A couple of patches from the earlier version were clean enough to submit,
>    so we did. This gave us room to split the two largest patches each into
>    two. Even though the split is not git-bisactable and really some of it didn't
>    make much sense (eg spliting a create, and update in one patch and delete and
>    get into another) we made sure each of the split patches compiled
>    independently. The idea is to reduce the number of lines of code to review
>    and when we get sufficient reviews we will put the splits together again.
>    See patch #12 and #13 as well as patches #7 and #8).
>
> 2) Add more context in patch 0. Please READ!
>
> 3) Added dump/delete filters back to the code - we had taken them out in the
>    earlier patches to reduce the amount of code for review - but in retrospect
>    we feel they are important enough to push earlier rather than later.
>
>
> Changes In version 9
> ---------------------
>
> 1) Remove the largest patch (externs) to ease review.
>
> 2) Break up action patches into two to ease review bringing down the patches
>    that need more scrutiny to 8 (the first 7 are almost trivial).
>
> 3) Fixup prefix naming convention to p4tc_xxx for uapi and p4a_xxx for actions
>    to provide consistency(Jiri).
>
> 4) Silence sparse warning "was not declared. Should it be static?" for kfuncs
>    by making them static. TBH, not sure if this is the right solution
>    but it makes sparse happy and hopefully someone will comment.
>
> Changes In Version 8
> ---------------------
>
> 1) Fix all the patchwork warnings and improve our ci to catch them in the future
>
> 2) Reduce the number of patches to basic max(15)  to ease review.
>
> Changes In Version 7
> -------------------------
>
> 0) First time removing the RFC tag!
>
> 1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that
> using bpf links was sufficient to protect us from someone replacing or deleting
> a eBPF program after it has been bound to a netdev.
>
> 2) Add some reviewed-bys from Vlad.
>
> 3) Small bug fixes from v6 based on testing for ebpf.
>
> 4) Added the counter extern as a sample extern. Illustrating this example because
>    it is slightly complex since it is possible to invoke it directly from
>    the P4TC domain (in case of direct counters) or from eBPF (indirect counters).
>    It is not exactly the most efficient implementation (a reasonable counter impl
>    should be per-cpu).
>
> Changes In RFC Version 6
> -------------------------
>
> 1) Completed integration from scriptable view to eBPF. Completed integration
>    of externs integration.
>
> 2) Small bug fixes from v5 based on testing.
>
> Changes In RFC Version 5
> -------------------------
>
> 1) More integration from scriptable view to eBPF. Small bug fixes from last
>    integration.
>
> 2) More streamlining support of externs via kfunc (create-on-miss, etc)
>
> 3) eBPF linking for XDP.
>
> There is more eBPF integration/streamlining coming (we are getting close to
> conversion from scriptable domain).
>
> Changes In RFC Version 4
> -------------------------
>
> 1) More integration from scriptable to eBPF. Small bug fixes.
>
> 2) More streamlining support of externs via kfunc (one additional kfunc).
>
> 3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata.
>
> There is more eBPF integration coming. One thing we looked at but is not in this
> patchset but should be in the next is use of eBPF link in our loading (see
> "challenge #1" further below).
>
> Changes In RFC Version 3
> -------------------------
>
> These patches are still in a little bit of flux as we adjust to integrating
> eBPF. So there are small constructs that are used in V1 and 2 but no longer
> used in this version. We will make a V4 which will remove those.
> The changes from V2 are as follows:
>
> 1) Feedback we got in V2 is to try stick to one of the two modes. In this version
> we are taking one more step and going the path of mode2 vs v2 where we had 2 modes.
>
> 2) The P4 Register extern is no longer standalone. Instead, as part of integrating
> into eBPF we introduce another kfunc which encapsulates Register as part of the
> extern interface.
>
> 3) We have improved our CICD to include tools pointed to us by Simon. See
>    "Testing" further below. Thanks to Simon for that and other issues he caught.
>    Simon, we discussed on issue [7] but decided to keep that log since we think
>    it is useful.
>
> 4) A lot of small cleanups. Thanks Marcelo. There are two things we need to
>    re-discuss though; see: [5], [6].
>
> 5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub.
>
> 6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are
>    guaranteed that either A or B must exist; however, lets make smatch happy.
>    Thanks to Simon and Dan Carpenter.
>
> Changes In RFC Version 2
> -------------------------
>
> Version 2 is the initial integration of the eBPF datapath.
> We took into consideration suggestions provided to use eBPF and put effort into
> analyzing eBPF as datapath which involved extensive testing.
> We implemented 6 approaches with eBPF and ran performance analysis and presented
> our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6
> vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if
> you account for XDP or TC separately).
>
> Conclusions from the exercise: We lose the simple operational model we had
> prior to integrating eBPF. We do gain performance in most cases when the
> datapath is less compute-bound.
> For more discussion on our requirements vs journeying the eBPF path please
> scroll down to "Restating Our Requirements" and "Challenges".
>
> This patch set presented two modes.
> mode1: the parser is entirely based on eBPF - whereas the rest of the
> SW datapath stays as _scriptable_ as in Version 1.
> mode2: All of the kernel s/w datapath (including parser) is in eBPF.
>
> The key ingredient for eBPF, that we did not have access to in the past, is
> kfunc (it made a big difference for us to reconsider eBPF).
>
> In V2 the two modes are mutually exclusive (IOW, you get to choose one
> or the other via Kconfig).
>
> Jamal Hadi Salim (15):
>   net: sched: act_api: Introduce P4 actions list
>   net/sched: act_api: increase action kind string length
>   net/sched: act_api: Update tc_action_ops to account for P4 actions
>   net/sched: act_api: add struct p4tc_action_ops as a parameter to
>     lookup callback
>   net: sched: act_api: Add support for preallocated P4 action instances
>   p4tc: add P4 data types
>   p4tc: add template API
>   p4tc: add template pipeline create, get, update, delete
>   p4tc: add template action create, update, delete, get, flush and dump
>   p4tc: add runtime action support
>   p4tc: add template table create, update, delete, get, flush and dump
>   p4tc: add runtime table entry create and update
>   p4tc: add runtime table entry get, delete, flush and dump
>   p4tc: add set of P4TC table kfuncs
>   p4tc: add P4 classifier
>
>  include/linux/bitops.h            |    1 +
>  include/net/act_api.h             |   23 +-
>  include/net/p4tc.h                |  714 +++++++
>  include/net/p4tc_types.h          |   89 +
>  include/net/tc_act/p4tc.h         |   79 +
>  include/uapi/linux/p4tc.h         |  465 +++++
>  include/uapi/linux/pkt_cls.h      |   15 +
>  include/uapi/linux/rtnetlink.h    |   18 +
>  include/uapi/linux/tc_act/tc_p4.h |   11 +
>  net/sched/Kconfig                 |   23 +
>  net/sched/Makefile                |    3 +
>  net/sched/act_api.c               |  192 +-
>  net/sched/cls_api.c               |    2 +-
>  net/sched/cls_p4.c                |  305 +++
>  net/sched/p4tc/Makefile           |    8 +
>  net/sched/p4tc/p4tc_action.c      | 2419 +++++++++++++++++++++++
>  net/sched/p4tc/p4tc_bpf.c         |  360 ++++
>  net/sched/p4tc/p4tc_filter.c      | 1012 ++++++++++
>  net/sched/p4tc/p4tc_pipeline.c    |  700 +++++++
>  net/sched/p4tc/p4tc_runtime_api.c |  145 ++
>  net/sched/p4tc/p4tc_table.c       | 1820 +++++++++++++++++
>  net/sched/p4tc/p4tc_tbl_entry.c   | 3071 +++++++++++++++++++++++++++++
>  net/sched/p4tc/p4tc_tmpl_api.c    |  440 +++++
>  net/sched/p4tc/p4tc_types.c       | 1213 ++++++++++++
>  net/sched/p4tc/trace.c            |   10 +
>  net/sched/p4tc/trace.h            |   44 +
>  security/selinux/nlmsgtab.c       |   10 +-
>  27 files changed, 13156 insertions(+), 36 deletions(-)
>  create mode 100644 include/net/p4tc.h
>  create mode 100644 include/net/p4tc_types.h
>  create mode 100644 include/net/tc_act/p4tc.h
>  create mode 100644 include/uapi/linux/p4tc.h
>  create mode 100644 include/uapi/linux/tc_act/tc_p4.h
>  create mode 100644 net/sched/cls_p4.c
>  create mode 100644 net/sched/p4tc/Makefile
>  create mode 100644 net/sched/p4tc/p4tc_action.c
>  create mode 100644 net/sched/p4tc/p4tc_bpf.c
>  create mode 100644 net/sched/p4tc/p4tc_filter.c
>  create mode 100644 net/sched/p4tc/p4tc_pipeline.c
>  create mode 100644 net/sched/p4tc/p4tc_runtime_api.c
>  create mode 100644 net/sched/p4tc/p4tc_table.c
>  create mode 100644 net/sched/p4tc/p4tc_tbl_entry.c
>  create mode 100644 net/sched/p4tc/p4tc_tmpl_api.c
>  create mode 100644 net/sched/p4tc/p4tc_types.c
>  create mode 100644 net/sched/p4tc/trace.c
>  create mode 100644 net/sched/p4tc/trace.h
>
> --
> 2.34.1
>
Marcelo Ricardo Leitner April 5, 2024, 1 p.m. UTC | #2
On Thu, Apr 04, 2024 at 08:44:29AM -0400, Jamal Hadi Salim wrote:
> On Thu, Apr 4, 2024 at 8:23 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> >
> > This is the first patchset of two. In this patch we are submitting 15 which
> > cover the minimal viable P4 PNA architecture.
> > Please, if you want to discuss a slightly tangential subject like offload
> > or even your politics then start another thread with a different subject
> > line.  The way you do it is to change the subject line to for example
> > "<Your New Subject here> (WAS: <original subject line here>)".
> >
> > In this cover letter i am restoring text i took out in V10 which stated
> > "our requirements".
> >
> > Martin, please look at patch 14 again. The bpf selftests for kfuncs is
> > sloted for series 2. Paolo, please take a look at 1, 3, 6 for the changes
> > you suggested. Marcelo, because we made changes to patch 14, I have
> > removed your reviewed-by. Can you please take another look at that patch?
>
> Sorry, Marcelo - you already reviewed and we restored your reviewed-by.

Aye.

Cheers,
Marcelo