diff mbox series

[nf-next] netfilter: nf_tables: add ebpf expression

Message ID 20220831101617.22329-1-fw@strlen.de (mailing list archive)
State Awaiting Upstream
Delegated to: Netdev Maintainers
Headers show
Series [nf-next] netfilter: nf_tables: add ebpf expression | expand

Checks

Context Check Description
netdev/tree_selection success Guessed tree name to be net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix warning Target tree name not specified in the subject
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 841 this patch: 842
netdev/cc_maintainers warning 7 maintainers not CCed: pablo@netfilter.org edumazet@google.com pabeni@redhat.com coreteam@netfilter.org kuba@kernel.org kadlec@netfilter.org davem@davemloft.net
netdev/build_clang success Errors and warnings before: 227 this patch: 227
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 826 this patch: 827
netdev/checkpatch warning CHECK: Please use a blank line after function/struct/union/enum declarations WARNING: Missing or malformed SPDX-License-Identifier tag in line 1 WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns
netdev/kdoc fail Errors and warnings before: 162 this patch: 164
netdev/source_inline success Was 0 now: 0

Commit Message

Florian Westphal Aug. 31, 2022, 10:16 a.m. UTC
This expression is a native replacement for xtables 'bpf' match "pinned" mode.
Userspace needs to pass a file descriptor referencing the program (of socket
filter type).
Userspace should also pass the original pathname for that fd so userspace can
print the original filename again.

Tag and program id are dumped to userspace on 'list' to allow to see which
program is in use in case the filename isn't available/present.

cbpf bytecode isn't supported.

No new Kconfig option is added: Its included if BPF_SYSCALL is enabled.

Proposed nft userspace syntax is:

add rule ... ebpf pinned "/sys/fs/bpf/myprog"

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_tables_core.h   |   3 +
 include/uapi/linux/netfilter/nf_tables.h |  18 ++++
 net/netfilter/Makefile                   |   4 +
 net/netfilter/nf_tables_core.c           |   7 ++
 net/netfilter/nft_ebpf.c                 | 128 +++++++++++++++++++++++
 5 files changed, 160 insertions(+)
 create mode 100644 net/netfilter/nft_ebpf.c

Comments

Toke Høiland-Jørgensen Aug. 31, 2022, 12:13 p.m. UTC | #1
Florian Westphal <fw@strlen.de> writes:

> This expression is a native replacement for xtables 'bpf' match "pinned" mode.
> Userspace needs to pass a file descriptor referencing the program (of socket
> filter type).
> Userspace should also pass the original pathname for that fd so userspace can
> print the original filename again.
>
> Tag and program id are dumped to userspace on 'list' to allow to see which
> program is in use in case the filename isn't available/present.

It seems a bit odd to include the file path in the kernel as well. For
one thing, the same object can be pinned multiple times in different
paths (even in different mount namespaces), and there's also nothing
preventing a different program to have been substituted by the pinned
one by the time the value is echoed back.

Also, there's nothing checking that the path attribute actually contains
a path, so it's really just an arbitrary label that the kernel promises
to echo back. But doesn't NFT already have a per-rule comment feature,
so why add another specifically for BPF? Instead we could just teach the
userspace utility to extract metadata from the BPF program (based on the
ID) like bpftool does. This would include the program name, BTW, so it
does have a semantic identifier.

> cbpf bytecode isn't supported.
>
> No new Kconfig option is added: Its included if BPF_SYSCALL is enabled.
>
> Proposed nft userspace syntax is:
>
> add rule ... ebpf pinned "/sys/fs/bpf/myprog"

Any plan to also teach the nft binary to load a BPF program from an ELF
file (instead of relying on pinning)?

-Toke
Florian Westphal Aug. 31, 2022, 12:56 p.m. UTC | #2
Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> > Tag and program id are dumped to userspace on 'list' to allow to see which
> > program is in use in case the filename isn't available/present.
> 
> It seems a bit odd to include the file path in the kernel as well.

Its needed to be able to re-load the ruleset.

> For
> one thing, the same object can be pinned multiple times in different
> paths (even in different mount namespaces),

Sure.

> and there's also nothing
> preventing a different program to have been substituted by the pinned
> one by the time the value is echoed back.

Yes, but what would you expect it should do?

> Also, there's nothing checking that the path attribute actually contains
> a path, so it's really just an arbitrary label that the kernel promises
> to echo back

Yes exactly.

> But doesn't NFT already have a per-rule comment feature,
> so why add another specifically for BPF?

You can attach up to 256 bytes to a rule, yes.
Might not be enough for a longer path, and there could be multiple
expressions in the same rule.

This way was the most simple solution.

> Instead we could just teach the
> userspace utility to extract metadata from the BPF program (based on the
> ID) like bpftool does. This would include the program name, BTW, so it
> does have a semantic identifier.

Sure, I could change the grammar so it expects a tag or ID, e.g.
'ebpf id 42'

If thats preferred, I can change this, it avoids the need for storing
the name.

> > cbpf bytecode isn't supported.
> > add rule ... ebpf pinned "/sys/fs/bpf/myprog"
> 
> Any plan to also teach the nft binary to load a BPF program from an ELF
> file (instead of relying on pinning)?

I used pinning because that is what '-m bpf' uses.
Toke Høiland-Jørgensen Aug. 31, 2022, 1:41 p.m. UTC | #3
Florian Westphal <fw@strlen.de> writes:

> Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> > Tag and program id are dumped to userspace on 'list' to allow to see which
>> > program is in use in case the filename isn't available/present.
>> 
>> It seems a bit odd to include the file path in the kernel as well.
>
> Its needed to be able to re-load the ruleset.

How does that work, exactly? Is this so that the userspace binary can
query the current ruleset, and feed it back to the kernel expecting it
to stay the same? Because in that case, if the pinned object goes away
in the meantime (or changes to a different program), this could lead to
some really hard to debug errors, where a reload subtly changes the
behaviour because the BPF program is not in fact the same.

Using IDs would avoid this ambiguity at least, so I think that's a
better solution. We'd have to make sure the BPF program is not released
completely until after the reload has finished, so that it doesn't
suddenly disappear.

>> But doesn't NFT already have a per-rule comment feature,
>> so why add another specifically for BPF?
>
> You can attach up to 256 bytes to a rule, yes.
> Might not be enough for a longer path, and there could be multiple
> expressions in the same rule.
>
> This way was the most simple solution.

My point here was more that if it's just a label for human consumption,
the comment field should be fine, didn't realise it was needed for the
tool operation (and see above re: that).

>> Instead we could just teach the
>> userspace utility to extract metadata from the BPF program (based on the
>> ID) like bpftool does. This would include the program name, BTW, so it
>> does have a semantic identifier.
>
> Sure, I could change the grammar so it expects a tag or ID, e.g.
> 'ebpf id 42'
>
> If thats preferred, I can change this, it avoids the need for storing
> the name.

I think for echoing back, just relying on the ID is better as that is at
least guaranteed to stay constant for the lifetime of the BPF program in
the kernel. We could still support the 'pinned <path>' syntax on the
command line so that the initial load could be done from a pinned file,
just as a user interface improvement...

>> > cbpf bytecode isn't supported.
>> > add rule ... ebpf pinned "/sys/fs/bpf/myprog"
>> 
>> Any plan to also teach the nft binary to load a BPF program from an ELF
>> file (instead of relying on pinning)?
>
> I used pinning because that is what '-m bpf' uses.

I'm not against supporting pinning, per se (except for the issues noted
above), but we could do multiple things, including supporting loading
the program from an object file. This is similar to how TC operates, for
instance...

-Toke
Florian Westphal Aug. 31, 2022, 1:44 p.m. UTC | #4
Florian Westphal <fw@strlen.de> wrote:
> Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> > > Tag and program id are dumped to userspace on 'list' to allow to see which
> > > program is in use in case the filename isn't available/present.
> > 
> > It seems a bit odd to include the file path in the kernel as well.
> 
> Its needed to be able to re-load the ruleset.

In particular, I can't find any better alternative.

load by id -> works, easy to echo back to userspace, but not stable
identifier across reboots or add/del operations of the program.

load by tag -> similar, except that this time the tag needs to be
adjusted whenever the program changes, so not ideal either.

load via ELF name -> same problems as the proposed 'pinned' mode, but perhaps a bit
easier to use?

It has the slight advantage that users don't need to load/pin the program first,
lifetime of the program would be tied to the nftables rule.

The downside is that nft needs to deal with possible rejection of the program
instead of 'outsourcing' this problem to bpftool (or another program).
Florian Westphal Aug. 31, 2022, 1:57 p.m. UTC | #5
Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >> It seems a bit odd to include the file path in the kernel as well.
> >
> > Its needed to be able to re-load the ruleset.
> 
> How does that work, exactly? Is this so that the userspace binary can
> query the current ruleset, and feed it back to the kernel expecting it
> to stay the same?

Yes.

> Because in that case, if the pinned object goes away
> in the meantime (or changes to a different program), this could lead to
> some really hard to debug errors, where a reload subtly changes the
> behaviour because the BPF program is not in fact the same.

Correct, but thats kind of expected when the user changes programs
logic.

Same with a 'nft list ruleset > /etc/nft.txt', reboot,
'nft -f /etc/nft.txt' fails because user forgot to load/pin the program
first.

> Using IDs would avoid this ambiguity at least, so I think that's a
> better solution. We'd have to make sure the BPF program is not released
> completely until after the reload has finished, so that it doesn't
> suddenly disappear.

This should be covered, the destructor runs after the ruleset has been
detached from the data plan (and after a synchronize_rcu).

> > This way was the most simple solution.
> 
> My point here was more that if it's just a label for human consumption,
> the comment field should be fine, didn't realise it was needed for the
> tool operation (and see above re: that).

Yes, this is unfortunate.  I would like to avoid introducing an
asymmetry between input and output (as in "... add rule ebpf pinned
bla', but 'nft list ruleset' showing 'ebpf id 42') or similar, UNLESS we
can somehow use that alternate output to reconstruct that was originally
intended.  And so far I can only see that happening with storing some
label in the kernel for userspace to consume (elf filename, pinned name,
program name ... ).

To give an example:

With 'ebpf id 42', we might be able to let this get echoed back as if
user would have said 'ebpf progname myfilter' (I am making this up!),
just to have a more 'stable' identifier.

This would make it necessary to also support load-by-program-name, of
course.

> > Sure, I could change the grammar so it expects a tag or ID, e.g.
> > 'ebpf id 42'
> >
> > If thats preferred, I can change this, it avoids the need for storing
> > the name.
> 
> I think for echoing back, just relying on the ID is better as that is at
> least guaranteed to stay constant for the lifetime of the BPF program in
> the kernel.

Yes, I realize that, this is why the id and tag are included in the
netlink dump, but on the userspace side this information is currently
hidden and only shown with --debug output.

> >> Any plan to also teach the nft binary to load a BPF program from an ELF
> >> file (instead of relying on pinning)?
> >
> > I used pinning because that is what '-m bpf' uses.
> 
> I'm not against supporting pinning, per se (except for the issues noted
> above),

Okay, thanks for clarifying.  -m bpf is a bit older so I was not sure if
pinning has been deprecated or something like that.

> But we could do multiple things, including supporting loading
> the program from an object file. This is similar to how TC operates, for
> instance...

Right, there is no need to restrict this to one method.
Toke Høiland-Jørgensen Aug. 31, 2022, 2:43 p.m. UTC | #6
Florian Westphal <fw@strlen.de> writes:

> Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> >> It seems a bit odd to include the file path in the kernel as well.
>> >
>> > Its needed to be able to re-load the ruleset.
>> 
>> How does that work, exactly? Is this so that the userspace binary can
>> query the current ruleset, and feed it back to the kernel expecting it
>> to stay the same?
>
> Yes.
>
>> Because in that case, if the pinned object goes away
>> in the meantime (or changes to a different program), this could lead to
>> some really hard to debug errors, where a reload subtly changes the
>> behaviour because the BPF program is not in fact the same.
>
> Correct, but thats kind of expected when the user changes programs
> logic.
>
> Same with a 'nft list ruleset > /etc/nft.txt', reboot,
> 'nft -f /etc/nft.txt' fails because user forgot to load/pin the program
> first.

Right, so under what conditions is the identifier expected to survive,
exactly? It's okay if it fails after a reboot, but it should keep
working while the system is up?

>> > This way was the most simple solution.
>> 
>> My point here was more that if it's just a label for human consumption,
>> the comment field should be fine, didn't realise it was needed for the
>> tool operation (and see above re: that).
>
> Yes, this is unfortunate.  I would like to avoid introducing an
> asymmetry between input and output (as in "... add rule ebpf pinned
> bla', but 'nft list ruleset' showing 'ebpf id 42') or similar, UNLESS we
> can somehow use that alternate output to reconstruct that was originally
> intended.  And so far I can only see that happening with storing some
> label in the kernel for userspace to consume (elf filename, pinned name,
> program name ... ).
>
> To give an example:
>
> With 'ebpf id 42', we might be able to let this get echoed back as if
> user would have said 'ebpf progname myfilter' (I am making this up!),
> just to have a more 'stable' identifier.
>
> This would make it necessary to also support load-by-program-name, of
> course.

Seems like this kind of mapping can be done in userspace without
involving the kernel?

For example, the 'progname' thing could be implemented by defining an
nft-specific pinning location so that 'ebpf progname myfilter' is
equivalent to 'ebpf pinned /sys/bpf/nft/myfilter' and when nft receives
an ID from the kernel it goes looking in /sys/bpf/nft to see if it can
find the program with that ID and echoes it with the appropriate
progname if it does exist?

This could also be extended, so that if a user does '... add rule ebpf
file /usr/lib/bpf/myrule.o' the nft binary stashes the id -> .o file
mapping somewhere (in /run for instance) so that it can echo back where
it got it from later?

In either case I'm not really sure that there's much to be gained from
asking the kernel to store an additional label with the program rule?

-Toke
Pablo Neira Ayuso Aug. 31, 2022, 3:09 p.m. UTC | #7
On Wed, Aug 31, 2022 at 04:43:26PM +0200, Toke Høiland-Jørgensen wrote:
> Florian Westphal <fw@strlen.de> writes:
> 
> > Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >> >> It seems a bit odd to include the file path in the kernel as well.
> >> >
> >> > Its needed to be able to re-load the ruleset.
> >> 
> >> How does that work, exactly? Is this so that the userspace binary can
> >> query the current ruleset, and feed it back to the kernel expecting it
> >> to stay the same?
> >
> > Yes.
> >
> >> Because in that case, if the pinned object goes away
> >> in the meantime (or changes to a different program), this could lead to
> >> some really hard to debug errors, where a reload subtly changes the
> >> behaviour because the BPF program is not in fact the same.
> >
> > Correct, but thats kind of expected when the user changes programs
> > logic.
> >
> > Same with a 'nft list ruleset > /etc/nft.txt', reboot,
> > 'nft -f /etc/nft.txt' fails because user forgot to load/pin the program
> > first.
> 
> Right, so under what conditions is the identifier expected to survive,
> exactly? It's okay if it fails after a reboot, but it should keep
> working while the system is up?
> 
> >> > This way was the most simple solution.
> >> 
> >> My point here was more that if it's just a label for human consumption,
> >> the comment field should be fine, didn't realise it was needed for the
> >> tool operation (and see above re: that).
> >
> > Yes, this is unfortunate.  I would like to avoid introducing an
> > asymmetry between input and output (as in "... add rule ebpf pinned
> > bla', but 'nft list ruleset' showing 'ebpf id 42') or similar, UNLESS we
> > can somehow use that alternate output to reconstruct that was originally
> > intended.  And so far I can only see that happening with storing some
> > label in the kernel for userspace to consume (elf filename, pinned name,
> > program name ... ).
> >
> > To give an example:
> >
> > With 'ebpf id 42', we might be able to let this get echoed back as if
> > user would have said 'ebpf progname myfilter' (I am making this up!),
> > just to have a more 'stable' identifier.
> >
> > This would make it necessary to also support load-by-program-name, of
> > course.
> 
> Seems like this kind of mapping can be done in userspace without
> involving the kernel?
> 
> For example, the 'progname' thing could be implemented by defining an
> nft-specific pinning location so that 'ebpf progname myfilter' is
> equivalent to 'ebpf pinned /sys/bpf/nft/myfilter' and when nft receives
> an ID from the kernel it goes looking in /sys/bpf/nft to see if it can
> find the program with that ID and echoes it with the appropriate
> progname if it does exist?
> 
> This could also be extended, so that if a user does '... add rule ebpf
> file /usr/lib/bpf/myrule.o' the nft binary stashes the id -> .o file
> mapping somewhere (in /run for instance) so that it can echo back where
> it got it from later?
> 
> In either case I'm not really sure that there's much to be gained from
> asking the kernel to store an additional label with the program rule?

@Florian, could you probably use the object infrastructure to refer to
the program?

This might also allow you to refer to this new object type from
nf_tables maps.

It would be good to avoid linear rule-based matching to select what
program to run.

Maybe this also fits fine for your requirements?
Florian Westphal Aug. 31, 2022, 3:26 p.m. UTC | #8
Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> > Same with a 'nft list ruleset > /etc/nft.txt', reboot,
> > 'nft -f /etc/nft.txt' fails because user forgot to load/pin the program
> > first.
> 
> Right, so under what conditions is the identifier expected to survive,
> exactly? It's okay if it fails after a reboot, but it should keep
> working while the system is up?

Right, thats the question.  I think it boils down to 'least surprise',
which to me would mean useable labels are:

1. pinned name
2. elf filename
3. filter name

3) has the advantage that afaiu I can extend nft to use the dumped
id + program tag to query the name from the kernel, whereas 1+2 would
need to store the label.

1 and 2 have the upside that its easy to handle a 'file not found'
error.

> >> My point here was more that if it's just a label for human consumption,
> >> the comment field should be fine, didn't realise it was needed for the
> >> tool operation (and see above re: that).
> >
> > Yes, this is unfortunate.  I would like to avoid introducing an
> > asymmetry between input and output (as in "... add rule ebpf pinned
> > bla', but 'nft list ruleset' showing 'ebpf id 42') or similar, UNLESS we
> > can somehow use that alternate output to reconstruct that was originally
> > intended.  And so far I can only see that happening with storing some
> > label in the kernel for userspace to consume (elf filename, pinned name,
> > program name ... ).
> >
> > To give an example:
> >
> > With 'ebpf id 42', we might be able to let this get echoed back as if
> > user would have said 'ebpf progname myfilter' (I am making this up!),
> > just to have a more 'stable' identifier.
> >
> > This would make it necessary to also support load-by-program-name, of
> > course.
> 
> Seems like this kind of mapping can be done in userspace without
> involving the kernel?
>
> For example, the 'progname' thing could be implemented by defining an
> nft-specific pinning location so that 'ebpf progname myfilter' is
> equivalent to 'ebpf pinned /sys/bpf/nft/myfilter' and when nft receives
> an ID from the kernel it goes looking in /sys/bpf/nft to see if it can
> find the program with that ID and echoes it with the appropriate
> progname if it does exist?

Thats an interesting proposal.

I had not considered an nft specific namespace, just the raw pinned
filename.

> This could also be extended, so that if a user does '... add rule ebpf
> file /usr/lib/bpf/myrule.o' the nft binary stashes the id -> .o file
> mapping somewhere (in /run for instance) so that it can echo back where
> it got it from later?

Oh, I would prefer to abuse the kernel as a "database" directly for
that rather than keeping text files that have to be managed externally.

But, all things considered, what about this:

I'll respin, with the FILENAME attribute removed, so user says
'ebpf pinned bla', and on listing nft walks /sys/bpf/nft/ to see if
it can find the name again.

If it can't find it, print the id instead.

This would mean nft would also have to understand
'ebpf id 12' on input, but I think thats fine. We can document that
this is not the preferred method due to the difficulty of determining
the correct id to use.

With this, no 'extra' userspace-sake info needs to be stored.
We can revisit what do with 'ebpf file /bla/foo.o' once/if that gets
added.

What do you think?
Will take a while because I'll need to extend the nft side first to cope
with lack of 'FILENAME' attribute.
Florian Westphal Aug. 31, 2022, 3:35 p.m. UTC | #9
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > asking the kernel to store an additional label with the program rule?
> 
> @Florian, could you probably use the object infrastructure to refer to
> the program?

Yes, I would like to extend objref infra later once this is accepted.

> This might also allow you to refer to this new object type from
> nf_tables maps.

Yes, but first nft needs to be able to construct some meaningful output
again.  If we don't attach a specific label (such as filename), we need
to be able to reconstruct info based on what we can query via id/tag and
bpf syscall.

objref infra doesn't help here unless we'll force something like
'nft-defined-objref-name-must-match-elf-binary-name', and I find that
terrible.

> It would be good to avoid linear rule-based matching to select what
> program to run.

Hmmm, I did not consider it a huge deal, its an ebpf program so
users can dispatch to another program.

Objref is nice if the program to run should be selected from a criterion that isn't
readily available to a sk_filter program though.
Alexei Starovoitov Aug. 31, 2022, 3:39 p.m. UTC | #10
On Wed, Aug 31, 2022 at 8:31 AM Florian Westphal <fw@strlen.de> wrote:
>
> Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> > > Same with a 'nft list ruleset > /etc/nft.txt', reboot,
> > > 'nft -f /etc/nft.txt' fails because user forgot to load/pin the program
> > > first.
> >
> > Right, so under what conditions is the identifier expected to survive,
> > exactly? It's okay if it fails after a reboot, but it should keep
> > working while the system is up?
>
> Right, thats the question.  I think it boils down to 'least surprise',
> which to me would mean useable labels are:
>
> 1. pinned name
> 2. elf filename
> 3. filter name
>
> 3) has the advantage that afaiu I can extend nft to use the dumped
> id + program tag to query the name from the kernel, whereas 1+2 would
> need to store the label.
>
> 1 and 2 have the upside that its easy to handle a 'file not found'
> error.

I'm strongly against calling into bpf from the inner guts of nft.
Nack to all options discussed in this thread.
None of them make any sense.
Florian Westphal Aug. 31, 2022, 3:53 p.m. UTC | #11
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > 1 and 2 have the upside that its easy to handle a 'file not found'
> > error.
> 
> I'm strongly against calling into bpf from the inner guts of nft.
> Nack to all options discussed in this thread.
> None of them make any sense.

-v please.  I can just rework userspace to allow going via xt_bpf
but its brain damaged.

This helps gradually moving towards move epbf for those that
still heavily rely on the classic forwarding path.

If you are open to BPF_PROG_TYPE_NETFILTER I can go that route
as well, raw bpf program attachment via NF_HOOK and the bpf dispatcher,
but it will take significantly longer to get there.

It involves reviving
https://lore.kernel.org/netfilter-devel/20211014121046.29329-1-fw@strlen.de/

as a first stage/merge goal.
Alexei Starovoitov Aug. 31, 2022, 5:26 p.m. UTC | #12
On Wed, Aug 31, 2022 at 8:53 AM Florian Westphal <fw@strlen.de> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > > 1 and 2 have the upside that its easy to handle a 'file not found'
> > > error.
> >
> > I'm strongly against calling into bpf from the inner guts of nft.
> > Nack to all options discussed in this thread.
> > None of them make any sense.
>
> -v please.  I can just rework userspace to allow going via xt_bpf
> but its brain damaged.

Right. xt_bpf was a dead end from the start.
It's time to deprecate it and remove it.

> This helps gradually moving towards move epbf for those that
> still heavily rely on the classic forwarding path.

No one is using it.
If it was, we would have seen at least one bug report over
all these years. We've seen none.

tbh we had a fair share of wrong design decisions that look
very reasonable early on and turned out to be useless with
zero users.
BPF_PROG_TYPE_SCHED_ACT and BPF_PROG_TYPE_LWT*
are in this category.
All this code does is bit rot.
As a minimum we shouldn't step on the same rakes.
xt_ebpf would be the same dead code as xt_bpf.

> If you are open to BPF_PROG_TYPE_NETFILTER I can go that route
> as well, raw bpf program attachment via NF_HOOK and the bpf dispatcher,
> but it will take significantly longer to get there.
>
> It involves reviving
> https://lore.kernel.org/netfilter-devel/20211014121046.29329-1-fw@strlen.de/

I missed it earlier. What is the end goal ?
Optimize nft run-time with on the fly generation of bpf byte code ?
Pablo Neira Ayuso Aug. 31, 2022, 8:38 p.m. UTC | #13
On Wed, Aug 31, 2022 at 05:35:08PM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > asking the kernel to store an additional label with the program rule?
> > 
> > @Florian, could you probably use the object infrastructure to refer to
> > the program?
> 
> Yes, I would like to extend objref infra later once this is accepted.
> 
> > This might also allow you to refer to this new object type from
> > nf_tables maps.
> 
> Yes, but first nft needs to be able to construct some meaningful output
> again.  If we don't attach a specific label (such as filename), we need
> to be able to reconstruct info based on what we can query via id/tag and
> bpf syscall.
> 
> objref infra doesn't help here unless we'll force something like
> 'nft-defined-objref-name-must-match-elf-binary-name', and I find that
> terrible.

OK, you don't have to select such an ugly long name ;)

But I get your point: users need to declare explicitly the object.

> > It would be good to avoid linear rule-based matching to select what
> > program to run.
> 
> Hmmm, I did not consider it a huge deal, its an ebpf program so
> users can dispatch to another program.
> 
> Objref is nice if the program to run should be selected from a criterion that isn't
> readily available to a sk_filter program though.

You can also perform updates on the object without the need for
reloading your ruleset. And the declared object also allows for more
attributes to be added on it moving forward.

I think this approach would also allow you to maintain symmetry
between what you add and what it is shown in the listing?
Toke Høiland-Jørgensen Aug. 31, 2022, 8:44 p.m. UTC | #14
> But, all things considered, what about this:
>
> I'll respin, with the FILENAME attribute removed, so user says
> 'ebpf pinned bla', and on listing nft walks /sys/bpf/nft/ to see if
> it can find the name again.
>
> If it can't find it, print the id instead.
>
> This would mean nft would also have to understand
> 'ebpf id 12' on input, but I think thats fine. We can document that
> this is not the preferred method due to the difficulty of determining
> the correct id to use.
>
> With this, no 'extra' userspace-sake info needs to be stored.
> We can revisit what do with 'ebpf file /bla/foo.o' once/if that gets
> added.
>
> What do you think?
> Will take a while because I'll need to extend the nft side first to cope
> with lack of 'FILENAME' attribute.

To the extend it's still relevant, yeah, this seems like a reasonable
plan to me :)

-Toke
Daniel Borkmann Aug. 31, 2022, 9:49 p.m. UTC | #15
On 8/31/22 7:26 PM, Alexei Starovoitov wrote:
> On Wed, Aug 31, 2022 at 8:53 AM Florian Westphal <fw@strlen.de> wrote:
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>> 1 and 2 have the upside that its easy to handle a 'file not found'
>>>> error.
>>>
>>> I'm strongly against calling into bpf from the inner guts of nft.
>>> Nack to all options discussed in this thread.
>>> None of them make any sense.
>>
>> -v please.  I can just rework userspace to allow going via xt_bpf
>> but its brain damaged.
> 
> Right. xt_bpf was a dead end from the start.
> It's time to deprecate it and remove it.
> 
>> This helps gradually moving towards move epbf for those that
>> still heavily rely on the classic forwarding path.
> 
> No one is using it.
> If it was, we would have seen at least one bug report over
> all these years. We've seen none.
> 
> tbh we had a fair share of wrong design decisions that look
> very reasonable early on and turned out to be useless with
> zero users.
> BPF_PROG_TYPE_SCHED_ACT and BPF_PROG_TYPE_LWT*
> are in this category. > All this code does is bit rot.

+1

> As a minimum we shouldn't step on the same rakes.
> xt_ebpf would be the same dead code as xt_bpf.

+1, and on top, the user experience will just be horrible. :(

>> If you are open to BPF_PROG_TYPE_NETFILTER I can go that route
>> as well, raw bpf program attachment via NF_HOOK and the bpf dispatcher,
>> but it will take significantly longer to get there.
>>
>> It involves reviving
>> https://lore.kernel.org/netfilter-devel/20211014121046.29329-1-fw@strlen.de/
> 
> I missed it earlier. What is the end goal ?
> Optimize nft run-time with on the fly generation of bpf byte code ?

Or rather to provide a pendant to nft given existence of xt_bpf, and the
latter will be removed at some point? (If so, can't we just deprecate the
old xt_bpf?)

Thanks,
Daniel
Florian Westphal Aug. 31, 2022, 9:57 p.m. UTC | #16
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > This helps gradually moving towards move epbf for those that
> > still heavily rely on the classic forwarding path.
> 
> No one is using it.
> If it was, we would have seen at least one bug report over
> all these years. We've seen none.

Err, it IS used, else I would not have sent this patch.

> very reasonable early on and turned out to be useless with
> zero users.
> BPF_PROG_TYPE_SCHED_ACT and BPF_PROG_TYPE_LWT*
> are in this category.

I doubt it had 0 users.  Those users probably moved to something
better?

> As a minimum we shouldn't step on the same rakes.
> xt_ebpf would be the same dead code as xt_bpf.

Its just 160 LOC or so, I don't see it has a huge technical debt.

> > If you are open to BPF_PROG_TYPE_NETFILTER I can go that route
> > as well, raw bpf program attachment via NF_HOOK and the bpf dispatcher,
> > but it will take significantly longer to get there.
> >
> > It involves reviving
> > https://lore.kernel.org/netfilter-devel/20211014121046.29329-1-fw@strlen.de/
> 
> I missed it earlier. What is the end goal ?

Immediate goal: get rid of all indirect calls from NF_HOOK()
invocations. Its about 2% speedup in my tests (with connection
tracking+defrag enabled).

This series changes prototype of the callbacks to int foo(struct *),
so I think it would be possible to build on this and allow attaching raw
bpf progs/implement what is now a netfilter kernel module as a bpf
program.

I have not spent time on this so far though, so I don't know yet
how the "please attach prog id 12345 at FORWARD with prio 42" should
be done.

> Optimize nft run-time with on the fly generation of bpf byte code ?

This could be done too, so far this JITs nf_hook_slow() only.
The big question for nft run-time would be how and where to do the JIT
translation.

I think that "nft run time jit" would be step 3, after allowing
(re)implementation of netfilter modules via bpf programs.
Eyal Birger Sept. 1, 2022, 5:18 a.m. UTC | #17
On Thu, Sep 1, 2022 at 1:16 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 8/31/22 7:26 PM, Alexei Starovoitov wrote:
> > On Wed, Aug 31, 2022 at 8:53 AM Florian Westphal <fw@strlen.de> wrote:
> >> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>> 1 and 2 have the upside that its easy to handle a 'file not found'
> >>>> error.
> >>>
> >>> I'm strongly against calling into bpf from the inner guts of nft.
> >>> Nack to all options discussed in this thread.
> >>> None of them make any sense.
> >>
> >> -v please.  I can just rework userspace to allow going via xt_bpf
> >> but its brain damaged.
> >
> > Right. xt_bpf was a dead end from the start.
> > It's time to deprecate it and remove it.
> >
> >> This helps gradually moving towards move epbf for those that
> >> still heavily rely on the classic forwarding path.
> >
> > No one is using it.
> > If it was, we would have seen at least one bug report over
> > all these years. We've seen none.
> >
> > tbh we had a fair share of wrong design decisions that look
> > very reasonable early on and turned out to be useless with
> > zero users.
> > BPF_PROG_TYPE_SCHED_ACT and BPF_PROG_TYPE_LWT*
> > are in this category. > All this code does is bit rot.
>
> +1
>
> > As a minimum we shouldn't step on the same rakes.
> > xt_ebpf would be the same dead code as xt_bpf.
>
> +1, and on top, the user experience will just be horrible. :(
>
> >> If you are open to BPF_PROG_TYPE_NETFILTER I can go that route
> >> as well, raw bpf program attachment via NF_HOOK and the bpf dispatcher,
> >> but it will take significantly longer to get there.
> >>
> >> It involves reviving
> >> https://lore.kernel.org/netfilter-devel/20211014121046.29329-1-fw@strlen.de/
> >
> > I missed it earlier. What is the end goal ?
> > Optimize nft run-time with on the fly generation of bpf byte code ?
>
> Or rather to provide a pendant to nft given existence of xt_bpf, and the
> latter will be removed at some point? (If so, can't we just deprecate the
> old xt_bpf?)

FWIW we've been using both lwt bpf and xt_bpf on our production workloads
for a few years now.

xt_bpf allows us to apply custom sophisticated policy logic at connection
establishment - which is not really possible (or efficient) using
iptables/nft constructs - without needing to reinvent all the facilities that
nf provides like connection tracking, ALGs, and simple filtering.

As for lwt bpf, We use it for load balancing towards collect md tunnels.
While this can be done at tc egress for unfragmented packets, the lwt out hook -
when used in tandem with nf fragment reassembly - provides a hooking point
where a bpf program can see reassembled packets and load balance based on
their internals.

Eyal.
Jan Engelhardt Sept. 1, 2022, 8:08 a.m. UTC | #18
On Wednesday 2022-08-31 19:26, Alexei Starovoitov wrote:
>
>Right. xt_bpf was a dead end from the start.
>It's time to deprecate it and remove it.

So does that extend to xt_u32, which is like a subset of BPF?
Florian Westphal Sept. 1, 2022, 10:14 a.m. UTC | #19
Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 8/31/22 7:26 PM, Alexei Starovoitov wrote:
> > On Wed, Aug 31, 2022 at 8:53 AM Florian Westphal <fw@strlen.de> wrote:
> > As a minimum we shouldn't step on the same rakes.
> > xt_ebpf would be the same dead code as xt_bpf.
> 
> +1, and on top, the user experience will just be horrible. :(

Compared to what?

> > > If you are open to BPF_PROG_TYPE_NETFILTER I can go that route
> > > as well, raw bpf program attachment via NF_HOOK and the bpf dispatcher,
> > > but it will take significantly longer to get there.
> > > 
> > > It involves reviving
> > > https://lore.kernel.org/netfilter-devel/20211014121046.29329-1-fw@strlen.de/
> > 
> > I missed it earlier. What is the end goal ?
> > Optimize nft run-time with on the fly generation of bpf byte code ?
> 
> Or rather to provide a pendant to nft given existence of xt_bpf, and the
> latter will be removed at some point? (If so, can't we just deprecate the
> old xt_bpf?)

See my reply to Alexey, immediate goal was to get rid of the indirect
calls by providing a tailored/jitted equivalent of nf_hook_slow().

The next step could be to allow implementation of netfilter hooks
(i.e., kernel modules that call nf_register_net_hook()) in bpf
but AFAIU it requires addition of BPF_PROG_TYPE_NETFILTER etc.

After that, yes, one could think about how to jit nft_do_chain() and
all the rest of the nft machinery.
Alexei Starovoitov Sept. 2, 2022, 4:53 p.m. UTC | #20
On Wed, Aug 31, 2022 at 10:18 PM Eyal Birger <eyal.birger@gmail.com> wrote:
>
> On Thu, Sep 1, 2022 at 1:16 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >
> > On 8/31/22 7:26 PM, Alexei Starovoitov wrote:
> > > On Wed, Aug 31, 2022 at 8:53 AM Florian Westphal <fw@strlen.de> wrote:
> > >> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > >>>> 1 and 2 have the upside that its easy to handle a 'file not found'
> > >>>> error.
> > >>>
> > >>> I'm strongly against calling into bpf from the inner guts of nft.
> > >>> Nack to all options discussed in this thread.
> > >>> None of them make any sense.
> > >>
> > >> -v please.  I can just rework userspace to allow going via xt_bpf
> > >> but its brain damaged.
> > >
> > > Right. xt_bpf was a dead end from the start.
> > > It's time to deprecate it and remove it.
> > >
> > >> This helps gradually moving towards move epbf for those that
> > >> still heavily rely on the classic forwarding path.
> > >
> > > No one is using it.
> > > If it was, we would have seen at least one bug report over
> > > all these years. We've seen none.
> > >
> > > tbh we had a fair share of wrong design decisions that look
> > > very reasonable early on and turned out to be useless with
> > > zero users.
> > > BPF_PROG_TYPE_SCHED_ACT and BPF_PROG_TYPE_LWT*
> > > are in this category. > All this code does is bit rot.
> >
> > +1
> >
> > > As a minimum we shouldn't step on the same rakes.
> > > xt_ebpf would be the same dead code as xt_bpf.
> >
> > +1, and on top, the user experience will just be horrible. :(
> >
> > >> If you are open to BPF_PROG_TYPE_NETFILTER I can go that route
> > >> as well, raw bpf program attachment via NF_HOOK and the bpf dispatcher,
> > >> but it will take significantly longer to get there.
> > >>
> > >> It involves reviving
> > >> https://lore.kernel.org/netfilter-devel/20211014121046.29329-1-fw@strlen.de/
> > >
> > > I missed it earlier. What is the end goal ?
> > > Optimize nft run-time with on the fly generation of bpf byte code ?
> >
> > Or rather to provide a pendant to nft given existence of xt_bpf, and the
> > latter will be removed at some point? (If so, can't we just deprecate the
> > old xt_bpf?)
>
> FWIW we've been using both lwt bpf and xt_bpf on our production workloads
> for a few years now.
>
> xt_bpf allows us to apply custom sophisticated policy logic at connection
> establishment - which is not really possible (or efficient) using
> iptables/nft constructs - without needing to reinvent all the facilities that
> nf provides like connection tracking, ALGs, and simple filtering.
>
> As for lwt bpf, We use it for load balancing towards collect md tunnels.
> While this can be done at tc egress for unfragmented packets, the lwt out hook -
> when used in tandem with nf fragment reassembly - provides a hooking point
> where a bpf program can see reassembled packets and load balance based on
> their internals.

Sounds very interesting!
Any open source code to look at ?
Alexei Starovoitov Sept. 2, 2022, 5:06 p.m. UTC | #21
On Thu, Sep 1, 2022 at 3:14 AM Florian Westphal <fw@strlen.de> wrote:
>
> Daniel Borkmann <daniel@iogearbox.net> wrote:
> > On 8/31/22 7:26 PM, Alexei Starovoitov wrote:
> > > On Wed, Aug 31, 2022 at 8:53 AM Florian Westphal <fw@strlen.de> wrote:
> > > As a minimum we shouldn't step on the same rakes.
> > > xt_ebpf would be the same dead code as xt_bpf.
> >
> > +1, and on top, the user experience will just be horrible. :(
>
> Compared to what?
>
> > > > If you are open to BPF_PROG_TYPE_NETFILTER I can go that route
> > > > as well, raw bpf program attachment via NF_HOOK and the bpf dispatcher,
> > > > but it will take significantly longer to get there.
> > > >
> > > > It involves reviving
> > > > https://lore.kernel.org/netfilter-devel/20211014121046.29329-1-fw@strlen.de/
> > >
> > > I missed it earlier. What is the end goal ?
> > > Optimize nft run-time with on the fly generation of bpf byte code ?
> >
> > Or rather to provide a pendant to nft given existence of xt_bpf, and the
> > latter will be removed at some point? (If so, can't we just deprecate the
> > old xt_bpf?)
>
> See my reply to Alexey, immediate goal was to get rid of the indirect
> calls by providing a tailored/jitted equivalent of nf_hook_slow().
>
> The next step could be to allow implementation of netfilter hooks
> (i.e., kernel modules that call nf_register_net_hook()) in bpf
> but AFAIU it requires addition of BPF_PROG_TYPE_NETFILTER etc.

We were adding new prog and maps types in the past.
Now new features are being added differently.
All of the networking either works with sk_buff-s or xdp frames.
We try hard not to add any new uapi helpers.
Everything is moving to kfuncs.
Other sub-systems should be able to use bpf without touching
the bpf core. See hid-bpf as an example.
It needs several verifier improvements, but doesn't need
new prog types, helpers, etc.

> After that, yes, one could think about how to jit nft_do_chain() and
> all the rest of the nft machinery.

Sounds like a ton of work. All that just to accelerate nft a bit?
I think there are more impactful projects to work on.
For example, accelerating classic iptables with bpf would immediately
help a bunch of users.
Florian Westphal Sept. 2, 2022, 5:52 p.m. UTC | #22
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > See my reply to Alexey, immediate goal was to get rid of the indirect
> > calls by providing a tailored/jitted equivalent of nf_hook_slow().
> >
> > The next step could be to allow implementation of netfilter hooks
> > (i.e., kernel modules that call nf_register_net_hook()) in bpf
> > but AFAIU it requires addition of BPF_PROG_TYPE_NETFILTER etc.
> 
> We were adding new prog and maps types in the past.
> Now new features are being added differently.
> All of the networking either works with sk_buff-s or xdp frames.
> We try hard not to add any new uapi helpers.
> Everything is moving to kfuncs.
> Other sub-systems should be able to use bpf without touching
> the bpf core. See hid-bpf as an example.
> It needs several verifier improvements, but doesn't need
> new prog types, helpers, etc.

I don't see how it can be done without a new prog type, the bpf progs
would need access to "nf_hook_state" struct, passed as argument
to nf_hook_slow() (and down to the individual xt_foo modules...).

We can't change the existing netfilter hook prototype to go by
sk_buff * as that doesn't have all information, most prominent are
the input and output net_device, but also okfn is needed for async
reinject (nf_queue), the hook location and so on.

> > After that, yes, one could think about how to jit nft_do_chain() and
> > all the rest of the nft machinery.
> 
> Sounds like a ton of work. All that just to accelerate nft a bit?
> I think there are more impactful projects to work on.
> For example, accelerating classic iptables with bpf would immediately
> help a bunch of users.

Maybe, but from the problem points and the required effort it doesn't matter
if the chosen target is iptables or nftables; as far as the time/effort
needed I'd say they are identical.

The hard issues that need to be solved first are the same; they reside
in the netfilter core and not in the specific interpreter (nft_do_chain
vs. ipt_do_table and friends).

nf_tables might be *slightly* easier once that point would be reached
because the core functionality is more integrated with nf_tables whereas
in iptables there is more copypastry (ipt_do_table, ip6t_do_table,
ebt_do_table, ...).
Eyal Birger Sept. 5, 2022, 5:50 p.m. UTC | #23
On Fri, Sep 2, 2022 at 7:53 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Aug 31, 2022 at 10:18 PM Eyal Birger <eyal.birger@gmail.com> wrote:
> >
> > On Thu, Sep 1, 2022 at 1:16 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > >
> > > On 8/31/22 7:26 PM, Alexei Starovoitov wrote:
> > > > On Wed, Aug 31, 2022 at 8:53 AM Florian Westphal <fw@strlen.de> wrote:
> > > >> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > > >>>> 1 and 2 have the upside that its easy to handle a 'file not found'
> > > >>>> error.
> > > >>>
> > > >>> I'm strongly against calling into bpf from the inner guts of nft.
> > > >>> Nack to all options discussed in this thread.
> > > >>> None of them make any sense.
> > > >>
> > > >> -v please.  I can just rework userspace to allow going via xt_bpf
> > > >> but its brain damaged.
> > > >
> > > > Right. xt_bpf was a dead end from the start.
> > > > It's time to deprecate it and remove it.
> > > >
> > > >> This helps gradually moving towards move epbf for those that
> > > >> still heavily rely on the classic forwarding path.
> > > >
> > > > No one is using it.
> > > > If it was, we would have seen at least one bug report over
> > > > all these years. We've seen none.
> > > >
> > > > tbh we had a fair share of wrong design decisions that look
> > > > very reasonable early on and turned out to be useless with
> > > > zero users.
> > > > BPF_PROG_TYPE_SCHED_ACT and BPF_PROG_TYPE_LWT*
> > > > are in this category. > All this code does is bit rot.
> > >
> > > +1
> > >
> > > > As a minimum we shouldn't step on the same rakes.
> > > > xt_ebpf would be the same dead code as xt_bpf.
> > >
> > > +1, and on top, the user experience will just be horrible. :(
> > >
> > > >> If you are open to BPF_PROG_TYPE_NETFILTER I can go that route
> > > >> as well, raw bpf program attachment via NF_HOOK and the bpf dispatcher,
> > > >> but it will take significantly longer to get there.
> > > >>
> > > >> It involves reviving
> > > >> https://lore.kernel.org/netfilter-devel/20211014121046.29329-1-fw@strlen.de/
> > > >
> > > > I missed it earlier. What is the end goal ?
> > > > Optimize nft run-time with on the fly generation of bpf byte code ?
> > >
> > > Or rather to provide a pendant to nft given existence of xt_bpf, and the
> > > latter will be removed at some point? (If so, can't we just deprecate the
> > > old xt_bpf?)
> >
> > FWIW we've been using both lwt bpf and xt_bpf on our production workloads
> > for a few years now.
> >
> > xt_bpf allows us to apply custom sophisticated policy logic at connection
> > establishment - which is not really possible (or efficient) using
> > iptables/nft constructs - without needing to reinvent all the facilities that
> > nf provides like connection tracking, ALGs, and simple filtering.
> >
> > As for lwt bpf, We use it for load balancing towards collect md tunnels.
> > While this can be done at tc egress for unfragmented packets, the lwt out hook -
> > when used in tandem with nf fragment reassembly - provides a hooking point
> > where a bpf program can see reassembled packets and load balance based on
> > their internals.
>
> Sounds very interesting!
> Any open source code to look at ?

For these projects there isn't at this point. But some of the benefit in these
specific hooking points is that our custom logic is very scoped and integrates
well with the "classical" forwarding path.

In netfilter we have an identity based policy engine provisioning sets of
bpf maps. These maps are use used by policy programs invoked by xt_bpf on
connection establishment as part of a larger set of iptables rules.

In LWT this solved us a problem with fragmented traffic, as our load
balancing solution supports - among other things - IPsec stickiness based
on ESP-in-UDP SPI and as such needs to see unfragemented traffic.

Eyal.
Nicolas Dichtel Sept. 6, 2022, 6:57 a.m. UTC | #24
Le 31/08/2022 à 23:57, Florian Westphal a écrit :
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>> This helps gradually moving towards move epbf for those that
>>> still heavily rely on the classic forwarding path.
>>
>> No one is using it.
>> If it was, we would have seen at least one bug report over
>> all these years. We've seen none.
> 
> Err, it IS used, else I would not have sent this patch.
> 
>> very reasonable early on and turned out to be useless with
>> zero users.
>> BPF_PROG_TYPE_SCHED_ACT and BPF_PROG_TYPE_LWT*
>> are in this category.
> 
> I doubt it had 0 users.  Those users probably moved to something
> better?
We are using BPF_PROG_TYPE_SCHED_ACT to perform custom encapsulations.
What could we used to replace that?
Alexei Starovoitov Sept. 7, 2022, 3:04 a.m. UTC | #25
On Mon, Sep 5, 2022 at 11:57 PM Nicolas Dichtel
<nicolas.dichtel@6wind.com> wrote:
>
>
> Le 31/08/2022 à 23:57, Florian Westphal a écrit :
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>> This helps gradually moving towards move epbf for those that
> >>> still heavily rely on the classic forwarding path.
> >>
> >> No one is using it.
> >> If it was, we would have seen at least one bug report over
> >> all these years. We've seen none.
> >
> > Err, it IS used, else I would not have sent this patch.
> >
> >> very reasonable early on and turned out to be useless with
> >> zero users.
> >> BPF_PROG_TYPE_SCHED_ACT and BPF_PROG_TYPE_LWT*
> >> are in this category.
> >
> > I doubt it had 0 users.  Those users probably moved to something
> > better?
> We are using BPF_PROG_TYPE_SCHED_ACT to perform custom encapsulations.
> What could we used to replace that?

SCHED_CLS. It has all of the features of cls and act combined.
Nicolas Dichtel Sept. 7, 2022, 3:52 p.m. UTC | #26
Le 07/09/2022 à 05:04, Alexei Starovoitov a écrit :
> On Mon, Sep 5, 2022 at 11:57 PM Nicolas Dichtel
> <nicolas.dichtel@6wind.com> wrote:
>>
>>
>> Le 31/08/2022 à 23:57, Florian Westphal a écrit :
>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>> This helps gradually moving towards move epbf for those that
>>>>> still heavily rely on the classic forwarding path.
>>>>
>>>> No one is using it.
>>>> If it was, we would have seen at least one bug report over
>>>> all these years. We've seen none.
>>>
>>> Err, it IS used, else I would not have sent this patch.
>>>
>>>> very reasonable early on and turned out to be useless with
>>>> zero users.
>>>> BPF_PROG_TYPE_SCHED_ACT and BPF_PROG_TYPE_LWT*
>>>> are in this category.
>>>
>>> I doubt it had 0 users.  Those users probably moved to something
>>> better?
>> We are using BPF_PROG_TYPE_SCHED_ACT to perform custom encapsulations.
>> What could we used to replace that?
> 
> SCHED_CLS. It has all of the features of cls and act combined.

Indeed, thanks.
diff mbox series

Patch

diff --git a/include/net/netfilter/nf_tables_core.h b/include/net/netfilter/nf_tables_core.h
index 1223af68cd9a..72ee4a6e2952 100644
--- a/include/net/netfilter/nf_tables_core.h
+++ b/include/net/netfilter/nf_tables_core.h
@@ -18,6 +18,7 @@  extern struct nft_expr_type nft_meta_type;
 extern struct nft_expr_type nft_rt_type;
 extern struct nft_expr_type nft_exthdr_type;
 extern struct nft_expr_type nft_last_type;
+extern struct nft_expr_type nft_ebpf_type;
 
 #ifdef CONFIG_NETWORK_SECMARK
 extern struct nft_object_type nft_secmark_obj_type;
@@ -148,4 +149,6 @@  void nft_rt_get_eval(const struct nft_expr *expr,
 		     struct nft_regs *regs, const struct nft_pktinfo *pkt);
 void nft_counter_eval(const struct nft_expr *expr, struct nft_regs *regs,
                       const struct nft_pktinfo *pkt);
+void nft_ebpf_eval(const struct nft_expr *expr, struct nft_regs *regs,
+		   const struct nft_pktinfo *pkt);
 #endif /* _NET_NF_TABLES_CORE_H */
diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
index 466fd3f4447c..39e9442e8c2a 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -805,6 +805,24 @@  enum nft_payload_attributes {
 };
 #define NFTA_PAYLOAD_MAX	(__NFTA_PAYLOAD_MAX - 1)
 
+/**
+ * enum nft_ebpf_attributes - nf_tables ebpf expression netlink attributes
+ *
+ * @NFTA_EBPF_FD: file descriptor holding ebpf program (NLA_U32)
+ * @NFTA_EBPF_FILENAME: file name, only for storage/printing (NLA_STRING)
+ * @NFTA_EBPF_ID: bpf program id (NLA_U32)
+ * @NFTA_EBPF_TAG: bpf tag (NLA_BINARY)
+ */
+enum nft_ebpf_attributes {
+	NFTA_EBPF_UNSPEC,
+	NFTA_EBPF_FD,
+	NFTA_EBPF_FILENAME,
+	NFTA_EBPF_ID,
+	NFTA_EBPF_TAG,
+	__NFTA_EBPF_MAX,
+};
+#define NFTA_EBPF_MAX	(__NFTA_EBPF_MAX - 1)
+
 enum nft_exthdr_flags {
 	NFT_EXTHDR_F_PRESENT = (1 << 0),
 };
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 06df49ea6329..f335a1ea26b9 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -90,6 +90,10 @@  nf_tables-objs += nft_set_pipapo_avx2.o
 endif
 endif
 
+ifdef CONFIG_BPF_SYSCALL
+nf_tables-objs += nft_ebpf.o
+endif
+
 obj-$(CONFIG_NF_TABLES)		+= nf_tables.o
 obj-$(CONFIG_NFT_COMPAT)	+= nft_compat.o
 obj-$(CONFIG_NFT_CONNLIMIT)	+= nft_connlimit.o
diff --git a/net/netfilter/nf_tables_core.c b/net/netfilter/nf_tables_core.c
index cee3e4e905ec..f33064959f6c 100644
--- a/net/netfilter/nf_tables_core.c
+++ b/net/netfilter/nf_tables_core.c
@@ -8,6 +8,7 @@ 
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/init.h>
+#include <linux/filter.h>
 #include <linux/list.h>
 #include <linux/rculist.h>
 #include <linux/skbuff.h>
@@ -209,6 +210,9 @@  static void expr_call_ops_eval(const struct nft_expr *expr,
 	X(e, nft_dynset_eval);
 	X(e, nft_rt_get_eval);
 	X(e, nft_bitwise_eval);
+#ifdef CONFIG_BPF_SYSCALL
+	X(e, nft_ebpf_eval);
+#endif
 #undef  X
 #endif /* CONFIG_RETPOLINE */
 	expr->ops->eval(expr, regs, pkt);
@@ -340,6 +344,9 @@  static struct nft_expr_type *nft_basic_types[] = {
 	&nft_exthdr_type,
 	&nft_last_type,
 	&nft_counter_type,
+#ifdef CONFIG_BPF_SYSCALL
+	&nft_ebpf_type,
+#endif
 };
 
 static struct nft_object_type *nft_basic_objects[] = {
diff --git a/net/netfilter/nft_ebpf.c b/net/netfilter/nft_ebpf.c
new file mode 100644
index 000000000000..f4945f4e7bc5
--- /dev/null
+++ b/net/netfilter/nft_ebpf.c
@@ -0,0 +1,128 @@ 
+#include <linux/bpf.h>
+#include <linux/filter.h>
+
+#include <net/netfilter/nf_tables.h>
+#include <net/netfilter/nf_tables_core.h>
+
+struct nft_ebpf {
+	struct bpf_prog *prog;
+	const char *name;
+};
+
+static const struct nla_policy nft_ebpf_policy[NFTA_EBPF_MAX + 1] = {
+	[NFTA_EBPF_FD] = { .type = NLA_U32 },
+	[NFTA_EBPF_FILENAME] = { .type = NLA_STRING,
+				 .len = PATH_MAX },
+	[NFTA_EBPF_ID] = { .type = NLA_U32 },
+	[NFTA_EBPF_TAG] = NLA_POLICY_EXACT_LEN(BPF_TAG_SIZE),
+};
+
+void nft_ebpf_eval(const struct nft_expr *expr, struct nft_regs *regs,
+		   const struct nft_pktinfo *pkt)
+{
+	const struct nft_ebpf *priv = nft_expr_priv(expr);
+
+	if (!bpf_prog_run_save_cb(priv->prog, pkt->skb))
+		regs->verdict.code = NFT_BREAK;
+}
+
+static int nft_ebpf_init(const struct nft_ctx *ctx,
+			 const struct nft_expr *expr,
+			 const struct nlattr * const tb[])
+{
+	struct nft_ebpf *priv = nft_expr_priv(expr);
+	struct bpf_prog *prog;
+	int fd;
+
+	if (!tb[NFTA_EBPF_FD])
+		return -EINVAL;
+
+	if (tb[NFTA_EBPF_ID] || tb[NFTA_EBPF_TAG])
+		return -EOPNOTSUPP;
+
+	fd = ntohl(nla_get_u32(tb[NFTA_EBPF_FD]));
+	if (fd < 0)
+		return -EBADF;
+
+	if (tb[NFTA_EBPF_FILENAME]) {
+		priv->name = nla_strdup(tb[NFTA_EBPF_FILENAME], GFP_KERNEL);
+		if (!priv->name)
+			return -ENOMEM;
+	}
+
+	prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER);
+	if (IS_ERR(prog)) {
+		kfree(priv->name);
+		return PTR_ERR(prog);
+	}
+
+	priv->prog = prog;
+	return 0;
+}
+
+static void nft_ebpf_destroy(const struct nft_ctx *ctx,
+			     const struct nft_expr *expr)
+{
+	struct nft_ebpf *priv = nft_expr_priv(expr);
+
+	bpf_prog_destroy(priv->prog);
+	kfree(priv->name);
+}
+
+static int nft_ebpf_validate(const struct nft_ctx *ctx,
+			     const struct nft_expr *expr,
+			     const struct nft_data **data)
+{
+	static const unsigned int supported_hooks = ((1 << NF_INET_PRE_ROUTING) |
+						     (1 << NF_INET_LOCAL_IN) |
+						     (1 << NF_INET_FORWARD) |
+						     (1 << NF_INET_LOCAL_OUT) |
+						     (1 << NF_INET_POST_ROUTING));
+
+	switch (ctx->family) {
+	case NFPROTO_IPV4:
+	case NFPROTO_IPV6:
+	case NFPROTO_INET:
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	return nft_chain_validate_hooks(ctx->chain, supported_hooks);
+}
+
+static int nft_ebpf_dump(struct sk_buff *skb, const struct nft_expr *expr)
+{
+	const struct nft_ebpf *priv = nft_expr_priv(expr);
+	const struct bpf_prog *prog = priv->prog;
+
+	if (nla_put_be32(skb, NFTA_EBPF_ID, htonl(prog->aux->id)))
+		return -1;
+
+	if (nla_put(skb, NFTA_EBPF_TAG, sizeof(prog->tag), prog->tag))
+		return -1;
+
+	if (priv->name && nla_put_string(skb, NFTA_EBPF_FILENAME, priv->name))
+		return -1;
+
+	return 0;
+}
+
+static const struct nft_expr_ops nft_ebpf_ops = {
+	.type		= &nft_ebpf_type,
+	.size		= NFT_EXPR_SIZE(sizeof(struct nft_ebpf)),
+	.init		= nft_ebpf_init,
+	.eval		= nft_ebpf_eval,
+	.reduce		= NFT_REDUCE_READONLY,
+	.destroy	= nft_ebpf_destroy,
+	.dump		= nft_ebpf_dump,
+	.validate	= nft_ebpf_validate,
+};
+
+struct nft_expr_type nft_ebpf_type __read_mostly = {
+	.name		= "ebpf",
+	.ops		= &nft_ebpf_ops,
+	.policy		= nft_ebpf_policy,
+	.maxattr	= NFTA_EBPF_MAX,
+	.owner		= THIS_MODULE,
+};