diff mbox series

[bpf-next,v7,1/3] bpf: Add "live packet" mode for XDP in bpf_prog_run()

Message ID 20220107215438.321922-2-toke@redhat.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series Add support for transmitting packets using XDP in bpf_prog_run() | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1797 this patch: 1797
netdev/cc_maintainers warning 1 maintainers not CCed: joe@cilium.io
netdev/build_clang success Errors and warnings before: 212 this patch: 212
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1811 this patch: 1811
netdev/checkpatch warning CHECK: Prefer using the BIT macro WARNING: line length of 81 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next success VM_Test
bpf/vmtest-bpf-next-PR success PR summary

Commit Message

Toke Høiland-Jørgensen Jan. 7, 2022, 9:54 p.m. UTC
This adds support for running XDP programs through bpf_prog_run() in a mode
that enables live packet processing of the resulting frames. Previous uses
of bpf_prog_run() for XDP returned the XDP program return code and the
modified packet data to userspace, which is useful for unit testing of XDP
programs.

The existing bpf_prog_run() for XDP allows userspace to set the ingress
ifindex and RXQ number as part of the context object being passed to the
kernel. This patch reuses that code, but adds a new mode with different
semantics, which can be selected with the new BPF_F_TEST_XDP_LIVE_FRAMES
flag.

When running bpf_prog_run() in this mode, the XDP program return codes will
be honoured: returning XDP_PASS will result in the frame being injected
into the networking stack as if it came from the selected networking
interface, while returning XDP_TX and XDP_REDIRECT will result in the frame
being transmitted out that interface. XDP_TX is translated into an
XDP_REDIRECT operation to the same interface, since the real XDP_TX action
is only possible from within the network drivers themselves, not from the
process context where bpf_prog_run() is executed.

Internally, this new mode of operation creates a page pool instance while
setting up the test run, and feeds pages from that into the XDP program.
The setup cost of this is amortised over the number of repetitions
specified by userspace.

To support the performance testing use case, we further optimise the setup
step so that all pages in the pool are pre-initialised with the packet
data, and pre-computed context and xdp_frame objects stored at the start of
each page. This makes it possible to entirely avoid touching the page
content on each XDP program invocation, and enables sending up to 12
Mpps/core on my test box.

Because the data pages are recycled by the page pool, and the test runner
doesn't re-initialise them for each run, subsequent invocations of the XDP
program will see the packet data in the state it was after the last time it
ran on that particular page. This means that an XDP program that modifies
the packet before redirecting it has to be careful about which assumptions
it makes about the packet content, but that is only an issue for the most
naively written programs.

Enabling the new flag is only allowed when not setting ctx_out and data_out
in the test specification, since using it means frames will be redirected
somewhere else, so they can't be returned.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/uapi/linux/bpf.h       |   2 +
 kernel/bpf/Kconfig             |   1 +
 net/bpf/test_run.c             | 290 ++++++++++++++++++++++++++++++++-
 tools/include/uapi/linux/bpf.h |   2 +
 4 files changed, 287 insertions(+), 8 deletions(-)

Comments

Alexei Starovoitov Jan. 8, 2022, 2:44 a.m. UTC | #1
On Fri, Jan 7, 2022 at 1:54 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Because the data pages are recycled by the page pool, and the test runner
> doesn't re-initialise them for each run, subsequent invocations of the XDP
> program will see the packet data in the state it was after the last time it
> ran on that particular page. This means that an XDP program that modifies
> the packet before redirecting it has to be careful about which assumptions
> it makes about the packet content, but that is only an issue for the most
> naively written programs.

This is too vague and partially incorrect.
The bpf program can do bpf_xdp_adjust_meta() and otherwise change
packet boundaries. These effects will be seen by subsequent
XDP_PASS/TX/REDIRECT, but on the next iteration the boundaries
will get reset to the original values.
So the test runner actually re-initializes some parts of the data,
but not the contents of the packet.
At least that's my understanding of the patch.
The users shouldn't need to dig into implementation to discover this.
Please document it.
The more I think about it the more I believe that it warrants
a little blurb in Documentation/bpf/ that describes what one can
do with this "xdp live mode".

Another question comes to mind:
What happens when a program modifies the packet?
Does it mean that the 2nd frame will see the modified data?
It will not, right?
It's the page pool size of packets that will be inited the same way
at the beginning. Which is NAPI_POLL_WEIGHT * 2 == 128 packets.
Why this number?
Should it be configurable?
Then the user can say: init N packets with this one pattern
and the program will know that exactly N invocation will be
with the same data, but N+1 it will see the 1st packet again
that potentially was modified by the program.
Is it accurate?
Toke Høiland-Jørgensen Jan. 8, 2022, 1:19 p.m. UTC | #2
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Fri, Jan 7, 2022 at 1:54 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Because the data pages are recycled by the page pool, and the test runner
>> doesn't re-initialise them for each run, subsequent invocations of the XDP
>> program will see the packet data in the state it was after the last time it
>> ran on that particular page. This means that an XDP program that modifies
>> the packet before redirecting it has to be careful about which assumptions
>> it makes about the packet content, but that is only an issue for the most
>> naively written programs.
>
> This is too vague and partially incorrect.
> The bpf program can do bpf_xdp_adjust_meta() and otherwise change
> packet boundaries. These effects will be seen by subsequent
> XDP_PASS/TX/REDIRECT, but on the next iteration the boundaries
> will get reset to the original values.
> So the test runner actually re-initializes some parts of the data,
> but not the contents of the packet.
> At least that's my understanding of the patch.

Yes, that's correct. Boundaries will be reset, data won't. The boundary
reset was added later, though, so guess I neglected to update the commit
message. Will fix.

> The users shouldn't need to dig into implementation to discover this.
> Please document it.
> The more I think about it the more I believe that it warrants
> a little blurb in Documentation/bpf/ that describes what one can
> do with this "xdp live mode".

Sure, can do. Doesn't look like BPF_PROG_RUN is documented in there at
all, so guess I can start such a document :)

> Another question comes to mind:
> What happens when a program modifies the packet?
> Does it mean that the 2nd frame will see the modified data?
> It will not, right?
> It's the page pool size of packets that will be inited the same way
> at the beginning. Which is NAPI_POLL_WEIGHT * 2 == 128 packets.
> Why this number?

Yes, you're right: the next run won't see the modified packet data. The
128 pages is because we run the program loop in batches of 64 (like NAPI
does, the fact that TEST_XDP_BATCH and NAPI_POLL_WEIGHT are the same is
not a coincidence).

We need 2x because we want enough pages so we can keep running without
allocating more, and the first batch can still be in flight on a
different CPU while we're processing batch 2.

I experimented with different values, and 128 was the minimum size that
didn't have a significant negative impact on performance, and above that
saw diminishing returns.

> Should it be configurable?
> Then the user can say: init N packets with this one pattern
> and the program will know that exactly N invocation will be
> with the same data, but N+1 it will see the 1st packet again
> that potentially was modified by the program.
> Is it accurate?

I thought about making it configurable, but the trouble is that it's not
quite as straight-forward as the first N packets being "pristine": it
depends on what happens to the packet afterwards:

On XDP_DROP, the page will be recycled immediately, whereas on
XDP_{TX,REDIRECT} it will go through the egress driver after sitting in
the bulk queue for a little while, so you can get reordering compared to
the original execution order.

On XDP_PASS the kernel will release the page entirely from the pool when
building an skb, so you'll never see that particular page again (and
eventually page_pool will allocate a new batch that will be
re-initialised to the original value).

If we do want to support a "pristine data" mode, I think the least
cumbersome way would be to add a flag that would make the kernel
re-initialise the packet data before every program invocation. The
reason I didn't do this was because I didn't have a use case for it. The
traffic generator use case only rewrites a tiny bit of the packet
header, and it's just as easy to just keep rewriting it without assuming
a particular previous value. And there's also the possibility of just
calling bpf_prog_run() multiple times from userspace with a lower number
of repetitions...

I'm not opposed to adding such a flag if you think it would be useful,
though. WDYT?

-Toke
Alexei Starovoitov Jan. 8, 2022, 7:28 p.m. UTC | #3
On Sat, Jan 8, 2022 at 5:19 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Sure, can do. Doesn't look like BPF_PROG_RUN is documented in there at
> all, so guess I can start such a document :)

prog_run was simple enough.
This live packet mode is a different level of complexity.
Just look at the length of this thread.
We keep finding implementation details that will be relevant
to anyone trying to use this interface.
They all will become part of uapi.

> > Another question comes to mind:
> > What happens when a program modifies the packet?
> > Does it mean that the 2nd frame will see the modified data?
> > It will not, right?
> > It's the page pool size of packets that will be inited the same way
> > at the beginning. Which is NAPI_POLL_WEIGHT * 2 == 128 packets.
> > Why this number?
>
> Yes, you're right: the next run won't see the modified packet data. The
> 128 pages is because we run the program loop in batches of 64 (like NAPI
> does, the fact that TEST_XDP_BATCH and NAPI_POLL_WEIGHT are the same is
> not a coincidence).
>
> We need 2x because we want enough pages so we can keep running without
> allocating more, and the first batch can still be in flight on a
> different CPU while we're processing batch 2.
>
> I experimented with different values, and 128 was the minimum size that
> didn't have a significant negative impact on performance, and above that
> saw diminishing returns.

I guess it's ok-ish to get stuck with 128.
It will be uapi that we cannot change though.
Are you comfortable with that?

> > Should it be configurable?
> > Then the user can say: init N packets with this one pattern
> > and the program will know that exactly N invocation will be
> > with the same data, but N+1 it will see the 1st packet again
> > that potentially was modified by the program.
> > Is it accurate?
>
> I thought about making it configurable, but the trouble is that it's not
> quite as straight-forward as the first N packets being "pristine": it
> depends on what happens to the packet afterwards:
>
> On XDP_DROP, the page will be recycled immediately, whereas on
> XDP_{TX,REDIRECT} it will go through the egress driver after sitting in
> the bulk queue for a little while, so you can get reordering compared to
> the original execution order.
>
> On XDP_PASS the kernel will release the page entirely from the pool when
> building an skb, so you'll never see that particular page again (and
> eventually page_pool will allocate a new batch that will be
> re-initialised to the original value).

That all makes sense. Thanks for explaining.
Please document it and update the selftest.
Looks like XDP_DROP is not tested.
Single packet TX and REDIRECT is imo too weak to give
confidence that the mechanism will not explode with millions of packets.

> If we do want to support a "pristine data" mode, I think the least
> cumbersome way would be to add a flag that would make the kernel
> re-initialise the packet data before every program invocation. The
> reason I didn't do this was because I didn't have a use case for it. The
> traffic generator use case only rewrites a tiny bit of the packet
> header, and it's just as easy to just keep rewriting it without assuming
> a particular previous value. And there's also the possibility of just
> calling bpf_prog_run() multiple times from userspace with a lower number
> of repetitions...
>
> I'm not opposed to adding such a flag if you think it would be useful,
> though. WDYT?

reinit doesn't feel necessary.
How one would use this interface to send N different packets?
The api provides an interface for only one.
It will be copied 128 times, but the prog_run call with repeat=1
will invoke bpf prog only once, right?
So technically doing N prog_run commands with different data
and repeat=1 will achieve the result, right?
But it's not efficient, since 128 pages and 128 copies will be
performed each time.
May be there is a use case for configurable page_pool size?
Toke Høiland-Jørgensen Jan. 8, 2022, 8:19 p.m. UTC | #4
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Sat, Jan 8, 2022 at 5:19 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Sure, can do. Doesn't look like BPF_PROG_RUN is documented in there at
>> all, so guess I can start such a document :)
>
> prog_run was simple enough.
> This live packet mode is a different level of complexity.
> Just look at the length of this thread.
> We keep finding implementation details that will be relevant
> to anyone trying to use this interface.
> They all will become part of uapi.

Sure, totally fine with documenting it. Just seems to me the most
obvious place to put this is in a new
Documentation/bpf/prog_test_run.rst file with a short introduction about
the general BPF_PROG_RUN mechanism, and then a subsection dedicated to
this facility.

Or would you rather I create something like
Documentation/bpf/xdp_live_packets.rst ?

>> > Another question comes to mind:
>> > What happens when a program modifies the packet?
>> > Does it mean that the 2nd frame will see the modified data?
>> > It will not, right?
>> > It's the page pool size of packets that will be inited the same way
>> > at the beginning. Which is NAPI_POLL_WEIGHT * 2 == 128 packets.
>> > Why this number?
>>
>> Yes, you're right: the next run won't see the modified packet data. The
>> 128 pages is because we run the program loop in batches of 64 (like NAPI
>> does, the fact that TEST_XDP_BATCH and NAPI_POLL_WEIGHT are the same is
>> not a coincidence).
>>
>> We need 2x because we want enough pages so we can keep running without
>> allocating more, and the first batch can still be in flight on a
>> different CPU while we're processing batch 2.
>>
>> I experimented with different values, and 128 was the minimum size that
>> didn't have a significant negative impact on performance, and above that
>> saw diminishing returns.
>
> I guess it's ok-ish to get stuck with 128.
> It will be uapi that we cannot change though.
> Are you comfortable with that?

UAPI in what sense? I'm thinking of documenting it like:

"The packet data being supplied as data_in to BPF_PROG_RUN will be used
 for the initial run of the XDP program. However, when running the
 program multiple times (with repeat > 1), only the packet *bounds*
 (i.e., the data, data_end and data_meta pointers) will be reset on each
 invocation, the packet data itself won't be rewritten. The pages
 backing the packets are recycled, but the order depends on the path the
 packet takes through the kernel, making it hard to predict when a
 particular modified page makes it back to the XDP program. In practice,
 this means that if the XDP program modifies the packet payload before
 sending out the packet, it has to be prepared to deal with subsequent
 invocations seeing either the initial data or the already-modified
 packet, in arbitrary order."

I don't think this makes any promises about any particular size of the
page pool, so how does it constitute UAPI?

>> > Should it be configurable?
>> > Then the user can say: init N packets with this one pattern
>> > and the program will know that exactly N invocation will be
>> > with the same data, but N+1 it will see the 1st packet again
>> > that potentially was modified by the program.
>> > Is it accurate?
>>
>> I thought about making it configurable, but the trouble is that it's not
>> quite as straight-forward as the first N packets being "pristine": it
>> depends on what happens to the packet afterwards:
>>
>> On XDP_DROP, the page will be recycled immediately, whereas on
>> XDP_{TX,REDIRECT} it will go through the egress driver after sitting in
>> the bulk queue for a little while, so you can get reordering compared to
>> the original execution order.
>>
>> On XDP_PASS the kernel will release the page entirely from the pool when
>> building an skb, so you'll never see that particular page again (and
>> eventually page_pool will allocate a new batch that will be
>> re-initialised to the original value).
>
> That all makes sense. Thanks for explaining.
> Please document it and update the selftest.
> Looks like XDP_DROP is not tested.
> Single packet TX and REDIRECT is imo too weak to give
> confidence that the mechanism will not explode with millions of
> packets.

OK, will do.

>> If we do want to support a "pristine data" mode, I think the least
>> cumbersome way would be to add a flag that would make the kernel
>> re-initialise the packet data before every program invocation. The
>> reason I didn't do this was because I didn't have a use case for it. The
>> traffic generator use case only rewrites a tiny bit of the packet
>> header, and it's just as easy to just keep rewriting it without assuming
>> a particular previous value. And there's also the possibility of just
>> calling bpf_prog_run() multiple times from userspace with a lower number
>> of repetitions...
>>
>> I'm not opposed to adding such a flag if you think it would be useful,
>> though. WDYT?
>
> reinit doesn't feel necessary.
> How one would use this interface to send N different packets?
> The api provides an interface for only one.

By having the XDP program react appropriately. E.g., here is the XDP
program used by the trafficgen tool to cycle through UDP ports when
sending out the packets - it just reads the current value and updates
based on that, so it doesn't matter if it sees the initial page or one
it already modified:

const volatile __u16 port_start;
const volatile __u16 port_range;
volatile __u16 next_port = 0;

SEC("xdp")
int xdp_redirect_update_port(struct xdp_md *ctx)
{
	void *data_end = (void *)(long)ctx->data_end;
	void *data = (void *)(long)ctx->data;
	__u16 cur_port, cksum_diff;
	struct udphdr *hdr;

	hdr = data + (sizeof(struct ethhdr) + sizeof(struct ipv6hdr));
	if (hdr + 1 > data_end)
		return XDP_ABORTED;

	cur_port = bpf_ntohs(hdr->dest);
	cksum_diff = next_port - cur_port;
	if (cksum_diff) {
		hdr->check = bpf_htons(~(~bpf_ntohs(hdr->check) + cksum_diff));
		hdr->dest = bpf_htons(next_port);
	}
	if (next_port++ >= port_start + port_range - 1)
		next_port = port_start;

	return bpf_redirect(ifindex_out, 0);
}

You could do something similar with a whole packet header or payload; or
you could even populate a map with the full-size packets and copy that
in based on a counter.

> It will be copied 128 times, but the prog_run call with repeat=1
> will invoke bpf prog only once, right?
> So technically doing N prog_run commands with different data
> and repeat=1 will achieve the result, right?
> But it's not efficient, since 128 pages and 128 copies will be
> performed each time.
> May be there is a use case for configurable page_pool size?

Hmm, we could size the page_pool as min(repeat, 128) to avoid the extra
copies when they won't be used?

Another question seeing as the merge window is imminent: How do you feel
about merging this before the merge window? I can resubmit before it
opens with the updated selftest and documentation, and we can deal with
any tweaks during the -rcs; or would you rather postpone the whole
thing until the next cycle?

-Toke
Alexei Starovoitov Jan. 9, 2022, 2:24 a.m. UTC | #5
On Sat, Jan 08, 2022 at 09:19:41PM +0100, Toke Høiland-Jørgensen wrote:
> 
> Sure, totally fine with documenting it. Just seems to me the most
> obvious place to put this is in a new
> Documentation/bpf/prog_test_run.rst file with a short introduction about
> the general BPF_PROG_RUN mechanism, and then a subsection dedicated to
> this facility.

sgtm

> > I guess it's ok-ish to get stuck with 128.
> > It will be uapi that we cannot change though.
> > Are you comfortable with that?
> 
> UAPI in what sense? I'm thinking of documenting it like:
> 
> "The packet data being supplied as data_in to BPF_PROG_RUN will be used
>  for the initial run of the XDP program. However, when running the
>  program multiple times (with repeat > 1), only the packet *bounds*
>  (i.e., the data, data_end and data_meta pointers) will be reset on each
>  invocation, the packet data itself won't be rewritten. The pages
>  backing the packets are recycled, but the order depends on the path the
>  packet takes through the kernel, making it hard to predict when a
>  particular modified page makes it back to the XDP program. In practice,
>  this means that if the XDP program modifies the packet payload before
>  sending out the packet, it has to be prepared to deal with subsequent
>  invocations seeing either the initial data or the already-modified
>  packet, in arbitrary order."
> 
> I don't think this makes any promises about any particular size of the
> page pool, so how does it constitute UAPI?

Could you explain out-of-order scanario again?
It's possible only if xdp_redirect is done into different netdevs.
Then they can xmit at different times and cycle pages back into
the loop in different order. But TX or REDIRECT into the same netdev
will keep the pages in the same order. So the program can rely on that.

> >
> > reinit doesn't feel necessary.
> > How one would use this interface to send N different packets?
> > The api provides an interface for only one.
> 
> By having the XDP program react appropriately. E.g., here is the XDP
> program used by the trafficgen tool to cycle through UDP ports when
> sending out the packets - it just reads the current value and updates
> based on that, so it doesn't matter if it sees the initial page or one
> it already modified:

Sure. I think there is an untapped potential here.
With this live packet prog_run anyone can buy 10G or 100G nic equipped
server and for free transform it into $300k+ IXIA beating machine.
It could be a game changer. pktgen doesn't come close.
I'm thinking about generating and consuming test TCP traffic.
TCP blaster would xmit 1M TCP connections through this live prog_run
into eth0 and consume the traffic returning from "server under test"
via a different XDP program attached to eth0.
The prog_run's xdp prog would need to send SYN, increment sequence number,
and keep sane data in the packets. It could be HTTP request, for example.

To achive this IXIA beating setup the TCP blaster would need a full
understanding of what page pool is doing with the packets.
Just saying "in arbitrary order" is a non starter. It diminishes
this live prog_run into pktgen equivalent which is still useful,
but lots of potential is lost.

> Another question seeing as the merge window is imminent: How do you feel
> about merging this before the merge window? I can resubmit before it
> opens with the updated selftest and documentation, and we can deal with
> any tweaks during the -rcs; or would you rather postpone the whole
> thing until the next cycle?

It's already too late for this merge window, but bpf-next is always open.
Just like it was open for the last year. So please resubmit as soon as
the tests are green and this discussion is over.
Toke Høiland-Jørgensen Jan. 9, 2022, 12:30 p.m. UTC | #6
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Sat, Jan 08, 2022 at 09:19:41PM +0100, Toke Høiland-Jørgensen wrote:
>> 
>> Sure, totally fine with documenting it. Just seems to me the most
>> obvious place to put this is in a new
>> Documentation/bpf/prog_test_run.rst file with a short introduction about
>> the general BPF_PROG_RUN mechanism, and then a subsection dedicated to
>> this facility.
>
> sgtm

Great!

>> > I guess it's ok-ish to get stuck with 128.
>> > It will be uapi that we cannot change though.
>> > Are you comfortable with that?
>> 
>> UAPI in what sense? I'm thinking of documenting it like:
>> 
>> "The packet data being supplied as data_in to BPF_PROG_RUN will be used
>>  for the initial run of the XDP program. However, when running the
>>  program multiple times (with repeat > 1), only the packet *bounds*
>>  (i.e., the data, data_end and data_meta pointers) will be reset on each
>>  invocation, the packet data itself won't be rewritten. The pages
>>  backing the packets are recycled, but the order depends on the path the
>>  packet takes through the kernel, making it hard to predict when a
>>  particular modified page makes it back to the XDP program. In practice,
>>  this means that if the XDP program modifies the packet payload before
>>  sending out the packet, it has to be prepared to deal with subsequent
>>  invocations seeing either the initial data or the already-modified
>>  packet, in arbitrary order."
>> 
>> I don't think this makes any promises about any particular size of the
>> page pool, so how does it constitute UAPI?
>
> Could you explain out-of-order scanario again?
> It's possible only if xdp_redirect is done into different netdevs.
> Then they can xmit at different times and cycle pages back into
> the loop in different order. But TX or REDIRECT into the same netdev
> will keep the pages in the same order. So the program can rely on
> that.

I left that out on purpose: I feel it's exposing an internal
implementation detail as UAPI (as you said). And I'm not convinced it
really needed (or helpful) - see below.

>> >
>> > reinit doesn't feel necessary.
>> > How one would use this interface to send N different packets?
>> > The api provides an interface for only one.
>> 
>> By having the XDP program react appropriately. E.g., here is the XDP
>> program used by the trafficgen tool to cycle through UDP ports when
>> sending out the packets - it just reads the current value and updates
>> based on that, so it doesn't matter if it sees the initial page or one
>> it already modified:
>
> Sure. I think there is an untapped potential here.
> With this live packet prog_run anyone can buy 10G or 100G nic equipped
> server and for free transform it into $300k+ IXIA beating machine.
> It could be a game changer. pktgen doesn't come close.
> I'm thinking about generating and consuming test TCP traffic.
> TCP blaster would xmit 1M TCP connections through this live prog_run
> into eth0 and consume the traffic returning from "server under test"
> via a different XDP program attached to eth0.
> The prog_run's xdp prog would need to send SYN, increment sequence number,
> and keep sane data in the packets. It could be HTTP request, for example.

I'm glad you see the potential :)

> To achive this IXIA beating setup the TCP blaster would need a full
> understanding of what page pool is doing with the packets.
> Just saying "in arbitrary order" is a non starter. It diminishes
> this live prog_run into pktgen equivalent which is still useful,
> but lots of potential is lost.

I don't think a detailed knowledge of how the pages are recycled is
needed to implement a TCP stream? Even if you just rely on the packets
being recycled with a fixed period of 128 pages, how does that make your
XDP program simpler? You'll still have to update the packet header for
each packet, with state kept in a map; so why is it helpful to know when
a particular page comes back?

I'll try implementing a TCP stream mode in xdp_trafficgen just to make
sure I'm not missing something. But I believe that sending out a stream
of packets that looks like a coherent TCP stream should be simple
enough, at least. Dealing with the full handshake + CWND control loop
will be harder, though, and right now I think it'll require multiple
trips back to userspace.

>> Another question seeing as the merge window is imminent: How do you feel
>> about merging this before the merge window? I can resubmit before it
>> opens with the updated selftest and documentation, and we can deal with
>> any tweaks during the -rcs; or would you rather postpone the whole
>> thing until the next cycle?
>
> It's already too late for this merge window, but bpf-next is always open.
> Just like it was open for the last year. So please resubmit as soon as
> the tests are green and this discussion is over.

Ah, OK. I was under the impression that the cutoff date was tomorrow;
has that changed? But no worries, I'll spend my Sunday outside instead
of coding, then, and come back to this tomorrow :)

-Toke
Alexei Starovoitov Jan. 13, 2022, 1:37 a.m. UTC | #7
On Sun, Jan 9, 2022 at 4:30 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> I left that out on purpose: I feel it's exposing an internal
> implementation detail as UAPI (as you said). And I'm not convinced it
> really needed (or helpful) - see below.

It's irrelevant whether it's documented or not.
Once this implementation detail is being relied upon
by user space it becomes an undocumented uapi that we cannot change.

> I'll try implementing a TCP stream mode in xdp_trafficgen just to make
> sure I'm not missing something. But I believe that sending out a stream
> of packets that looks like a coherent TCP stream should be simple
> enough, at least. Dealing with the full handshake + CWND control loop
> will be harder, though, and right now I think it'll require multiple
> trips back to userspace.

The patch set looks very close to being able to do such TCP streaming.
Let's make sure nothing is missing from API before we land it.
Martin KaFai Lau Feb. 11, 2022, 7:19 a.m. UTC | #8
On Wed, Jan 12, 2022 at 05:37:54PM -0800, Alexei Starovoitov wrote:
> On Sun, Jan 9, 2022 at 4:30 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > I left that out on purpose: I feel it's exposing an internal
> > implementation detail as UAPI (as you said). And I'm not convinced it
> > really needed (or helpful) - see below.
> 
> It's irrelevant whether it's documented or not.
> Once this implementation detail is being relied upon
> by user space it becomes an undocumented uapi that we cannot change.
> 
> > I'll try implementing a TCP stream mode in xdp_trafficgen just to make
> > sure I'm not missing something. But I believe that sending out a stream
> > of packets that looks like a coherent TCP stream should be simple
> > enough, at least. Dealing with the full handshake + CWND control loop
> > will be harder, though, and right now I think it'll require multiple
> > trips back to userspace.
> 
> The patch set looks very close to being able to do such TCP streaming.
> Let's make sure nothing is missing from API before we land it.
Hi Toke,  I am also looking at ways to blast tcp packets by using
bpf to overcome the pktgen udp-only limitation.
Are you planning to respin with a TCP stream mode in xdp_trafficgen ?
Thanks !
Toke Høiland-Jørgensen Feb. 11, 2022, 4:03 p.m. UTC | #9
Martin KaFai Lau <kafai@fb.com> writes:

> On Wed, Jan 12, 2022 at 05:37:54PM -0800, Alexei Starovoitov wrote:
>> On Sun, Jan 9, 2022 at 4:30 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >
>> > I left that out on purpose: I feel it's exposing an internal
>> > implementation detail as UAPI (as you said). And I'm not convinced it
>> > really needed (or helpful) - see below.
>> 
>> It's irrelevant whether it's documented or not.
>> Once this implementation detail is being relied upon
>> by user space it becomes an undocumented uapi that we cannot change.
>> 
>> > I'll try implementing a TCP stream mode in xdp_trafficgen just to make
>> > sure I'm not missing something. But I believe that sending out a stream
>> > of packets that looks like a coherent TCP stream should be simple
>> > enough, at least. Dealing with the full handshake + CWND control loop
>> > will be harder, though, and right now I think it'll require multiple
>> > trips back to userspace.
>> 
>> The patch set looks very close to being able to do such TCP streaming.
>> Let's make sure nothing is missing from API before we land it.
> Hi Toke,  I am also looking at ways to blast tcp packets by using
> bpf to overcome the pktgen udp-only limitation.
> Are you planning to respin with a TCP stream mode in xdp_trafficgen ?
> Thanks !

Yes, working on it! Got sidetracked a bit, but hoping to have something
to show for my efforts sometime next week :)

-Toke
diff mbox series

Patch

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b0383d371b9a..5ef20deaf49f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1225,6 +1225,8 @@  enum {
 
 /* If set, run the test on the cpu specified by bpf_attr.test.cpu */
 #define BPF_F_TEST_RUN_ON_CPU	(1U << 0)
+/* If set, XDP frames will be transmitted after processing */
+#define BPF_F_TEST_XDP_LIVE_FRAMES	(1U << 1)
 
 /* type for BPF_ENABLE_STATS */
 enum bpf_stats_type {
diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
index d24d518ddd63..c8c920020d11 100644
--- a/kernel/bpf/Kconfig
+++ b/kernel/bpf/Kconfig
@@ -30,6 +30,7 @@  config BPF_SYSCALL
 	select TASKS_TRACE_RCU
 	select BINARY_PRINTF
 	select NET_SOCK_MSG if NET
+	select PAGE_POOL if NET
 	default n
 	help
 	  Enable the bpf() system call that allows to manipulate BPF programs
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 46dd95755967..c110daccd6e5 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -14,6 +14,7 @@ 
 #include <net/sock.h>
 #include <net/tcp.h>
 #include <net/net_namespace.h>
+#include <net/page_pool.h>
 #include <linux/error-injection.h>
 #include <linux/smp.h>
 #include <linux/sock_diag.h>
@@ -52,10 +53,11 @@  static void bpf_test_timer_leave(struct bpf_test_timer *t)
 	rcu_read_unlock();
 }
 
-static bool bpf_test_timer_continue(struct bpf_test_timer *t, u32 repeat, int *err, u32 *duration)
+static bool bpf_test_timer_continue(struct bpf_test_timer *t, int iterations,
+				    u32 repeat, int *err, u32 *duration)
 	__must_hold(rcu)
 {
-	t->i++;
+	t->i += iterations;
 	if (t->i >= repeat) {
 		/* We're done. */
 		t->time_spent += ktime_get_ns() - t->time_start;
@@ -87,6 +89,270 @@  static bool bpf_test_timer_continue(struct bpf_test_timer *t, u32 repeat, int *e
 	return false;
 }
 
+/* We put this struct at the head of each page with a context and frame
+ * initialised when the page is allocated, so we don't have to do this on each
+ * repetition of the test run.
+ */
+struct xdp_page_head {
+	struct xdp_buff orig_ctx;
+	struct xdp_buff ctx;
+	struct xdp_frame frm;
+	u8 data[];
+};
+
+struct xdp_test_data {
+	struct xdp_buff *orig_ctx;
+	struct xdp_rxq_info rxq;
+	struct net_device *dev;
+	struct page_pool *pp;
+	u32 frame_cnt;
+};
+
+#define TEST_XDP_FRAME_SIZE (PAGE_SIZE - sizeof(struct xdp_page_head)	\
+			     - sizeof(struct skb_shared_info))
+#define TEST_XDP_BATCH 64
+
+static void xdp_test_run_init_page(struct page *page, void *arg)
+{
+	struct xdp_page_head *head = phys_to_virt(page_to_phys(page));
+	struct xdp_buff *new_ctx, *orig_ctx;
+	u32 headroom = XDP_PACKET_HEADROOM;
+	struct xdp_test_data *xdp = arg;
+	size_t frm_len, meta_len;
+	struct xdp_frame *frm;
+	void *data;
+
+	orig_ctx = xdp->orig_ctx;
+	frm_len = orig_ctx->data_end - orig_ctx->data_meta;
+	meta_len = orig_ctx->data - orig_ctx->data_meta;
+	headroom -= meta_len;
+
+	new_ctx = &head->ctx;
+	frm = &head->frm;
+	data = &head->data;
+	memcpy(data + headroom, orig_ctx->data_meta, frm_len);
+
+	xdp_init_buff(new_ctx, TEST_XDP_FRAME_SIZE, &xdp->rxq);
+	xdp_prepare_buff(new_ctx, data, headroom, frm_len, true);
+	new_ctx->data = new_ctx->data_meta + meta_len;
+
+	xdp_update_frame_from_buff(new_ctx, frm);
+	frm->mem = new_ctx->rxq->mem;
+
+	memcpy(&head->orig_ctx, new_ctx, sizeof(head->orig_ctx));
+}
+
+static int xdp_test_run_setup(struct xdp_test_data *xdp, struct xdp_buff *orig_ctx)
+{
+	struct xdp_mem_info mem = {};
+	struct page_pool *pp;
+	int err;
+	struct page_pool_params pp_params = {
+		.order = 0,
+		.flags = 0,
+		.pool_size = NAPI_POLL_WEIGHT * 2,
+		.nid = NUMA_NO_NODE,
+		.max_len = TEST_XDP_FRAME_SIZE,
+		.init_callback = xdp_test_run_init_page,
+		.init_arg = xdp,
+	};
+
+	pp = page_pool_create(&pp_params);
+	if (IS_ERR(pp))
+		return PTR_ERR(pp);
+
+	/* will copy 'mem.id' into pp->xdp_mem_id */
+	err = xdp_reg_mem_model(&mem, MEM_TYPE_PAGE_POOL, pp);
+	if (err) {
+		page_pool_destroy(pp);
+		return err;
+	}
+	xdp->pp = pp;
+
+	/* We create a 'fake' RXQ referencing the original dev, but with an
+	 * xdp_mem_info pointing to our page_pool
+	 */
+	xdp_rxq_info_reg(&xdp->rxq, orig_ctx->rxq->dev, 0, 0);
+	xdp->rxq.mem.type = MEM_TYPE_PAGE_POOL;
+	xdp->rxq.mem.id = pp->xdp_mem_id;
+	xdp->dev = orig_ctx->rxq->dev;
+	xdp->orig_ctx = orig_ctx;
+
+	return 0;
+}
+
+static void xdp_test_run_teardown(struct xdp_test_data *xdp)
+{
+	struct xdp_mem_info mem = {
+		.id = xdp->pp->xdp_mem_id,
+		.type = MEM_TYPE_PAGE_POOL,
+	};
+
+	xdp_unreg_mem_model(&mem);
+}
+
+static bool ctx_was_changed(struct xdp_page_head *head)
+{
+	return head->orig_ctx.data != head->ctx.data ||
+		head->orig_ctx.data_meta != head->ctx.data_meta ||
+		head->orig_ctx.data_end != head->ctx.data_end;
+}
+
+static void reset_ctx(struct xdp_page_head *head)
+{
+	if (likely(!ctx_was_changed(head)))
+		return;
+
+	head->ctx.data = head->orig_ctx.data;
+	head->ctx.data_meta = head->orig_ctx.data_meta;
+	head->ctx.data_end = head->orig_ctx.data_end;
+	xdp_update_frame_from_buff(&head->ctx, &head->frm);
+}
+
+static int xdp_recv_frames(struct xdp_frame **frames, int nframes,
+			   struct net_device *dev)
+{
+	gfp_t gfp = __GFP_ZERO | GFP_ATOMIC;
+	void *skbs[TEST_XDP_BATCH];
+	int i, n;
+	LIST_HEAD(list);
+
+	n = kmem_cache_alloc_bulk(skbuff_head_cache, gfp, nframes, skbs);
+	if (unlikely(n == 0)) {
+		for (i = 0; i < nframes; i++)
+			xdp_return_frame(frames[i]);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < nframes; i++) {
+		struct xdp_frame *xdpf = frames[i];
+		struct sk_buff *skb = skbs[i];
+
+		skb = __xdp_build_skb_from_frame(xdpf, skb, dev);
+		if (!skb) {
+			xdp_return_frame(xdpf);
+			continue;
+		}
+
+		list_add_tail(&skb->list, &list);
+	}
+	netif_receive_skb_list(&list);
+
+	return 0;
+}
+
+static int xdp_test_run_batch(struct xdp_test_data *xdp, struct bpf_prog *prog,
+			      u32 repeat)
+{
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	int err = 0, act, ret, i, nframes = 0, batch_sz;
+	struct xdp_frame *frames[TEST_XDP_BATCH];
+	struct xdp_page_head *head;
+	struct xdp_frame *frm;
+	bool redirect = false;
+	struct xdp_buff *ctx;
+	struct page *page;
+
+	batch_sz = min_t(u32, repeat, TEST_XDP_BATCH);
+	local_bh_disable();
+	xdp_set_return_frame_no_direct();
+
+	for (i = 0; i < batch_sz; i++) {
+		page = page_pool_dev_alloc_pages(xdp->pp);
+		if (!page) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		head = phys_to_virt(page_to_phys(page));
+		reset_ctx(head);
+		ctx = &head->ctx;
+		frm = &head->frm;
+		xdp->frame_cnt++;
+
+		act = bpf_prog_run_xdp(prog, ctx);
+
+		/* if program changed pkt bounds we need to update the xdp_frame */
+		if (unlikely(ctx_was_changed(head))) {
+			err = xdp_update_frame_from_buff(ctx, frm);
+			if (err) {
+				xdp_return_buff(ctx);
+				goto out;
+			}
+		}
+
+		switch (act) {
+		case XDP_TX:
+			/* we can't do a real XDP_TX since we're not in the
+			 * driver, so turn it into a REDIRECT back to the same
+			 * index
+			 */
+			ri->tgt_index = xdp->dev->ifindex;
+			ri->map_id = INT_MAX;
+			ri->map_type = BPF_MAP_TYPE_UNSPEC;
+			fallthrough;
+		case XDP_REDIRECT:
+			redirect = true;
+			err = xdp_do_redirect_frame(xdp->dev, ctx, frm, prog);
+			if (err) {
+				xdp_return_buff(ctx);
+				goto out;
+			}
+			break;
+		case XDP_PASS:
+			frames[nframes++] = frm;
+			break;
+		default:
+			bpf_warn_invalid_xdp_action(NULL, prog, act);
+			fallthrough;
+		case XDP_DROP:
+			xdp_return_buff(ctx);
+			break;
+		}
+	}
+
+out:
+	if (redirect)
+		xdp_do_flush();
+	if (nframes) {
+		ret = xdp_recv_frames(frames, nframes, xdp->dev);
+		if (ret)
+			err = ret;
+	}
+
+	xdp_clear_return_frame_no_direct();
+	local_bh_enable();
+	return err;
+}
+
+static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx,
+				 u32 repeat, u32 *time)
+
+{
+	struct bpf_test_timer t = { .mode = NO_MIGRATE };
+	struct xdp_test_data xdp = {};
+	int ret;
+
+	if (!repeat)
+		repeat = 1;
+
+	ret = xdp_test_run_setup(&xdp, ctx);
+	if (ret)
+		return ret;
+
+	bpf_test_timer_enter(&t);
+	do {
+		xdp.frame_cnt = 0;
+		ret = xdp_test_run_batch(&xdp, prog, repeat - t.i);
+		if (unlikely(ret < 0))
+			break;
+	} while (bpf_test_timer_continue(&t, xdp.frame_cnt, repeat, &ret, time));
+	bpf_test_timer_leave(&t);
+
+	xdp_test_run_teardown(&xdp);
+	return ret;
+}
+
 static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 			u32 *retval, u32 *time, bool xdp)
 {
@@ -118,7 +384,7 @@  static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 			*retval = bpf_prog_run_xdp(prog, ctx);
 		else
 			*retval = bpf_prog_run(prog, ctx);
-	} while (bpf_test_timer_continue(&t, repeat, &ret, time));
+	} while (bpf_test_timer_continue(&t, 1, repeat, &ret, time));
 	bpf_reset_run_ctx(old_ctx);
 	bpf_test_timer_leave(&t);
 
@@ -757,13 +1023,14 @@  static void xdp_convert_buff_to_md(struct xdp_buff *xdp, struct xdp_md *xdp_md)
 int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 			  union bpf_attr __user *uattr)
 {
+	bool do_live = (kattr->test.flags & BPF_F_TEST_XDP_LIVE_FRAMES);
 	u32 tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 	u32 headroom = XDP_PACKET_HEADROOM;
 	u32 size = kattr->test.data_size_in;
 	u32 repeat = kattr->test.repeat;
 	struct netdev_rx_queue *rxqueue;
 	struct xdp_buff xdp = {};
-	u32 retval, duration;
+	u32 retval = 0, duration;
 	struct xdp_md *ctx;
 	u32 max_data_sz;
 	void *data;
@@ -773,6 +1040,9 @@  int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 	    prog->expected_attach_type == BPF_XDP_CPUMAP)
 		return -EINVAL;
 
+	if (kattr->test.flags & ~BPF_F_TEST_XDP_LIVE_FRAMES)
+		return -EINVAL;
+
 	ctx = bpf_ctx_init(kattr, sizeof(struct xdp_md));
 	if (IS_ERR(ctx))
 		return PTR_ERR(ctx);
@@ -781,7 +1051,8 @@  int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 		/* There can't be user provided data before the meta data */
 		if (ctx->data_meta || ctx->data_end != size ||
 		    ctx->data > ctx->data_end ||
-		    unlikely(xdp_metalen_invalid(ctx->data)))
+		    unlikely(xdp_metalen_invalid(ctx->data)) ||
+		    (do_live && (kattr->test.data_out || kattr->test.ctx_out)))
 			goto free_ctx;
 		/* Meta data is allocated from the headroom */
 		headroom -= ctx->data;
@@ -807,7 +1078,10 @@  int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 
 	if (repeat > 1)
 		bpf_prog_change_xdp(NULL, prog);
-	ret = bpf_test_run(prog, &xdp, repeat, &retval, &duration, true);
+	if (do_live)
+		ret = bpf_test_run_xdp_live(prog, &xdp, repeat, &duration);
+	else
+		ret = bpf_test_run(prog, &xdp, repeat, &retval, &duration, true);
 	/* We convert the xdp_buff back to an xdp_md before checking the return
 	 * code so the reference count of any held netdevice will be decremented
 	 * even if the test run failed.
@@ -905,7 +1179,7 @@  int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	do {
 		retval = bpf_flow_dissect(prog, &ctx, eth->h_proto, ETH_HLEN,
 					  size, flags);
-	} while (bpf_test_timer_continue(&t, repeat, &ret, &duration));
+	} while (bpf_test_timer_continue(&t, 1, repeat, &ret, &duration));
 	bpf_test_timer_leave(&t);
 
 	if (ret < 0)
@@ -1000,7 +1274,7 @@  int bpf_prog_test_run_sk_lookup(struct bpf_prog *prog, const union bpf_attr *kat
 	do {
 		ctx.selected_sk = NULL;
 		retval = BPF_PROG_SK_LOOKUP_RUN_ARRAY(progs, ctx, bpf_prog_run);
-	} while (bpf_test_timer_continue(&t, repeat, &ret, &duration));
+	} while (bpf_test_timer_continue(&t, 1, repeat, &ret, &duration));
 	bpf_test_timer_leave(&t);
 
 	if (ret < 0)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index b0383d371b9a..5ef20deaf49f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1225,6 +1225,8 @@  enum {
 
 /* If set, run the test on the cpu specified by bpf_attr.test.cpu */
 #define BPF_F_TEST_RUN_ON_CPU	(1U << 0)
+/* If set, XDP frames will be transmitted after processing */
+#define BPF_F_TEST_XDP_LIVE_FRAMES	(1U << 1)
 
 /* type for BPF_ENABLE_STATS */
 enum bpf_stats_type {