diff mbox series

[net-next,v2] samples/bpf: fixup some tools to be able to support xdp multibuffer

Message ID 20220621175402.35327-1-gospo@broadcom.com (mailing list archive)
State Accepted
Commit 772251742262bd651529a3cea3ee756b71e95a29
Delegated to: BPF
Headers show
Series [net-next,v2] samples/bpf: fixup some tools to be able to support xdp multibuffer | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-VM_Test-1 success Logs for Kernel LATEST on ubuntu-latest with gcc
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Kernel LATEST on ubuntu-latest with llvm-15
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Kernel LATEST on z15 with gcc
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 13 of 13 maintainers
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning CHECK: spaces preferred around that '-' (ctx:VxV) WARNING: From:/Signed-off-by: email address mismatch: 'From: Andy Gospodarek <andrew.gospodarek@broadcom.com>' != 'Signed-off-by: Andy Gospodarek <gospo@broadcom.com>'
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Andy Gospodarek June 21, 2022, 5:54 p.m. UTC
This changes the section name for the bpf program embedded in these
files to "xdp.frags" to allow the programs to be loaded on drivers that
are using an MTU greater than PAGE_SIZE.  Rather than directly accessing
the buffers, the packet data is now accessed via xdp helper functions to
provide an example for those who may need to write more complex
programs.

v2: remove new unnecessary variable

Signed-off-by: Andy Gospodarek <gospo@broadcom.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 samples/bpf/xdp1_kern.c            | 11 ++++++++---
 samples/bpf/xdp2_kern.c            | 11 ++++++++---
 samples/bpf/xdp_tx_iptunnel_kern.c |  2 +-
 3 files changed, 17 insertions(+), 7 deletions(-)

Comments

patchwork-bot+netdevbpf@kernel.org June 22, 2022, 2 a.m. UTC | #1
Hello:

This patch was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Tue, 21 Jun 2022 17:54:02 +0000 you wrote:
> This changes the section name for the bpf program embedded in these
> files to "xdp.frags" to allow the programs to be loaded on drivers that
> are using an MTU greater than PAGE_SIZE.  Rather than directly accessing
> the buffers, the packet data is now accessed via xdp helper functions to
> provide an example for those who may need to write more complex
> programs.
> 
> [...]

Here is the summary with links:
  - [net-next,v2] samples/bpf: fixup some tools to be able to support xdp multibuffer
    https://git.kernel.org/bpf/bpf-next/c/772251742262

You are awesome, thank you!
Tariq Toukan Jan. 3, 2023, 12:55 p.m. UTC | #2
On 21/06/2022 20:54, Andy Gospodarek wrote:
> This changes the section name for the bpf program embedded in these
> files to "xdp.frags" to allow the programs to be loaded on drivers that
> are using an MTU greater than PAGE_SIZE.  Rather than directly accessing
> the buffers, the packet data is now accessed via xdp helper functions to
> provide an example for those who may need to write more complex
> programs.
> 
> v2: remove new unnecessary variable
> 

Hi,

I'm trying to understand if there are any assumptions/requirements on 
the length of the xdp_buf linear part when passed to XDP multi-buf programs?
Can the linear part be empty, with all data residing in the fragments? 
Is it valid?

Per the proposed pattern below (calling bpf_xdp_load_bytes() to memcpy 
packet data into a local buffer), no such assumption is required, and an 
xdp_buf created by the driver with an empty linear part is valid.

However, in the _xdp_tx_iptunnel example program, it fails (returns 
XDP_DROP) in case the headers are not in the linear part.

Regards,
Tariq

> Signed-off-by: Andy Gospodarek <gospo@broadcom.com>
> Acked-by: John Fastabend <john.fastabend@gmail.com>
> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
>   samples/bpf/xdp1_kern.c            | 11 ++++++++---
>   samples/bpf/xdp2_kern.c            | 11 ++++++++---
>   samples/bpf/xdp_tx_iptunnel_kern.c |  2 +-
>   3 files changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/samples/bpf/xdp1_kern.c b/samples/bpf/xdp1_kern.c
> index f0c5d95084de..0a5c704badd0 100644
> --- a/samples/bpf/xdp1_kern.c
> +++ b/samples/bpf/xdp1_kern.c
> @@ -39,11 +39,13 @@ static int parse_ipv6(void *data, u64 nh_off, void *data_end)
>   	return ip6h->nexthdr;
>   }
>   
> -SEC("xdp1")
> +#define XDPBUFSIZE	64
> +SEC("xdp.frags")
>   int xdp_prog1(struct xdp_md *ctx)
>   {
> -	void *data_end = (void *)(long)ctx->data_end;
> -	void *data = (void *)(long)ctx->data;
> +	__u8 pkt[XDPBUFSIZE] = {};
> +	void *data_end = &pkt[XDPBUFSIZE-1];
> +	void *data = pkt;
>   	struct ethhdr *eth = data;
>   	int rc = XDP_DROP;
>   	long *value;
> @@ -51,6 +53,9 @@ int xdp_prog1(struct xdp_md *ctx)
>   	u64 nh_off;
>   	u32 ipproto;
>   
> +	if (bpf_xdp_load_bytes(ctx, 0, pkt, sizeof(pkt)))
> +		return rc;
> +
>   	nh_off = sizeof(*eth);
>   	if (data + nh_off > data_end)
>   		return rc;
> diff --git a/samples/bpf/xdp2_kern.c b/samples/bpf/xdp2_kern.c
> index d8a64ab077b0..3332ba6bb95f 100644
> --- a/samples/bpf/xdp2_kern.c
> +++ b/samples/bpf/xdp2_kern.c
> @@ -55,11 +55,13 @@ static int parse_ipv6(void *data, u64 nh_off, void *data_end)
>   	return ip6h->nexthdr;
>   }
>   
> -SEC("xdp1")
> +#define XDPBUFSIZE	64
> +SEC("xdp.frags")
>   int xdp_prog1(struct xdp_md *ctx)
>   {
> -	void *data_end = (void *)(long)ctx->data_end;
> -	void *data = (void *)(long)ctx->data;
> +	__u8 pkt[XDPBUFSIZE] = {};
> +	void *data_end = &pkt[XDPBUFSIZE-1];
> +	void *data = pkt;
>   	struct ethhdr *eth = data;
>   	int rc = XDP_DROP;
>   	long *value;
> @@ -67,6 +69,9 @@ int xdp_prog1(struct xdp_md *ctx)
>   	u64 nh_off;
>   	u32 ipproto;
>   
> +	if (bpf_xdp_load_bytes(ctx, 0, pkt, sizeof(pkt)))
> +		return rc;
> +
>   	nh_off = sizeof(*eth);
>   	if (data + nh_off > data_end)
>   		return rc;
> diff --git a/samples/bpf/xdp_tx_iptunnel_kern.c b/samples/bpf/xdp_tx_iptunnel_kern.c
> index 575d57e4b8d6..0e2bca3a3fff 100644
> --- a/samples/bpf/xdp_tx_iptunnel_kern.c
> +++ b/samples/bpf/xdp_tx_iptunnel_kern.c
> @@ -212,7 +212,7 @@ static __always_inline int handle_ipv6(struct xdp_md *xdp)
>   	return XDP_TX;
>   }
>   
> -SEC("xdp_tx_iptunnel")
> +SEC("xdp.frags")
>   int _xdp_tx_iptunnel(struct xdp_md *xdp)
>   {
>   	void *data_end = (void *)(long)xdp->data_end;
Toke Høiland-Jørgensen Jan. 3, 2023, 3:19 p.m. UTC | #3
Tariq Toukan <ttoukan.linux@gmail.com> writes:

> On 21/06/2022 20:54, Andy Gospodarek wrote:
>> This changes the section name for the bpf program embedded in these
>> files to "xdp.frags" to allow the programs to be loaded on drivers that
>> are using an MTU greater than PAGE_SIZE.  Rather than directly accessing
>> the buffers, the packet data is now accessed via xdp helper functions to
>> provide an example for those who may need to write more complex
>> programs.
>> 
>> v2: remove new unnecessary variable
>> 
>
> Hi,
>
> I'm trying to understand if there are any assumptions/requirements on 
> the length of the xdp_buf linear part when passed to XDP multi-buf programs?
> Can the linear part be empty, with all data residing in the fragments? 
> Is it valid?
>
> Per the proposed pattern below (calling bpf_xdp_load_bytes() to memcpy 
> packet data into a local buffer), no such assumption is required, and an 
> xdp_buf created by the driver with an empty linear part is valid.
>
> However, in the _xdp_tx_iptunnel example program, it fails (returns 
> XDP_DROP) in case the headers are not in the linear part.

Hmm, good question! I don't think we've ever explicitly documented any
assumptions one way or the other. My own mental model has certainly
always assumed the first frag would continue to be the same size as in
non-multi-buf packets.

I do seem to recall there was some discussion around this when we were
discussing whether or not we needed programs to explicitly opt-in to
multi-buf support (what ended up being the "xdp.frags" section). The
reason we said it might *not* be necessary to do that was that most
programs would just continue working, and it would only be those that
either tried to access the end of the packet, or to compute the packet
length as data_end-data that would need any changes before enabling
the frags flag. Which also kinda implies that headers etc would continue
to be in the linear part.

This is all from memory, though, so maybe others have different
recollections. In any case this is probably something we should document
somewhere :)

-Toke
Jakub Kicinski Jan. 4, 2023, 1:21 a.m. UTC | #4
On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote:
> Hmm, good question! I don't think we've ever explicitly documented any
> assumptions one way or the other. My own mental model has certainly
> always assumed the first frag would continue to be the same size as in
> non-multi-buf packets.

Interesting! :) My mental model was closer to GRO by frags 
so the linear part would have no data, just headers.

A random datapoint is that bpf_xdp_adjust_head() seems 
to enforce that there is at least ETH_HLEN.
Lorenzo Bianconi Jan. 4, 2023, 8:44 a.m. UTC | #5
> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote:
> > Hmm, good question! I don't think we've ever explicitly documented any
> > assumptions one way or the other. My own mental model has certainly
> > always assumed the first frag would continue to be the same size as in
> > non-multi-buf packets.
> 
> Interesting! :) My mental model was closer to GRO by frags 
> so the linear part would have no data, just headers.

That is assumption as well.

Regards,
Lorenzo

> 
> A random datapoint is that bpf_xdp_adjust_head() seems 
> to enforce that there is at least ETH_HLEN.
Toke Høiland-Jørgensen Jan. 4, 2023, 12:28 p.m. UTC | #6
Lorenzo Bianconi <lorenzo@kernel.org> writes:

>> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote:
>> > Hmm, good question! I don't think we've ever explicitly documented any
>> > assumptions one way or the other. My own mental model has certainly
>> > always assumed the first frag would continue to be the same size as in
>> > non-multi-buf packets.
>> 
>> Interesting! :) My mental model was closer to GRO by frags 
>> so the linear part would have no data, just headers.
>
> That is assumption as well.

Right, okay, so how many headers? Only Ethernet, or all the way up to
L4 (TCP/UDP)?

I do seem to recall a discussion around the header/data split for TCP
specifically, but I think I mentally put that down as "something people
may way to do at some point in the future", which is why it hasn't made
it into my own mental model (yet?) :)

-Toke
Jakub Kicinski Jan. 5, 2023, 1:17 a.m. UTC | #7
On Wed, 04 Jan 2023 13:28:57 +0100 Toke Høiland-Jørgensen wrote:
> >> Interesting! :) My mental model was closer to GRO by frags 
> >> so the linear part would have no data, just headers.  
> >
> > That is assumption as well.  
> 
> Right, okay, so how many headers? Only Ethernet, or all the way up to
> L4 (TCP/UDP)?

If we're speaking about guarantees or hard requirements - I think that
we can only require / guarantee the Ethernet header. Requiring more
will be defeated by tunnels (i.e. adjust_head() + redirect to a veth).

> I do seem to recall a discussion around the header/data split for TCP
> specifically, but I think I mentally put that down as "something people
> may way to do at some point in the future", which is why it hasn't made
> it into my own mental model (yet?) :)
Tariq Toukan Jan. 5, 2023, 7:20 a.m. UTC | #8
On 04/01/2023 14:28, Toke Høiland-Jørgensen wrote:
> Lorenzo Bianconi <lorenzo@kernel.org> writes:
> 
>>> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote:
>>>> Hmm, good question! I don't think we've ever explicitly documented any
>>>> assumptions one way or the other. My own mental model has certainly
>>>> always assumed the first frag would continue to be the same size as in
>>>> non-multi-buf packets.
>>>
>>> Interesting! :) My mental model was closer to GRO by frags
>>> so the linear part would have no data, just headers.
>>
>> That is assumption as well.
> 
> Right, okay, so how many headers? Only Ethernet, or all the way up to
> L4 (TCP/UDP)?
> 
> I do seem to recall a discussion around the header/data split for TCP
> specifically, but I think I mentally put that down as "something people
> may way to do at some point in the future", which is why it hasn't made
> it into my own mental model (yet?) :)
> 
> -Toke
> 

I don't think that all the different GRO layers assume having their 
headers/data in the linear part. IMO they will just perform better if 
these parts are already there. Otherwise, the GRO flow manages, and 
pulls the needed amount into the linear part.
As examples, see calls to gro_pull_from_frag0 in net/core/gro.c, and the 
call to pskb_may_pull() from skb_gro_header_slow().

This resembles the bpf_xdp_load_bytes() API used here in the xdp prog.

The context of my questions is that I'm looking for the right memory 
scheme for adding xdp-mb support to mlx5e striding RQ.
In striding RQ, the RX buffer consists of "strides" of a fixed size set 
by the driver. An incoming packet is written to the buffer starting from 
the beginning of the next available stride, consuming as much strides as 
needed.

Due to the need for headroom and tailroom, there's no easy way of 
building the xdp_buf in place (around the packet), so it should go to a 
side buffer.

By using 0-length linear part in a side buffer, I can address two 
challenging issues: (1) save the in-driver headers memcpy (copy might 
still exist in the xdp program though), and (2) conform to the 
"fragments of the same size" requirement/assumption in xdp-mb. 
Otherwise, if we pull from frag[0] into the linear part, frag[0] becomes 
smaller than the next fragments.

Regards,
Tariq
Toke Høiland-Jørgensen Jan. 5, 2023, 3:43 p.m. UTC | #9
Tariq Toukan <ttoukan.linux@gmail.com> writes:

> On 04/01/2023 14:28, Toke Høiland-Jørgensen wrote:
>> Lorenzo Bianconi <lorenzo@kernel.org> writes:
>> 
>>>> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote:
>>>>> Hmm, good question! I don't think we've ever explicitly documented any
>>>>> assumptions one way or the other. My own mental model has certainly
>>>>> always assumed the first frag would continue to be the same size as in
>>>>> non-multi-buf packets.
>>>>
>>>> Interesting! :) My mental model was closer to GRO by frags
>>>> so the linear part would have no data, just headers.
>>>
>>> That is assumption as well.
>> 
>> Right, okay, so how many headers? Only Ethernet, or all the way up to
>> L4 (TCP/UDP)?
>> 
>> I do seem to recall a discussion around the header/data split for TCP
>> specifically, but I think I mentally put that down as "something people
>> may way to do at some point in the future", which is why it hasn't made
>> it into my own mental model (yet?) :)
>> 
>> -Toke
>> 
>
> I don't think that all the different GRO layers assume having their 
> headers/data in the linear part. IMO they will just perform better if 
> these parts are already there. Otherwise, the GRO flow manages, and 
> pulls the needed amount into the linear part.
> As examples, see calls to gro_pull_from_frag0 in net/core/gro.c, and the 
> call to pskb_may_pull() from skb_gro_header_slow().
>
> This resembles the bpf_xdp_load_bytes() API used here in the xdp prog.

Right, but that is kernel code; what we end up doing with the API here
affects how many programs need to make significant changes to work with
multibuf, and how many can just set the frags flag and continue working.
Which also has a performance impact, see below.

> The context of my questions is that I'm looking for the right memory 
> scheme for adding xdp-mb support to mlx5e striding RQ.
> In striding RQ, the RX buffer consists of "strides" of a fixed size set 
> by pthe driver. An incoming packet is written to the buffer starting from 
> the beginning of the next available stride, consuming as much strides as 
> needed.
>
> Due to the need for headroom and tailroom, there's no easy way of 
> building the xdp_buf in place (around the packet), so it should go to a 
> side buffer.
>
> By using 0-length linear part in a side buffer, I can address two 
> challenging issues: (1) save the in-driver headers memcpy (copy might 
> still exist in the xdp program though), and (2) conform to the 
> "fragments of the same size" requirement/assumption in xdp-mb. 
> Otherwise, if we pull from frag[0] into the linear part, frag[0] becomes 
> smaller than the next fragments.

Right, I see.

So my main concern would be that if we "allow" this, the only way to
write an interoperable XDP program will be to use bpf_xdp_load_bytes()
for every packet access. Which will be slower than DPA, so we may end up
inadvertently slowing down all of the XDP ecosystem, because no one is
going to bother with writing two versions of their programs. Whereas if
you can rely on packet headers always being in the linear part, you can
write a lot of the "look at headers and make a decision" type programs
using just DPA, and they'll work for multibuf as well.

But maybe I'm mistaken and people are just going to use the load_bytes
helper anyway because they want to go deeper than whatever "headers" bit
we'll end up guaranteeing is in the linear part?

-Toke
Andy Gospodarek Jan. 5, 2023, 4:18 p.m. UTC | #10
On Tue, Jan 03, 2023 at 02:55:22PM +0200, Tariq Toukan wrote:
> 
> 
> On 21/06/2022 20:54, Andy Gospodarek wrote:
> > This changes the section name for the bpf program embedded in these
> > files to "xdp.frags" to allow the programs to be loaded on drivers that
> > are using an MTU greater than PAGE_SIZE.  Rather than directly accessing
> > the buffers, the packet data is now accessed via xdp helper functions to
> > provide an example for those who may need to write more complex
> > programs.
> > 
> > v2: remove new unnecessary variable
> > 
> 
> Hi,
> 
> I'm trying to understand if there are any assumptions/requirements on the
> length of the xdp_buf linear part when passed to XDP multi-buf programs?
> Can the linear part be empty, with all data residing in the fragments? Is it
> valid?

That's a great question.  The implementation in bnxt_en was based on the
implementation as I understood it in mvneta where the linear area
contained approx the first 4k of data - xdp headroom - dma_offset.  This
means that you have something that looks like this with a 9k MTU:

skb->data	[~3.6k of packet data]
skb->frag[0]	[4k of paket data]
     frag[1]	[remainder of packet data]

At some point, I'd like to take the opportunity to test something like
this:

skb->data	[header only + space for header expansion]
skb->frag[0]	[first 4k of data]
     frag[1]	[second 4k of data]
     frag[2]	[remainder of packet data]

Though this will use a bit more memory, I think it will be much more
performant for data that is ultimately consumed rather than forwarded
by the host as the actual packet data will be aligned on page boundaries.

With the ability to have packets that are handled by an XDP program
span buffers, I would also like to test out whether or not it would be
worthwhile to have standard MTU packets also look like this:

skb->data	[header only + space for header expansion]
skb->frag[0]	[packet data]

I think the overall system performance would be better in the XDP_PASS
case, but until there is data to back this up, that's just speculation. 

> Per the proposed pattern below (calling bpf_xdp_load_bytes() to memcpy
> packet data into a local buffer), no such assumption is required, and an
> xdp_buf created by the driver with an empty linear part is valid.
> 
> However, in the _xdp_tx_iptunnel example program, it fails (returns
> XDP_DROP) in case the headers are not in the linear part.
> 
> Regards,
> Tariq
> 
> > Signed-off-by: Andy Gospodarek <gospo@broadcom.com>
> > Acked-by: John Fastabend <john.fastabend@gmail.com>
> > Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > ---
> >   samples/bpf/xdp1_kern.c            | 11 ++++++++---
> >   samples/bpf/xdp2_kern.c            | 11 ++++++++---
> >   samples/bpf/xdp_tx_iptunnel_kern.c |  2 +-
> >   3 files changed, 17 insertions(+), 7 deletions(-)
> > 
> > diff --git a/samples/bpf/xdp1_kern.c b/samples/bpf/xdp1_kern.c
> > index f0c5d95084de..0a5c704badd0 100644
> > --- a/samples/bpf/xdp1_kern.c
> > +++ b/samples/bpf/xdp1_kern.c
> > @@ -39,11 +39,13 @@ static int parse_ipv6(void *data, u64 nh_off, void *data_end)
> >   	return ip6h->nexthdr;
> >   }
> > -SEC("xdp1")
> > +#define XDPBUFSIZE	64
> > +SEC("xdp.frags")
> >   int xdp_prog1(struct xdp_md *ctx)
> >   {
> > -	void *data_end = (void *)(long)ctx->data_end;
> > -	void *data = (void *)(long)ctx->data;
> > +	__u8 pkt[XDPBUFSIZE] = {};
> > +	void *data_end = &pkt[XDPBUFSIZE-1];
> > +	void *data = pkt;
> >   	struct ethhdr *eth = data;
> >   	int rc = XDP_DROP;
> >   	long *value;
> > @@ -51,6 +53,9 @@ int xdp_prog1(struct xdp_md *ctx)
> >   	u64 nh_off;
> >   	u32 ipproto;
> > +	if (bpf_xdp_load_bytes(ctx, 0, pkt, sizeof(pkt)))
> > +		return rc;
> > +
> >   	nh_off = sizeof(*eth);
> >   	if (data + nh_off > data_end)
> >   		return rc;
> > diff --git a/samples/bpf/xdp2_kern.c b/samples/bpf/xdp2_kern.c
> > index d8a64ab077b0..3332ba6bb95f 100644
> > --- a/samples/bpf/xdp2_kern.c
> > +++ b/samples/bpf/xdp2_kern.c
> > @@ -55,11 +55,13 @@ static int parse_ipv6(void *data, u64 nh_off, void *data_end)
> >   	return ip6h->nexthdr;
> >   }
> > -SEC("xdp1")
> > +#define XDPBUFSIZE	64
> > +SEC("xdp.frags")
> >   int xdp_prog1(struct xdp_md *ctx)
> >   {
> > -	void *data_end = (void *)(long)ctx->data_end;
> > -	void *data = (void *)(long)ctx->data;
> > +	__u8 pkt[XDPBUFSIZE] = {};
> > +	void *data_end = &pkt[XDPBUFSIZE-1];
> > +	void *data = pkt;
> >   	struct ethhdr *eth = data;
> >   	int rc = XDP_DROP;
> >   	long *value;
> > @@ -67,6 +69,9 @@ int xdp_prog1(struct xdp_md *ctx)
> >   	u64 nh_off;
> >   	u32 ipproto;
> > +	if (bpf_xdp_load_bytes(ctx, 0, pkt, sizeof(pkt)))
> > +		return rc;
> > +
> >   	nh_off = sizeof(*eth);
> >   	if (data + nh_off > data_end)
> >   		return rc;
> > diff --git a/samples/bpf/xdp_tx_iptunnel_kern.c b/samples/bpf/xdp_tx_iptunnel_kern.c
> > index 575d57e4b8d6..0e2bca3a3fff 100644
> > --- a/samples/bpf/xdp_tx_iptunnel_kern.c
> > +++ b/samples/bpf/xdp_tx_iptunnel_kern.c
> > @@ -212,7 +212,7 @@ static __always_inline int handle_ipv6(struct xdp_md *xdp)
> >   	return XDP_TX;
> >   }
> > -SEC("xdp_tx_iptunnel")
> > +SEC("xdp.frags")
> >   int _xdp_tx_iptunnel(struct xdp_md *xdp)
> >   {
> >   	void *data_end = (void *)(long)xdp->data_end;
Andy Gospodarek Jan. 5, 2023, 4:22 p.m. UTC | #11
On Tue, Jan 03, 2023 at 05:21:53PM -0800, Jakub Kicinski wrote:
> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote:
> > Hmm, good question! I don't think we've ever explicitly documented any
> > assumptions one way or the other. My own mental model has certainly
> > always assumed the first frag would continue to be the same size as in
> > non-multi-buf packets.
> 
> Interesting! :) My mental model was closer to GRO by frags 
> so the linear part would have no data, just headers.

As I mentioned in my mail just a few mins ago, I think this would be a good
model to consider.  All headers (including potentially tunnel headers) could be
in the linear area with the actual packet data in frags.

> A random datapoint is that bpf_xdp_adjust_head() seems 
> to enforce that there is at least ETH_HLEN.
Andy Gospodarek Jan. 5, 2023, 4:57 p.m. UTC | #12
On Thu, Jan 05, 2023 at 04:43:28PM +0100, Toke Høiland-Jørgensen wrote:
> Tariq Toukan <ttoukan.linux@gmail.com> writes:
> 
> > On 04/01/2023 14:28, Toke Høiland-Jørgensen wrote:
> >> Lorenzo Bianconi <lorenzo@kernel.org> writes:
> >> 
> >>>> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote:
> >>>>> Hmm, good question! I don't think we've ever explicitly documented any
> >>>>> assumptions one way or the other. My own mental model has certainly
> >>>>> always assumed the first frag would continue to be the same size as in
> >>>>> non-multi-buf packets.
> >>>>
> >>>> Interesting! :) My mental model was closer to GRO by frags
> >>>> so the linear part would have no data, just headers.
> >>>
> >>> That is assumption as well.
> >> 
> >> Right, okay, so how many headers? Only Ethernet, or all the way up to
> >> L4 (TCP/UDP)?
> >> 
> >> I do seem to recall a discussion around the header/data split for TCP
> >> specifically, but I think I mentally put that down as "something people
> >> may way to do at some point in the future", which is why it hasn't made
> >> it into my own mental model (yet?) :)
> >> 
> >> -Toke
> >> 
> >
> > I don't think that all the different GRO layers assume having their 
> > headers/data in the linear part. IMO they will just perform better if 
> > these parts are already there. Otherwise, the GRO flow manages, and 
> > pulls the needed amount into the linear part.
> > As examples, see calls to gro_pull_from_frag0 in net/core/gro.c, and the 
> > call to pskb_may_pull() from skb_gro_header_slow().
> >
> > This resembles the bpf_xdp_load_bytes() API used here in the xdp prog.
> 
> Right, but that is kernel code; what we end up doing with the API here
> affects how many programs need to make significant changes to work with
> multibuf, and how many can just set the frags flag and continue working.
> Which also has a performance impact, see below.
> 
> > The context of my questions is that I'm looking for the right memory 
> > scheme for adding xdp-mb support to mlx5e striding RQ.
> > In striding RQ, the RX buffer consists of "strides" of a fixed size set 
> > by pthe driver. An incoming packet is written to the buffer starting from 
> > the beginning of the next available stride, consuming as much strides as 
> > needed.
> >
> > Due to the need for headroom and tailroom, there's no easy way of 
> > building the xdp_buf in place (around the packet), so it should go to a 
> > side buffer.
> >
> > By using 0-length linear part in a side buffer, I can address two 
> > challenging issues: (1) save the in-driver headers memcpy (copy might 
> > still exist in the xdp program though), and (2) conform to the 
> > "fragments of the same size" requirement/assumption in xdp-mb. 
> > Otherwise, if we pull from frag[0] into the linear part, frag[0] becomes 
> > smaller than the next fragments.
> 
> Right, I see.
> 
> So my main concern would be that if we "allow" this, the only way to
> write an interoperable XDP program will be to use bpf_xdp_load_bytes()
> for every packet access. Which will be slower than DPA, so we may end up
> inadvertently slowing down all of the XDP ecosystem, because no one is
> going to bother with writing two versions of their programs. Whereas if
> you can rely on packet headers always being in the linear part, you can
> write a lot of the "look at headers and make a decision" type programs
> using just DPA, and they'll work for multibuf as well.

The question I would have is what is really the 'slow down' for
bpf_xdp_load_bytes() vs DPA?  I know you and Jesper can tell me how many
instructions each use. :)

Taking a step back...years ago Dave mentioned wanting to make XDP
programs easy to write and it feels like using these accessor APIs would
help accomplish that.  If the kernel examples use bpf_xdp_load_bytes()
accessors everywhere then that would accomplish that.

> But maybe I'm mistaken and people are just going to use the load_bytes
> helper anyway because they want to go deeper than whatever "headers" bit
> we'll end up guaranteeing is in the linear part?
Jakub Kicinski Jan. 5, 2023, 6:16 p.m. UTC | #13
On Thu, 5 Jan 2023 11:57:32 -0500 Andy Gospodarek wrote:
> > So my main concern would be that if we "allow" this, the only way to
> > write an interoperable XDP program will be to use bpf_xdp_load_bytes()
> > for every packet access. Which will be slower than DPA, so we may end up
> > inadvertently slowing down all of the XDP ecosystem, because no one is
> > going to bother with writing two versions of their programs. Whereas if
> > you can rely on packet headers always being in the linear part, you can
> > write a lot of the "look at headers and make a decision" type programs
> > using just DPA, and they'll work for multibuf as well.  
> 
> The question I would have is what is really the 'slow down' for
> bpf_xdp_load_bytes() vs DPA?  I know you and Jesper can tell me how many
> instructions each use. :)

Until we have an efficient and inlined DPA access to frags an
unconditional memcpy() of the first 2 cachelines-worth of headers
in the driver must be faster than a piece-by-piece bpf_xdp_load_bytes()
onto the stack, right?

> Taking a step back...years ago Dave mentioned wanting to make XDP
> programs easy to write and it feels like using these accessor APIs would
> help accomplish that.  If the kernel examples use bpf_xdp_load_bytes()
> accessors everywhere then that would accomplish that.

I've been pushing for an skb_header_pointer()-like helper but 
the semantics were not universally loved :)
Toke Høiland-Jørgensen Jan. 5, 2023, 10:07 p.m. UTC | #14
Andy Gospodarek <andrew.gospodarek@broadcom.com> writes:

> On Thu, Jan 05, 2023 at 04:43:28PM +0100, Toke Høiland-Jørgensen wrote:
>> Tariq Toukan <ttoukan.linux@gmail.com> writes:
>> 
>> > On 04/01/2023 14:28, Toke Høiland-Jørgensen wrote:
>> >> Lorenzo Bianconi <lorenzo@kernel.org> writes:
>> >> 
>> >>>> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote:
>> >>>>> Hmm, good question! I don't think we've ever explicitly documented any
>> >>>>> assumptions one way or the other. My own mental model has certainly
>> >>>>> always assumed the first frag would continue to be the same size as in
>> >>>>> non-multi-buf packets.
>> >>>>
>> >>>> Interesting! :) My mental model was closer to GRO by frags
>> >>>> so the linear part would have no data, just headers.
>> >>>
>> >>> That is assumption as well.
>> >> 
>> >> Right, okay, so how many headers? Only Ethernet, or all the way up to
>> >> L4 (TCP/UDP)?
>> >> 
>> >> I do seem to recall a discussion around the header/data split for TCP
>> >> specifically, but I think I mentally put that down as "something people
>> >> may way to do at some point in the future", which is why it hasn't made
>> >> it into my own mental model (yet?) :)
>> >> 
>> >> -Toke
>> >> 
>> >
>> > I don't think that all the different GRO layers assume having their 
>> > headers/data in the linear part. IMO they will just perform better if 
>> > these parts are already there. Otherwise, the GRO flow manages, and 
>> > pulls the needed amount into the linear part.
>> > As examples, see calls to gro_pull_from_frag0 in net/core/gro.c, and the 
>> > call to pskb_may_pull() from skb_gro_header_slow().
>> >
>> > This resembles the bpf_xdp_load_bytes() API used here in the xdp prog.
>> 
>> Right, but that is kernel code; what we end up doing with the API here
>> affects how many programs need to make significant changes to work with
>> multibuf, and how many can just set the frags flag and continue working.
>> Which also has a performance impact, see below.
>> 
>> > The context of my questions is that I'm looking for the right memory 
>> > scheme for adding xdp-mb support to mlx5e striding RQ.
>> > In striding RQ, the RX buffer consists of "strides" of a fixed size set 
>> > by pthe driver. An incoming packet is written to the buffer starting from 
>> > the beginning of the next available stride, consuming as much strides as 
>> > needed.
>> >
>> > Due to the need for headroom and tailroom, there's no easy way of 
>> > building the xdp_buf in place (around the packet), so it should go to a 
>> > side buffer.
>> >
>> > By using 0-length linear part in a side buffer, I can address two 
>> > challenging issues: (1) save the in-driver headers memcpy (copy might 
>> > still exist in the xdp program though), and (2) conform to the 
>> > "fragments of the same size" requirement/assumption in xdp-mb. 
>> > Otherwise, if we pull from frag[0] into the linear part, frag[0] becomes 
>> > smaller than the next fragments.
>> 
>> Right, I see.
>> 
>> So my main concern would be that if we "allow" this, the only way to
>> write an interoperable XDP program will be to use bpf_xdp_load_bytes()
>> for every packet access. Which will be slower than DPA, so we may end up
>> inadvertently slowing down all of the XDP ecosystem, because no one is
>> going to bother with writing two versions of their programs. Whereas if
>> you can rely on packet headers always being in the linear part, you can
>> write a lot of the "look at headers and make a decision" type programs
>> using just DPA, and they'll work for multibuf as well.
>
> The question I would have is what is really the 'slow down' for
> bpf_xdp_load_bytes() vs DPA?  I know you and Jesper can tell me how many
> instructions each use. :)

I can try running some benchmarks to compare the two, sure!

-Toke
Andy Gospodarek Jan. 6, 2023, 1:56 p.m. UTC | #15
On Thu, Jan 05, 2023 at 10:16:42AM -0800, Jakub Kicinski wrote:
> On Thu, 5 Jan 2023 11:57:32 -0500 Andy Gospodarek wrote:
> > > So my main concern would be that if we "allow" this, the only way to
> > > write an interoperable XDP program will be to use bpf_xdp_load_bytes()
> > > for every packet access. Which will be slower than DPA, so we may end up
> > > inadvertently slowing down all of the XDP ecosystem, because no one is
> > > going to bother with writing two versions of their programs. Whereas if
> > > you can rely on packet headers always being in the linear part, you can
> > > write a lot of the "look at headers and make a decision" type programs
> > > using just DPA, and they'll work for multibuf as well.  
> > 
> > The question I would have is what is really the 'slow down' for
> > bpf_xdp_load_bytes() vs DPA?  I know you and Jesper can tell me how many
> > instructions each use. :)
> 
> Until we have an efficient and inlined DPA access to frags an
> unconditional memcpy() of the first 2 cachelines-worth of headers
> in the driver must be faster than a piece-by-piece bpf_xdp_load_bytes()
> onto the stack, right?

100% 

Seems like we are back to speed vs ease of use, then?

> > Taking a step back...years ago Dave mentioned wanting to make XDP
> > programs easy to write and it feels like using these accessor APIs would
> > help accomplish that.  If the kernel examples use bpf_xdp_load_bytes()
> > accessors everywhere then that would accomplish that.
> 
> I've been pushing for an skb_header_pointer()-like helper but 
> the semantics were not universally loved :)

I didn't recall that -- maybe I'll check the archives and see what I can
find.
Toke Høiland-Jørgensen Jan. 6, 2023, 5:54 p.m. UTC | #16
>>> So my main concern would be that if we "allow" this, the only way to
>>> write an interoperable XDP program will be to use bpf_xdp_load_bytes()
>>> for every packet access. Which will be slower than DPA, so we may end up
>>> inadvertently slowing down all of the XDP ecosystem, because no one is
>>> going to bother with writing two versions of their programs. Whereas if
>>> you can rely on packet headers always being in the linear part, you can
>>> write a lot of the "look at headers and make a decision" type programs
>>> using just DPA, and they'll work for multibuf as well.
>>
>> The question I would have is what is really the 'slow down' for
>> bpf_xdp_load_bytes() vs DPA?  I know you and Jesper can tell me how many
>> instructions each use. :)
>
> I can try running some benchmarks to compare the two, sure!

Okay, ran a simple test: a program that just parses the IP header, then
drops the packet. Results as follows:

Baseline (don't touch data):    26.5 Mpps / 37.8 ns/pkt
Touch data (ethernet hdr):      25.0 Mpps / 40.0 ns/pkt
Parse IP (DPA):                 24.1 Mpps / 41.5 ns/pkt
Parse IP (bpf_xdp_load_bytes):  15.3 Mpps / 65.3 ns/pkt

So 2.2 ns of overhead from reading the packet data, another 1.5 ns from
the parsing logic, and a whopping 23.8 ns extra from switching to
bpf_xdp_load_bytes(). This is with two calls to bpf_xdp_load_bytes(),
one to get the Ethernet header, and another to get the IP header.
Dropping one of them also drops the overhead in half, so it seems to fit
with ~12 ns of overhead from a single call to bpf_xdp_load_bytes().

I pushed the code I used for testing here, in case someone else wants to
play around with it:

https://github.com/xdp-project/xdp-tools/tree/xdp-load-bytes

It's part of the 'xdp-bench' utility. Run it as:

./xdp-bench drop <iface> -p parse-ip

for DPA parsing and

./xdp-bench drop <iface> -p parse-ip -l

to use bpf_xdp_load_bytes().

-Toke
Tariq Toukan Jan. 8, 2023, 12:33 p.m. UTC | #17
On 05/01/2023 20:16, Jakub Kicinski wrote:
> On Thu, 5 Jan 2023 11:57:32 -0500 Andy Gospodarek wrote:
>>> So my main concern would be that if we "allow" this, the only way to
>>> write an interoperable XDP program will be to use bpf_xdp_load_bytes()
>>> for every packet access. Which will be slower than DPA, so we may end up
>>> inadvertently slowing down all of the XDP ecosystem, because no one is
>>> going to bother with writing two versions of their programs. Whereas if
>>> you can rely on packet headers always being in the linear part, you can
>>> write a lot of the "look at headers and make a decision" type programs
>>> using just DPA, and they'll work for multibuf as well.
>>
>> The question I would have is what is really the 'slow down' for
>> bpf_xdp_load_bytes() vs DPA?  I know you and Jesper can tell me how many
>> instructions each use. :)
> 
> Until we have an efficient and inlined DPA access to frags an
> unconditional memcpy() of the first 2 cachelines-worth of headers
> in the driver must be faster than a piece-by-piece bpf_xdp_load_bytes()
> onto the stack, right?
> 
>> Taking a step back...years ago Dave mentioned wanting to make XDP
>> programs easy to write and it feels like using these accessor APIs would
>> help accomplish that.  If the kernel examples use bpf_xdp_load_bytes()
>> accessors everywhere then that would accomplish that.
> 
> I've been pushing for an skb_header_pointer()-like helper but
> the semantics were not universally loved :)

Maybe it's time to re-consider.

Is it something like an API that given an offset returns a pointer + 
allowed length to be accessed?

This sounds like a good direction to me, that avoids having any 
linear-part-length assumptions, while preserving good performance.

Maybe we can still require/guarantee that each single header (eth, ip, 
tcp, ...) does not cross a frag/page boundary. For otherwise, a prog 
needs to handle cases where headers span several fragments, so it has to 
reconstruct the header by copying the different parts into some local 
buffer.

This can be achieved by having another assumption that AFAIK already 
holds today: all fragments are of size PAGE_SIZE.

Regards,
Tariq
Tariq Toukan Jan. 8, 2023, 12:42 p.m. UTC | #18
On 08/01/2023 14:33, Tariq Toukan wrote:
> 
> 
> On 05/01/2023 20:16, Jakub Kicinski wrote:
>> On Thu, 5 Jan 2023 11:57:32 -0500 Andy Gospodarek wrote:
>>>> So my main concern would be that if we "allow" this, the only way to
>>>> write an interoperable XDP program will be to use bpf_xdp_load_bytes()
>>>> for every packet access. Which will be slower than DPA, so we may 
>>>> end up
>>>> inadvertently slowing down all of the XDP ecosystem, because no one is
>>>> going to bother with writing two versions of their programs. Whereas if
>>>> you can rely on packet headers always being in the linear part, you can
>>>> write a lot of the "look at headers and make a decision" type programs
>>>> using just DPA, and they'll work for multibuf as well.
>>>
>>> The question I would have is what is really the 'slow down' for
>>> bpf_xdp_load_bytes() vs DPA?  I know you and Jesper can tell me how many
>>> instructions each use. :)
>>
>> Until we have an efficient and inlined DPA access to frags an
>> unconditional memcpy() of the first 2 cachelines-worth of headers
>> in the driver must be faster than a piece-by-piece bpf_xdp_load_bytes()
>> onto the stack, right?
>>
>>> Taking a step back...years ago Dave mentioned wanting to make XDP
>>> programs easy to write and it feels like using these accessor APIs would
>>> help accomplish that.  If the kernel examples use bpf_xdp_load_bytes()
>>> accessors everywhere then that would accomplish that.
>>
>> I've been pushing for an skb_header_pointer()-like helper but
>> the semantics were not universally loved :)
> 
> Maybe it's time to re-consider.
> 
> Is it something like an API that given an offset returns a pointer + 
> allowed length to be accessed?
> 
> This sounds like a good direction to me, that avoids having any 
> linear-part-length assumptions, while preserving good performance.
> 
> Maybe we can still require/guarantee that each single header (eth, ip, 
> tcp, ...) does not cross a frag/page boundary. For otherwise, a prog 
> needs to handle cases where headers span several fragments, so it has to 
> reconstruct the header by copying the different parts into some local 
> buffer.
> 
> This can be achieved by having another assumption that AFAIK already 
> holds today: all fragments are of size PAGE_SIZE.
> 
> Regards,
> Tariq

This can be a good starting point:
static void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len)

It's currently not exposed as a bpf-helper, and it works a bit 
differently to what I mentioned earlier: It gets the desired length, and 
fails in case it's not continuously accessible (i.e. this piece of data 
spans multiple frags).
Toke Høiland-Jørgensen Jan. 9, 2023, 1:50 p.m. UTC | #19
Tariq Toukan <ttoukan.linux@gmail.com> writes:

> On 08/01/2023 14:33, Tariq Toukan wrote:
>> 
>> 
>> On 05/01/2023 20:16, Jakub Kicinski wrote:
>>> On Thu, 5 Jan 2023 11:57:32 -0500 Andy Gospodarek wrote:
>>>>> So my main concern would be that if we "allow" this, the only way to
>>>>> write an interoperable XDP program will be to use bpf_xdp_load_bytes()
>>>>> for every packet access. Which will be slower than DPA, so we may 
>>>>> end up
>>>>> inadvertently slowing down all of the XDP ecosystem, because no one is
>>>>> going to bother with writing two versions of their programs. Whereas if
>>>>> you can rely on packet headers always being in the linear part, you can
>>>>> write a lot of the "look at headers and make a decision" type programs
>>>>> using just DPA, and they'll work for multibuf as well.
>>>>
>>>> The question I would have is what is really the 'slow down' for
>>>> bpf_xdp_load_bytes() vs DPA?  I know you and Jesper can tell me how many
>>>> instructions each use. :)
>>>
>>> Until we have an efficient and inlined DPA access to frags an
>>> unconditional memcpy() of the first 2 cachelines-worth of headers
>>> in the driver must be faster than a piece-by-piece bpf_xdp_load_bytes()
>>> onto the stack, right?
>>>
>>>> Taking a step back...years ago Dave mentioned wanting to make XDP
>>>> programs easy to write and it feels like using these accessor APIs would
>>>> help accomplish that.  If the kernel examples use bpf_xdp_load_bytes()
>>>> accessors everywhere then that would accomplish that.
>>>
>>> I've been pushing for an skb_header_pointer()-like helper but
>>> the semantics were not universally loved :)
>> 
>> Maybe it's time to re-consider.
>> 
>> Is it something like an API that given an offset returns a pointer + 
>> allowed length to be accessed?
>> 
>> This sounds like a good direction to me, that avoids having any 
>> linear-part-length assumptions, while preserving good performance.
>> 
>> Maybe we can still require/guarantee that each single header (eth, ip, 
>> tcp, ...) does not cross a frag/page boundary. For otherwise, a prog 
>> needs to handle cases where headers span several fragments, so it has to 
>> reconstruct the header by copying the different parts into some local 
>> buffer.
>> 
>> This can be achieved by having another assumption that AFAIK already 
>> holds today: all fragments are of size PAGE_SIZE.
>> 
>> Regards,
>> Tariq
>
> This can be a good starting point:
> static void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len)
>
> It's currently not exposed as a bpf-helper, and it works a bit 
> differently to what I mentioned earlier: It gets the desired length, and 
> fails in case it's not continuously accessible (i.e. this piece of data 
> spans multiple frags).

Did a bit of digging through the mail archives. Exposing
bpf_xdp_pointer() as a helper was proposed back in March last year:

https://lore.kernel.org/r/20220306234311.452206-1-memxor@gmail.com

The discussion of this seems to have ended on "let's use dynptrs
instead". There was a patch series posted for this as well, which seems
to have stalled out with this comment from Alexei in October:

https://lore.kernel.org/r/CAADnVQKhv2YBrUAQJq6UyqoZJ-FGNQbKenGoPySPNK+GaOjBOg@mail.gmail.com

-Toke
Maxim Mikityanskiy Jan. 10, 2023, 8:59 p.m. UTC | #20
On Tue, Jan 03, 2023 at 05:21:53PM -0800, Jakub Kicinski wrote:
> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote:
> > Hmm, good question! I don't think we've ever explicitly documented any
> > assumptions one way or the other. My own mental model has certainly
> > always assumed the first frag would continue to be the same size as in
> > non-multi-buf packets.
> 
> Interesting! :) My mental model was closer to GRO by frags 
> so the linear part would have no data, just headers.
> 
> A random datapoint is that bpf_xdp_adjust_head() seems 
> to enforce that there is at least ETH_HLEN.

Also bpf_xdp_frags_increase_tail has the following check:

	if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
		return -EOPNOTSUPP;

However, I can't seem to find where the `frag_size > frame_sz` part is
actually used. Maybe this condition can be dropped? Can someone shed
some light?

BTW, Tariq, we seem to have missed setting frag_size to a non-zero
value. Could you check that increasing the tail indeed doesn't work on
fragmented packets on mlx5e? I can send a oneliner to fix that.
Tariq Toukan Jan. 13, 2023, 9:07 p.m. UTC | #21
On 10/01/2023 22:59, Maxim Mikityanskiy wrote:
> On Tue, Jan 03, 2023 at 05:21:53PM -0800, Jakub Kicinski wrote:
>> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote:
>>> Hmm, good question! I don't think we've ever explicitly documented any
>>> assumptions one way or the other. My own mental model has certainly
>>> always assumed the first frag would continue to be the same size as in
>>> non-multi-buf packets.
>>
>> Interesting! :) My mental model was closer to GRO by frags
>> so the linear part would have no data, just headers.
>>
>> A random datapoint is that bpf_xdp_adjust_head() seems
>> to enforce that there is at least ETH_HLEN.
> 
> Also bpf_xdp_frags_increase_tail has the following check:
> 
> 	if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
> 		return -EOPNOTSUPP;
> 
> However, I can't seem to find where the `frag_size > frame_sz` part is
> actually used. Maybe this condition can be dropped? Can someone shed
> some light?
> 
> BTW, Tariq, we seem to have missed setting frag_size to a non-zero
> value.

Hey Maxim,
Indeed. We use xdp_rxq_info_reg, it passes 0 as frag_size.

> Could you check that increasing the tail indeed doesn't work on
> fragmented packets on mlx5e? I can send a oneliner to fix that.

I can test early next week, and update.
Tariq Toukan Jan. 25, 2023, 12:49 p.m. UTC | #22
On 13/01/2023 23:07, Tariq Toukan wrote:
> 
> 
> On 10/01/2023 22:59, Maxim Mikityanskiy wrote:
>> On Tue, Jan 03, 2023 at 05:21:53PM -0800, Jakub Kicinski wrote:
>>> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote:
>>>> Hmm, good question! I don't think we've ever explicitly documented any
>>>> assumptions one way or the other. My own mental model has certainly
>>>> always assumed the first frag would continue to be the same size as in
>>>> non-multi-buf packets.
>>>
>>> Interesting! :) My mental model was closer to GRO by frags
>>> so the linear part would have no data, just headers.
>>>
>>> A random datapoint is that bpf_xdp_adjust_head() seems
>>> to enforce that there is at least ETH_HLEN.
>>
>> Also bpf_xdp_frags_increase_tail has the following check:
>>
>>     if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
>>         return -EOPNOTSUPP;
>>
>> However, I can't seem to find where the `frag_size > frame_sz` part is
>> actually used. Maybe this condition can be dropped? Can someone shed
>> some light?
>>
>> BTW, Tariq, we seem to have missed setting frag_size to a non-zero
>> value.
> 
> Hey Maxim,
> Indeed. We use xdp_rxq_info_reg, it passes 0 as frag_size.
> 
>> Could you check that increasing the tail indeed doesn't work on
>> fragmented packets on mlx5e? I can send a oneliner to fix that.
> 
> I can test early next week, and update.

Hi Maxim,

Hacking the existing adjust_tail sample program, in between other high 
prio tasks, to make it work with frags took me some time.
But I can approve the fix below (or equivalent) is needed. Please go on 
and submit it.

Regards,
Tariq

--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -600,7 +600,7 @@ static int mlx5e_init_rxq_rq(struct mlx5e_channel 
*c, struct mlx5e_params *param
         if (err)
                 return err;

-       return xdp_rxq_info_reg(&rq->xdp_rxq, rq->netdev, rq->ix, 
c->napi.napi_id);
+       return __xdp_rxq_info_reg(&rq->xdp_rxq, rq->netdev, rq->ix, 
c->napi.napi_id, PAGE_SIZE);
  }

  static int mlx5_rq_shampo_alloc(struct mlx5_core_dev *mdev,
diff mbox series

Patch

diff --git a/samples/bpf/xdp1_kern.c b/samples/bpf/xdp1_kern.c
index f0c5d95084de..0a5c704badd0 100644
--- a/samples/bpf/xdp1_kern.c
+++ b/samples/bpf/xdp1_kern.c
@@ -39,11 +39,13 @@  static int parse_ipv6(void *data, u64 nh_off, void *data_end)
 	return ip6h->nexthdr;
 }
 
-SEC("xdp1")
+#define XDPBUFSIZE	64
+SEC("xdp.frags")
 int xdp_prog1(struct xdp_md *ctx)
 {
-	void *data_end = (void *)(long)ctx->data_end;
-	void *data = (void *)(long)ctx->data;
+	__u8 pkt[XDPBUFSIZE] = {};
+	void *data_end = &pkt[XDPBUFSIZE-1];
+	void *data = pkt;
 	struct ethhdr *eth = data;
 	int rc = XDP_DROP;
 	long *value;
@@ -51,6 +53,9 @@  int xdp_prog1(struct xdp_md *ctx)
 	u64 nh_off;
 	u32 ipproto;
 
+	if (bpf_xdp_load_bytes(ctx, 0, pkt, sizeof(pkt)))
+		return rc;
+
 	nh_off = sizeof(*eth);
 	if (data + nh_off > data_end)
 		return rc;
diff --git a/samples/bpf/xdp2_kern.c b/samples/bpf/xdp2_kern.c
index d8a64ab077b0..3332ba6bb95f 100644
--- a/samples/bpf/xdp2_kern.c
+++ b/samples/bpf/xdp2_kern.c
@@ -55,11 +55,13 @@  static int parse_ipv6(void *data, u64 nh_off, void *data_end)
 	return ip6h->nexthdr;
 }
 
-SEC("xdp1")
+#define XDPBUFSIZE	64
+SEC("xdp.frags")
 int xdp_prog1(struct xdp_md *ctx)
 {
-	void *data_end = (void *)(long)ctx->data_end;
-	void *data = (void *)(long)ctx->data;
+	__u8 pkt[XDPBUFSIZE] = {};
+	void *data_end = &pkt[XDPBUFSIZE-1];
+	void *data = pkt;
 	struct ethhdr *eth = data;
 	int rc = XDP_DROP;
 	long *value;
@@ -67,6 +69,9 @@  int xdp_prog1(struct xdp_md *ctx)
 	u64 nh_off;
 	u32 ipproto;
 
+	if (bpf_xdp_load_bytes(ctx, 0, pkt, sizeof(pkt)))
+		return rc;
+
 	nh_off = sizeof(*eth);
 	if (data + nh_off > data_end)
 		return rc;
diff --git a/samples/bpf/xdp_tx_iptunnel_kern.c b/samples/bpf/xdp_tx_iptunnel_kern.c
index 575d57e4b8d6..0e2bca3a3fff 100644
--- a/samples/bpf/xdp_tx_iptunnel_kern.c
+++ b/samples/bpf/xdp_tx_iptunnel_kern.c
@@ -212,7 +212,7 @@  static __always_inline int handle_ipv6(struct xdp_md *xdp)
 	return XDP_TX;
 }
 
-SEC("xdp_tx_iptunnel")
+SEC("xdp.frags")
 int _xdp_tx_iptunnel(struct xdp_md *xdp)
 {
 	void *data_end = (void *)(long)xdp->data_end;