Message ID | 20220621175402.35327-1-gospo@broadcom.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 772251742262bd651529a3cea3ee756b71e95a29 |
Delegated to: | BPF |
Headers | show |
Series | [net-next,v2] samples/bpf: fixup some tools to be able to support xdp multibuffer | expand |
Hello: This patch was applied to bpf/bpf-next.git (master) by Alexei Starovoitov <ast@kernel.org>: On Tue, 21 Jun 2022 17:54:02 +0000 you wrote: > This changes the section name for the bpf program embedded in these > files to "xdp.frags" to allow the programs to be loaded on drivers that > are using an MTU greater than PAGE_SIZE. Rather than directly accessing > the buffers, the packet data is now accessed via xdp helper functions to > provide an example for those who may need to write more complex > programs. > > [...] Here is the summary with links: - [net-next,v2] samples/bpf: fixup some tools to be able to support xdp multibuffer https://git.kernel.org/bpf/bpf-next/c/772251742262 You are awesome, thank you!
On 21/06/2022 20:54, Andy Gospodarek wrote: > This changes the section name for the bpf program embedded in these > files to "xdp.frags" to allow the programs to be loaded on drivers that > are using an MTU greater than PAGE_SIZE. Rather than directly accessing > the buffers, the packet data is now accessed via xdp helper functions to > provide an example for those who may need to write more complex > programs. > > v2: remove new unnecessary variable > Hi, I'm trying to understand if there are any assumptions/requirements on the length of the xdp_buf linear part when passed to XDP multi-buf programs? Can the linear part be empty, with all data residing in the fragments? Is it valid? Per the proposed pattern below (calling bpf_xdp_load_bytes() to memcpy packet data into a local buffer), no such assumption is required, and an xdp_buf created by the driver with an empty linear part is valid. However, in the _xdp_tx_iptunnel example program, it fails (returns XDP_DROP) in case the headers are not in the linear part. Regards, Tariq > Signed-off-by: Andy Gospodarek <gospo@broadcom.com> > Acked-by: John Fastabend <john.fastabend@gmail.com> > Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> > --- > samples/bpf/xdp1_kern.c | 11 ++++++++--- > samples/bpf/xdp2_kern.c | 11 ++++++++--- > samples/bpf/xdp_tx_iptunnel_kern.c | 2 +- > 3 files changed, 17 insertions(+), 7 deletions(-) > > diff --git a/samples/bpf/xdp1_kern.c b/samples/bpf/xdp1_kern.c > index f0c5d95084de..0a5c704badd0 100644 > --- a/samples/bpf/xdp1_kern.c > +++ b/samples/bpf/xdp1_kern.c > @@ -39,11 +39,13 @@ static int parse_ipv6(void *data, u64 nh_off, void *data_end) > return ip6h->nexthdr; > } > > -SEC("xdp1") > +#define XDPBUFSIZE 64 > +SEC("xdp.frags") > int xdp_prog1(struct xdp_md *ctx) > { > - void *data_end = (void *)(long)ctx->data_end; > - void *data = (void *)(long)ctx->data; > + __u8 pkt[XDPBUFSIZE] = {}; > + void *data_end = &pkt[XDPBUFSIZE-1]; > + void *data = pkt; > struct ethhdr *eth = data; > int rc = XDP_DROP; > long *value; > @@ -51,6 +53,9 @@ int xdp_prog1(struct xdp_md *ctx) > u64 nh_off; > u32 ipproto; > > + if (bpf_xdp_load_bytes(ctx, 0, pkt, sizeof(pkt))) > + return rc; > + > nh_off = sizeof(*eth); > if (data + nh_off > data_end) > return rc; > diff --git a/samples/bpf/xdp2_kern.c b/samples/bpf/xdp2_kern.c > index d8a64ab077b0..3332ba6bb95f 100644 > --- a/samples/bpf/xdp2_kern.c > +++ b/samples/bpf/xdp2_kern.c > @@ -55,11 +55,13 @@ static int parse_ipv6(void *data, u64 nh_off, void *data_end) > return ip6h->nexthdr; > } > > -SEC("xdp1") > +#define XDPBUFSIZE 64 > +SEC("xdp.frags") > int xdp_prog1(struct xdp_md *ctx) > { > - void *data_end = (void *)(long)ctx->data_end; > - void *data = (void *)(long)ctx->data; > + __u8 pkt[XDPBUFSIZE] = {}; > + void *data_end = &pkt[XDPBUFSIZE-1]; > + void *data = pkt; > struct ethhdr *eth = data; > int rc = XDP_DROP; > long *value; > @@ -67,6 +69,9 @@ int xdp_prog1(struct xdp_md *ctx) > u64 nh_off; > u32 ipproto; > > + if (bpf_xdp_load_bytes(ctx, 0, pkt, sizeof(pkt))) > + return rc; > + > nh_off = sizeof(*eth); > if (data + nh_off > data_end) > return rc; > diff --git a/samples/bpf/xdp_tx_iptunnel_kern.c b/samples/bpf/xdp_tx_iptunnel_kern.c > index 575d57e4b8d6..0e2bca3a3fff 100644 > --- a/samples/bpf/xdp_tx_iptunnel_kern.c > +++ b/samples/bpf/xdp_tx_iptunnel_kern.c > @@ -212,7 +212,7 @@ static __always_inline int handle_ipv6(struct xdp_md *xdp) > return XDP_TX; > } > > -SEC("xdp_tx_iptunnel") > +SEC("xdp.frags") > int _xdp_tx_iptunnel(struct xdp_md *xdp) > { > void *data_end = (void *)(long)xdp->data_end;
Tariq Toukan <ttoukan.linux@gmail.com> writes: > On 21/06/2022 20:54, Andy Gospodarek wrote: >> This changes the section name for the bpf program embedded in these >> files to "xdp.frags" to allow the programs to be loaded on drivers that >> are using an MTU greater than PAGE_SIZE. Rather than directly accessing >> the buffers, the packet data is now accessed via xdp helper functions to >> provide an example for those who may need to write more complex >> programs. >> >> v2: remove new unnecessary variable >> > > Hi, > > I'm trying to understand if there are any assumptions/requirements on > the length of the xdp_buf linear part when passed to XDP multi-buf programs? > Can the linear part be empty, with all data residing in the fragments? > Is it valid? > > Per the proposed pattern below (calling bpf_xdp_load_bytes() to memcpy > packet data into a local buffer), no such assumption is required, and an > xdp_buf created by the driver with an empty linear part is valid. > > However, in the _xdp_tx_iptunnel example program, it fails (returns > XDP_DROP) in case the headers are not in the linear part. Hmm, good question! I don't think we've ever explicitly documented any assumptions one way or the other. My own mental model has certainly always assumed the first frag would continue to be the same size as in non-multi-buf packets. I do seem to recall there was some discussion around this when we were discussing whether or not we needed programs to explicitly opt-in to multi-buf support (what ended up being the "xdp.frags" section). The reason we said it might *not* be necessary to do that was that most programs would just continue working, and it would only be those that either tried to access the end of the packet, or to compute the packet length as data_end-data that would need any changes before enabling the frags flag. Which also kinda implies that headers etc would continue to be in the linear part. This is all from memory, though, so maybe others have different recollections. In any case this is probably something we should document somewhere :) -Toke
On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote: > Hmm, good question! I don't think we've ever explicitly documented any > assumptions one way or the other. My own mental model has certainly > always assumed the first frag would continue to be the same size as in > non-multi-buf packets. Interesting! :) My mental model was closer to GRO by frags so the linear part would have no data, just headers. A random datapoint is that bpf_xdp_adjust_head() seems to enforce that there is at least ETH_HLEN.
> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote: > > Hmm, good question! I don't think we've ever explicitly documented any > > assumptions one way or the other. My own mental model has certainly > > always assumed the first frag would continue to be the same size as in > > non-multi-buf packets. > > Interesting! :) My mental model was closer to GRO by frags > so the linear part would have no data, just headers. That is assumption as well. Regards, Lorenzo > > A random datapoint is that bpf_xdp_adjust_head() seems > to enforce that there is at least ETH_HLEN.
Lorenzo Bianconi <lorenzo@kernel.org> writes: >> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote: >> > Hmm, good question! I don't think we've ever explicitly documented any >> > assumptions one way or the other. My own mental model has certainly >> > always assumed the first frag would continue to be the same size as in >> > non-multi-buf packets. >> >> Interesting! :) My mental model was closer to GRO by frags >> so the linear part would have no data, just headers. > > That is assumption as well. Right, okay, so how many headers? Only Ethernet, or all the way up to L4 (TCP/UDP)? I do seem to recall a discussion around the header/data split for TCP specifically, but I think I mentally put that down as "something people may way to do at some point in the future", which is why it hasn't made it into my own mental model (yet?) :) -Toke
On Wed, 04 Jan 2023 13:28:57 +0100 Toke Høiland-Jørgensen wrote: > >> Interesting! :) My mental model was closer to GRO by frags > >> so the linear part would have no data, just headers. > > > > That is assumption as well. > > Right, okay, so how many headers? Only Ethernet, or all the way up to > L4 (TCP/UDP)? If we're speaking about guarantees or hard requirements - I think that we can only require / guarantee the Ethernet header. Requiring more will be defeated by tunnels (i.e. adjust_head() + redirect to a veth). > I do seem to recall a discussion around the header/data split for TCP > specifically, but I think I mentally put that down as "something people > may way to do at some point in the future", which is why it hasn't made > it into my own mental model (yet?) :)
On 04/01/2023 14:28, Toke Høiland-Jørgensen wrote: > Lorenzo Bianconi <lorenzo@kernel.org> writes: > >>> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote: >>>> Hmm, good question! I don't think we've ever explicitly documented any >>>> assumptions one way or the other. My own mental model has certainly >>>> always assumed the first frag would continue to be the same size as in >>>> non-multi-buf packets. >>> >>> Interesting! :) My mental model was closer to GRO by frags >>> so the linear part would have no data, just headers. >> >> That is assumption as well. > > Right, okay, so how many headers? Only Ethernet, or all the way up to > L4 (TCP/UDP)? > > I do seem to recall a discussion around the header/data split for TCP > specifically, but I think I mentally put that down as "something people > may way to do at some point in the future", which is why it hasn't made > it into my own mental model (yet?) :) > > -Toke > I don't think that all the different GRO layers assume having their headers/data in the linear part. IMO they will just perform better if these parts are already there. Otherwise, the GRO flow manages, and pulls the needed amount into the linear part. As examples, see calls to gro_pull_from_frag0 in net/core/gro.c, and the call to pskb_may_pull() from skb_gro_header_slow(). This resembles the bpf_xdp_load_bytes() API used here in the xdp prog. The context of my questions is that I'm looking for the right memory scheme for adding xdp-mb support to mlx5e striding RQ. In striding RQ, the RX buffer consists of "strides" of a fixed size set by the driver. An incoming packet is written to the buffer starting from the beginning of the next available stride, consuming as much strides as needed. Due to the need for headroom and tailroom, there's no easy way of building the xdp_buf in place (around the packet), so it should go to a side buffer. By using 0-length linear part in a side buffer, I can address two challenging issues: (1) save the in-driver headers memcpy (copy might still exist in the xdp program though), and (2) conform to the "fragments of the same size" requirement/assumption in xdp-mb. Otherwise, if we pull from frag[0] into the linear part, frag[0] becomes smaller than the next fragments. Regards, Tariq
Tariq Toukan <ttoukan.linux@gmail.com> writes: > On 04/01/2023 14:28, Toke Høiland-Jørgensen wrote: >> Lorenzo Bianconi <lorenzo@kernel.org> writes: >> >>>> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote: >>>>> Hmm, good question! I don't think we've ever explicitly documented any >>>>> assumptions one way or the other. My own mental model has certainly >>>>> always assumed the first frag would continue to be the same size as in >>>>> non-multi-buf packets. >>>> >>>> Interesting! :) My mental model was closer to GRO by frags >>>> so the linear part would have no data, just headers. >>> >>> That is assumption as well. >> >> Right, okay, so how many headers? Only Ethernet, or all the way up to >> L4 (TCP/UDP)? >> >> I do seem to recall a discussion around the header/data split for TCP >> specifically, but I think I mentally put that down as "something people >> may way to do at some point in the future", which is why it hasn't made >> it into my own mental model (yet?) :) >> >> -Toke >> > > I don't think that all the different GRO layers assume having their > headers/data in the linear part. IMO they will just perform better if > these parts are already there. Otherwise, the GRO flow manages, and > pulls the needed amount into the linear part. > As examples, see calls to gro_pull_from_frag0 in net/core/gro.c, and the > call to pskb_may_pull() from skb_gro_header_slow(). > > This resembles the bpf_xdp_load_bytes() API used here in the xdp prog. Right, but that is kernel code; what we end up doing with the API here affects how many programs need to make significant changes to work with multibuf, and how many can just set the frags flag and continue working. Which also has a performance impact, see below. > The context of my questions is that I'm looking for the right memory > scheme for adding xdp-mb support to mlx5e striding RQ. > In striding RQ, the RX buffer consists of "strides" of a fixed size set > by pthe driver. An incoming packet is written to the buffer starting from > the beginning of the next available stride, consuming as much strides as > needed. > > Due to the need for headroom and tailroom, there's no easy way of > building the xdp_buf in place (around the packet), so it should go to a > side buffer. > > By using 0-length linear part in a side buffer, I can address two > challenging issues: (1) save the in-driver headers memcpy (copy might > still exist in the xdp program though), and (2) conform to the > "fragments of the same size" requirement/assumption in xdp-mb. > Otherwise, if we pull from frag[0] into the linear part, frag[0] becomes > smaller than the next fragments. Right, I see. So my main concern would be that if we "allow" this, the only way to write an interoperable XDP program will be to use bpf_xdp_load_bytes() for every packet access. Which will be slower than DPA, so we may end up inadvertently slowing down all of the XDP ecosystem, because no one is going to bother with writing two versions of their programs. Whereas if you can rely on packet headers always being in the linear part, you can write a lot of the "look at headers and make a decision" type programs using just DPA, and they'll work for multibuf as well. But maybe I'm mistaken and people are just going to use the load_bytes helper anyway because they want to go deeper than whatever "headers" bit we'll end up guaranteeing is in the linear part? -Toke
On Tue, Jan 03, 2023 at 02:55:22PM +0200, Tariq Toukan wrote: > > > On 21/06/2022 20:54, Andy Gospodarek wrote: > > This changes the section name for the bpf program embedded in these > > files to "xdp.frags" to allow the programs to be loaded on drivers that > > are using an MTU greater than PAGE_SIZE. Rather than directly accessing > > the buffers, the packet data is now accessed via xdp helper functions to > > provide an example for those who may need to write more complex > > programs. > > > > v2: remove new unnecessary variable > > > > Hi, > > I'm trying to understand if there are any assumptions/requirements on the > length of the xdp_buf linear part when passed to XDP multi-buf programs? > Can the linear part be empty, with all data residing in the fragments? Is it > valid? That's a great question. The implementation in bnxt_en was based on the implementation as I understood it in mvneta where the linear area contained approx the first 4k of data - xdp headroom - dma_offset. This means that you have something that looks like this with a 9k MTU: skb->data [~3.6k of packet data] skb->frag[0] [4k of paket data] frag[1] [remainder of packet data] At some point, I'd like to take the opportunity to test something like this: skb->data [header only + space for header expansion] skb->frag[0] [first 4k of data] frag[1] [second 4k of data] frag[2] [remainder of packet data] Though this will use a bit more memory, I think it will be much more performant for data that is ultimately consumed rather than forwarded by the host as the actual packet data will be aligned on page boundaries. With the ability to have packets that are handled by an XDP program span buffers, I would also like to test out whether or not it would be worthwhile to have standard MTU packets also look like this: skb->data [header only + space for header expansion] skb->frag[0] [packet data] I think the overall system performance would be better in the XDP_PASS case, but until there is data to back this up, that's just speculation. > Per the proposed pattern below (calling bpf_xdp_load_bytes() to memcpy > packet data into a local buffer), no such assumption is required, and an > xdp_buf created by the driver with an empty linear part is valid. > > However, in the _xdp_tx_iptunnel example program, it fails (returns > XDP_DROP) in case the headers are not in the linear part. > > Regards, > Tariq > > > Signed-off-by: Andy Gospodarek <gospo@broadcom.com> > > Acked-by: John Fastabend <john.fastabend@gmail.com> > > Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> > > --- > > samples/bpf/xdp1_kern.c | 11 ++++++++--- > > samples/bpf/xdp2_kern.c | 11 ++++++++--- > > samples/bpf/xdp_tx_iptunnel_kern.c | 2 +- > > 3 files changed, 17 insertions(+), 7 deletions(-) > > > > diff --git a/samples/bpf/xdp1_kern.c b/samples/bpf/xdp1_kern.c > > index f0c5d95084de..0a5c704badd0 100644 > > --- a/samples/bpf/xdp1_kern.c > > +++ b/samples/bpf/xdp1_kern.c > > @@ -39,11 +39,13 @@ static int parse_ipv6(void *data, u64 nh_off, void *data_end) > > return ip6h->nexthdr; > > } > > -SEC("xdp1") > > +#define XDPBUFSIZE 64 > > +SEC("xdp.frags") > > int xdp_prog1(struct xdp_md *ctx) > > { > > - void *data_end = (void *)(long)ctx->data_end; > > - void *data = (void *)(long)ctx->data; > > + __u8 pkt[XDPBUFSIZE] = {}; > > + void *data_end = &pkt[XDPBUFSIZE-1]; > > + void *data = pkt; > > struct ethhdr *eth = data; > > int rc = XDP_DROP; > > long *value; > > @@ -51,6 +53,9 @@ int xdp_prog1(struct xdp_md *ctx) > > u64 nh_off; > > u32 ipproto; > > + if (bpf_xdp_load_bytes(ctx, 0, pkt, sizeof(pkt))) > > + return rc; > > + > > nh_off = sizeof(*eth); > > if (data + nh_off > data_end) > > return rc; > > diff --git a/samples/bpf/xdp2_kern.c b/samples/bpf/xdp2_kern.c > > index d8a64ab077b0..3332ba6bb95f 100644 > > --- a/samples/bpf/xdp2_kern.c > > +++ b/samples/bpf/xdp2_kern.c > > @@ -55,11 +55,13 @@ static int parse_ipv6(void *data, u64 nh_off, void *data_end) > > return ip6h->nexthdr; > > } > > -SEC("xdp1") > > +#define XDPBUFSIZE 64 > > +SEC("xdp.frags") > > int xdp_prog1(struct xdp_md *ctx) > > { > > - void *data_end = (void *)(long)ctx->data_end; > > - void *data = (void *)(long)ctx->data; > > + __u8 pkt[XDPBUFSIZE] = {}; > > + void *data_end = &pkt[XDPBUFSIZE-1]; > > + void *data = pkt; > > struct ethhdr *eth = data; > > int rc = XDP_DROP; > > long *value; > > @@ -67,6 +69,9 @@ int xdp_prog1(struct xdp_md *ctx) > > u64 nh_off; > > u32 ipproto; > > + if (bpf_xdp_load_bytes(ctx, 0, pkt, sizeof(pkt))) > > + return rc; > > + > > nh_off = sizeof(*eth); > > if (data + nh_off > data_end) > > return rc; > > diff --git a/samples/bpf/xdp_tx_iptunnel_kern.c b/samples/bpf/xdp_tx_iptunnel_kern.c > > index 575d57e4b8d6..0e2bca3a3fff 100644 > > --- a/samples/bpf/xdp_tx_iptunnel_kern.c > > +++ b/samples/bpf/xdp_tx_iptunnel_kern.c > > @@ -212,7 +212,7 @@ static __always_inline int handle_ipv6(struct xdp_md *xdp) > > return XDP_TX; > > } > > -SEC("xdp_tx_iptunnel") > > +SEC("xdp.frags") > > int _xdp_tx_iptunnel(struct xdp_md *xdp) > > { > > void *data_end = (void *)(long)xdp->data_end;
On Tue, Jan 03, 2023 at 05:21:53PM -0800, Jakub Kicinski wrote: > On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote: > > Hmm, good question! I don't think we've ever explicitly documented any > > assumptions one way or the other. My own mental model has certainly > > always assumed the first frag would continue to be the same size as in > > non-multi-buf packets. > > Interesting! :) My mental model was closer to GRO by frags > so the linear part would have no data, just headers. As I mentioned in my mail just a few mins ago, I think this would be a good model to consider. All headers (including potentially tunnel headers) could be in the linear area with the actual packet data in frags. > A random datapoint is that bpf_xdp_adjust_head() seems > to enforce that there is at least ETH_HLEN.
On Thu, Jan 05, 2023 at 04:43:28PM +0100, Toke Høiland-Jørgensen wrote: > Tariq Toukan <ttoukan.linux@gmail.com> writes: > > > On 04/01/2023 14:28, Toke Høiland-Jørgensen wrote: > >> Lorenzo Bianconi <lorenzo@kernel.org> writes: > >> > >>>> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote: > >>>>> Hmm, good question! I don't think we've ever explicitly documented any > >>>>> assumptions one way or the other. My own mental model has certainly > >>>>> always assumed the first frag would continue to be the same size as in > >>>>> non-multi-buf packets. > >>>> > >>>> Interesting! :) My mental model was closer to GRO by frags > >>>> so the linear part would have no data, just headers. > >>> > >>> That is assumption as well. > >> > >> Right, okay, so how many headers? Only Ethernet, or all the way up to > >> L4 (TCP/UDP)? > >> > >> I do seem to recall a discussion around the header/data split for TCP > >> specifically, but I think I mentally put that down as "something people > >> may way to do at some point in the future", which is why it hasn't made > >> it into my own mental model (yet?) :) > >> > >> -Toke > >> > > > > I don't think that all the different GRO layers assume having their > > headers/data in the linear part. IMO they will just perform better if > > these parts are already there. Otherwise, the GRO flow manages, and > > pulls the needed amount into the linear part. > > As examples, see calls to gro_pull_from_frag0 in net/core/gro.c, and the > > call to pskb_may_pull() from skb_gro_header_slow(). > > > > This resembles the bpf_xdp_load_bytes() API used here in the xdp prog. > > Right, but that is kernel code; what we end up doing with the API here > affects how many programs need to make significant changes to work with > multibuf, and how many can just set the frags flag and continue working. > Which also has a performance impact, see below. > > > The context of my questions is that I'm looking for the right memory > > scheme for adding xdp-mb support to mlx5e striding RQ. > > In striding RQ, the RX buffer consists of "strides" of a fixed size set > > by pthe driver. An incoming packet is written to the buffer starting from > > the beginning of the next available stride, consuming as much strides as > > needed. > > > > Due to the need for headroom and tailroom, there's no easy way of > > building the xdp_buf in place (around the packet), so it should go to a > > side buffer. > > > > By using 0-length linear part in a side buffer, I can address two > > challenging issues: (1) save the in-driver headers memcpy (copy might > > still exist in the xdp program though), and (2) conform to the > > "fragments of the same size" requirement/assumption in xdp-mb. > > Otherwise, if we pull from frag[0] into the linear part, frag[0] becomes > > smaller than the next fragments. > > Right, I see. > > So my main concern would be that if we "allow" this, the only way to > write an interoperable XDP program will be to use bpf_xdp_load_bytes() > for every packet access. Which will be slower than DPA, so we may end up > inadvertently slowing down all of the XDP ecosystem, because no one is > going to bother with writing two versions of their programs. Whereas if > you can rely on packet headers always being in the linear part, you can > write a lot of the "look at headers and make a decision" type programs > using just DPA, and they'll work for multibuf as well. The question I would have is what is really the 'slow down' for bpf_xdp_load_bytes() vs DPA? I know you and Jesper can tell me how many instructions each use. :) Taking a step back...years ago Dave mentioned wanting to make XDP programs easy to write and it feels like using these accessor APIs would help accomplish that. If the kernel examples use bpf_xdp_load_bytes() accessors everywhere then that would accomplish that. > But maybe I'm mistaken and people are just going to use the load_bytes > helper anyway because they want to go deeper than whatever "headers" bit > we'll end up guaranteeing is in the linear part?
On Thu, 5 Jan 2023 11:57:32 -0500 Andy Gospodarek wrote: > > So my main concern would be that if we "allow" this, the only way to > > write an interoperable XDP program will be to use bpf_xdp_load_bytes() > > for every packet access. Which will be slower than DPA, so we may end up > > inadvertently slowing down all of the XDP ecosystem, because no one is > > going to bother with writing two versions of their programs. Whereas if > > you can rely on packet headers always being in the linear part, you can > > write a lot of the "look at headers and make a decision" type programs > > using just DPA, and they'll work for multibuf as well. > > The question I would have is what is really the 'slow down' for > bpf_xdp_load_bytes() vs DPA? I know you and Jesper can tell me how many > instructions each use. :) Until we have an efficient and inlined DPA access to frags an unconditional memcpy() of the first 2 cachelines-worth of headers in the driver must be faster than a piece-by-piece bpf_xdp_load_bytes() onto the stack, right? > Taking a step back...years ago Dave mentioned wanting to make XDP > programs easy to write and it feels like using these accessor APIs would > help accomplish that. If the kernel examples use bpf_xdp_load_bytes() > accessors everywhere then that would accomplish that. I've been pushing for an skb_header_pointer()-like helper but the semantics were not universally loved :)
Andy Gospodarek <andrew.gospodarek@broadcom.com> writes: > On Thu, Jan 05, 2023 at 04:43:28PM +0100, Toke Høiland-Jørgensen wrote: >> Tariq Toukan <ttoukan.linux@gmail.com> writes: >> >> > On 04/01/2023 14:28, Toke Høiland-Jørgensen wrote: >> >> Lorenzo Bianconi <lorenzo@kernel.org> writes: >> >> >> >>>> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote: >> >>>>> Hmm, good question! I don't think we've ever explicitly documented any >> >>>>> assumptions one way or the other. My own mental model has certainly >> >>>>> always assumed the first frag would continue to be the same size as in >> >>>>> non-multi-buf packets. >> >>>> >> >>>> Interesting! :) My mental model was closer to GRO by frags >> >>>> so the linear part would have no data, just headers. >> >>> >> >>> That is assumption as well. >> >> >> >> Right, okay, so how many headers? Only Ethernet, or all the way up to >> >> L4 (TCP/UDP)? >> >> >> >> I do seem to recall a discussion around the header/data split for TCP >> >> specifically, but I think I mentally put that down as "something people >> >> may way to do at some point in the future", which is why it hasn't made >> >> it into my own mental model (yet?) :) >> >> >> >> -Toke >> >> >> > >> > I don't think that all the different GRO layers assume having their >> > headers/data in the linear part. IMO they will just perform better if >> > these parts are already there. Otherwise, the GRO flow manages, and >> > pulls the needed amount into the linear part. >> > As examples, see calls to gro_pull_from_frag0 in net/core/gro.c, and the >> > call to pskb_may_pull() from skb_gro_header_slow(). >> > >> > This resembles the bpf_xdp_load_bytes() API used here in the xdp prog. >> >> Right, but that is kernel code; what we end up doing with the API here >> affects how many programs need to make significant changes to work with >> multibuf, and how many can just set the frags flag and continue working. >> Which also has a performance impact, see below. >> >> > The context of my questions is that I'm looking for the right memory >> > scheme for adding xdp-mb support to mlx5e striding RQ. >> > In striding RQ, the RX buffer consists of "strides" of a fixed size set >> > by pthe driver. An incoming packet is written to the buffer starting from >> > the beginning of the next available stride, consuming as much strides as >> > needed. >> > >> > Due to the need for headroom and tailroom, there's no easy way of >> > building the xdp_buf in place (around the packet), so it should go to a >> > side buffer. >> > >> > By using 0-length linear part in a side buffer, I can address two >> > challenging issues: (1) save the in-driver headers memcpy (copy might >> > still exist in the xdp program though), and (2) conform to the >> > "fragments of the same size" requirement/assumption in xdp-mb. >> > Otherwise, if we pull from frag[0] into the linear part, frag[0] becomes >> > smaller than the next fragments. >> >> Right, I see. >> >> So my main concern would be that if we "allow" this, the only way to >> write an interoperable XDP program will be to use bpf_xdp_load_bytes() >> for every packet access. Which will be slower than DPA, so we may end up >> inadvertently slowing down all of the XDP ecosystem, because no one is >> going to bother with writing two versions of their programs. Whereas if >> you can rely on packet headers always being in the linear part, you can >> write a lot of the "look at headers and make a decision" type programs >> using just DPA, and they'll work for multibuf as well. > > The question I would have is what is really the 'slow down' for > bpf_xdp_load_bytes() vs DPA? I know you and Jesper can tell me how many > instructions each use. :) I can try running some benchmarks to compare the two, sure! -Toke
On Thu, Jan 05, 2023 at 10:16:42AM -0800, Jakub Kicinski wrote: > On Thu, 5 Jan 2023 11:57:32 -0500 Andy Gospodarek wrote: > > > So my main concern would be that if we "allow" this, the only way to > > > write an interoperable XDP program will be to use bpf_xdp_load_bytes() > > > for every packet access. Which will be slower than DPA, so we may end up > > > inadvertently slowing down all of the XDP ecosystem, because no one is > > > going to bother with writing two versions of their programs. Whereas if > > > you can rely on packet headers always being in the linear part, you can > > > write a lot of the "look at headers and make a decision" type programs > > > using just DPA, and they'll work for multibuf as well. > > > > The question I would have is what is really the 'slow down' for > > bpf_xdp_load_bytes() vs DPA? I know you and Jesper can tell me how many > > instructions each use. :) > > Until we have an efficient and inlined DPA access to frags an > unconditional memcpy() of the first 2 cachelines-worth of headers > in the driver must be faster than a piece-by-piece bpf_xdp_load_bytes() > onto the stack, right? 100% Seems like we are back to speed vs ease of use, then? > > Taking a step back...years ago Dave mentioned wanting to make XDP > > programs easy to write and it feels like using these accessor APIs would > > help accomplish that. If the kernel examples use bpf_xdp_load_bytes() > > accessors everywhere then that would accomplish that. > > I've been pushing for an skb_header_pointer()-like helper but > the semantics were not universally loved :) I didn't recall that -- maybe I'll check the archives and see what I can find.
>>> So my main concern would be that if we "allow" this, the only way to >>> write an interoperable XDP program will be to use bpf_xdp_load_bytes() >>> for every packet access. Which will be slower than DPA, so we may end up >>> inadvertently slowing down all of the XDP ecosystem, because no one is >>> going to bother with writing two versions of their programs. Whereas if >>> you can rely on packet headers always being in the linear part, you can >>> write a lot of the "look at headers and make a decision" type programs >>> using just DPA, and they'll work for multibuf as well. >> >> The question I would have is what is really the 'slow down' for >> bpf_xdp_load_bytes() vs DPA? I know you and Jesper can tell me how many >> instructions each use. :) > > I can try running some benchmarks to compare the two, sure! Okay, ran a simple test: a program that just parses the IP header, then drops the packet. Results as follows: Baseline (don't touch data): 26.5 Mpps / 37.8 ns/pkt Touch data (ethernet hdr): 25.0 Mpps / 40.0 ns/pkt Parse IP (DPA): 24.1 Mpps / 41.5 ns/pkt Parse IP (bpf_xdp_load_bytes): 15.3 Mpps / 65.3 ns/pkt So 2.2 ns of overhead from reading the packet data, another 1.5 ns from the parsing logic, and a whopping 23.8 ns extra from switching to bpf_xdp_load_bytes(). This is with two calls to bpf_xdp_load_bytes(), one to get the Ethernet header, and another to get the IP header. Dropping one of them also drops the overhead in half, so it seems to fit with ~12 ns of overhead from a single call to bpf_xdp_load_bytes(). I pushed the code I used for testing here, in case someone else wants to play around with it: https://github.com/xdp-project/xdp-tools/tree/xdp-load-bytes It's part of the 'xdp-bench' utility. Run it as: ./xdp-bench drop <iface> -p parse-ip for DPA parsing and ./xdp-bench drop <iface> -p parse-ip -l to use bpf_xdp_load_bytes(). -Toke
On 05/01/2023 20:16, Jakub Kicinski wrote: > On Thu, 5 Jan 2023 11:57:32 -0500 Andy Gospodarek wrote: >>> So my main concern would be that if we "allow" this, the only way to >>> write an interoperable XDP program will be to use bpf_xdp_load_bytes() >>> for every packet access. Which will be slower than DPA, so we may end up >>> inadvertently slowing down all of the XDP ecosystem, because no one is >>> going to bother with writing two versions of their programs. Whereas if >>> you can rely on packet headers always being in the linear part, you can >>> write a lot of the "look at headers and make a decision" type programs >>> using just DPA, and they'll work for multibuf as well. >> >> The question I would have is what is really the 'slow down' for >> bpf_xdp_load_bytes() vs DPA? I know you and Jesper can tell me how many >> instructions each use. :) > > Until we have an efficient and inlined DPA access to frags an > unconditional memcpy() of the first 2 cachelines-worth of headers > in the driver must be faster than a piece-by-piece bpf_xdp_load_bytes() > onto the stack, right? > >> Taking a step back...years ago Dave mentioned wanting to make XDP >> programs easy to write and it feels like using these accessor APIs would >> help accomplish that. If the kernel examples use bpf_xdp_load_bytes() >> accessors everywhere then that would accomplish that. > > I've been pushing for an skb_header_pointer()-like helper but > the semantics were not universally loved :) Maybe it's time to re-consider. Is it something like an API that given an offset returns a pointer + allowed length to be accessed? This sounds like a good direction to me, that avoids having any linear-part-length assumptions, while preserving good performance. Maybe we can still require/guarantee that each single header (eth, ip, tcp, ...) does not cross a frag/page boundary. For otherwise, a prog needs to handle cases where headers span several fragments, so it has to reconstruct the header by copying the different parts into some local buffer. This can be achieved by having another assumption that AFAIK already holds today: all fragments are of size PAGE_SIZE. Regards, Tariq
On 08/01/2023 14:33, Tariq Toukan wrote: > > > On 05/01/2023 20:16, Jakub Kicinski wrote: >> On Thu, 5 Jan 2023 11:57:32 -0500 Andy Gospodarek wrote: >>>> So my main concern would be that if we "allow" this, the only way to >>>> write an interoperable XDP program will be to use bpf_xdp_load_bytes() >>>> for every packet access. Which will be slower than DPA, so we may >>>> end up >>>> inadvertently slowing down all of the XDP ecosystem, because no one is >>>> going to bother with writing two versions of their programs. Whereas if >>>> you can rely on packet headers always being in the linear part, you can >>>> write a lot of the "look at headers and make a decision" type programs >>>> using just DPA, and they'll work for multibuf as well. >>> >>> The question I would have is what is really the 'slow down' for >>> bpf_xdp_load_bytes() vs DPA? I know you and Jesper can tell me how many >>> instructions each use. :) >> >> Until we have an efficient and inlined DPA access to frags an >> unconditional memcpy() of the first 2 cachelines-worth of headers >> in the driver must be faster than a piece-by-piece bpf_xdp_load_bytes() >> onto the stack, right? >> >>> Taking a step back...years ago Dave mentioned wanting to make XDP >>> programs easy to write and it feels like using these accessor APIs would >>> help accomplish that. If the kernel examples use bpf_xdp_load_bytes() >>> accessors everywhere then that would accomplish that. >> >> I've been pushing for an skb_header_pointer()-like helper but >> the semantics were not universally loved :) > > Maybe it's time to re-consider. > > Is it something like an API that given an offset returns a pointer + > allowed length to be accessed? > > This sounds like a good direction to me, that avoids having any > linear-part-length assumptions, while preserving good performance. > > Maybe we can still require/guarantee that each single header (eth, ip, > tcp, ...) does not cross a frag/page boundary. For otherwise, a prog > needs to handle cases where headers span several fragments, so it has to > reconstruct the header by copying the different parts into some local > buffer. > > This can be achieved by having another assumption that AFAIK already > holds today: all fragments are of size PAGE_SIZE. > > Regards, > Tariq This can be a good starting point: static void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len) It's currently not exposed as a bpf-helper, and it works a bit differently to what I mentioned earlier: It gets the desired length, and fails in case it's not continuously accessible (i.e. this piece of data spans multiple frags).
Tariq Toukan <ttoukan.linux@gmail.com> writes: > On 08/01/2023 14:33, Tariq Toukan wrote: >> >> >> On 05/01/2023 20:16, Jakub Kicinski wrote: >>> On Thu, 5 Jan 2023 11:57:32 -0500 Andy Gospodarek wrote: >>>>> So my main concern would be that if we "allow" this, the only way to >>>>> write an interoperable XDP program will be to use bpf_xdp_load_bytes() >>>>> for every packet access. Which will be slower than DPA, so we may >>>>> end up >>>>> inadvertently slowing down all of the XDP ecosystem, because no one is >>>>> going to bother with writing two versions of their programs. Whereas if >>>>> you can rely on packet headers always being in the linear part, you can >>>>> write a lot of the "look at headers and make a decision" type programs >>>>> using just DPA, and they'll work for multibuf as well. >>>> >>>> The question I would have is what is really the 'slow down' for >>>> bpf_xdp_load_bytes() vs DPA? I know you and Jesper can tell me how many >>>> instructions each use. :) >>> >>> Until we have an efficient and inlined DPA access to frags an >>> unconditional memcpy() of the first 2 cachelines-worth of headers >>> in the driver must be faster than a piece-by-piece bpf_xdp_load_bytes() >>> onto the stack, right? >>> >>>> Taking a step back...years ago Dave mentioned wanting to make XDP >>>> programs easy to write and it feels like using these accessor APIs would >>>> help accomplish that. If the kernel examples use bpf_xdp_load_bytes() >>>> accessors everywhere then that would accomplish that. >>> >>> I've been pushing for an skb_header_pointer()-like helper but >>> the semantics were not universally loved :) >> >> Maybe it's time to re-consider. >> >> Is it something like an API that given an offset returns a pointer + >> allowed length to be accessed? >> >> This sounds like a good direction to me, that avoids having any >> linear-part-length assumptions, while preserving good performance. >> >> Maybe we can still require/guarantee that each single header (eth, ip, >> tcp, ...) does not cross a frag/page boundary. For otherwise, a prog >> needs to handle cases where headers span several fragments, so it has to >> reconstruct the header by copying the different parts into some local >> buffer. >> >> This can be achieved by having another assumption that AFAIK already >> holds today: all fragments are of size PAGE_SIZE. >> >> Regards, >> Tariq > > This can be a good starting point: > static void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len) > > It's currently not exposed as a bpf-helper, and it works a bit > differently to what I mentioned earlier: It gets the desired length, and > fails in case it's not continuously accessible (i.e. this piece of data > spans multiple frags). Did a bit of digging through the mail archives. Exposing bpf_xdp_pointer() as a helper was proposed back in March last year: https://lore.kernel.org/r/20220306234311.452206-1-memxor@gmail.com The discussion of this seems to have ended on "let's use dynptrs instead". There was a patch series posted for this as well, which seems to have stalled out with this comment from Alexei in October: https://lore.kernel.org/r/CAADnVQKhv2YBrUAQJq6UyqoZJ-FGNQbKenGoPySPNK+GaOjBOg@mail.gmail.com -Toke
On Tue, Jan 03, 2023 at 05:21:53PM -0800, Jakub Kicinski wrote: > On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote: > > Hmm, good question! I don't think we've ever explicitly documented any > > assumptions one way or the other. My own mental model has certainly > > always assumed the first frag would continue to be the same size as in > > non-multi-buf packets. > > Interesting! :) My mental model was closer to GRO by frags > so the linear part would have no data, just headers. > > A random datapoint is that bpf_xdp_adjust_head() seems > to enforce that there is at least ETH_HLEN. Also bpf_xdp_frags_increase_tail has the following check: if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz) return -EOPNOTSUPP; However, I can't seem to find where the `frag_size > frame_sz` part is actually used. Maybe this condition can be dropped? Can someone shed some light? BTW, Tariq, we seem to have missed setting frag_size to a non-zero value. Could you check that increasing the tail indeed doesn't work on fragmented packets on mlx5e? I can send a oneliner to fix that.
On 10/01/2023 22:59, Maxim Mikityanskiy wrote: > On Tue, Jan 03, 2023 at 05:21:53PM -0800, Jakub Kicinski wrote: >> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote: >>> Hmm, good question! I don't think we've ever explicitly documented any >>> assumptions one way or the other. My own mental model has certainly >>> always assumed the first frag would continue to be the same size as in >>> non-multi-buf packets. >> >> Interesting! :) My mental model was closer to GRO by frags >> so the linear part would have no data, just headers. >> >> A random datapoint is that bpf_xdp_adjust_head() seems >> to enforce that there is at least ETH_HLEN. > > Also bpf_xdp_frags_increase_tail has the following check: > > if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz) > return -EOPNOTSUPP; > > However, I can't seem to find where the `frag_size > frame_sz` part is > actually used. Maybe this condition can be dropped? Can someone shed > some light? > > BTW, Tariq, we seem to have missed setting frag_size to a non-zero > value. Hey Maxim, Indeed. We use xdp_rxq_info_reg, it passes 0 as frag_size. > Could you check that increasing the tail indeed doesn't work on > fragmented packets on mlx5e? I can send a oneliner to fix that. I can test early next week, and update.
On 13/01/2023 23:07, Tariq Toukan wrote: > > > On 10/01/2023 22:59, Maxim Mikityanskiy wrote: >> On Tue, Jan 03, 2023 at 05:21:53PM -0800, Jakub Kicinski wrote: >>> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote: >>>> Hmm, good question! I don't think we've ever explicitly documented any >>>> assumptions one way or the other. My own mental model has certainly >>>> always assumed the first frag would continue to be the same size as in >>>> non-multi-buf packets. >>> >>> Interesting! :) My mental model was closer to GRO by frags >>> so the linear part would have no data, just headers. >>> >>> A random datapoint is that bpf_xdp_adjust_head() seems >>> to enforce that there is at least ETH_HLEN. >> >> Also bpf_xdp_frags_increase_tail has the following check: >> >> if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz) >> return -EOPNOTSUPP; >> >> However, I can't seem to find where the `frag_size > frame_sz` part is >> actually used. Maybe this condition can be dropped? Can someone shed >> some light? >> >> BTW, Tariq, we seem to have missed setting frag_size to a non-zero >> value. > > Hey Maxim, > Indeed. We use xdp_rxq_info_reg, it passes 0 as frag_size. > >> Could you check that increasing the tail indeed doesn't work on >> fragmented packets on mlx5e? I can send a oneliner to fix that. > > I can test early next week, and update. Hi Maxim, Hacking the existing adjust_tail sample program, in between other high prio tasks, to make it work with frags took me some time. But I can approve the fix below (or equivalent) is needed. Please go on and submit it. Regards, Tariq --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -600,7 +600,7 @@ static int mlx5e_init_rxq_rq(struct mlx5e_channel *c, struct mlx5e_params *param if (err) return err; - return xdp_rxq_info_reg(&rq->xdp_rxq, rq->netdev, rq->ix, c->napi.napi_id); + return __xdp_rxq_info_reg(&rq->xdp_rxq, rq->netdev, rq->ix, c->napi.napi_id, PAGE_SIZE); } static int mlx5_rq_shampo_alloc(struct mlx5_core_dev *mdev,
diff --git a/samples/bpf/xdp1_kern.c b/samples/bpf/xdp1_kern.c index f0c5d95084de..0a5c704badd0 100644 --- a/samples/bpf/xdp1_kern.c +++ b/samples/bpf/xdp1_kern.c @@ -39,11 +39,13 @@ static int parse_ipv6(void *data, u64 nh_off, void *data_end) return ip6h->nexthdr; } -SEC("xdp1") +#define XDPBUFSIZE 64 +SEC("xdp.frags") int xdp_prog1(struct xdp_md *ctx) { - void *data_end = (void *)(long)ctx->data_end; - void *data = (void *)(long)ctx->data; + __u8 pkt[XDPBUFSIZE] = {}; + void *data_end = &pkt[XDPBUFSIZE-1]; + void *data = pkt; struct ethhdr *eth = data; int rc = XDP_DROP; long *value; @@ -51,6 +53,9 @@ int xdp_prog1(struct xdp_md *ctx) u64 nh_off; u32 ipproto; + if (bpf_xdp_load_bytes(ctx, 0, pkt, sizeof(pkt))) + return rc; + nh_off = sizeof(*eth); if (data + nh_off > data_end) return rc; diff --git a/samples/bpf/xdp2_kern.c b/samples/bpf/xdp2_kern.c index d8a64ab077b0..3332ba6bb95f 100644 --- a/samples/bpf/xdp2_kern.c +++ b/samples/bpf/xdp2_kern.c @@ -55,11 +55,13 @@ static int parse_ipv6(void *data, u64 nh_off, void *data_end) return ip6h->nexthdr; } -SEC("xdp1") +#define XDPBUFSIZE 64 +SEC("xdp.frags") int xdp_prog1(struct xdp_md *ctx) { - void *data_end = (void *)(long)ctx->data_end; - void *data = (void *)(long)ctx->data; + __u8 pkt[XDPBUFSIZE] = {}; + void *data_end = &pkt[XDPBUFSIZE-1]; + void *data = pkt; struct ethhdr *eth = data; int rc = XDP_DROP; long *value; @@ -67,6 +69,9 @@ int xdp_prog1(struct xdp_md *ctx) u64 nh_off; u32 ipproto; + if (bpf_xdp_load_bytes(ctx, 0, pkt, sizeof(pkt))) + return rc; + nh_off = sizeof(*eth); if (data + nh_off > data_end) return rc; diff --git a/samples/bpf/xdp_tx_iptunnel_kern.c b/samples/bpf/xdp_tx_iptunnel_kern.c index 575d57e4b8d6..0e2bca3a3fff 100644 --- a/samples/bpf/xdp_tx_iptunnel_kern.c +++ b/samples/bpf/xdp_tx_iptunnel_kern.c @@ -212,7 +212,7 @@ static __always_inline int handle_ipv6(struct xdp_md *xdp) return XDP_TX; } -SEC("xdp_tx_iptunnel") +SEC("xdp.frags") int _xdp_tx_iptunnel(struct xdp_md *xdp) { void *data_end = (void *)(long)xdp->data_end;