mbox series

[v14,bpf-next,00/18] mvneta: introduce XDP multi-buffer support

Message ID cover.1631289870.git.lorenzo@kernel.org (mailing list archive)
Headers show
Series mvneta: introduce XDP multi-buffer support | expand

Message

Lorenzo Bianconi Sept. 10, 2021, 4:14 p.m. UTC
This series introduce XDP multi-buffer support. The mvneta driver is
the first to support these new "non-linear" xdp_{buff,frame}. Reviewers
please focus on how these new types of xdp_{buff,frame} packets
traverse the different layers and the layout design. It is on purpose
that BPF-helpers are kept simple, as we don't want to expose the
internal layout to allow later changes.

The main idea for the new multi-buffer layout is to reuse the same
structure used for non-linear SKB. This rely on the "skb_shared_info"
struct at the end of the first buffer to link together subsequent
buffers. Keeping the layout compatible with SKBs is also done to ease
and speedup creating a SKB from an xdp_{buff,frame}.
Converting xdp_frame to SKB and deliver it to the network stack is shown
in patch 05/18 (e.g. cpumaps).

A multi-buffer bit (mb) has been introduced in the flags field of xdp_{buff,frame}
structure to notify the bpf/network layer if this is a xdp multi-buffer frame
(mb = 1) or not (mb = 0).
The mb bit will be set by a xdp multi-buffer capable driver only for
non-linear frames maintaining the capability to receive linear frames
without any extra cost since the skb_shared_info structure at the end
of the first buffer will be initialized only if mb is set.
Moreover the flags field in xdp_{buff,frame} will be reused even for
xdp rx csum offloading in future series.

Typical use cases for this series are:
- Jumbo-frames
- Packet header split (please see Google’s use-case @ NetDevConf 0x14, [0])
- TSO/GRO for XDP_REDIRECT

The two following ebpf helpers (and related selftests) has been introduced:
- bpf_xdp_adjust_data:
  Move xdp_md->data and xdp_md->data_end pointers in subsequent fragments
  according to the offset provided by the ebpf program. This helper can be
  used to read/write values in frame payload.
- bpf_xdp_get_buff_len:
  Return the total frame size (linear + paged parts)

bpf_xdp_adjust_tail and bpf_xdp_copy helpers have been modified to take into
account xdp multi-buff frames.

A multi-buffer enabled NIC may receive XDP frames with multiple frags. If a BPF
program does not understand mb layouts its possible to contrive a BPF program
that incorrectly views data_end as the end of data when there is more data in
the payload. Note helpers will generally due the correct thing, for example
perf_output will consume entire payload. But, it is still possible some programs
could do the wrong thing even if in an edge case. Although we expect most BPF
programs not to be impacted we can't rule out, you've been warned.

More info about the main idea behind this approach can be found here [1][2].

Changes since v13:
- use u32 for xdp_buff/xdp_frame flags field
- rename xdp_frags_tsize in xdp_frags_truesize
- fixed comments

Changes since v12:
- fix bpf_xdp_adjust_data helper for single-buffer use case
- return -EFAULT in bpf_xdp_adjust_{head,tail} in case the data pointers are not
  properly reset
- collect ACKs from John

Changes since v11:
- add missing static to bpf_xdp_get_buff_len_proto structure
- fix bpf_xdp_adjust_data helper when offset is smaller than linear area length.

Changes since v10:
- move xdp->data to the requested payload offset instead of to the beginning of
  the fragment in bpf_xdp_adjust_data()

Changes since v9:
- introduce bpf_xdp_adjust_data helper and related selftest
- add xdp_frags_size and xdp_frags_tsize fields in skb_shared_info
- introduce xdp_update_skb_shared_info utility routine in ordere to not reset
  frags array in skb_shared_info converting from a xdp_buff/xdp_frame to a skb 
- simplify bpf_xdp_copy routine

Changes since v8:
- add proper dma unmapping if XDP_TX fails on mvneta for a xdp multi-buff
- switch back to skb_shared_info implementation from previous xdp_shared_info
  one
- avoid using a bietfield in xdp_buff/xdp_frame since it introduces performance
  regressions. Tested now on 10G NIC (ixgbe) to verify there are no performance
  penalties for regular codebase
- add bpf_xdp_get_buff_len helper and remove frame_length field in xdp ctx
- add data_len field in skb_shared_info struct
- introduce XDP_FLAGS_FRAGS_PF_MEMALLOC flag

Changes since v7:
- rebase on top of bpf-next
- fix sparse warnings
- improve comments for frame_length in include/net/xdp.h

Changes since v6:
- the main difference respect to previous versions is the new approach proposed
  by Eelco to pass full length of the packet to eBPF layer in XDP context
- reintroduce multi-buff support to eBPF kself-tests
- reintroduce multi-buff support to bpf_xdp_adjust_tail helper
- introduce multi-buffer support to bpf_xdp_copy helper
- rebase on top of bpf-next

Changes since v5:
- rebase on top of bpf-next
- initialize mb bit in xdp_init_buff() and drop per-driver initialization
- drop xdp->mb initialization in xdp_convert_zc_to_xdp_frame()
- postpone introduction of frame_length field in XDP ctx to another series
- minor changes

Changes since v4:
- rebase ontop of bpf-next
- introduce xdp_shared_info to build xdp multi-buff instead of using the
  skb_shared_info struct
- introduce frame_length in xdp ctx
- drop previous bpf helpers
- fix bpf_xdp_adjust_tail for xdp multi-buff
- introduce xdp multi-buff self-tests for bpf_xdp_adjust_tail
- fix xdp_return_frame_bulk for xdp multi-buff

Changes since v3:
- rebase ontop of bpf-next
- add patch 10/13 to copy back paged data from a xdp multi-buff frame to
  userspace buffer for xdp multi-buff selftests

Changes since v2:
- add throughput measurements
- drop bpf_xdp_adjust_mb_header bpf helper
- introduce selftest for xdp multibuffer
- addressed comments on bpf_xdp_get_frags_count
- introduce xdp multi-buff support to cpumaps

Changes since v1:
- Fix use-after-free in xdp_return_{buff/frame}
- Introduce bpf helpers
- Introduce xdp_mb sample program
- access skb_shared_info->nr_frags only on the last fragment

Changes since RFC:
- squash multi-buffer bit initialization in a single patch
- add mvneta non-linear XDP buff support for tx side

[0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy
[1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org
[2] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver (XDPmulti-buffers section)

Eelco Chaudron (3):
  bpf: add multi-buff support to the bpf_xdp_adjust_tail() API
  bpf: add multi-buffer support to xdp copy helpers
  bpf: update xdp_adjust_tail selftest to include multi-buffer

Lorenzo Bianconi (15):
  net: skbuff: add size metadata to skb_shared_info for xdp
  xdp: introduce flags field in xdp_buff/xdp_frame
  net: mvneta: update mb bit before passing the xdp buffer to eBPF layer
  net: mvneta: simplify mvneta_swbm_add_rx_fragment management
  net: xdp: add xdp_update_skb_shared_info utility routine
  net: marvell: rely on xdp_update_skb_shared_info utility routine
  xdp: add multi-buff support to xdp_return_{buff/frame}
  net: mvneta: add multi buffer support to XDP_TX
  net: mvneta: enable jumbo frames for XDP
  bpf: introduce bpf_xdp_get_buff_len helper
  bpf: move user_size out of bpf_test_init
  bpf: introduce multibuff support to bpf_prog_test_run_xdp()
  bpf: test_run: add xdp_shared_info pointer in bpf_test_finish
    signature
  net: xdp: introduce bpf_xdp_adjust_data helper
  bpf: add bpf_xdp_adjust_data selftest

 drivers/net/ethernet/marvell/mvneta.c         | 204 ++++++++++-------
 include/linux/skbuff.h                        |   6 +-
 include/net/xdp.h                             |  95 +++++++-
 include/uapi/linux/bpf.h                      |  39 ++++
 kernel/trace/bpf_trace.c                      |   3 +
 net/bpf/test_run.c                            | 117 ++++++++--
 net/core/filter.c                             | 216 +++++++++++++++++-
 net/core/xdp.c                                |  76 +++++-
 tools/include/uapi/linux/bpf.h                |  39 ++++
 .../bpf/prog_tests/xdp_adjust_data.c          |  63 +++++
 .../bpf/prog_tests/xdp_adjust_tail.c          | 118 ++++++++++
 .../selftests/bpf/prog_tests/xdp_bpf2bpf.c    | 151 ++++++++----
 .../bpf/progs/test_xdp_adjust_tail_grow.c     |  10 +-
 .../bpf/progs/test_xdp_adjust_tail_shrink.c   |  32 ++-
 .../selftests/bpf/progs/test_xdp_bpf2bpf.c    |   2 +-
 .../bpf/progs/test_xdp_update_frags.c         |  45 ++++
 16 files changed, 1051 insertions(+), 165 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_adjust_data.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_update_frags.c

Comments

Jakub Kicinski Sept. 16, 2021, 4:55 p.m. UTC | #1
On Fri, 10 Sep 2021 18:14:06 +0200 Lorenzo Bianconi wrote:
> The two following ebpf helpers (and related selftests) has been introduced:
> - bpf_xdp_adjust_data:
>   Move xdp_md->data and xdp_md->data_end pointers in subsequent fragments
>   according to the offset provided by the ebpf program. This helper can be
>   used to read/write values in frame payload.
> - bpf_xdp_get_buff_len:
>   Return the total frame size (linear + paged parts)

> More info about the main idea behind this approach can be found here [1][2].

Is there much critique of the skb helpers we have? My intuition would
be to follow a similar paradigm from the API perspective. It may seem
trivial to us to switch between the two but "normal" users could easily
be confused.

By skb paradigm I mean skb_pull_data() and bpf_skb_load/store_bytes().

Alternatively how about we produce a variation on skb_header_pointer()
(use on-stack buffer or direct access if the entire region is in one
frag).

bpf_xdp_adjust_data() seems to add cost to helpers and TBH I'm not sure
how practical it would be to applications. My understanding is that the
application is not supposed to make assumptions about the fragment
geometry, meaning data can be split at any point. Parsing data
arbitrarily split into buffers is hard if pull() is not an option, let
alone making such parsing provably correct.

Won't applications end up building something like skb_header_pointer()
based on bpf_xdp_adjust_data(), anyway? In which case why don't we
provide them what they need?

say: 

void *xdp_mb_pointer(struct xdp_buff *xdp_md, u32 flags, 
                     u32 offset, u32 len, void *stack_buf)

flags and offset can be squashed into one u64 as needed. Helper returns
pointer to packet data, either real one or stack_buf. Verifier has to
be taught that the return value is NULL or a pointer which is safe with
offsets up to @len.

If the reason for access is write we'd also need:

void *xdp_mb_pointer_flush(struct xdp_buff *xdp_md, u32 flags, 
                           u32 offset, u32 len, void *stack_buf)

Same inputs, if stack buffer was used it does write back, otherwise nop.

Sorry for the longish email if I'm missing something obvious and/or
discussed earlier.


The other thing I wanted to double check - was the decision on program
compatibility made? Is a new program type an option? It'd be extremely
useful operationally to be able to depend on kernel enforcement.
Lorenzo Bianconi Sept. 17, 2021, 2:51 p.m. UTC | #2
> On Fri, 10 Sep 2021 18:14:06 +0200 Lorenzo Bianconi wrote:
> > The two following ebpf helpers (and related selftests) has been introduced:
> > - bpf_xdp_adjust_data:
> >   Move xdp_md->data and xdp_md->data_end pointers in subsequent fragments
> >   according to the offset provided by the ebpf program. This helper can be
> >   used to read/write values in frame payload.
> > - bpf_xdp_get_buff_len:
> >   Return the total frame size (linear + paged parts)
> 
> > More info about the main idea behind this approach can be found here [1][2].
> 
> Is there much critique of the skb helpers we have? My intuition would
> be to follow a similar paradigm from the API perspective. It may seem
> trivial to us to switch between the two but "normal" users could easily
> be confused.
> 
> By skb paradigm I mean skb_pull_data() and bpf_skb_load/store_bytes().
> 
> Alternatively how about we produce a variation on skb_header_pointer()
> (use on-stack buffer or direct access if the entire region is in one
> frag).
> 
> bpf_xdp_adjust_data() seems to add cost to helpers and TBH I'm not sure
> how practical it would be to applications. My understanding is that the
> application is not supposed to make assumptions about the fragment
> geometry, meaning data can be split at any point. Parsing data
> arbitrarily split into buffers is hard if pull() is not an option, let
> alone making such parsing provably correct.
> 
> Won't applications end up building something like skb_header_pointer()
> based on bpf_xdp_adjust_data(), anyway? In which case why don't we
> provide them what they need?

Please correct me if I am wrong, here you mean in bpf_xdp_adjust_data()
we are moving the logic to read/write data across fragment boundaries
to the caller. Right.
I do not have a clear view about what could be a real use-case for the helper
(maybe John can help on this), but similar to what you are suggesting, what
about doing something like bpf_skb_load/store_bytes()?

- bpf_xdp_load_bytes(struct xdp_buff *xdp_md, u32 offset, u32 len,
		     void *data)

- bpf_xdp_store_bytes(struct xdp_buff *xdp_md, u32 offset, u32 len,
		      void *data)

the helper can take care of reading/writing across fragment boundaries
and remove any layout info from the caller. The only downside here
(as for bpf_skb_load/store_bytes()) is we need to copy. But in a
real application, is it actually an issue? (since we have much less
pps for xdp multi-buff).
Moreover I do not know if this solution will requires some verifier
changes.

@John: can this approach works in your use-case?

Anyway I think we should try to get everyone on the same page here since the
helper can change according to specific use-case. Since this series is on the
agenda for LPC next week, I hope you and others who have an opinion about this
will find the time to come and discuss it during the conference :)

Regards,
Lorenzo
> 
> say: 
> 
> void *xdp_mb_pointer(struct xdp_buff *xdp_md, u32 flags, 
>                      u32 offset, u32 len, void *stack_buf)
> 
> flags and offset can be squashed into one u64 as needed. Helper returns
> pointer to packet data, either real one or stack_buf. Verifier has to
> be taught that the return value is NULL or a pointer which is safe with
> offsets up to @len.
> 
> If the reason for access is write we'd also need:
> 
> void *xdp_mb_pointer_flush(struct xdp_buff *xdp_md, u32 flags, 
>                            u32 offset, u32 len, void *stack_buf)
> 
> Same inputs, if stack buffer was used it does write back, otherwise nop.
> 
> Sorry for the longish email if I'm missing something obvious and/or
> discussed earlier.
> 
> 
> The other thing I wanted to double check - was the decision on program
> compatibility made? Is a new program type an option? It'd be extremely
> useful operationally to be able to depend on kernel enforcement.
Jakub Kicinski Sept. 17, 2021, 6:33 p.m. UTC | #3
On Fri, 17 Sep 2021 16:51:06 +0200 Lorenzo Bianconi wrote:
> > Is there much critique of the skb helpers we have? My intuition would
> > be to follow a similar paradigm from the API perspective. It may seem
> > trivial to us to switch between the two but "normal" users could easily
> > be confused.
> > 
> > By skb paradigm I mean skb_pull_data() and bpf_skb_load/store_bytes().
> > 
> > Alternatively how about we produce a variation on skb_header_pointer()
> > (use on-stack buffer or direct access if the entire region is in one
> > frag).
> > 
> > bpf_xdp_adjust_data() seems to add cost to helpers and TBH I'm not sure
> > how practical it would be to applications. My understanding is that the
> > application is not supposed to make assumptions about the fragment
> > geometry, meaning data can be split at any point. Parsing data
> > arbitrarily split into buffers is hard if pull() is not an option, let
> > alone making such parsing provably correct.
> > 
> > Won't applications end up building something like skb_header_pointer()
> > based on bpf_xdp_adjust_data(), anyway? In which case why don't we
> > provide them what they need?  
> 
> Please correct me if I am wrong, here you mean in bpf_xdp_adjust_data()
> we are moving the logic to read/write data across fragment boundaries
> to the caller. Right.
> I do not have a clear view about what could be a real use-case for the helper
> (maybe John can help on this), but similar to what you are suggesting, what
> about doing something like bpf_skb_load/store_bytes()?
> 
> - bpf_xdp_load_bytes(struct xdp_buff *xdp_md, u32 offset, u32 len,
> 		     void *data)
> 
> - bpf_xdp_store_bytes(struct xdp_buff *xdp_md, u32 offset, u32 len,
> 		      void *data)
> 
> the helper can take care of reading/writing across fragment boundaries
> and remove any layout info from the caller. The only downside here
> (as for bpf_skb_load/store_bytes()) is we need to copy.

If bpf_xdp_load_bytes() / bpf_xdp_store_bytes() works for most we
can start with that. In all honesty I don't know what the exact
use cases for looking at data are, either. I'm primarily worried 
about exposing the kernel internals too early.

What I'm imagining is that data access mostly works around bad
header/data split or loading some simple >L4 headers. On one hand
such headers will fall into one frag, so 99.9% of the time the copy
isn't really required. But as I said I'm happy with load/store, to
begin with.

> But in a real application, is it actually an issue? (since we have
> much less pps for xdp multi-buff).
> Moreover I do not know if this solution will requires some verifier
> changes.
> 
> @John: can this approach works in your use-case?
> 
> Anyway I think we should try to get everyone on the same page here
> since the helper can change according to specific use-case. Since
> this series is on the agenda for LPC next week, I hope you and others
> who have an opinion about this will find the time to come and discuss
> it during the conference :)

Yup, I'll be there. I'm not good at thinking on my feet tho, hence
sharing my comments now.
Alexei Starovoitov Sept. 17, 2021, 6:43 p.m. UTC | #4
On Fri, Sep 17, 2021 at 11:33 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> > >
> > > Alternatively how about we produce a variation on skb_header_pointer()
> > > (use on-stack buffer or direct access if the entire region is in one
> > > frag).
>
> If bpf_xdp_load_bytes() / bpf_xdp_store_bytes() works for most we
> can start with that. In all honesty I don't know what the exact
> use cases for looking at data are, either. I'm primarily worried
> about exposing the kernel internals too early.

I don't mind the xdp equivalent of skb_load_bytes,
but skb_header_pointer() idea is superior.
When we did xdp with data/data_end there was no refine_retval_range
concept in the verifier (iirc or we just missed that opportunity).
We'd need something more advanced: a pointer with valid range
refined by input argument 'len' or NULL.
The verifier doesn't have such thing yet, but it fits as a combination of
value_or_null plus refine_retval_range.
The bpf_xdp_header_pointer() and bpf_skb_header_pointer()
would probably simplify bpf programs as well.
There would be no need to deal with data/data_end.
Jakub Kicinski Sept. 17, 2021, 7 p.m. UTC | #5
On Fri, 17 Sep 2021 11:43:07 -0700 Alexei Starovoitov wrote:
> > If bpf_xdp_load_bytes() / bpf_xdp_store_bytes() works for most we
> > can start with that. In all honesty I don't know what the exact
> > use cases for looking at data are, either. I'm primarily worried
> > about exposing the kernel internals too early.  
> 
> I don't mind the xdp equivalent of skb_load_bytes,
> but skb_header_pointer() idea is superior.
> When we did xdp with data/data_end there was no refine_retval_range
> concept in the verifier (iirc or we just missed that opportunity).
> We'd need something more advanced: a pointer with valid range
> refined by input argument 'len' or NULL.
> The verifier doesn't have such thing yet, but it fits as a combination of
> value_or_null plus refine_retval_range.
> The bpf_xdp_header_pointer() and bpf_skb_header_pointer()
> would probably simplify bpf programs as well.
> There would be no need to deal with data/data_end.

What are your thoughts on inlining? Can we inline the common case 
of the header being in the "head"? Otherwise data/end comparisons 
would be faster.
Alexei Starovoitov Sept. 17, 2021, 7:10 p.m. UTC | #6
On Fri, Sep 17, 2021 at 12:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 17 Sep 2021 11:43:07 -0700 Alexei Starovoitov wrote:
> > > If bpf_xdp_load_bytes() / bpf_xdp_store_bytes() works for most we
> > > can start with that. In all honesty I don't know what the exact
> > > use cases for looking at data are, either. I'm primarily worried
> > > about exposing the kernel internals too early.
> >
> > I don't mind the xdp equivalent of skb_load_bytes,
> > but skb_header_pointer() idea is superior.
> > When we did xdp with data/data_end there was no refine_retval_range
> > concept in the verifier (iirc or we just missed that opportunity).
> > We'd need something more advanced: a pointer with valid range
> > refined by input argument 'len' or NULL.
> > The verifier doesn't have such thing yet, but it fits as a combination of
> > value_or_null plus refine_retval_range.
> > The bpf_xdp_header_pointer() and bpf_skb_header_pointer()
> > would probably simplify bpf programs as well.
> > There would be no need to deal with data/data_end.
>
> What are your thoughts on inlining? Can we inline the common case
> of the header being in the "head"? Otherwise data/end comparisons
> would be faster.

Yeah. It can be inlined by the verifier.
It would still look like a call from bpf prog pov with llvm doing spill/fill
of scratched regs, but it's minor.

Also we can use the same bpf_header_pointer(ctx, ...)
helper for both xdp and skb program types. They will have different
implementation underneath, but this might make possible writing bpf
programs that could work in both xdp and skb context.
I believe cilium has fancy macros to achieve that.
John Fastabend Sept. 17, 2021, 10:07 p.m. UTC | #7
Alexei Starovoitov wrote:
> On Fri, Sep 17, 2021 at 12:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Fri, 17 Sep 2021 11:43:07 -0700 Alexei Starovoitov wrote:
> > > > If bpf_xdp_load_bytes() / bpf_xdp_store_bytes() works for most we
> > > > can start with that. In all honesty I don't know what the exact
> > > > use cases for looking at data are, either. I'm primarily worried
> > > > about exposing the kernel internals too early.
> > >
> > > I don't mind the xdp equivalent of skb_load_bytes,
> > > but skb_header_pointer() idea is superior.
> > > When we did xdp with data/data_end there was no refine_retval_range
> > > concept in the verifier (iirc or we just missed that opportunity).
> > > We'd need something more advanced: a pointer with valid range
> > > refined by input argument 'len' or NULL.
> > > The verifier doesn't have such thing yet, but it fits as a combination of
> > > value_or_null plus refine_retval_range.
> > > The bpf_xdp_header_pointer() and bpf_skb_header_pointer()
> > > would probably simplify bpf programs as well.
> > > There would be no need to deal with data/data_end.
> >
> > What are your thoughts on inlining? Can we inline the common case
> > of the header being in the "head"? Otherwise data/end comparisons
> > would be faster.
> 
> Yeah. It can be inlined by the verifier.
> It would still look like a call from bpf prog pov with llvm doing spill/fill
> of scratched regs, but it's minor.
> 
> Also we can use the same bpf_header_pointer(ctx, ...)
> helper for both xdp and skb program types. They will have different
> implementation underneath, but this might make possible writing bpf
> programs that could work in both xdp and skb context.
> I believe cilium has fancy macros to achieve that.

Hi,

First a header_pointer() logic that works across skb and xdp seems like
a great idea to me. I wonder though if instead of doing the copy
into a new buffer for offset past the initial frag like what is done in
skb_header_pointer could we just walk the frags and point at the new offset.
This is what we do on the socket side with bpf_msg_pull-data() for example.
For XDP it should also work. The skb case would depend on clone state
and things so might be a bit more tricky there.

This has the advantage of only doing the copy when its necessary. This
can be useful for example when reading the tail of an IPsec packet. With
blind copy most packets will get hit with a copy. By just writing the
pkt->data and pkt->data_end we can avoid this case.

Lorenz originally implemented something similar earlier and we had the
refine retval logic. It failed on no-alu32 for some reason we could
revisit. I didn't mind the current help returning with data pointer set
to the start of the frag so we stopped following up on it.

I agree though the current implementation puts a lot on the BPF writer.
So getting both cases covered, I want to take pains in my BPF prog
to avoid copies and I just want these bytes handled behind a single
helper seems good to me.

Thanks,
John
Toke Høiland-Jørgensen Sept. 18, 2021, 11:53 a.m. UTC | #8
John Fastabend <john.fastabend@gmail.com> writes:

> Alexei Starovoitov wrote:
>> On Fri, Sep 17, 2021 at 12:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
>> >
>> > On Fri, 17 Sep 2021 11:43:07 -0700 Alexei Starovoitov wrote:
>> > > > If bpf_xdp_load_bytes() / bpf_xdp_store_bytes() works for most we
>> > > > can start with that. In all honesty I don't know what the exact
>> > > > use cases for looking at data are, either. I'm primarily worried
>> > > > about exposing the kernel internals too early.
>> > >
>> > > I don't mind the xdp equivalent of skb_load_bytes,
>> > > but skb_header_pointer() idea is superior.
>> > > When we did xdp with data/data_end there was no refine_retval_range
>> > > concept in the verifier (iirc or we just missed that opportunity).
>> > > We'd need something more advanced: a pointer with valid range
>> > > refined by input argument 'len' or NULL.
>> > > The verifier doesn't have such thing yet, but it fits as a combination of
>> > > value_or_null plus refine_retval_range.
>> > > The bpf_xdp_header_pointer() and bpf_skb_header_pointer()
>> > > would probably simplify bpf programs as well.
>> > > There would be no need to deal with data/data_end.
>> >
>> > What are your thoughts on inlining? Can we inline the common case
>> > of the header being in the "head"? Otherwise data/end comparisons
>> > would be faster.
>> 
>> Yeah. It can be inlined by the verifier.
>> It would still look like a call from bpf prog pov with llvm doing spill/fill
>> of scratched regs, but it's minor.
>> 
>> Also we can use the same bpf_header_pointer(ctx, ...)
>> helper for both xdp and skb program types. They will have different
>> implementation underneath, but this might make possible writing bpf
>> programs that could work in both xdp and skb context.
>> I believe cilium has fancy macros to achieve that.
>
> Hi,
>
> First a header_pointer() logic that works across skb and xdp seems like
> a great idea to me. I wonder though if instead of doing the copy
> into a new buffer for offset past the initial frag like what is done in
> skb_header_pointer could we just walk the frags and point at the new offset.
> This is what we do on the socket side with bpf_msg_pull-data() for example.
> For XDP it should also work. The skb case would depend on clone state
> and things so might be a bit more tricky there.
>
> This has the advantage of only doing the copy when its necessary. This
> can be useful for example when reading the tail of an IPsec packet. With
> blind copy most packets will get hit with a copy. By just writing the
> pkt->data and pkt->data_end we can avoid this case.
>
> Lorenz originally implemented something similar earlier and we had the
> refine retval logic. It failed on no-alu32 for some reason we could
> revisit. I didn't mind the current help returning with data pointer set
> to the start of the frag so we stopped following up on it.
>
> I agree though the current implementation puts a lot on the BPF writer.
> So getting both cases covered, I want to take pains in my BPF prog
> to avoid copies and I just want these bytes handled behind a single
> helper seems good to me.

I'm OK with a bpf_header_pointer()-type helper - I quite like the
in-kernel version of this for SKBs, so replicating it as a BPF helper
would be great. But I'm a little worried about taking a performance hit.

I.e., if you do:

ptr = bpf_header_pointer(pkt, offset, len, stack_ptr)
*ptr = xxx;

then, if the helper ended up copying the data into the stack pointer,
you didn't actually change anything in the packet, so you need to do a
writeback.

Jakub suggested up-thread that this should be done with some kind of
flush() helper. But you don't know whether the header_pointer()-helper
copied the data, so you always need to call the flush() helper, which
will incur overhead. If the verifier can in-line the helpers that will
lower it, but will it be enough to make it negligible?

-Toke
Jakub Kicinski Sept. 20, 2021, 6:02 p.m. UTC | #9
On Sat, 18 Sep 2021 13:53:35 +0200 Toke Høiland-Jørgensen wrote:
> I'm OK with a bpf_header_pointer()-type helper - I quite like the
> in-kernel version of this for SKBs, so replicating it as a BPF helper
> would be great. But I'm a little worried about taking a performance hit.
> 
> I.e., if you do:
> 
> ptr = bpf_header_pointer(pkt, offset, len, stack_ptr)
> *ptr = xxx;
> 
> then, if the helper ended up copying the data into the stack pointer,
> you didn't actually change anything in the packet, so you need to do a
> writeback.
> 
> Jakub suggested up-thread that this should be done with some kind of
> flush() helper. But you don't know whether the header_pointer()-helper
> copied the data, so you always need to call the flush() helper, which
> will incur overhead. If the verifier can in-line the helpers that will
> lower it, but will it be enough to make it negligible?

Depends on the assumptions the program otherwise makes, right?

For reading I'd expect a *layout-independent* TC program would 
replace approximately:

ptr = <some_ptr>;
if (ptr + CONST >= md->ptr_end)
	if (bpf_pull_data(md, off + CONST))
		return DROP;
	ptr = <some_ptr>;
	if (ptr + CONST >= md->ptr_end)
		return DROP; /* da hell? */
}

With this (pre-inlining):

ptr = bpf_header_pointer(md, offset, len, stack);
if (!ptr)
	return DROP;

Post-inlining (assuming static validation of args to prevent wraps):

if (md->ptr + args->off + args->len < md->ptr_end)
	ptr = md->ptr + args->off;
else
	ptr = __bpf_header_pointer(md, offset, len, stack);
if (!ptr)
	return DROP;

But that's based on guesswork so perhaps I'm off base.


Regarding the flush() I was expecting that most progs will not modify
the packet (or at least won't modify most headers they load) so no
point paying the price of tracking changes auto-magically.

In fact I don't think there is anything infra can do better for
flushing than the prog itself:

	bool mod = false;

	ptr = bpf_header_pointer(...);
	...
	if (some_cond(...)) {
		change_packet(...);
		mod = true;
	}
	...
	if (mod)
		bpf_header_pointer_flush();


is simple enough.. to me.
Toke Høiland-Jørgensen Sept. 20, 2021, 9:01 p.m. UTC | #10
Jakub Kicinski <kuba@kernel.org> writes:

> On Sat, 18 Sep 2021 13:53:35 +0200 Toke Høiland-Jørgensen wrote:
>> I'm OK with a bpf_header_pointer()-type helper - I quite like the
>> in-kernel version of this for SKBs, so replicating it as a BPF helper
>> would be great. But I'm a little worried about taking a performance hit.
>> 
>> I.e., if you do:
>> 
>> ptr = bpf_header_pointer(pkt, offset, len, stack_ptr)
>> *ptr = xxx;
>> 
>> then, if the helper ended up copying the data into the stack pointer,
>> you didn't actually change anything in the packet, so you need to do a
>> writeback.
>> 
>> Jakub suggested up-thread that this should be done with some kind of
>> flush() helper. But you don't know whether the header_pointer()-helper
>> copied the data, so you always need to call the flush() helper, which
>> will incur overhead. If the verifier can in-line the helpers that will
>> lower it, but will it be enough to make it negligible?
>
> Depends on the assumptions the program otherwise makes, right?
>
> For reading I'd expect a *layout-independent* TC program would 
> replace approximately:
>
> ptr = <some_ptr>;
> if (ptr + CONST >= md->ptr_end)
> 	if (bpf_pull_data(md, off + CONST))
> 		return DROP;
> 	ptr = <some_ptr>;
> 	if (ptr + CONST >= md->ptr_end)
> 		return DROP; /* da hell? */
> }
>
> With this (pre-inlining):
>
> ptr = bpf_header_pointer(md, offset, len, stack);
> if (!ptr)
> 	return DROP;
>
> Post-inlining (assuming static validation of args to prevent wraps):
>
> if (md->ptr + args->off + args->len < md->ptr_end)
> 	ptr = md->ptr + args->off;
> else
> 	ptr = __bpf_header_pointer(md, offset, len, stack);
> if (!ptr)
> 	return DROP;
>
> But that's based on guesswork so perhaps I'm off base.

Yeah, that's more or less what I was thinking...

> Regarding the flush() I was expecting that most progs will not modify
> the packet (or at least won't modify most headers they load) so no
> point paying the price of tracking changes auto-magically.

... but I guess my mental model assumed that packet writes would be just
as frequent as reads, in which case it would be a win if you could amend
your example like:

> In fact I don't think there is anything infra can do better for
> flushing than the prog itself:
>
> 	bool mod = false;
>
> 	ptr = bpf_header_pointer(...);
> 	...
> 	if (some_cond(...)) {
> 		change_packet(...);
> 		mod = true;
> 	}
> 	...
> 	if (mod)

to have an additional check like:

if (mod && ptr == stack)

(or something to that effect). No?

-Toke

> 		bpf_header_pointer_flush();
>
>
> is simple enough.. to me.
Jakub Kicinski Sept. 20, 2021, 9:25 p.m. UTC | #11
On Mon, 20 Sep 2021 23:01:48 +0200 Toke Høiland-Jørgensen wrote:
> > In fact I don't think there is anything infra can do better for
> > flushing than the prog itself:
> >
> > 	bool mod = false;
> >
> > 	ptr = bpf_header_pointer(...);
> > 	...
> > 	if (some_cond(...)) {
> > 		change_packet(...);
> > 		mod = true;
> > 	}
> > 	...
> > 	if (mod)  
> 
> to have an additional check like:
> 
> if (mod && ptr == stack)
> 
> (or something to that effect). No?

Good point. Do you think we should have the kernel add/inline this
optimization or have the user do it explicitly.

The draft API was:

void *xdp_mb_pointer_flush(struct xdp_buff *xdp_md, u32 flags, 
                           u32 offset, u32 len, void *stack_buf)

Which does not take the ptr returned by header_pointer(), but that's
easy to add (well, easy other than the fact it'd be the 6th arg).

BTW I drafted the API this way to cater to the case where flush()
is called without a prior call to header_pointer(). For when packet
trailer or header is populated directly from a map value. Dunno if
that's actually useful, either.
Toke Høiland-Jørgensen Sept. 20, 2021, 10:44 p.m. UTC | #12
Jakub Kicinski <kuba@kernel.org> writes:

> On Mon, 20 Sep 2021 23:01:48 +0200 Toke Høiland-Jørgensen wrote:
>> > In fact I don't think there is anything infra can do better for
>> > flushing than the prog itself:
>> >
>> > 	bool mod = false;
>> >
>> > 	ptr = bpf_header_pointer(...);
>> > 	...
>> > 	if (some_cond(...)) {
>> > 		change_packet(...);
>> > 		mod = true;
>> > 	}
>> > 	...
>> > 	if (mod)  
>> 
>> to have an additional check like:
>> 
>> if (mod && ptr == stack)
>> 
>> (or something to that effect). No?
>
> Good point. Do you think we should have the kernel add/inline this
> optimization or have the user do it explicitly.

Hmm, good question. On the one hand it seems like an easy optimisation
to add, but on the other hand maybe the caller has other logic that can
better know how/when to omit the check.

Hmm, but the helper needs to check it anyway, doesn't it? At least it
can't just blindly memcpy() if the source and destination would be the
same...

> The draft API was:
>
> void *xdp_mb_pointer_flush(struct xdp_buff *xdp_md, u32 flags, 
>                            u32 offset, u32 len, void *stack_buf)
>
> Which does not take the ptr returned by header_pointer(), but that's
> easy to add (well, easy other than the fact it'd be the 6th arg).

I guess we could play some trickery with stuffing offset/len/flags into
one or two u64s to save an argument or two?

> BTW I drafted the API this way to cater to the case where flush()
> is called without a prior call to header_pointer(). For when packet
> trailer or header is populated directly from a map value. Dunno if
> that's actually useful, either.

Ah, didn't think of that; so then it really becomes a generic
xdp_store_bytes()-type helper? Might be useful, I suppose. Adding
headers is certainly a fairly common occurrence, but dunno to what
extent they'd be copied wholesale from a map (hadn't thought about doing
that before either).

-Toke
Eelco Chaudron Sept. 21, 2021, 10:03 a.m. UTC | #13
On 21 Sep 2021, at 0:44, Toke Høiland-Jørgensen wrote:

> Jakub Kicinski <kuba@kernel.org> writes:
>
>> On Mon, 20 Sep 2021 23:01:48 +0200 Toke Høiland-Jørgensen wrote:
>>>> In fact I don't think there is anything infra can do better for
>>>> flushing than the prog itself:
>>>>
>>>> 	bool mod = false;
>>>>
>>>> 	ptr = bpf_header_pointer(...);
>>>> 	...
>>>> 	if (some_cond(...)) {
>>>> 		change_packet(...);
>>>> 		mod = true;
>>>> 	}
>>>> 	...
>>>> 	if (mod)
>>>
>>> to have an additional check like:
>>>
>>> if (mod && ptr == stack)
>>>
>>> (or something to that effect). No?
>>
>> Good point. Do you think we should have the kernel add/inline this
>> optimization or have the user do it explicitly.
>
> Hmm, good question. On the one hand it seems like an easy optimisation
> to add, but on the other hand maybe the caller has other logic that can
> better know how/when to omit the check.
>
> Hmm, but the helper needs to check it anyway, doesn't it? At least it
> can't just blindly memcpy() if the source and destination would be the
> same...
>
>> The draft API was:
>>
>> void *xdp_mb_pointer_flush(struct xdp_buff *xdp_md, u32 flags,
>>                            u32 offset, u32 len, void *stack_buf)
>>
>> Which does not take the ptr returned by header_pointer(), but that's
>> easy to add (well, easy other than the fact it'd be the 6th arg).
>
> I guess we could play some trickery with stuffing offset/len/flags into
> one or two u64s to save an argument or two?
>
>> BTW I drafted the API this way to cater to the case where flush()
>> is called without a prior call to header_pointer(). For when packet
>> trailer or header is populated directly from a map value. Dunno if
>> that's actually useful, either.
>
> Ah, didn't think of that; so then it really becomes a generic
> xdp_store_bytes()-type helper? Might be useful, I suppose. Adding
> headers is certainly a fairly common occurrence, but dunno to what
> extent they'd be copied wholesale from a map (hadn't thought about doing
> that before either).


Sorry for commenting late but I was busy and had to catch up on emails...

I like the idea, as these APIs are exactly what I proposed in April, https://lore.kernel.org/bpf/FD3E6E08-DE78-4FBA-96F6-646C93E88631@redhat.com/

I did not call it flush, as it can be used as a general function to copy data to a specific location.


//Eelco
Magnus Karlsson Sept. 28, 2021, 11:48 a.m. UTC | #14
On Tue, Sep 21, 2021 at 12:04 PM Eelco Chaudron <echaudro@redhat.com> wrote:
>
>
>
> On 21 Sep 2021, at 0:44, Toke Høiland-Jørgensen wrote:
>
> > Jakub Kicinski <kuba@kernel.org> writes:
> >
> >> On Mon, 20 Sep 2021 23:01:48 +0200 Toke Høiland-Jørgensen wrote:
> >>>> In fact I don't think there is anything infra can do better for
> >>>> flushing than the prog itself:
> >>>>
> >>>>    bool mod = false;
> >>>>
> >>>>    ptr = bpf_header_pointer(...);
> >>>>    ...
> >>>>    if (some_cond(...)) {
> >>>>            change_packet(...);
> >>>>            mod = true;
> >>>>    }
> >>>>    ...
> >>>>    if (mod)
> >>>
> >>> to have an additional check like:
> >>>
> >>> if (mod && ptr == stack)
> >>>
> >>> (or something to that effect). No?
> >>
> >> Good point. Do you think we should have the kernel add/inline this
> >> optimization or have the user do it explicitly.
> >
> > Hmm, good question. On the one hand it seems like an easy optimisation
> > to add, but on the other hand maybe the caller has other logic that can
> > better know how/when to omit the check.
> >
> > Hmm, but the helper needs to check it anyway, doesn't it? At least it
> > can't just blindly memcpy() if the source and destination would be the
> > same...
> >
> >> The draft API was:
> >>
> >> void *xdp_mb_pointer_flush(struct xdp_buff *xdp_md, u32 flags,
> >>                            u32 offset, u32 len, void *stack_buf)
> >>
> >> Which does not take the ptr returned by header_pointer(), but that's
> >> easy to add (well, easy other than the fact it'd be the 6th arg).
> >
> > I guess we could play some trickery with stuffing offset/len/flags into
> > one or two u64s to save an argument or two?
> >
> >> BTW I drafted the API this way to cater to the case where flush()
> >> is called without a prior call to header_pointer(). For when packet
> >> trailer or header is populated directly from a map value. Dunno if
> >> that's actually useful, either.
> >
> > Ah, didn't think of that; so then it really becomes a generic
> > xdp_store_bytes()-type helper? Might be useful, I suppose. Adding
> > headers is certainly a fairly common occurrence, but dunno to what
> > extent they'd be copied wholesale from a map (hadn't thought about doing
> > that before either).
>
>
> Sorry for commenting late but I was busy and had to catch up on emails...
>
> I like the idea, as these APIs are exactly what I proposed in April, https://lore.kernel.org/bpf/FD3E6E08-DE78-4FBA-96F6-646C93E88631@redhat.com/
>
> I did not call it flush, as it can be used as a general function to copy data to a specific location.

Here is some performance data (throughput) for this patch set on i40e
(40 Gbit/s NIC). All using the xdp_rxq_info sample and NO multi-buffer
packets.

With v14 only:

XDP_DROP: +4%
XDP_TX: +1%
XDP_PASS: -1%

With v14 plus multi-buffer support implemented in i40e courtesy of Tirtha:

XDP_DROP: +3%
XDP_TX: -1%
XDP_PASS: -2%

/Magnus

>
> //Eelco
>
Lorenz Bauer Sept. 29, 2021, 10:36 a.m. UTC | #15
On Mon, 20 Sept 2021 at 23:46, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > The draft API was:
> >
> > void *xdp_mb_pointer_flush(struct xdp_buff *xdp_md, u32 flags,
> >                            u32 offset, u32 len, void *stack_buf)
> >
> > Which does not take the ptr returned by header_pointer(), but that's
> > easy to add (well, easy other than the fact it'd be the 6th arg).
>
> I guess we could play some trickery with stuffing offset/len/flags into
> one or two u64s to save an argument or two?

Adding another pointer arg seems really hard to explain as an API.
What happens if I pass the "wrong" ptr? What happens if I pass NULL?

How about this: instead of taking stack_ptr, take the return value
from header_pointer as an argument. Then use the already existing
(right ;P) inlining to do the following:

   if (md->ptr + args->off != ret_ptr)
     __pointer_flush(...)

This means that __pointer_flush has to deal with aliased memory
though, so it would always have to memmove. Probably OK for the "slow"
path?

Lorenz
Lorenz Bauer Sept. 29, 2021, 10:41 a.m. UTC | #16
On Thu, 16 Sept 2021 at 18:47, Jakub Kicinski <kuba@kernel.org> wrote:
>
> Won't applications end up building something like skb_header_pointer()
> based on bpf_xdp_adjust_data(), anyway? In which case why don't we
> provide them what they need?
>
> say:
>
> void *xdp_mb_pointer(struct xdp_buff *xdp_md, u32 flags,
>                      u32 offset, u32 len, void *stack_buf)
>
> flags and offset can be squashed into one u64 as needed. Helper returns
> pointer to packet data, either real one or stack_buf. Verifier has to
> be taught that the return value is NULL or a pointer which is safe with
> offsets up to @len.
>
> If the reason for access is write we'd also need:
>
> void *xdp_mb_pointer_flush(struct xdp_buff *xdp_md, u32 flags,
>                            u32 offset, u32 len, void *stack_buf)

Yes! This would be so much better than bpf_skb_load/store_bytes(),
especially if we can use it for both XDP and skb contexts as stated
elsewhere in this thread.
Toke Høiland-Jørgensen Sept. 29, 2021, 12:10 p.m. UTC | #17
Lorenz Bauer <lmb@cloudflare.com> writes:

> On Thu, 16 Sept 2021 at 18:47, Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> Won't applications end up building something like skb_header_pointer()
>> based on bpf_xdp_adjust_data(), anyway? In which case why don't we
>> provide them what they need?
>>
>> say:
>>
>> void *xdp_mb_pointer(struct xdp_buff *xdp_md, u32 flags,
>>                      u32 offset, u32 len, void *stack_buf)
>>
>> flags and offset can be squashed into one u64 as needed. Helper returns
>> pointer to packet data, either real one or stack_buf. Verifier has to
>> be taught that the return value is NULL or a pointer which is safe with
>> offsets up to @len.
>>
>> If the reason for access is write we'd also need:
>>
>> void *xdp_mb_pointer_flush(struct xdp_buff *xdp_md, u32 flags,
>>                            u32 offset, u32 len, void *stack_buf)
>
> Yes! This would be so much better than bpf_skb_load/store_bytes(),
> especially if we can use it for both XDP and skb contexts as stated
> elsewhere in this thread.

Alright. Let's see if we can go this route, then :)

-Toke
Toke Høiland-Jørgensen Sept. 29, 2021, 12:25 p.m. UTC | #18
Lorenz Bauer <lmb@cloudflare.com> writes:

> On Mon, 20 Sept 2021 at 23:46, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> > The draft API was:
>> >
>> > void *xdp_mb_pointer_flush(struct xdp_buff *xdp_md, u32 flags,
>> >                            u32 offset, u32 len, void *stack_buf)
>> >
>> > Which does not take the ptr returned by header_pointer(), but that's
>> > easy to add (well, easy other than the fact it'd be the 6th arg).
>>
>> I guess we could play some trickery with stuffing offset/len/flags into
>> one or two u64s to save an argument or two?
>
> Adding another pointer arg seems really hard to explain as an API.
> What happens if I pass the "wrong" ptr? What happens if I pass NULL?
>
> How about this: instead of taking stack_ptr, take the return value
> from header_pointer as an argument.

Hmm, that's a good point; I do think that passing the return value from
header pointer is more natural as well (you're flushing pointer you just
wrote to, after all).

> Then use the already existing (right ;P) inlining to do the following:
>
>    if (md->ptr + args->off != ret_ptr)
>      __pointer_flush(...)

The inlining is orthogonal, though, right? The helper can do this check
whether or not it's a proper CALL or not (although obviously for
performance reasons we do want it to inline, at least eventually). In
particular, I believe we can make progress on this patch series without
working out the inlining :)

> This means that __pointer_flush has to deal with aliased memory
> though, so it would always have to memmove. Probably OK for the "slow"
> path?

Erm, not sure what you mean here? Yeah, flushing is going to take longer
if you ended up using the stack pointer instead of writing directly to
the packet. That's kinda intrinsic? Or am I misunderstanding you?

-Toke
Lorenz Bauer Sept. 29, 2021, 12:32 p.m. UTC | #19
On Wed, 29 Sept 2021 at 13:25, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> > Then use the already existing (right ;P) inlining to do the following:
> >
> >    if (md->ptr + args->off != ret_ptr)
> >      __pointer_flush(...)
>
> The inlining is orthogonal, though, right? The helper can do this check
> whether or not it's a proper CALL or not (although obviously for
> performance reasons we do want it to inline, at least eventually). In
> particular, I believe we can make progress on this patch series without
> working out the inlining :)

Yes, I was just worried that your answer would be "it's too expensive" ;)

> > This means that __pointer_flush has to deal with aliased memory
> > though, so it would always have to memmove. Probably OK for the "slow"
> > path?
>
> Erm, not sure what you mean here? Yeah, flushing is going to take longer
> if you ended up using the stack pointer instead of writing directly to
> the packet. That's kinda intrinsic? Or am I misunderstanding you?

I think I misunderstood your comment about memcpy to mean "want to
avoid aliased memory for perf reasons". Never mind!
Lorenz Bauer Sept. 29, 2021, 12:38 p.m. UTC | #20
On Wed, 29 Sept 2021 at 13:10, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Lorenz Bauer <lmb@cloudflare.com> writes:
>
> > On Thu, 16 Sept 2021 at 18:47, Jakub Kicinski <kuba@kernel.org> wrote:
> >>
> >> Won't applications end up building something like skb_header_pointer()
> >> based on bpf_xdp_adjust_data(), anyway? In which case why don't we
> >> provide them what they need?
> >>
> >> say:
> >>
> >> void *xdp_mb_pointer(struct xdp_buff *xdp_md, u32 flags,
> >>                      u32 offset, u32 len, void *stack_buf)
> >>
> >> flags and offset can be squashed into one u64 as needed. Helper returns
> >> pointer to packet data, either real one or stack_buf. Verifier has to
> >> be taught that the return value is NULL or a pointer which is safe with
> >> offsets up to @len.
> >>
> >> If the reason for access is write we'd also need:
> >>
> >> void *xdp_mb_pointer_flush(struct xdp_buff *xdp_md, u32 flags,
> >>                            u32 offset, u32 len, void *stack_buf)
> >
> > Yes! This would be so much better than bpf_skb_load/store_bytes(),
> > especially if we can use it for both XDP and skb contexts as stated
> > elsewhere in this thread.
>
> Alright. Let's see if we can go this route, then :)

Something I forgot to mention: you could infer that an XDP program is
mb-aware if it only does packet access via the helpers. Put another
way, it might be nice if ctx->data wasn't accessible in mb XDP. That
way I know that all packet access has to handle mb-aware ctx (think
pulling in functions via headers or even pre-compiled bpf libraries).
Jakub Kicinski Sept. 29, 2021, 5:46 p.m. UTC | #21
On Wed, 29 Sep 2021 11:36:33 +0100 Lorenz Bauer wrote:
> On Mon, 20 Sept 2021 at 23:46, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > > The draft API was:
> > >
> > > void *xdp_mb_pointer_flush(struct xdp_buff *xdp_md, u32 flags,
> > >                            u32 offset, u32 len, void *stack_buf)
> > >
> > > Which does not take the ptr returned by header_pointer(), but that's
> > > easy to add (well, easy other than the fact it'd be the 6th arg).  
> >
> > I guess we could play some trickery with stuffing offset/len/flags into
> > one or two u64s to save an argument or two?  
> 
> Adding another pointer arg seems really hard to explain as an API.
> What happens if I pass the "wrong" ptr? What happens if I pass NULL?

Sure. We can leave the checking to the program then, but that ties
our hands for the implementation changes later on.

Not sure which pointer type will be chosen for the ret value but it 
may allow error checking at verification.

> How about this: instead of taking stack_ptr, take the return value
> from header_pointer as an argument. Then use the already existing
> (right ;P) inlining to do the following:
> 
>    if (md->ptr + args->off != ret_ptr)
>      __pointer_flush(...)

That only checks for the case where pointer is in the "head" frag,
and is not generally correct. You need to check the length of the 
first frag is smaller than off. Otherwise BPF stack may "happen"
to follow the head page and math will work out.

It would also be slower than Lorenzo's current code, which allows
access to tail pages without copying.
Jakub Kicinski Sept. 29, 2021, 5:48 p.m. UTC | #22
On Wed, 29 Sep 2021 14:25:09 +0200 Toke Høiland-Jørgensen wrote:
> >> I guess we could play some trickery with stuffing offset/len/flags into
> >> one or two u64s to save an argument or two?  
> >
> > Adding another pointer arg seems really hard to explain as an API.
> > What happens if I pass the "wrong" ptr? What happens if I pass NULL?
> >
> > How about this: instead of taking stack_ptr, take the return value
> > from header_pointer as an argument.  
> 
> Hmm, that's a good point; I do think that passing the return value from
> header pointer is more natural as well (you're flushing pointer you just
> wrote to, after all).

It is more natural but doesn't allow for the same level of optimization.
Wouldn't consistent naming obviate the semantics?
Alexei Starovoitov Sept. 29, 2021, 6:54 p.m. UTC | #23
On Wed, Sep 29, 2021 at 5:38 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>
> On Wed, 29 Sept 2021 at 13:10, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > Lorenz Bauer <lmb@cloudflare.com> writes:
> >
> > > On Thu, 16 Sept 2021 at 18:47, Jakub Kicinski <kuba@kernel.org> wrote:
> > >>
> > >> Won't applications end up building something like skb_header_pointer()
> > >> based on bpf_xdp_adjust_data(), anyway? In which case why don't we
> > >> provide them what they need?
> > >>
> > >> say:
> > >>
> > >> void *xdp_mb_pointer(struct xdp_buff *xdp_md, u32 flags,
> > >>                      u32 offset, u32 len, void *stack_buf)
> > >>
> > >> flags and offset can be squashed into one u64 as needed. Helper returns
> > >> pointer to packet data, either real one or stack_buf. Verifier has to
> > >> be taught that the return value is NULL or a pointer which is safe with
> > >> offsets up to @len.
> > >>
> > >> If the reason for access is write we'd also need:
> > >>
> > >> void *xdp_mb_pointer_flush(struct xdp_buff *xdp_md, u32 flags,
> > >>                            u32 offset, u32 len, void *stack_buf)

I'm missing something. Why do we need a separate flush() helper?
Can't we do:
char buf[64], *p;
p = xdp_mb_pointer(ctx, flags, off, len, buf);
read/write p[]
if (p == buf)
    xdp_store_bytes(ctx, off, buf, len, flags);
Jakub Kicinski Sept. 29, 2021, 7:22 p.m. UTC | #24
On Wed, 29 Sep 2021 11:54:46 -0700 Alexei Starovoitov wrote:
> I'm missing something. Why do we need a separate flush() helper?
> Can't we do:
> char buf[64], *p;
> p = xdp_mb_pointer(ctx, flags, off, len, buf);
> read/write p[]
> if (p == buf)
>     xdp_store_bytes(ctx, off, buf, len, flags);

Sure we can. That's what I meant by "leave the checking to the program".
It's bike shedding at this point.
Toke Høiland-Jørgensen Sept. 29, 2021, 8:39 p.m. UTC | #25
Jakub Kicinski <kuba@kernel.org> writes:

> On Wed, 29 Sep 2021 11:54:46 -0700 Alexei Starovoitov wrote:
>> I'm missing something. Why do we need a separate flush() helper?
>> Can't we do:
>> char buf[64], *p;
>> p = xdp_mb_pointer(ctx, flags, off, len, buf);
>> read/write p[]
>> if (p == buf)
>>     xdp_store_bytes(ctx, off, buf, len, flags);
>
> Sure we can. That's what I meant by "leave the checking to the program".
> It's bike shedding at this point.

Yeah, let's discuss the details once we have a patch :)

-Toke
Lorenzo Bianconi Oct. 1, 2021, 9:03 a.m. UTC | #26
> Jakub Kicinski <kuba@kernel.org> writes:
> 
> > On Wed, 29 Sep 2021 11:54:46 -0700 Alexei Starovoitov wrote:
> >> I'm missing something. Why do we need a separate flush() helper?
> >> Can't we do:
> >> char buf[64], *p;
> >> p = xdp_mb_pointer(ctx, flags, off, len, buf);
> >> read/write p[]
> >> if (p == buf)
> >>     xdp_store_bytes(ctx, off, buf, len, flags);
> >
> > Sure we can. That's what I meant by "leave the checking to the program".
> > It's bike shedding at this point.
> 
> Yeah, let's discuss the details once we have a patch :)
> 
> -Toke
> 

Hi,

I implemented the xdp_mb_pointer/xdp_mb_pointer_flush logic here (according to
current discussion):
https://github.com/LorenzoBianconi/bpf-next/commit/a5c61c0fa6cb05bab8caebd96aca5fbbd9510867

For the moment I have only defined two utility routines and I have not exported
them in ebpf helpers since I need to check what are missing bits in the verifier
code (but afaik this would be orthogonal with respect to the "helper code"):
- bpf_xdp_pointer --> xdp_mb_pointer
- bpf_xdp_copy_buf --> xdp_mb_pointer_flush

In order to test them I have defined two new ebpf helpers (they use
bpf_xdp_pointer/bpf_xdp_copy_buf internally):
- bpf_xdp_load_bytes
- bpf_xdp_store_bytes

In order to test bpf_xdp_load_bytes/bpf_xdp_store_bytes +
bpf_xdp_pointer/bpf_xdp_copy_buf I added some selftests here:
https://github.com/LorenzoBianconi/bpf-next/commit/5661a491a890c00db744f2884b7ee3a6d0319384

Can you please check if the code above is aligned to current requirements or if
it is missing something?
If this code it is fine, I guess we have two option here:
- integrate the commits above in xdp multi-buff series (posting v15) and work on
  the verfier code in parallel (if xdp_mb_pointer helper is not required from day0)
- integrate verfier changes in xdp multi-buff series, drop bpf_xdp_load_bytes
  helper (probably we will still need bpf_xdp_store_bytes) and introduce
  bpf_xdp_pointer as new ebpf helper.

I am fine both ways. If we decide for the second option I would need some
guidance on verifier changes since I am not so familiar with that code.

Regards,
Lorenzo
Jakub Kicinski Oct. 1, 2021, 6:35 p.m. UTC | #27
On Fri, 1 Oct 2021 11:03:58 +0200 Lorenzo Bianconi wrote:
> Can you please check if the code above is aligned to current requirements or if
> it is missing something?
> If this code it is fine, I guess we have two option here:
> - integrate the commits above in xdp multi-buff series (posting v15) and work on
>   the verfier code in parallel (if xdp_mb_pointer helper is not required from day0)
> - integrate verfier changes in xdp multi-buff series, drop bpf_xdp_load_bytes
>   helper (probably we will still need bpf_xdp_store_bytes) and introduce
>   bpf_xdp_pointer as new ebpf helper.

It wasn't clear to me that we wanted bpf_xdp_load_bytes() to exist.
But FWIW no preference here.
Lorenzo Bianconi Oct. 6, 2021, 9:32 a.m. UTC | #28
> On Fri, 1 Oct 2021 11:03:58 +0200 Lorenzo Bianconi wrote:
> > Can you please check if the code above is aligned to current requirements or if
> > it is missing something?
> > If this code it is fine, I guess we have two option here:
> > - integrate the commits above in xdp multi-buff series (posting v15) and work on
> >   the verfier code in parallel (if xdp_mb_pointer helper is not required from day0)
> > - integrate verfier changes in xdp multi-buff series, drop bpf_xdp_load_bytes
> >   helper (probably we will still need bpf_xdp_store_bytes) and introduce
> >   bpf_xdp_pointer as new ebpf helper.
> 
> It wasn't clear to me that we wanted bpf_xdp_load_bytes() to exist.
> But FWIW no preference here.
> 

ack, same here. Any other opinion about it?

Regards,
Lorenzo
Eelco Chaudron Oct. 6, 2021, 10:08 a.m. UTC | #29
On 6 Oct 2021, at 11:32, Lorenzo Bianconi wrote:

>> On Fri, 1 Oct 2021 11:03:58 +0200 Lorenzo Bianconi wrote:
>>> Can you please check if the code above is aligned to current requirements or if
>>> it is missing something?
>>> If this code it is fine, I guess we have two option here:
>>> - integrate the commits above in xdp multi-buff series (posting v15) and work on
>>>   the verfier code in parallel (if xdp_mb_pointer helper is not required from day0)
>>> - integrate verfier changes in xdp multi-buff series, drop bpf_xdp_load_bytes
>>>   helper (probably we will still need bpf_xdp_store_bytes) and introduce
>>>   bpf_xdp_pointer as new ebpf helper.
>>
>> It wasn't clear to me that we wanted bpf_xdp_load_bytes() to exist.
>> But FWIW no preference here.
>>
>
> ack, same here. Any other opinion about it?

I was under the impression getting a pointer might be enough. But playing with the bpf ring buffers for a bit, it still might be handy to extract some data to be sent to userspace. So I would not mind keeping it.
Lorenzo Bianconi Oct. 6, 2021, 12:15 p.m. UTC | #30
> 
> 
> On 6 Oct 2021, at 11:32, Lorenzo Bianconi wrote:
> 
> >> On Fri, 1 Oct 2021 11:03:58 +0200 Lorenzo Bianconi wrote:
> >>> Can you please check if the code above is aligned to current requirements or if
> >>> it is missing something?
> >>> If this code it is fine, I guess we have two option here:
> >>> - integrate the commits above in xdp multi-buff series (posting v15) and work on
> >>>   the verfier code in parallel (if xdp_mb_pointer helper is not required from day0)
> >>> - integrate verfier changes in xdp multi-buff series, drop bpf_xdp_load_bytes
> >>>   helper (probably we will still need bpf_xdp_store_bytes) and introduce
> >>>   bpf_xdp_pointer as new ebpf helper.
> >>
> >> It wasn't clear to me that we wanted bpf_xdp_load_bytes() to exist.
> >> But FWIW no preference here.
> >>
> >
> > ack, same here. Any other opinion about it?
> 
> I was under the impression getting a pointer might be enough. But playing with the bpf ring buffers for a bit, it still might be handy to extract some data to be sent to userspace. So I would not mind keeping it.
> 

ok, so it seems we have a use-case for bpf_xdp_load_bytes(). If everybody
agree, I will post v15 with them included and we then we can work in parallel
for a bpf_xdp_pointer ebpf helper.

Regards,
Lorenzo