diff mbox series

[bpf-next,v5,02/13] xsk: Add TX timestamp and TX checksum offload support

Message ID 20231102225837.1141915-3-sdf@google.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series xsk: TX metadata | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR fail PR summary
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 6419 this patch: 6419
netdev/cc_maintainers warning 6 maintainers not CCed: tariqt@nvidia.com pabeni@redhat.com jonathan.lemon@gmail.com davem@davemloft.net edumazet@google.com lorenzo@kernel.org
netdev/build_clang success Errors and warnings before: 2663 this patch: 2663
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 6777 this patch: 6777
netdev/checkpatch warning CHECK: Prefer using the BIT macro WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 97 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-8 success Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-3 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-5 success Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-15 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-16 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-19 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for x86_64-llvm-16 / build / build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-25 success Logs for x86_64-llvm-16 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-26 success Logs for x86_64-llvm-16 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-27 success Logs for x86_64-llvm-16 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-28 success Logs for x86_64-llvm-16 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for x86_64-llvm-16 / veristat
bpf/vmtest-bpf-next-VM_Test-13 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 fail Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-11 fail Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for s390x-gcc / test (test_maps, false, 360) / test_maps on s390x with gcc

Commit Message

Stanislav Fomichev Nov. 2, 2023, 10:58 p.m. UTC
This change actually defines the (initial) metadata layout
that should be used by AF_XDP userspace (xsk_tx_metadata).
The first field is flags which requests appropriate offloads,
followed by the offload-specific fields. The supported per-device
offloads are exported via netlink (new xsk-flags).

The offloads themselves are still implemented in a bit of a
framework-y fashion that's left from my initial kfunc attempt.
I'm introducing new xsk_tx_metadata_ops which drivers are
supposed to implement. The drivers are also supposed
to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
the right places. Since xsk_tx_metadata_{request,_complete}
are static inline, we don't incur any extra overhead doing
indirect calls.

The benefit of this scheme is as follows:
- keeps all metadata layout parsing away from driver code
- makes it easy to grep and see which drivers implement what
- don't need any extra flags to maintain to keep track of what
  offloads are implemented; if the callback is implemented - the offload
  is supported (used by netlink reporting code)

Two offloads are defined right now:
1. XDP_TXMD_FLAGS_CHECKSUM: skb-style csum_start+csum_offset
2. XDP_TXMD_FLAGS_TIMESTAMP: writes TX timestamp back into metadata
   area upon completion (tx_timestamp field)

XDP_TXMD_FLAGS_TIMESTAMP is also implemented for XDP_COPY mode: it writes
SW timestamp from the skb destructor (note I'm reusing hwtstamps to pass
metadata pointer).

The struct is forward-compatible and can be extended in the future
by appending more fields.

Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 Documentation/netlink/specs/netdev.yaml |  19 +++-
 include/linux/netdevice.h               |   2 +
 include/linux/skbuff.h                  |  14 ++-
 include/net/xdp_sock.h                  | 110 ++++++++++++++++++++++++
 include/net/xdp_sock_drv.h              |  13 +++
 include/net/xsk_buff_pool.h             |   6 ++
 include/uapi/linux/if_xdp.h             |  38 ++++++++
 include/uapi/linux/netdev.h             |  16 ++++
 net/core/netdev-genl.c                  |  13 ++-
 net/xdp/xsk.c                           |  34 ++++++++
 net/xdp/xsk_queue.h                     |   2 +-
 tools/include/uapi/linux/if_xdp.h       |  52 +++++++++--
 tools/include/uapi/linux/netdev.h       |  16 ++++
 tools/net/ynl/generated/netdev-user.c   |  19 ++++
 tools/net/ynl/generated/netdev-user.h   |   3 +
 15 files changed, 348 insertions(+), 9 deletions(-)

Comments

Jesper Dangaard Brouer Nov. 13, 2023, 1:16 p.m. UTC | #1
On 11/2/23 23:58, Stanislav Fomichev wrote:
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index 2ecf79282c26..b0ee7ad19b51 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -106,6 +106,41 @@ struct xdp_options {
>   #define XSK_UNALIGNED_BUF_ADDR_MASK \
>   	((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
>   
> +/* Request transmit timestamp. Upon completion, put it into tx_timestamp
> + * field of struct xsk_tx_metadata.
> + */
> +#define XDP_TXMD_FLAGS_TIMESTAMP		(1 << 0)
> +
> +/* Request transmit checksum offload. Checksum start position and offset
> + * are communicated via csum_start and csum_offset fields of struct
> + * xsk_tx_metadata.
> + */
> +#define XDP_TXMD_FLAGS_CHECKSUM			(1 << 1)
> +
> +/* AF_XDP offloads request. 'request' union member is consumed by the driver
> + * when the packet is being transmitted. 'completion' union member is
> + * filled by the driver when the transmit completion arrives.
> + */
> +struct xsk_tx_metadata {
> +	union {
> +		struct {
> +			__u32 flags;
> +
> +			/* XDP_TXMD_FLAGS_CHECKSUM */
> +
> +			/* Offset from desc->addr where checksumming should start. */
> +			__u16 csum_start;
> +			/* Offset from csum_start where checksum should be stored. */
> +			__u16 csum_offset;
> +		} request;
> +
> +		struct {
> +			/* XDP_TXMD_FLAGS_TIMESTAMP */
> +			__u64 tx_timestamp;
> +		} completion;
> +	};
> +};

This looks wrong to me. It looks like member @flags is not avail at
completion time.  At completion time, I assume we also want to know if
someone requested to get the timestamp for this packet (else we could
read garbage).

Another thing (I've raised this before): It would be really practical to
store an u64 opaque value at TX and then read it at Completion time.
One use-case is a forwarding application storing HW RX-time and
comparing this to TX completion time to deduce the time spend processing
the packet.

--Jesper
Stanislav Fomichev Nov. 13, 2023, 2:10 p.m. UTC | #2
On Mon, Nov 13, 2023 at 5:16 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
>
>
> On 11/2/23 23:58, Stanislav Fomichev wrote:
> > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> > index 2ecf79282c26..b0ee7ad19b51 100644
> > --- a/include/uapi/linux/if_xdp.h
> > +++ b/include/uapi/linux/if_xdp.h
> > @@ -106,6 +106,41 @@ struct xdp_options {
> >   #define XSK_UNALIGNED_BUF_ADDR_MASK \
> >       ((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
> >
> > +/* Request transmit timestamp. Upon completion, put it into tx_timestamp
> > + * field of struct xsk_tx_metadata.
> > + */
> > +#define XDP_TXMD_FLAGS_TIMESTAMP             (1 << 0)
> > +
> > +/* Request transmit checksum offload. Checksum start position and offset
> > + * are communicated via csum_start and csum_offset fields of struct
> > + * xsk_tx_metadata.
> > + */
> > +#define XDP_TXMD_FLAGS_CHECKSUM                      (1 << 1)
> > +
> > +/* AF_XDP offloads request. 'request' union member is consumed by the driver
> > + * when the packet is being transmitted. 'completion' union member is
> > + * filled by the driver when the transmit completion arrives.
> > + */
> > +struct xsk_tx_metadata {
> > +     union {
> > +             struct {
> > +                     __u32 flags;
> > +
> > +                     /* XDP_TXMD_FLAGS_CHECKSUM */
> > +
> > +                     /* Offset from desc->addr where checksumming should start. */
> > +                     __u16 csum_start;
> > +                     /* Offset from csum_start where checksum should be stored. */
> > +                     __u16 csum_offset;
> > +             } request;
> > +
> > +             struct {
> > +                     /* XDP_TXMD_FLAGS_TIMESTAMP */
> > +                     __u64 tx_timestamp;
> > +             } completion;
> > +     };
> > +};
>
> This looks wrong to me. It looks like member @flags is not avail at
> completion time.  At completion time, I assume we also want to know if
> someone requested to get the timestamp for this packet (else we could
> read garbage).

I've moved the parts that are preserved across tx and tx completion
into xsk_tx_metadata_compl.
This is to address Magnus/Maciej feedback where userspace might race
with the kernel.
See: https://lore.kernel.org/bpf/ZNoJenzKXW5QSR3E@boxer/

> Another thing (I've raised this before): It would be really practical to
> store an u64 opaque value at TX and then read it at Completion time.
> One use-case is a forwarding application storing HW RX-time and
> comparing this to TX completion time to deduce the time spend processing
> the packet.

This can be another member, right? But note that extending
xsk_tx_metadata_compl might be a bit complicated because drivers have
to carry this info somewhere. So we have to balance the amount of
passed data between the tx and the completion.
Jesper Dangaard Brouer Nov. 13, 2023, 4:29 p.m. UTC | #3
On 11/13/23 15:10, Stanislav Fomichev wrote:
> On Mon, Nov 13, 2023 at 5:16 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>>
>>
>> On 11/2/23 23:58, Stanislav Fomichev wrote:
>>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>>> index 2ecf79282c26..b0ee7ad19b51 100644
>>> --- a/include/uapi/linux/if_xdp.h
>>> +++ b/include/uapi/linux/if_xdp.h
>>> @@ -106,6 +106,41 @@ struct xdp_options {
>>>    #define XSK_UNALIGNED_BUF_ADDR_MASK \
>>>        ((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
>>>
>>> +/* Request transmit timestamp. Upon completion, put it into tx_timestamp
>>> + * field of struct xsk_tx_metadata.
>>> + */
>>> +#define XDP_TXMD_FLAGS_TIMESTAMP             (1 << 0)
>>> +
>>> +/* Request transmit checksum offload. Checksum start position and offset
>>> + * are communicated via csum_start and csum_offset fields of struct
>>> + * xsk_tx_metadata.
>>> + */
>>> +#define XDP_TXMD_FLAGS_CHECKSUM                      (1 << 1)
>>> +
>>> +/* AF_XDP offloads request. 'request' union member is consumed by the driver
>>> + * when the packet is being transmitted. 'completion' union member is
>>> + * filled by the driver when the transmit completion arrives.
>>> + */
>>> +struct xsk_tx_metadata {
>>> +     union {
>>> +             struct {
>>> +                     __u32 flags;
>>> +
>>> +                     /* XDP_TXMD_FLAGS_CHECKSUM */
>>> +
>>> +                     /* Offset from desc->addr where checksumming should start. */
>>> +                     __u16 csum_start;
>>> +                     /* Offset from csum_start where checksum should be stored. */
>>> +                     __u16 csum_offset;
>>> +             } request;
>>> +
>>> +             struct {
>>> +                     /* XDP_TXMD_FLAGS_TIMESTAMP */
>>> +                     __u64 tx_timestamp;
>>> +             } completion;
>>> +     };
>>> +};
>>
>> This looks wrong to me. It looks like member @flags is not avail at
>> completion time.  At completion time, I assume we also want to know if
>> someone requested to get the timestamp for this packet (else we could
>> read garbage).
> 
> I've moved the parts that are preserved across tx and tx completion
> into xsk_tx_metadata_compl.
> This is to address Magnus/Maciej feedback where userspace might race
> with the kernel.
> See: https://lore.kernel.org/bpf/ZNoJenzKXW5QSR3E@boxer/
> 

Does this mean that every driver have to extend their TX-desc ring with
sizeof(struct xsk_tx_metadata_compl)?
Won't this affect the performance of this V5?

  $ pahole -C xsk_tx_metadata_compl 
./drivers/net/ethernet/stmicro/stmmac/stmmac.ko
  struct xsk_tx_metadata_compl {
	__u64 *              tx_timestamp;         /*     0     8 */

	/* size: 8, cachelines: 1, members: 1 */
	/* last cacheline: 8 bytes */
  };

Guess, I must be misunderstanding, as I was expecting to see the @flags
member being preserved across, as I get the race there.

Looking at stmmac driver, it does look like this xsk_tx_metadata_compl
is part of the TX-ring for completion (tx_skbuff_dma) and the
tx_timestamp data is getting stored here.  How is userspace AF_XDP
application getting access to the tx_timestamp data?
I though this was suppose to get stored in metadata data area (umem)?

Also looking at the code, the kernel would not have a "crash" race on
the flags member (if we preserve in struct), because the code checks the
driver HW-TS config-state + TX-descriptor for the availability of a
HW-TS in the descriptor.


>> Another thing (I've raised this before): It would be really practical to
>> store an u64 opaque value at TX and then read it at Completion time.
>> One use-case is a forwarding application storing HW RX-time and
>> comparing this to TX completion time to deduce the time spend processing
>> the packet.
> 
> This can be another member, right? But note that extending
> xsk_tx_metadata_compl might be a bit complicated because drivers have
> to carry this info somewhere. So we have to balance the amount of
> passed data between the tx and the completion.

I don't think my opaque value proposal is subject to same race problem.
I think this can be stores in metadata area and across tx and tx
completion, because any race on a flags change is the userspace
programmers problem, as it cannot cause any kernel crash (given kernel
have no need to read this).

--Jesper
Stanislav Fomichev Nov. 13, 2023, 5:02 p.m. UTC | #4
On 11/13, Jesper Dangaard Brouer wrote:
> 
> 
> On 11/13/23 15:10, Stanislav Fomichev wrote:
> > On Mon, Nov 13, 2023 at 5:16 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> > > 
> > > 
> > > On 11/2/23 23:58, Stanislav Fomichev wrote:
> > > > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> > > > index 2ecf79282c26..b0ee7ad19b51 100644
> > > > --- a/include/uapi/linux/if_xdp.h
> > > > +++ b/include/uapi/linux/if_xdp.h
> > > > @@ -106,6 +106,41 @@ struct xdp_options {
> > > >    #define XSK_UNALIGNED_BUF_ADDR_MASK \
> > > >        ((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
> > > > 
> > > > +/* Request transmit timestamp. Upon completion, put it into tx_timestamp
> > > > + * field of struct xsk_tx_metadata.
> > > > + */
> > > > +#define XDP_TXMD_FLAGS_TIMESTAMP             (1 << 0)
> > > > +
> > > > +/* Request transmit checksum offload. Checksum start position and offset
> > > > + * are communicated via csum_start and csum_offset fields of struct
> > > > + * xsk_tx_metadata.
> > > > + */
> > > > +#define XDP_TXMD_FLAGS_CHECKSUM                      (1 << 1)
> > > > +
> > > > +/* AF_XDP offloads request. 'request' union member is consumed by the driver
> > > > + * when the packet is being transmitted. 'completion' union member is
> > > > + * filled by the driver when the transmit completion arrives.
> > > > + */
> > > > +struct xsk_tx_metadata {
> > > > +     union {
> > > > +             struct {
> > > > +                     __u32 flags;
> > > > +
> > > > +                     /* XDP_TXMD_FLAGS_CHECKSUM */
> > > > +
> > > > +                     /* Offset from desc->addr where checksumming should start. */
> > > > +                     __u16 csum_start;
> > > > +                     /* Offset from csum_start where checksum should be stored. */
> > > > +                     __u16 csum_offset;
> > > > +             } request;
> > > > +
> > > > +             struct {
> > > > +                     /* XDP_TXMD_FLAGS_TIMESTAMP */
> > > > +                     __u64 tx_timestamp;
> > > > +             } completion;
> > > > +     };
> > > > +};
> > > 
> > > This looks wrong to me. It looks like member @flags is not avail at
> > > completion time.  At completion time, I assume we also want to know if
> > > someone requested to get the timestamp for this packet (else we could
> > > read garbage).
> > 
> > I've moved the parts that are preserved across tx and tx completion
> > into xsk_tx_metadata_compl.
> > This is to address Magnus/Maciej feedback where userspace might race
> > with the kernel.
> > See: https://lore.kernel.org/bpf/ZNoJenzKXW5QSR3E@boxer/
> > 
> 
> Does this mean that every driver have to extend their TX-desc ring with
> sizeof(struct xsk_tx_metadata_compl)?
> Won't this affect the performance of this V5?

Yes, but it doesn't have to be a descriptor. Might be some internal
driver completion queue (as in the case of mlx5). And definitely does
affect performance :-( (see all the static branches to disable it)
 
>  $ pahole -C xsk_tx_metadata_compl
> ./drivers/net/ethernet/stmicro/stmmac/stmmac.ko
>  struct xsk_tx_metadata_compl {
> 	__u64 *              tx_timestamp;         /*     0     8 */
> 
> 	/* size: 8, cachelines: 1, members: 1 */
> 	/* last cacheline: 8 bytes */
>  };
> 
> Guess, I must be misunderstanding, as I was expecting to see the @flags
> member being preserved across, as I get the race there.
>
> Looking at stmmac driver, it does look like this xsk_tx_metadata_compl
> is part of the TX-ring for completion (tx_skbuff_dma) and the
> tx_timestamp data is getting stored here.  How is userspace AF_XDP
> application getting access to the tx_timestamp data?
> I though this was suppose to get stored in metadata data area (umem)?
>
> Also looking at the code, the kernel would not have a "crash" race on
> the flags member (if we preserve in struct), because the code checks the
> driver HW-TS config-state + TX-descriptor for the availability of a
> HW-TS in the descriptor.

xsk_tx_metadata_compl stores a pointer to the completion timestamp
in the umem, so everything still arrives via the metadata area.

We want to make sure the flags are not changing across tx and tx completion.
Instead of saving the flags, we just use that xsk_tx_metadata_compl to
signal to the completion that "I know that I've requested the tx
completion timestamp, please put it at this address in umem".

I store the pointer instead of flags to avoid doing pointer math again
at completion. But it's an implementation detail and somewhat abstracted
from the drivers (besides the fact that it's probably has to fit in 8
bytes).

> > > Another thing (I've raised this before): It would be really practical to
> > > store an u64 opaque value at TX and then read it at Completion time.
> > > One use-case is a forwarding application storing HW RX-time and
> > > comparing this to TX completion time to deduce the time spend processing
> > > the packet.
> > 
> > This can be another member, right? But note that extending
> > xsk_tx_metadata_compl might be a bit complicated because drivers have
> > to carry this info somewhere. So we have to balance the amount of
> > passed data between the tx and the completion.
> 
> I don't think my opaque value proposal is subject to same race problem.
> I think this can be stores in metadata area and across tx and tx
> completion, because any race on a flags change is the userspace
> programmers problem, as it cannot cause any kernel crash (given kernel
> have no need to read this).

Thinking about it, I don't think this needs any special handing?
You can request sizeof(struct xsk_tx_metadata) + sizeof(opaque data)
as metadata. The kernel won't touch the 'opaque data' part. Or am I missing
something?
Jesper Dangaard Brouer Nov. 15, 2023, 10:05 a.m. UTC | #5
On 11/13/23 18:02, Stanislav Fomichev wrote:
> On 11/13, Jesper Dangaard Brouer wrote:
>>
>>
>> On 11/13/23 15:10, Stanislav Fomichev wrote:
>>> On Mon, Nov 13, 2023 at 5:16 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>>>>
>>>>
>>>> On 11/2/23 23:58, Stanislav Fomichev wrote:
>>>>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>>>>> index 2ecf79282c26..b0ee7ad19b51 100644
>>>>> --- a/include/uapi/linux/if_xdp.h
>>>>> +++ b/include/uapi/linux/if_xdp.h
>>>>> @@ -106,6 +106,41 @@ struct xdp_options {
>>>>>     #define XSK_UNALIGNED_BUF_ADDR_MASK \
>>>>>         ((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
>>>>>
>>>>> +/* Request transmit timestamp. Upon completion, put it into tx_timestamp
>>>>> + * field of struct xsk_tx_metadata.
>>>>> + */
>>>>> +#define XDP_TXMD_FLAGS_TIMESTAMP             (1 << 0)
>>>>> +
>>>>> +/* Request transmit checksum offload. Checksum start position and offset
>>>>> + * are communicated via csum_start and csum_offset fields of struct
>>>>> + * xsk_tx_metadata.
>>>>> + */
>>>>> +#define XDP_TXMD_FLAGS_CHECKSUM                      (1 << 1)
>>>>> +
>>>>> +/* AF_XDP offloads request. 'request' union member is consumed by the driver
>>>>> + * when the packet is being transmitted. 'completion' union member is
>>>>> + * filled by the driver when the transmit completion arrives.
>>>>> + */
>>>>> +struct xsk_tx_metadata {
>>>>> +     union {
>>>>> +             struct {
>>>>> +                     __u32 flags;
>>>>> +
>>>>> +                     /* XDP_TXMD_FLAGS_CHECKSUM */
>>>>> +
>>>>> +                     /* Offset from desc->addr where checksumming should start. */
>>>>> +                     __u16 csum_start;
>>>>> +                     /* Offset from csum_start where checksum should be stored. */
>>>>> +                     __u16 csum_offset;
>>>>> +             } request;
>>>>> +
>>>>> +             struct {
>>>>> +                     /* XDP_TXMD_FLAGS_TIMESTAMP */
>>>>> +                     __u64 tx_timestamp;
>>>>> +             } completion;
>>>>> +     };
>>>>> +};
>>>>
>>>> This looks wrong to me. It looks like member @flags is not avail at
>>>> completion time.  At completion time, I assume we also want to know if
>>>> someone requested to get the timestamp for this packet (else we could
>>>> read garbage).
>>>
>>> I've moved the parts that are preserved across tx and tx completion
>>> into xsk_tx_metadata_compl.
>>> This is to address Magnus/Maciej feedback where userspace might race
>>> with the kernel.
>>> See: https://lore.kernel.org/bpf/ZNoJenzKXW5QSR3E@boxer/
>>>
>>
>> Does this mean that every driver have to extend their TX-desc ring with
>> sizeof(struct xsk_tx_metadata_compl)?
>> Won't this affect the performance of this V5?
> 
> Yes, but it doesn't have to be a descriptor. Might be some internal
> driver completion queue (as in the case of mlx5). And definitely does
> affect performance :-( (see all the static branches to disable it)
>   
>>   $ pahole -C xsk_tx_metadata_compl
>> ./drivers/net/ethernet/stmicro/stmmac/stmmac.ko
>>   struct xsk_tx_metadata_compl {
>> 	__u64 *              tx_timestamp;         /*     0     8 */
>>
>> 	/* size: 8, cachelines: 1, members: 1 */
>> 	/* last cacheline: 8 bytes */
>>   };
>>
>> Guess, I must be misunderstanding, as I was expecting to see the @flags
>> member being preserved across, as I get the race there.
>>
>> Looking at stmmac driver, it does look like this xsk_tx_metadata_compl
>> is part of the TX-ring for completion (tx_skbuff_dma) and the
>> tx_timestamp data is getting stored here.  How is userspace AF_XDP
>> application getting access to the tx_timestamp data?
>> I though this was suppose to get stored in metadata data area (umem)?
>>
>> Also looking at the code, the kernel would not have a "crash" race on
>> the flags member (if we preserve in struct), because the code checks the
>> driver HW-TS config-state + TX-descriptor for the availability of a
>> HW-TS in the descriptor.
> 
> xsk_tx_metadata_compl stores a pointer to the completion timestamp
> in the umem, so everything still arrives via the metadata area.
> 
> We want to make sure the flags are not changing across tx and tx completion.
> Instead of saving the flags, we just use that xsk_tx_metadata_compl to
> signal to the completion that "I know that I've requested the tx
> completion timestamp, please put it at this address in umem".
> 
> I store the pointer instead of flags to avoid doing pointer math again
> at completion. But it's an implementation detail and somewhat abstracted
> from the drivers (besides the fact that it's probably has to fit in 8
> bytes).

I see it now (what I missed). At TX time you are storing a pointer where
to (later) write the TS at completion time.  It just seems overkill to
store 8 byte (pointer) to signal (via NULL) if the HWTS was requested.
Space in the drivers TX-ring is performance critical, and I think driver
developers would prefer to find a bit to indicate HWTS requested.

If HWTS was *NOT* requested, then the metadata area will not be updated
(right, correct?). Then memory area is basically garbage that survived.
How does the AF_XDP application know this packet contains a HWTS or not?

 From an UAPI PoV wouldn't it be easier to use (and extend) via keeping
the @flags member (in struct xsk_tx_metadata), but (as you already do)
not let kernel checks depend on it (to avoid the races).

> 
>>>> Another thing (I've raised this before): It would be really practical to
>>>> store an u64 opaque value at TX and then read it at Completion time.
>>>> One use-case is a forwarding application storing HW RX-time and
>>>> comparing this to TX completion time to deduce the time spend processing
>>>> the packet.
>>>
>>> This can be another member, right? But note that extending
>>> xsk_tx_metadata_compl might be a bit complicated because drivers have
>>> to carry this info somewhere. So we have to balance the amount of
>>> passed data between the tx and the completion.
>>
>> I don't think my opaque value proposal is subject to same race problem.
>> I think this can be stores in metadata area and across tx and tx
>> completion, because any race on a flags change is the userspace
>> programmers problem, as it cannot cause any kernel crash (given kernel
>> have no need to read this).
> 
> Thinking about it, I don't think this needs any special handing?
> You can request sizeof(struct xsk_tx_metadata) + sizeof(opaque data)
> as metadata. The kernel won't touch the 'opaque data' part. Or am I missing
> something?

Sure, I can just create some room after struct xsk_tx_metadata, via 
setting something larger than sizeof(struct xsk_tx_metadata).  I'm 
buying that.

--Jesper
Stanislav Fomichev Nov. 15, 2023, 1:37 p.m. UTC | #6
On 11/15, Jesper Dangaard Brouer wrote:
> 
> 
> On 11/13/23 18:02, Stanislav Fomichev wrote:
> > On 11/13, Jesper Dangaard Brouer wrote:
> > > 
> > > 
> > > On 11/13/23 15:10, Stanislav Fomichev wrote:
> > > > On Mon, Nov 13, 2023 at 5:16 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> > > > > 
> > > > > 
> > > > > On 11/2/23 23:58, Stanislav Fomichev wrote:
> > > > > > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> > > > > > index 2ecf79282c26..b0ee7ad19b51 100644
> > > > > > --- a/include/uapi/linux/if_xdp.h
> > > > > > +++ b/include/uapi/linux/if_xdp.h
> > > > > > @@ -106,6 +106,41 @@ struct xdp_options {
> > > > > >     #define XSK_UNALIGNED_BUF_ADDR_MASK \
> > > > > >         ((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
> > > > > > 
> > > > > > +/* Request transmit timestamp. Upon completion, put it into tx_timestamp
> > > > > > + * field of struct xsk_tx_metadata.
> > > > > > + */
> > > > > > +#define XDP_TXMD_FLAGS_TIMESTAMP             (1 << 0)
> > > > > > +
> > > > > > +/* Request transmit checksum offload. Checksum start position and offset
> > > > > > + * are communicated via csum_start and csum_offset fields of struct
> > > > > > + * xsk_tx_metadata.
> > > > > > + */
> > > > > > +#define XDP_TXMD_FLAGS_CHECKSUM                      (1 << 1)
> > > > > > +
> > > > > > +/* AF_XDP offloads request. 'request' union member is consumed by the driver
> > > > > > + * when the packet is being transmitted. 'completion' union member is
> > > > > > + * filled by the driver when the transmit completion arrives.
> > > > > > + */
> > > > > > +struct xsk_tx_metadata {
> > > > > > +     union {
> > > > > > +             struct {
> > > > > > +                     __u32 flags;
> > > > > > +
> > > > > > +                     /* XDP_TXMD_FLAGS_CHECKSUM */
> > > > > > +
> > > > > > +                     /* Offset from desc->addr where checksumming should start. */
> > > > > > +                     __u16 csum_start;
> > > > > > +                     /* Offset from csum_start where checksum should be stored. */
> > > > > > +                     __u16 csum_offset;
> > > > > > +             } request;
> > > > > > +
> > > > > > +             struct {
> > > > > > +                     /* XDP_TXMD_FLAGS_TIMESTAMP */
> > > > > > +                     __u64 tx_timestamp;
> > > > > > +             } completion;
> > > > > > +     };
> > > > > > +};
> > > > > 
> > > > > This looks wrong to me. It looks like member @flags is not avail at
> > > > > completion time.  At completion time, I assume we also want to know if
> > > > > someone requested to get the timestamp for this packet (else we could
> > > > > read garbage).
> > > > 
> > > > I've moved the parts that are preserved across tx and tx completion
> > > > into xsk_tx_metadata_compl.
> > > > This is to address Magnus/Maciej feedback where userspace might race
> > > > with the kernel.
> > > > See: https://lore.kernel.org/bpf/ZNoJenzKXW5QSR3E@boxer/
> > > > 
> > > 
> > > Does this mean that every driver have to extend their TX-desc ring with
> > > sizeof(struct xsk_tx_metadata_compl)?
> > > Won't this affect the performance of this V5?
> > 
> > Yes, but it doesn't have to be a descriptor. Might be some internal
> > driver completion queue (as in the case of mlx5). And definitely does
> > affect performance :-( (see all the static branches to disable it)
> > >   $ pahole -C xsk_tx_metadata_compl
> > > ./drivers/net/ethernet/stmicro/stmmac/stmmac.ko
> > >   struct xsk_tx_metadata_compl {
> > > 	__u64 *              tx_timestamp;         /*     0     8 */
> > > 
> > > 	/* size: 8, cachelines: 1, members: 1 */
> > > 	/* last cacheline: 8 bytes */
> > >   };
> > > 
> > > Guess, I must be misunderstanding, as I was expecting to see the @flags
> > > member being preserved across, as I get the race there.
> > > 
> > > Looking at stmmac driver, it does look like this xsk_tx_metadata_compl
> > > is part of the TX-ring for completion (tx_skbuff_dma) and the
> > > tx_timestamp data is getting stored here.  How is userspace AF_XDP
> > > application getting access to the tx_timestamp data?
> > > I though this was suppose to get stored in metadata data area (umem)?
> > > 
> > > Also looking at the code, the kernel would not have a "crash" race on
> > > the flags member (if we preserve in struct), because the code checks the
> > > driver HW-TS config-state + TX-descriptor for the availability of a
> > > HW-TS in the descriptor.
> > 
> > xsk_tx_metadata_compl stores a pointer to the completion timestamp
> > in the umem, so everything still arrives via the metadata area.
> > 
> > We want to make sure the flags are not changing across tx and tx completion.
> > Instead of saving the flags, we just use that xsk_tx_metadata_compl to
> > signal to the completion that "I know that I've requested the tx
> > completion timestamp, please put it at this address in umem".
> > 
> > I store the pointer instead of flags to avoid doing pointer math again
> > at completion. But it's an implementation detail and somewhat abstracted
> > from the drivers (besides the fact that it's probably has to fit in 8
> > bytes).
> 
> I see it now (what I missed). At TX time you are storing a pointer where
> to (later) write the TS at completion time.  It just seems overkill to
> store 8 byte (pointer) to signal (via NULL) if the HWTS was requested.
> Space in the drivers TX-ring is performance critical, and I think driver
> developers would prefer to find a bit to indicate HWTS requested.
> 
> If HWTS was *NOT* requested, then the metadata area will not be updated
> (right, correct?). Then memory area is basically garbage that survived.
> How does the AF_XDP application know this packet contains a HWTS or not?
> 
> From an UAPI PoV wouldn't it be easier to use (and extend) via keeping
> the @flags member (in struct xsk_tx_metadata), but (as you already do)
> not let kernel checks depend on it (to avoid the races).

I was assuming the userspace can keep this signal out of band or use
the same idea as suggested with padding struct xsk_tx_metadata to keep
some data around. But I see your point, it might be convenient to
keep the original flags around during completion on the uapi side.

I think I can just move flags from the request union member to the outer
struct. So the struct xsk_tx_metadata would look like:

struct xsk_tx_metadata {
	__u32 flags; /* maybe can even make this u64? */

	union {
		__u16 csum_start;
		__u16 csum_offset;
	} request;

	union {
		__u64 tx_timestamp;
	} completion;

	__u32 padding; /* to drop this padding */
};

But I'd also keep the existing xsk_tx_metadata_compl to carry the
pointer+signal around. As I mentioned previously, it's completely
opaque to the driver and we can change the internals in the future.

IOW, we won't override the flags from the kernel side and as long
as the userspace consumer doesn't mess them up it should receive
the original value at completion.

Would that work for you?
Jesper Dangaard Brouer Nov. 16, 2023, 2:30 p.m. UTC | #7
On 11/15/23 14:37, Stanislav Fomichev wrote:
> On 11/15, Jesper Dangaard Brouer wrote:
>>
>>
>> On 11/13/23 18:02, Stanislav Fomichev wrote:
>>> On 11/13, Jesper Dangaard Brouer wrote:
>>>>
>>>>
>>>> On 11/13/23 15:10, Stanislav Fomichev wrote:
>>>>> On Mon, Nov 13, 2023 at 5:16 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>>>>>>
>>>>>>
>>>>>> On 11/2/23 23:58, Stanislav Fomichev wrote:
>>>>>>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>>>>>>> index 2ecf79282c26..b0ee7ad19b51 100644
>>>>>>> --- a/include/uapi/linux/if_xdp.h
>>>>>>> +++ b/include/uapi/linux/if_xdp.h
>>>>>>> @@ -106,6 +106,41 @@ struct xdp_options {
>>>>>>>      #define XSK_UNALIGNED_BUF_ADDR_MASK \
>>>>>>>          ((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
>>>>>>>
>>>>>>> +/* Request transmit timestamp. Upon completion, put it into tx_timestamp
>>>>>>> + * field of struct xsk_tx_metadata.
>>>>>>> + */
>>>>>>> +#define XDP_TXMD_FLAGS_TIMESTAMP             (1 << 0)
>>>>>>> +
>>>>>>> +/* Request transmit checksum offload. Checksum start position and offset
>>>>>>> + * are communicated via csum_start and csum_offset fields of struct
>>>>>>> + * xsk_tx_metadata.
>>>>>>> + */
>>>>>>> +#define XDP_TXMD_FLAGS_CHECKSUM                      (1 << 1)
>>>>>>> +
>>>>>>> +/* AF_XDP offloads request. 'request' union member is consumed by the driver
>>>>>>> + * when the packet is being transmitted. 'completion' union member is
>>>>>>> + * filled by the driver when the transmit completion arrives.
>>>>>>> + */
>>>>>>> +struct xsk_tx_metadata {
>>>>>>> +     union {
>>>>>>> +             struct {
>>>>>>> +                     __u32 flags;
>>>>>>> +
>>>>>>> +                     /* XDP_TXMD_FLAGS_CHECKSUM */
>>>>>>> +
>>>>>>> +                     /* Offset from desc->addr where checksumming should start. */
>>>>>>> +                     __u16 csum_start;
>>>>>>> +                     /* Offset from csum_start where checksum should be stored. */
>>>>>>> +                     __u16 csum_offset;
>>>>>>> +             } request;
>>>>>>> +
>>>>>>> +             struct {
>>>>>>> +                     /* XDP_TXMD_FLAGS_TIMESTAMP */
>>>>>>> +                     __u64 tx_timestamp;
>>>>>>> +             } completion;
>>>>>>> +     };
>>>>>>> +};
>>>>>>
>>>>>> This looks wrong to me. It looks like member @flags is not avail at
>>>>>> completion time.  At completion time, I assume we also want to know if
>>>>>> someone requested to get the timestamp for this packet (else we could
>>>>>> read garbage).
>>>>>
>>>>> I've moved the parts that are preserved across tx and tx completion
>>>>> into xsk_tx_metadata_compl.
>>>>> This is to address Magnus/Maciej feedback where userspace might race
>>>>> with the kernel.
>>>>> See: https://lore.kernel.org/bpf/ZNoJenzKXW5QSR3E@boxer/
>>>>>
>>>>
>>>> Does this mean that every driver have to extend their TX-desc ring with
>>>> sizeof(struct xsk_tx_metadata_compl)?
>>>> Won't this affect the performance of this V5?
>>>
>>> Yes, but it doesn't have to be a descriptor. Might be some internal
>>> driver completion queue (as in the case of mlx5). And definitely does
>>> affect performance :-( (see all the static branches to disable it)
>>>>    $ pahole -C xsk_tx_metadata_compl
>>>> ./drivers/net/ethernet/stmicro/stmmac/stmmac.ko
>>>>    struct xsk_tx_metadata_compl {
>>>> 	__u64 *              tx_timestamp;         /*     0     8 */
>>>>
>>>> 	/* size: 8, cachelines: 1, members: 1 */
>>>> 	/* last cacheline: 8 bytes */
>>>>    };
>>>>
>>>> Guess, I must be misunderstanding, as I was expecting to see the @flags
>>>> member being preserved across, as I get the race there.
>>>>
>>>> Looking at stmmac driver, it does look like this xsk_tx_metadata_compl
>>>> is part of the TX-ring for completion (tx_skbuff_dma) and the
>>>> tx_timestamp data is getting stored here.  How is userspace AF_XDP
>>>> application getting access to the tx_timestamp data?
>>>> I though this was suppose to get stored in metadata data area (umem)?
>>>>
>>>> Also looking at the code, the kernel would not have a "crash" race on
>>>> the flags member (if we preserve in struct), because the code checks the
>>>> driver HW-TS config-state + TX-descriptor for the availability of a
>>>> HW-TS in the descriptor.
>>>
>>> xsk_tx_metadata_compl stores a pointer to the completion timestamp
>>> in the umem, so everything still arrives via the metadata area.
>>>
>>> We want to make sure the flags are not changing across tx and tx completion.
>>> Instead of saving the flags, we just use that xsk_tx_metadata_compl to
>>> signal to the completion that "I know that I've requested the tx
>>> completion timestamp, please put it at this address in umem".
>>>
>>> I store the pointer instead of flags to avoid doing pointer math again
>>> at completion. But it's an implementation detail and somewhat abstracted
>>> from the drivers (besides the fact that it's probably has to fit in 8
>>> bytes).
>>
>> I see it now (what I missed). At TX time you are storing a pointer where
>> to (later) write the TS at completion time.  It just seems overkill to
>> store 8 byte (pointer) to signal (via NULL) if the HWTS was requested.
>> Space in the drivers TX-ring is performance critical, and I think driver
>> developers would prefer to find a bit to indicate HWTS requested.
>>
>> If HWTS was *NOT* requested, then the metadata area will not be updated
>> (right, correct?). Then memory area is basically garbage that survived.
>> How does the AF_XDP application know this packet contains a HWTS or not?
>>
>>  From an UAPI PoV wouldn't it be easier to use (and extend) via keeping
>> the @flags member (in struct xsk_tx_metadata), but (as you already do)
>> not let kernel checks depend on it (to avoid the races).
> 
> I was assuming the userspace can keep this signal out of band or use
> the same idea as suggested with padding struct xsk_tx_metadata to keep
> some data around. But I see your point, it might be convenient to
> keep the original flags around during completion on the uapi side.
> 
> I think I can just move flags from the request union member to the outer
> struct. So the struct xsk_tx_metadata would look like:
> 
> struct xsk_tx_metadata {
> 	__u32 flags; /* maybe can even make this u64? */
> 

Yes to u64 for two reasons (1) this becomes UAPI and
(2) better alignment for tx_timestamp.
But I'm open to keeping it u32.

> 	union {
> 		__u16 csum_start;
> 		__u16 csum_offset;
> 	} request;
> 
> 	union {
> 		__u64 tx_timestamp;
> 	} completion;
> 
> 	__u32 padding; /* to drop this padding */
> };
> 
> But I'd also keep the existing xsk_tx_metadata_compl to carry the
> pointer+signal around. As I mentioned previously, it's completely
> opaque to the driver and we can change the internals in the future.
> 

Sure, it is an implementation detail and my objections are mostly that I
don't find it as a pretty code approach that can be hard to follow.
Maybe driver developer will object and change this later if it cost too
much to increase the element size in their TX-ring queues.

> IOW, we won't override the flags from the kernel side and as long
> as the userspace consumer doesn't mess them up it should receive
> the original value at completion.
> 
> Would that work for you?

Yes, that will work for me, thanks!

--Jesper
Stanislav Fomichev Nov. 20, 2023, 8:45 p.m. UTC | #8
On 11/16, Jesper Dangaard Brouer wrote:
> 
> 
> On 11/15/23 14:37, Stanislav Fomichev wrote:
> > On 11/15, Jesper Dangaard Brouer wrote:
> > > 
> > > 
> > > On 11/13/23 18:02, Stanislav Fomichev wrote:
> > > > On 11/13, Jesper Dangaard Brouer wrote:
> > > > > 
> > > > > 
> > > > > On 11/13/23 15:10, Stanislav Fomichev wrote:
> > > > > > On Mon, Nov 13, 2023 at 5:16 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > On 11/2/23 23:58, Stanislav Fomichev wrote:
> > > > > > > > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> > > > > > > > index 2ecf79282c26..b0ee7ad19b51 100644
> > > > > > > > --- a/include/uapi/linux/if_xdp.h
> > > > > > > > +++ b/include/uapi/linux/if_xdp.h
> > > > > > > > @@ -106,6 +106,41 @@ struct xdp_options {
> > > > > > > >      #define XSK_UNALIGNED_BUF_ADDR_MASK \
> > > > > > > >          ((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
> > > > > > > > 
> > > > > > > > +/* Request transmit timestamp. Upon completion, put it into tx_timestamp
> > > > > > > > + * field of struct xsk_tx_metadata.
> > > > > > > > + */
> > > > > > > > +#define XDP_TXMD_FLAGS_TIMESTAMP             (1 << 0)
> > > > > > > > +
> > > > > > > > +/* Request transmit checksum offload. Checksum start position and offset
> > > > > > > > + * are communicated via csum_start and csum_offset fields of struct
> > > > > > > > + * xsk_tx_metadata.
> > > > > > > > + */
> > > > > > > > +#define XDP_TXMD_FLAGS_CHECKSUM                      (1 << 1)
> > > > > > > > +
> > > > > > > > +/* AF_XDP offloads request. 'request' union member is consumed by the driver
> > > > > > > > + * when the packet is being transmitted. 'completion' union member is
> > > > > > > > + * filled by the driver when the transmit completion arrives.
> > > > > > > > + */
> > > > > > > > +struct xsk_tx_metadata {
> > > > > > > > +     union {
> > > > > > > > +             struct {
> > > > > > > > +                     __u32 flags;
> > > > > > > > +
> > > > > > > > +                     /* XDP_TXMD_FLAGS_CHECKSUM */
> > > > > > > > +
> > > > > > > > +                     /* Offset from desc->addr where checksumming should start. */
> > > > > > > > +                     __u16 csum_start;
> > > > > > > > +                     /* Offset from csum_start where checksum should be stored. */
> > > > > > > > +                     __u16 csum_offset;
> > > > > > > > +             } request;
> > > > > > > > +
> > > > > > > > +             struct {
> > > > > > > > +                     /* XDP_TXMD_FLAGS_TIMESTAMP */
> > > > > > > > +                     __u64 tx_timestamp;
> > > > > > > > +             } completion;
> > > > > > > > +     };
> > > > > > > > +};
> > > > > > > 
> > > > > > > This looks wrong to me. It looks like member @flags is not avail at
> > > > > > > completion time.  At completion time, I assume we also want to know if
> > > > > > > someone requested to get the timestamp for this packet (else we could
> > > > > > > read garbage).
> > > > > > 
> > > > > > I've moved the parts that are preserved across tx and tx completion
> > > > > > into xsk_tx_metadata_compl.
> > > > > > This is to address Magnus/Maciej feedback where userspace might race
> > > > > > with the kernel.
> > > > > > See: https://lore.kernel.org/bpf/ZNoJenzKXW5QSR3E@boxer/
> > > > > > 
> > > > > 
> > > > > Does this mean that every driver have to extend their TX-desc ring with
> > > > > sizeof(struct xsk_tx_metadata_compl)?
> > > > > Won't this affect the performance of this V5?
> > > > 
> > > > Yes, but it doesn't have to be a descriptor. Might be some internal
> > > > driver completion queue (as in the case of mlx5). And definitely does
> > > > affect performance :-( (see all the static branches to disable it)
> > > > >    $ pahole -C xsk_tx_metadata_compl
> > > > > ./drivers/net/ethernet/stmicro/stmmac/stmmac.ko
> > > > >    struct xsk_tx_metadata_compl {
> > > > > 	__u64 *              tx_timestamp;         /*     0     8 */
> > > > > 
> > > > > 	/* size: 8, cachelines: 1, members: 1 */
> > > > > 	/* last cacheline: 8 bytes */
> > > > >    };
> > > > > 
> > > > > Guess, I must be misunderstanding, as I was expecting to see the @flags
> > > > > member being preserved across, as I get the race there.
> > > > > 
> > > > > Looking at stmmac driver, it does look like this xsk_tx_metadata_compl
> > > > > is part of the TX-ring for completion (tx_skbuff_dma) and the
> > > > > tx_timestamp data is getting stored here.  How is userspace AF_XDP
> > > > > application getting access to the tx_timestamp data?
> > > > > I though this was suppose to get stored in metadata data area (umem)?
> > > > > 
> > > > > Also looking at the code, the kernel would not have a "crash" race on
> > > > > the flags member (if we preserve in struct), because the code checks the
> > > > > driver HW-TS config-state + TX-descriptor for the availability of a
> > > > > HW-TS in the descriptor.
> > > > 
> > > > xsk_tx_metadata_compl stores a pointer to the completion timestamp
> > > > in the umem, so everything still arrives via the metadata area.
> > > > 
> > > > We want to make sure the flags are not changing across tx and tx completion.
> > > > Instead of saving the flags, we just use that xsk_tx_metadata_compl to
> > > > signal to the completion that "I know that I've requested the tx
> > > > completion timestamp, please put it at this address in umem".
> > > > 
> > > > I store the pointer instead of flags to avoid doing pointer math again
> > > > at completion. But it's an implementation detail and somewhat abstracted
> > > > from the drivers (besides the fact that it's probably has to fit in 8
> > > > bytes).
> > > 
> > > I see it now (what I missed). At TX time you are storing a pointer where
> > > to (later) write the TS at completion time.  It just seems overkill to
> > > store 8 byte (pointer) to signal (via NULL) if the HWTS was requested.
> > > Space in the drivers TX-ring is performance critical, and I think driver
> > > developers would prefer to find a bit to indicate HWTS requested.
> > > 
> > > If HWTS was *NOT* requested, then the metadata area will not be updated
> > > (right, correct?). Then memory area is basically garbage that survived.
> > > How does the AF_XDP application know this packet contains a HWTS or not?
> > > 
> > >  From an UAPI PoV wouldn't it be easier to use (and extend) via keeping
> > > the @flags member (in struct xsk_tx_metadata), but (as you already do)
> > > not let kernel checks depend on it (to avoid the races).
> > 
> > I was assuming the userspace can keep this signal out of band or use
> > the same idea as suggested with padding struct xsk_tx_metadata to keep
> > some data around. But I see your point, it might be convenient to
> > keep the original flags around during completion on the uapi side.
> > 
> > I think I can just move flags from the request union member to the outer
> > struct. So the struct xsk_tx_metadata would look like:
> > 
> > struct xsk_tx_metadata {
> > 	__u32 flags; /* maybe can even make this u64? */
> > 
> 
> Yes to u64 for two reasons (1) this becomes UAPI and
> (2) better alignment for tx_timestamp.
> But I'm open to keeping it u32.
> 
> > 	union {
> > 		__u16 csum_start;
> > 		__u16 csum_offset;
> > 	} request;
> > 
> > 	union {
> > 		__u64 tx_timestamp;
> > 	} completion;
> > 
> > 	__u32 padding; /* to drop this padding */
> > };
> > 
> > But I'd also keep the existing xsk_tx_metadata_compl to carry the
> > pointer+signal around. As I mentioned previously, it's completely
> > opaque to the driver and we can change the internals in the future.
> > 
> 
> Sure, it is an implementation detail and my objections are mostly that I
> don't find it as a pretty code approach that can be hard to follow.
> Maybe driver developer will object and change this later if it cost too
> much to increase the element size in their TX-ring queues.

To make sure I understand, your preference is to save the flags, right?
A potential problem with that approach might be that we'd also have to
carry the pointer to the original umem chunk (blowing the overhead by extra
8 bytes) or pulling it of the tx completion descriptors in the drivers (extra
complexity). Pulling it out of the tx completion also might be
problematic because because we store iova/dma addresses in the
descriptors?
diff mbox series

Patch

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 14511b13f305..00439bcbd2e3 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -45,7 +45,6 @@  name: netdev
   -
     type: flags
     name: xdp-rx-metadata
-    render-max: true
     entries:
       -
         name: timestamp
@@ -55,6 +54,18 @@  name: netdev
         name: hash
         doc:
           Device is capable of exposing receive packet hash via bpf_xdp_metadata_rx_hash().
+  -
+    type: flags
+    name: xsk-flags
+    entries:
+      -
+        name: tx-timestamp
+        doc:
+          HW timestamping egress packets is supported by the driver.
+      -
+        name: tx-checksum
+        doc:
+          L3 checksum HW offload is supported by the driver.
 
 attribute-sets:
   -
@@ -86,6 +97,11 @@  name: netdev
              See Documentation/networking/xdp-rx-metadata.rst for more details.
         type: u64
         enum: xdp-rx-metadata
+      -
+        name: xsk-features
+        doc: Bitmask of enabled AF_XDP features.
+        type: u64
+        enum: xsk-flags
 
 operations:
   list:
@@ -103,6 +119,7 @@  name: netdev
             - xdp-features
             - xdp-zc-max-segs
             - xdp-rx-metadata-features
+            - xsk-features
       dump:
         reply: *dev-all
     -
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a16c9cc063fe..fe915fa334cc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1858,6 +1858,7 @@  enum netdev_ml_priv_type {
  *	@netdev_ops:	Includes several pointers to callbacks,
  *			if one wants to override the ndo_*() functions
  *	@xdp_metadata_ops:	Includes pointers to XDP metadata callbacks.
+ *	@xsk_tx_metadata_ops:	Includes pointers to AF_XDP TX metadata callbacks.
  *	@ethtool_ops:	Management operations
  *	@l3mdev_ops:	Layer 3 master device operations
  *	@ndisc_ops:	Includes callbacks for different IPv6 neighbour
@@ -2117,6 +2118,7 @@  struct net_device {
 	unsigned long long	priv_flags;
 	const struct net_device_ops *netdev_ops;
 	const struct xdp_metadata_ops *xdp_metadata_ops;
+	const struct xsk_tx_metadata_ops *xsk_tx_metadata_ops;
 	int			ifindex;
 	unsigned short		gflags;
 	unsigned short		hard_header_len;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 27998f73183e..b370eb8d70f7 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -566,6 +566,15 @@  struct ubuf_info_msgzc {
 int mm_account_pinned_pages(struct mmpin *mmp, size_t size);
 void mm_unaccount_pinned_pages(struct mmpin *mmp);
 
+/* Preserve some data across TX submission and completion.
+ *
+ * Note, this state is stored in the driver. Extending the layout
+ * might need some special care.
+ */
+struct xsk_tx_metadata_compl {
+	__u64 *tx_timestamp;
+};
+
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb->end.
  */
@@ -578,7 +587,10 @@  struct skb_shared_info {
 	/* Warning: this field is not always filled in (UFO)! */
 	unsigned short	gso_segs;
 	struct sk_buff	*frag_list;
-	struct skb_shared_hwtstamps hwtstamps;
+	union {
+		struct skb_shared_hwtstamps hwtstamps;
+		struct xsk_tx_metadata_compl xsk_meta;
+	};
 	unsigned int	gso_type;
 	u32		tskey;
 
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index bcf765124f72..52afde24596b 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -93,12 +93,105 @@  struct xdp_sock {
 	struct xsk_queue *cq_tmp; /* Only as tmp storage before bind */
 };
 
+/*
+ * AF_XDP TX metadata hooks for network devices.
+ * The following hooks can be defined; unless noted otherwise, they are
+ * optional and can be filled with a null pointer.
+ *
+ * void (*tmo_request_timestamp)(void *priv)
+ *     Called when AF_XDP frame requested egress timestamp.
+ *
+ * u64 (*tmo_fill_timestamp)(void *priv)
+ *     Called when AF_XDP frame, that had requested egress timestamp,
+ *     received a completion. The hook needs to return the actual HW timestamp.
+ *
+ * void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv)
+ *     Called when AF_XDP frame requested HW checksum offload. csum_start
+ *     indicates position where checksumming should start.
+ *     csum_offset indicates position where checksum should be stored.
+ *
+ */
+struct xsk_tx_metadata_ops {
+	void	(*tmo_request_timestamp)(void *priv);
+	u64	(*tmo_fill_timestamp)(void *priv);
+	void	(*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
+};
+
 #ifdef CONFIG_XDP_SOCKETS
 
 int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
 void __xsk_map_flush(void);
 
+/**
+ *  xsk_tx_metadata_to_compl - Save enough relevant metadata information
+ *  to perform tx completion in the future.
+ *  @meta: pointer to AF_XDP metadata area
+ *  @compl: pointer to output struct xsk_tx_metadata_to_compl
+ *
+ *  This function should be called by the networking device when
+ *  it prepares AF_XDP egress packet. The value of @compl should be stored
+ *  and passed to xsk_tx_metadata_complete upon TX completion.
+ */
+static inline void xsk_tx_metadata_to_compl(struct xsk_tx_metadata *meta,
+					    struct xsk_tx_metadata_compl *compl)
+{
+	if (!meta)
+		return;
+
+	if (meta->request.flags & XDP_TXMD_FLAGS_TIMESTAMP)
+		compl->tx_timestamp = &meta->completion.tx_timestamp;
+	else
+		compl->tx_timestamp = NULL;
+}
+
+/**
+ *  xsk_tx_metadata_request - Evaluate AF_XDP TX metadata at submission
+ *  and call appropriate xsk_tx_metadata_ops operation.
+ *  @meta: pointer to AF_XDP metadata area
+ *  @ops: pointer to struct xsk_tx_metadata_ops
+ *  @priv: pointer to driver-private aread
+ *
+ *  This function should be called by the networking device when
+ *  it prepares AF_XDP egress packet.
+ */
+static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
+					   const struct xsk_tx_metadata_ops *ops,
+					   void *priv)
+{
+	if (!meta)
+		return;
+
+	if (ops->tmo_request_timestamp)
+		if (meta->request.flags & XDP_TXMD_FLAGS_TIMESTAMP)
+			ops->tmo_request_timestamp(priv);
+
+	if (ops->tmo_request_checksum)
+		if (meta->request.flags & XDP_TXMD_FLAGS_CHECKSUM)
+			ops->tmo_request_checksum(meta->request.csum_start,
+						  meta->request.csum_offset, priv);
+}
+
+/**
+ *  xsk_tx_metadata_complete - Evaluate AF_XDP TX metadata at completion
+ *  and call appropriate xsk_tx_metadata_ops operation.
+ *  @compl: pointer to completion metadata produced from xsk_tx_metadata_to_compl
+ *  @ops: pointer to struct xsk_tx_metadata_ops
+ *  @priv: pointer to driver-private aread
+ *
+ *  This function should be called by the networking device upon
+ *  AF_XDP egress completion.
+ */
+static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata_compl *compl,
+					    const struct xsk_tx_metadata_ops *ops,
+					    void *priv)
+{
+	if (!compl)
+		return;
+
+	*compl->tx_timestamp = ops->tmo_fill_timestamp(priv);
+}
+
 #else
 
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
@@ -115,6 +208,23 @@  static inline void __xsk_map_flush(void)
 {
 }
 
+static inline void xsk_tx_metadata_to_compl(struct xsk_tx_metadata *meta,
+					    struct xsk_tx_metadata_compl *compl)
+{
+}
+
+static inline void xsk_tx_metadata_request(struct xsk_tx_metadata *meta,
+					   const struct xsk_tx_metadata_ops *ops,
+					   void *priv)
+{
+}
+
+static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata_compl *compl,
+					    const struct xsk_tx_metadata_ops *ops,
+					    void *priv)
+{
+}
+
 #endif /* CONFIG_XDP_SOCKETS */
 
 #if defined(CONFIG_XDP_SOCKETS) && defined(CONFIG_DEBUG_NET)
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 1f6fc8c7a84c..e2558ac3e195 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -165,6 +165,14 @@  static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
 	return xp_raw_get_data(pool, addr);
 }
 
+static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
+{
+	if (!pool->tx_metadata_len)
+		return NULL;
+
+	return xp_raw_get_data(pool, addr) - pool->tx_metadata_len;
+}
+
 static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
 {
 	struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
@@ -324,6 +332,11 @@  static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
 	return NULL;
 }
 
+static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
+{
+	return NULL;
+}
+
 static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
 {
 }
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index 1985ffaf9b0c..97f5cc10d79e 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -33,6 +33,7 @@  struct xdp_buff_xsk {
 };
 
 #define XSK_CHECK_PRIV_TYPE(t) BUILD_BUG_ON(sizeof(t) > offsetofend(struct xdp_buff_xsk, cb))
+#define XSK_TX_COMPL_FITS(t) BUILD_BUG_ON(sizeof(struct xsk_tx_metadata_compl) > sizeof(t))
 
 struct xsk_dma_map {
 	dma_addr_t *dma_pages;
@@ -234,4 +235,9 @@  static inline u64 xp_get_handle(struct xdp_buff_xsk *xskb)
 	return xskb->orig_addr + (offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT);
 }
 
+static inline bool xp_tx_metadata_enabled(const struct xsk_buff_pool *pool)
+{
+	return pool->tx_metadata_len > 0;
+}
+
 #endif /* XSK_BUFF_POOL_H_ */
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 2ecf79282c26..b0ee7ad19b51 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -106,6 +106,41 @@  struct xdp_options {
 #define XSK_UNALIGNED_BUF_ADDR_MASK \
 	((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
 
+/* Request transmit timestamp. Upon completion, put it into tx_timestamp
+ * field of struct xsk_tx_metadata.
+ */
+#define XDP_TXMD_FLAGS_TIMESTAMP		(1 << 0)
+
+/* Request transmit checksum offload. Checksum start position and offset
+ * are communicated via csum_start and csum_offset fields of struct
+ * xsk_tx_metadata.
+ */
+#define XDP_TXMD_FLAGS_CHECKSUM			(1 << 1)
+
+/* AF_XDP offloads request. 'request' union member is consumed by the driver
+ * when the packet is being transmitted. 'completion' union member is
+ * filled by the driver when the transmit completion arrives.
+ */
+struct xsk_tx_metadata {
+	union {
+		struct {
+			__u32 flags;
+
+			/* XDP_TXMD_FLAGS_CHECKSUM */
+
+			/* Offset from desc->addr where checksumming should start. */
+			__u16 csum_start;
+			/* Offset from csum_start where checksum should be stored. */
+			__u16 csum_offset;
+		} request;
+
+		struct {
+			/* XDP_TXMD_FLAGS_TIMESTAMP */
+			__u64 tx_timestamp;
+		} completion;
+	};
+};
+
 /* Rx/Tx descriptor */
 struct xdp_desc {
 	__u64 addr;
@@ -122,4 +157,7 @@  struct xdp_desc {
  */
 #define XDP_PKT_CONTD (1 << 0)
 
+/* TX packet carries valid metadata. */
+#define XDP_TX_METADATA (1 << 1)
+
 #endif /* _LINUX_IF_XDP_H */
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 2943a151d4f1..48d5477a668c 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -53,12 +53,28 @@  enum netdev_xdp_rx_metadata {
 	NETDEV_XDP_RX_METADATA_MASK = 3,
 };
 
+/**
+ * enum netdev_xsk_flags
+ * @NETDEV_XSK_FLAGS_TX_TIMESTAMP: HW timestamping egress packets is supported
+ *   by the driver.
+ * @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
+ *   driver.
+ */
+enum netdev_xsk_flags {
+	NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
+	NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
+
+	/* private: */
+	NETDEV_XSK_FLAGS_MASK = 3,
+};
+
 enum {
 	NETDEV_A_DEV_IFINDEX = 1,
 	NETDEV_A_DEV_PAD,
 	NETDEV_A_DEV_XDP_FEATURES,
 	NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
 	NETDEV_A_DEV_XDP_RX_METADATA_FEATURES,
+	NETDEV_A_DEV_XSK_FEATURES,
 
 	__NETDEV_A_DEV_MAX,
 	NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1)
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index fe61f85bcf33..10f2124e9e23 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -6,6 +6,7 @@ 
 #include <net/net_namespace.h>
 #include <net/sock.h>
 #include <net/xdp.h>
+#include <net/xdp_sock.h>
 
 #include "netdev-genl-gen.h"
 
@@ -13,6 +14,7 @@  static int
 netdev_nl_dev_fill(struct net_device *netdev, struct sk_buff *rsp,
 		   const struct genl_info *info)
 {
+	u64 xsk_features = 0;
 	u64 xdp_rx_meta = 0;
 	void *hdr;
 
@@ -26,11 +28,20 @@  netdev_nl_dev_fill(struct net_device *netdev, struct sk_buff *rsp,
 XDP_METADATA_KFUNC_xxx
 #undef XDP_METADATA_KFUNC
 
+	if (netdev->xsk_tx_metadata_ops) {
+		if (netdev->xsk_tx_metadata_ops->tmo_fill_timestamp)
+			xsk_features |= NETDEV_XSK_FLAGS_TX_TIMESTAMP;
+		if (netdev->xsk_tx_metadata_ops->tmo_request_checksum)
+			xsk_features |= NETDEV_XSK_FLAGS_TX_CHECKSUM;
+	}
+
 	if (nla_put_u32(rsp, NETDEV_A_DEV_IFINDEX, netdev->ifindex) ||
 	    nla_put_u64_64bit(rsp, NETDEV_A_DEV_XDP_FEATURES,
 			      netdev->xdp_features, NETDEV_A_DEV_PAD) ||
 	    nla_put_u64_64bit(rsp, NETDEV_A_DEV_XDP_RX_METADATA_FEATURES,
-			      xdp_rx_meta, NETDEV_A_DEV_PAD)) {
+			      xdp_rx_meta, NETDEV_A_DEV_PAD) ||
+	    nla_put_u64_64bit(rsp, NETDEV_A_DEV_XSK_FEATURES,
+			      xsk_features, NETDEV_A_DEV_PAD)) {
 		genlmsg_cancel(rsp, hdr);
 		return -EINVAL;
 	}
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index c904356e2800..84fd10201f2a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -571,6 +571,13 @@  static u32 xsk_get_num_desc(struct sk_buff *skb)
 
 static void xsk_destruct_skb(struct sk_buff *skb)
 {
+	struct xsk_tx_metadata_compl *compl = &skb_shinfo(skb)->xsk_meta;
+
+	if (compl->tx_timestamp) {
+		/* sw completion timestamp, not a real one */
+		*compl->tx_timestamp = ktime_get_tai_fast_ns();
+	}
+
 	xsk_cq_submit_locked(xdp_sk(skb->sk), xsk_get_num_desc(skb));
 	sock_wfree(skb);
 }
@@ -655,8 +662,10 @@  static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 				     struct xdp_desc *desc)
 {
+	struct xsk_tx_metadata *meta = NULL;
 	struct net_device *dev = xs->dev;
 	struct sk_buff *skb = xs->skb;
+	bool first_frag = false;
 	int err;
 
 	if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
@@ -687,6 +696,8 @@  static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 				kfree_skb(skb);
 				goto free_err;
 			}
+
+			first_frag = true;
 		} else {
 			int nr_frags = skb_shinfo(skb)->nr_frags;
 			struct page *page;
@@ -709,12 +720,35 @@  static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 
 			skb_add_rx_frag(skb, nr_frags, page, 0, len, 0);
 		}
+
+		if (first_frag && desc->options & XDP_TX_METADATA) {
+			if (unlikely(xs->pool->tx_metadata_len == 0)) {
+				err = -EINVAL;
+				goto free_err;
+			}
+
+			meta = buffer - xs->pool->tx_metadata_len;
+
+			if (meta->request.flags & XDP_TXMD_FLAGS_CHECKSUM) {
+				if (unlikely(meta->request.csum_start +
+					     meta->request.csum_offset +
+					     sizeof(__sum16) > len)) {
+					err = -EINVAL;
+					goto free_err;
+				}
+
+				skb->csum_start = hr + meta->request.csum_start;
+				skb->csum_offset = meta->request.csum_offset;
+				skb->ip_summed = CHECKSUM_PARTIAL;
+			}
+		}
 	}
 
 	skb->dev = dev;
 	skb->priority = READ_ONCE(xs->sk.sk_priority);
 	skb->mark = READ_ONCE(xs->sk.sk_mark);
 	skb->destructor = xsk_destruct_skb;
+	xsk_tx_metadata_to_compl(meta, &skb_shinfo(skb)->xsk_meta);
 	xsk_set_destructor_arg(skb);
 
 	return skb;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index c74a1372bcb9..6f2d1621c992 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -137,7 +137,7 @@  static inline bool xskq_cons_read_addr_unchecked(struct xsk_queue *q, u64 *addr)
 
 static inline bool xp_unused_options_set(u32 options)
 {
-	return options & ~XDP_PKT_CONTD;
+	return options & ~(XDP_PKT_CONTD | XDP_TX_METADATA);
 }
 
 static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index 34411a2e5b6c..29942c2c32dc 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -26,11 +26,11 @@ 
  */
 #define XDP_USE_NEED_WAKEUP (1 << 3)
 /* By setting this option, userspace application indicates that it can
- * handle multiple descriptors per packet thus enabling xsk core to split
+ * handle multiple descriptors per packet thus enabling AF_XDP to split
  * multi-buffer XDP frames into multiple Rx descriptors. Without this set
- * such frames will be dropped by xsk.
+ * such frames will be dropped.
  */
-#define XDP_USE_SG     (1 << 4)
+#define XDP_USE_SG	(1 << 4)
 
 /* Flags for xsk_umem_config flags */
 #define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
@@ -106,6 +106,41 @@  struct xdp_options {
 #define XSK_UNALIGNED_BUF_ADDR_MASK \
 	((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
 
+/* Request transmit timestamp. Upon completion, put it into tx_timestamp
+ * field of union xsk_tx_metadata.
+ */
+#define XDP_TXMD_FLAGS_TIMESTAMP		(1 << 0)
+
+/* Request transmit checksum offload. Checksum start position and offset
+ * are communicated via csum_start and csum_offset fields of union
+ * xsk_tx_metadata.
+ */
+#define XDP_TXMD_FLAGS_CHECKSUM			(1 << 1)
+
+/* AF_XDP offloads request. 'request' union member is consumed by the driver
+ * when the packet is being transmitted. 'completion' union member is
+ * filled by the driver when the transmit completion arrives.
+ */
+struct xsk_tx_metadata {
+	union {
+		struct {
+			__u32 flags;
+
+			/* XDP_TXMD_FLAGS_CHECKSUM */
+
+			/* Offset from desc->addr where checksumming should start. */
+			__u16 csum_start;
+			/* Offset from csum_start where checksum should be stored. */
+			__u16 csum_offset;
+		} request;
+
+		struct {
+			/* XDP_TXMD_FLAGS_TIMESTAMP */
+			__u64 tx_timestamp;
+		} completion;
+	};
+};
+
 /* Rx/Tx descriptor */
 struct xdp_desc {
 	__u64 addr;
@@ -113,9 +148,16 @@  struct xdp_desc {
 	__u32 options;
 };
 
-/* Flag indicating packet constitutes of multiple buffers*/
+/* UMEM descriptor is __u64 */
+
+/* Flag indicating that the packet continues with the buffer pointed out by the
+ * next frame in the ring. The end of the packet is signalled by setting this
+ * bit to zero. For single buffer packets, every descriptor has 'options' set
+ * to 0 and this maintains backward compatibility.
+ */
 #define XDP_PKT_CONTD (1 << 0)
 
-/* UMEM descriptor is __u64 */
+/* TX packet carries valid metadata. */
+#define XDP_TX_METADATA (1 << 1)
 
 #endif /* _LINUX_IF_XDP_H */
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 2943a151d4f1..48d5477a668c 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -53,12 +53,28 @@  enum netdev_xdp_rx_metadata {
 	NETDEV_XDP_RX_METADATA_MASK = 3,
 };
 
+/**
+ * enum netdev_xsk_flags
+ * @NETDEV_XSK_FLAGS_TX_TIMESTAMP: HW timestamping egress packets is supported
+ *   by the driver.
+ * @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
+ *   driver.
+ */
+enum netdev_xsk_flags {
+	NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
+	NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
+
+	/* private: */
+	NETDEV_XSK_FLAGS_MASK = 3,
+};
+
 enum {
 	NETDEV_A_DEV_IFINDEX = 1,
 	NETDEV_A_DEV_PAD,
 	NETDEV_A_DEV_XDP_FEATURES,
 	NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
 	NETDEV_A_DEV_XDP_RX_METADATA_FEATURES,
+	NETDEV_A_DEV_XSK_FEATURES,
 
 	__NETDEV_A_DEV_MAX,
 	NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1)
diff --git a/tools/net/ynl/generated/netdev-user.c b/tools/net/ynl/generated/netdev-user.c
index b5ffe8cd1144..6283d87dad37 100644
--- a/tools/net/ynl/generated/netdev-user.c
+++ b/tools/net/ynl/generated/netdev-user.c
@@ -58,6 +58,19 @@  const char *netdev_xdp_rx_metadata_str(enum netdev_xdp_rx_metadata value)
 	return netdev_xdp_rx_metadata_strmap[value];
 }
 
+static const char * const netdev_xsk_flags_strmap[] = {
+	[0] = "tx-timestamp",
+	[1] = "tx-checksum",
+};
+
+const char *netdev_xsk_flags_str(enum netdev_xsk_flags value)
+{
+	value = ffs(value) - 1;
+	if (value < 0 || value >= (int)MNL_ARRAY_SIZE(netdev_xsk_flags_strmap))
+		return NULL;
+	return netdev_xsk_flags_strmap[value];
+}
+
 /* Policies */
 struct ynl_policy_attr netdev_dev_policy[NETDEV_A_DEV_MAX + 1] = {
 	[NETDEV_A_DEV_IFINDEX] = { .name = "ifindex", .type = YNL_PT_U32, },
@@ -65,6 +78,7 @@  struct ynl_policy_attr netdev_dev_policy[NETDEV_A_DEV_MAX + 1] = {
 	[NETDEV_A_DEV_XDP_FEATURES] = { .name = "xdp-features", .type = YNL_PT_U64, },
 	[NETDEV_A_DEV_XDP_ZC_MAX_SEGS] = { .name = "xdp-zc-max-segs", .type = YNL_PT_U32, },
 	[NETDEV_A_DEV_XDP_RX_METADATA_FEATURES] = { .name = "xdp-rx-metadata-features", .type = YNL_PT_U64, },
+	[NETDEV_A_DEV_XSK_FEATURES] = { .name = "xsk-features", .type = YNL_PT_U64, },
 };
 
 struct ynl_policy_nest netdev_dev_nest = {
@@ -116,6 +130,11 @@  int netdev_dev_get_rsp_parse(const struct nlmsghdr *nlh, void *data)
 				return MNL_CB_ERROR;
 			dst->_present.xdp_rx_metadata_features = 1;
 			dst->xdp_rx_metadata_features = mnl_attr_get_u64(attr);
+		} else if (type == NETDEV_A_DEV_XSK_FEATURES) {
+			if (ynl_attr_validate(yarg, attr))
+				return MNL_CB_ERROR;
+			dst->_present.xsk_features = 1;
+			dst->xsk_features = mnl_attr_get_u64(attr);
 		}
 	}
 
diff --git a/tools/net/ynl/generated/netdev-user.h b/tools/net/ynl/generated/netdev-user.h
index 4fafac879df3..39af1908444b 100644
--- a/tools/net/ynl/generated/netdev-user.h
+++ b/tools/net/ynl/generated/netdev-user.h
@@ -19,6 +19,7 @@  extern const struct ynl_family ynl_netdev_family;
 const char *netdev_op_str(int op);
 const char *netdev_xdp_act_str(enum netdev_xdp_act value);
 const char *netdev_xdp_rx_metadata_str(enum netdev_xdp_rx_metadata value);
+const char *netdev_xsk_flags_str(enum netdev_xsk_flags value);
 
 /* Common nested types */
 /* ============== NETDEV_CMD_DEV_GET ============== */
@@ -50,12 +51,14 @@  struct netdev_dev_get_rsp {
 		__u32 xdp_features:1;
 		__u32 xdp_zc_max_segs:1;
 		__u32 xdp_rx_metadata_features:1;
+		__u32 xsk_features:1;
 	} _present;
 
 	__u32 ifindex;
 	__u64 xdp_features;
 	__u32 xdp_zc_max_segs;
 	__u64 xdp_rx_metadata_features;
+	__u64 xsk_features;
 };
 
 void netdev_dev_get_rsp_free(struct netdev_dev_get_rsp *rsp);