diff mbox series

[net-next] net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP

Message ID 20221205230925.3002558-1-willemdebruijn.kernel@gmail.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net-next] net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCP | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2669 this patch: 2669
netdev/cc_maintainers warning 10 maintainers not CCed: liuhangbin@gmail.com amcohen@nvidia.com linux@rempel-privat.de gal@nvidia.com andrew@lunn.ch alexandru.tachici@analog.com richardcochran@gmail.com linux-doc@vger.kernel.org corbet@lwn.net martin.lau@kernel.org
netdev/build_clang success Errors and warnings before: 567 this patch: 567
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 2799 this patch: 2799
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Willem de Bruijn Dec. 5, 2022, 11:09 p.m. UTC
From: Willem de Bruijn <willemb@google.com>

Add an option to initialize SOF_TIMESTAMPING_OPT_ID for TCP from
write_seq sockets instead of snd_una.

Intuitively the contract is that the counter is zero after the
setsockopt, so that the next write N results in a notification for
last byte N - 1.

On idle sockets snd_una == write_seq so this holds for both. But on
sockets with data in transmission, snd_una depends on the ACK response
from the peer. A process cannot learn this in a race free manner
(ioctl SIOCOUTQ is one racy approach).

write_seq is a better starting point because based on the seqno of
data written by the process only.

But the existing behavior may already be relied upon. So make the new
behavior optional behind a flag.

The new timestamp flag necessitates increasing sk_tsflags to 32 bits.
Move the field in struct sock to avoid growing the socket (for some
common CONFIG variants). The UAPI interface so_timestamping.flags is
already int, so 32 bits wide.

Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Willem de Bruijn <willemb@google.com>

---

Alternative solutions are

* make the change unconditionally: a one line change.
* make the condition a (per netns) sysctl instead of flag
* make SOF_TIMESTAMPING_OPT_ID_TCP not a modifier of, but alternative
  to SOF_TIMESTAMPING_OPT_ID. That requires also updating all existing
  code that now tests OPT_ID to test a new OPT_ID_MASK.

Weighing the variants, this seemed the best option to me.
---
 Documentation/networking/timestamping.rst | 19 +++++++++++++++++++
 include/net/sock.h                        |  6 +++---
 include/uapi/linux/net_tstamp.h           |  3 ++-
 net/core/sock.c                           |  9 ++++++++-
 net/ethtool/common.c                      |  1 +
 5 files changed, 33 insertions(+), 5 deletions(-)

Comments

Soheil Hassas Yeganeh Dec. 6, 2022, 12:34 a.m. UTC | #1
On Mon, Dec 5, 2022 at 6:09 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> From: Willem de Bruijn <willemb@google.com>
>
> Add an option to initialize SOF_TIMESTAMPING_OPT_ID for TCP from
> write_seq sockets instead of snd_una.
>
> Intuitively the contract is that the counter is zero after the
> setsockopt, so that the next write N results in a notification for
> last byte N - 1.
>
> On idle sockets snd_una == write_seq so this holds for both. But on
> sockets with data in transmission, snd_una depends on the ACK response
> from the peer. A process cannot learn this in a race free manner
> (ioctl SIOCOUTQ is one racy approach).
>
> write_seq is a better starting point because based on the seqno of
> data written by the process only.
>
> But the existing behavior may already be relied upon. So make the new
> behavior optional behind a flag.
>
> The new timestamp flag necessitates increasing sk_tsflags to 32 bits.
> Move the field in struct sock to avoid growing the socket (for some
> common CONFIG variants). The UAPI interface so_timestamping.flags is
> already int, so 32 bits wide.
>
> Reported-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Willem de Bruijn <willemb@google.com>

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Thanks for the fix!

> ---
>
> Alternative solutions are
>
> * make the change unconditionally: a one line change.
> * make the condition a (per netns) sysctl instead of flag
> * make SOF_TIMESTAMPING_OPT_ID_TCP not a modifier of, but alternative
>   to SOF_TIMESTAMPING_OPT_ID. That requires also updating all existing
>   code that now tests OPT_ID to test a new OPT_ID_MASK.
>
> Weighing the variants, this seemed the best option to me.
> ---
>  Documentation/networking/timestamping.rst | 19 +++++++++++++++++++
>  include/net/sock.h                        |  6 +++---
>  include/uapi/linux/net_tstamp.h           |  3 ++-
>  net/core/sock.c                           |  9 ++++++++-
>  net/ethtool/common.c                      |  1 +
>  5 files changed, 33 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst
> index be4eb1242057..578f24731be5 100644
> --- a/Documentation/networking/timestamping.rst
> +++ b/Documentation/networking/timestamping.rst
> @@ -192,6 +192,25 @@ SOF_TIMESTAMPING_OPT_ID:
>    among all possibly concurrently outstanding timestamp requests for
>    that socket.
>
> +SOF_TIMESTAMPING_OPT_ID_TCP:
> +  Pass this modifier along with SOF_TIMESTAMPING_OPT_ID for new TCP
> +  timestamping applications. SOF_TIMESTAMPING_OPT_ID defines how the
> +  counter increments for stream sockets, but its starting point is
> +  not entirely trivial. This modifier option changes that point.
> +
> +  A reasonable expectation is that the counter is reset to zero with
> +  the system call, so that a subsequent write() of N bytes generates
> +  a timestamp with counter N-1. SOF_TIMESTAMPING_OPT_ID_TCP
> +  implements this behavior under all conditions.
> +
> +  SOF_TIMESTAMPING_OPT_ID without modifier often reports the same,
> +  especially when the socket option is set when no data is in
> +  transmission. If data is being transmitted, it may be off by the
> +  length of the output queue (SIOCOUTQ) due to being based on snd_una
> +  rather than write_seq. That offset depends on factors outside of
> +  process control, including network RTT and peer response time. The
> +  difference is subtle and unlikely to be noticed when confiugred at
> +  initial socket creation. But .._OPT_ID behavior is more predictable.
>
>  SOF_TIMESTAMPING_OPT_CMSG:
>    Support recv() cmsg for all timestamped packets. Control messages
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 6d207e7c4ad0..ecea3dcc2217 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -503,10 +503,10 @@ struct sock {
>  #if BITS_PER_LONG==32
>         seqlock_t               sk_stamp_seq;
>  #endif
> -       u16                     sk_tsflags;
> -       u8                      sk_shutdown;
>         atomic_t                sk_tskey;
>         atomic_t                sk_zckey;
> +       u32                     sk_tsflags;
> +       u8                      sk_shutdown;
>
>         u8                      sk_clockid;
>         u8                      sk_txtime_deadline_mode : 1,
> @@ -1899,7 +1899,7 @@ static inline void sock_replace_proto(struct sock *sk, struct proto *proto)
>  struct sockcm_cookie {
>         u64 transmit_time;
>         u32 mark;
> -       u16 tsflags;
> +       u32 tsflags;
>  };
>
>  static inline void sockcm_init(struct sockcm_cookie *sockc,
> diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
> index 55501e5e7ac8..a2c66b3d7f0f 100644
> --- a/include/uapi/linux/net_tstamp.h
> +++ b/include/uapi/linux/net_tstamp.h
> @@ -31,8 +31,9 @@ enum {
>         SOF_TIMESTAMPING_OPT_PKTINFO = (1<<13),
>         SOF_TIMESTAMPING_OPT_TX_SWHW = (1<<14),
>         SOF_TIMESTAMPING_BIND_PHC = (1 << 15),
> +       SOF_TIMESTAMPING_OPT_ID_TCP = (1 << 16),
>
> -       SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_BIND_PHC,
> +       SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_ID_TCP,
>         SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_LAST - 1) |
>                                  SOF_TIMESTAMPING_LAST
>  };
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 4571914a4aa8..b0ab841e0aed 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -901,13 +901,20 @@ int sock_set_timestamping(struct sock *sk, int optname,
>         if (val & ~SOF_TIMESTAMPING_MASK)
>                 return -EINVAL;
>
> +       if (val & SOF_TIMESTAMPING_OPT_ID_TCP &&
> +           !(val & SOF_TIMESTAMPING_OPT_ID))
> +               return -EINVAL;
> +
>         if (val & SOF_TIMESTAMPING_OPT_ID &&
>             !(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
>                 if (sk_is_tcp(sk)) {
>                         if ((1 << sk->sk_state) &
>                             (TCPF_CLOSE | TCPF_LISTEN))
>                                 return -EINVAL;
> -                       atomic_set(&sk->sk_tskey, tcp_sk(sk)->snd_una);
> +                       if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
> +                               atomic_set(&sk->sk_tskey, tcp_sk(sk)->write_seq);
> +                       else
> +                               atomic_set(&sk->sk_tskey, tcp_sk(sk)->snd_una);
>                 } else {
>                         atomic_set(&sk->sk_tskey, 0);
>                 }
> diff --git a/net/ethtool/common.c b/net/ethtool/common.c
> index 21cfe8557205..6f399afc2ff2 100644
> --- a/net/ethtool/common.c
> +++ b/net/ethtool/common.c
> @@ -417,6 +417,7 @@ const char sof_timestamping_names[][ETH_GSTRING_LEN] = {
>         [const_ilog2(SOF_TIMESTAMPING_OPT_PKTINFO)]  = "option-pktinfo",
>         [const_ilog2(SOF_TIMESTAMPING_OPT_TX_SWHW)]  = "option-tx-swhw",
>         [const_ilog2(SOF_TIMESTAMPING_BIND_PHC)]     = "bind-phc",
> +       [const_ilog2(SOF_TIMESTAMPING_OPT_ID_TCP)]   = "option-id-tcp",
>  };
>  static_assert(ARRAY_SIZE(sof_timestamping_names) == __SOF_TIMESTAMPING_CNT);
>
> --
> 2.39.0.rc0.267.gcb52ba06e7-goog
>
Jakub Kicinski Dec. 6, 2022, 8:22 p.m. UTC | #2
On Mon,  5 Dec 2022 18:09:25 -0500 Willem de Bruijn wrote:
> Add an option to initialize SOF_TIMESTAMPING_OPT_ID for TCP from
> write_seq sockets instead of snd_una.
> 
> Intuitively the contract is that the counter is zero after the
> setsockopt, so that the next write N results in a notification for
> last byte N - 1.
> 
> On idle sockets snd_una == write_seq so this holds for both. But on
> sockets with data in transmission, snd_una depends on the ACK response
> from the peer. A process cannot learn this in a race free manner
> (ioctl SIOCOUTQ is one racy approach).

We can't just copy back the value of 

	tcp_sk(sk)->snd_una - tcp_sk(sk)->write_seq

to the user if the input of setsockopt is large enough (ie. extend the
struct, if len >= sizeof(new struct) -> user is asking to get this?
Or even add a bit somewhere that requests a copy back?

Highly unlikely to break anything, I reckon? But whether setsockopt()
can copy back is not 100% clear to me...

> write_seq is a better starting point because based on the seqno of
> data written by the process only.
> 
> But the existing behavior may already be relied upon. So make the new
> behavior optional behind a flag.
> 
> The new timestamp flag necessitates increasing sk_tsflags to 32 bits.
> Move the field in struct sock to avoid growing the socket (for some
> common CONFIG variants). The UAPI interface so_timestamping.flags is
> already int, so 32 bits wide.
> 
> Reported-by: Jakub Kicinski <kuba@kernel.org>

Reported-by: Sotirios Delimanolis <sotodel@meta.com>

I'm just a bad human information router.

> Signed-off-by: Willem de Bruijn <willemb@google.com>
> 
> ---
> 
> Alternative solutions are
> 
> * make the change unconditionally: a one line change.
> * make the condition a (per netns) sysctl instead of flag
> * make SOF_TIMESTAMPING_OPT_ID_TCP not a modifier of, but alternative
>   to SOF_TIMESTAMPING_OPT_ID. That requires also updating all existing
>   code that now tests OPT_ID to test a new OPT_ID_MASK.

 * copy back the SIOCOUTQ

;)

> Weighing the variants, this seemed the best option to me.
> ---
>  Documentation/networking/timestamping.rst | 19 +++++++++++++++++++
>  include/net/sock.h                        |  6 +++---
>  include/uapi/linux/net_tstamp.h           |  3 ++-
>  net/core/sock.c                           |  9 ++++++++-
>  net/ethtool/common.c                      |  1 +
>  5 files changed, 33 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst
> index be4eb1242057..578f24731be5 100644
> --- a/Documentation/networking/timestamping.rst
> +++ b/Documentation/networking/timestamping.rst
> @@ -192,6 +192,25 @@ SOF_TIMESTAMPING_OPT_ID:
>    among all possibly concurrently outstanding timestamp requests for
>    that socket.
>  
> +SOF_TIMESTAMPING_OPT_ID_TCP:
> +  Pass this modifier along with SOF_TIMESTAMPING_OPT_ID for new TCP
> +  timestamping applications. SOF_TIMESTAMPING_OPT_ID defines how the
> +  counter increments for stream sockets, but its starting point is
> +  not entirely trivial. This modifier option changes that point.
> +
> +  A reasonable expectation is that the counter is reset to zero with
> +  the system call, so that a subsequent write() of N bytes generates
> +  a timestamp with counter N-1. SOF_TIMESTAMPING_OPT_ID_TCP
> +  implements this behavior under all conditions.
> +
> +  SOF_TIMESTAMPING_OPT_ID without modifier often reports the same,
> +  especially when the socket option is set when no data is in
> +  transmission. If data is being transmitted, it may be off by the
> +  length of the output queue (SIOCOUTQ) due to being based on snd_una
> +  rather than write_seq. That offset depends on factors outside of
> +  process control, including network RTT and peer response time. The
> +  difference is subtle and unlikely to be noticed when confiugred at
> +  initial socket creation. But .._OPT_ID behavior is more predictable.

I reckon this needs to be more informative. Say how exactly they differ
(written vs queued for transmission). And I'd add to
SOF_TIMESTAMPING_OPT_ID docs a note to "see also .._OPT_ID_TCP version".
Willem de Bruijn Dec. 6, 2022, 8:46 p.m. UTC | #3
On Tue, Dec 6, 2022 at 3:22 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon,  5 Dec 2022 18:09:25 -0500 Willem de Bruijn wrote:
> > Add an option to initialize SOF_TIMESTAMPING_OPT_ID for TCP from
> > write_seq sockets instead of snd_una.
> >
> > Intuitively the contract is that the counter is zero after the
> > setsockopt, so that the next write N results in a notification for
> > last byte N - 1.
> >
> > On idle sockets snd_una == write_seq so this holds for both. But on
> > sockets with data in transmission, snd_una depends on the ACK response
> > from the peer. A process cannot learn this in a race free manner
> > (ioctl SIOCOUTQ is one racy approach).
>
> We can't just copy back the value of
>
>         tcp_sk(sk)->snd_una - tcp_sk(sk)->write_seq
>
> to the user if the input of setsockopt is large enough (ie. extend the
> struct, if len >= sizeof(new struct) -> user is asking to get this?
> Or even add a bit somewhere that requests a copy back?

We could, but indeed then we first need a way to signal that the
kernel is new enough to actually write something meaningful back that
is safe to read.

And if we change the kernel API and applications, I find this a
somewhat hacky approach: why program the slightly wrong thing and
return the offset to patch it up in userspace, if we can just program
the right thing to begin with?

> Highly unlikely to break anything, I reckon? But whether setsockopt()
> can copy back is not 100% clear to me...
>
> > write_seq is a better starting point because based on the seqno of
> > data written by the process only.
> >
> > But the existing behavior may already be relied upon. So make the new
> > behavior optional behind a flag.
> >
> > The new timestamp flag necessitates increasing sk_tsflags to 32 bits.
> > Move the field in struct sock to avoid growing the socket (for some
> > common CONFIG variants). The UAPI interface so_timestamping.flags is
> > already int, so 32 bits wide.
> >
> > Reported-by: Jakub Kicinski <kuba@kernel.org>
>
> Reported-by: Sotirios Delimanolis <sotodel@meta.com>
>
> I'm just a bad human information router.
>
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> >
> > ---
> >
> > Alternative solutions are
> >
> > * make the change unconditionally: a one line change.
> > * make the condition a (per netns) sysctl instead of flag
> > * make SOF_TIMESTAMPING_OPT_ID_TCP not a modifier of, but alternative
> >   to SOF_TIMESTAMPING_OPT_ID. That requires also updating all existing
> >   code that now tests OPT_ID to test a new OPT_ID_MASK.
>
>  * copy back the SIOCOUTQ
>
> ;)
>
> > Weighing the variants, this seemed the best option to me.
> > ---
> >  Documentation/networking/timestamping.rst | 19 +++++++++++++++++++
> >  include/net/sock.h                        |  6 +++---
> >  include/uapi/linux/net_tstamp.h           |  3 ++-
> >  net/core/sock.c                           |  9 ++++++++-
> >  net/ethtool/common.c                      |  1 +
> >  5 files changed, 33 insertions(+), 5 deletions(-)
> >
> > diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst
> > index be4eb1242057..578f24731be5 100644
> > --- a/Documentation/networking/timestamping.rst
> > +++ b/Documentation/networking/timestamping.rst
> > @@ -192,6 +192,25 @@ SOF_TIMESTAMPING_OPT_ID:
> >    among all possibly concurrently outstanding timestamp requests for
> >    that socket.
> >
> > +SOF_TIMESTAMPING_OPT_ID_TCP:
> > +  Pass this modifier along with SOF_TIMESTAMPING_OPT_ID for new TCP
> > +  timestamping applications. SOF_TIMESTAMPING_OPT_ID defines how the
> > +  counter increments for stream sockets, but its starting point is
> > +  not entirely trivial. This modifier option changes that point.
> > +
> > +  A reasonable expectation is that the counter is reset to zero with
> > +  the system call, so that a subsequent write() of N bytes generates
> > +  a timestamp with counter N-1. SOF_TIMESTAMPING_OPT_ID_TCP
> > +  implements this behavior under all conditions.
> > +
> > +  SOF_TIMESTAMPING_OPT_ID without modifier often reports the same,
> > +  especially when the socket option is set when no data is in
> > +  transmission. If data is being transmitted, it may be off by the
> > +  length of the output queue (SIOCOUTQ) due to being based on snd_una
> > +  rather than write_seq. That offset depends on factors outside of
> > +  process control, including network RTT and peer response time. The
> > +  difference is subtle and unlikely to be noticed when confiugred at

note to self: confiugred -> configured

> > +  initial socket creation. But .._OPT_ID behavior is more predictable.
>
> I reckon this needs to be more informative. Say how exactly they differ
> (written vs queued for transmission). And I'd add to
> SOF_TIMESTAMPING_OPT_ID docs a note to "see also .._OPT_ID_TCP version".

Will do. Assuming we're good with this approach.
Jakub Kicinski Dec. 6, 2022, 8:58 p.m. UTC | #4
On Tue, 6 Dec 2022 15:46:25 -0500 Willem de Bruijn wrote:
> > We can't just copy back the value of
> >
> >         tcp_sk(sk)->snd_una - tcp_sk(sk)->write_seq
> >
> > to the user if the input of setsockopt is large enough (ie. extend the
> > struct, if len >= sizeof(new struct) -> user is asking to get this?
> > Or even add a bit somewhere that requests a copy back?  
> 
> We could, but indeed then we first need a way to signal that the
> kernel is new enough to actually write something meaningful back that
> is safe to read.

It should be sufficient to init the memory to -1. 
(I guess I'm not helping my own "this is less hacky" argument? :)

> And if we change the kernel API and applications, I find this a
> somewhat hacky approach: why program the slightly wrong thing and
> return the offset to patch it up in userspace, if we can just program
> the right thing to begin with?

Ah, so you'd also switch all your apps to use this new bit?

What wasn't clear to me whether this is a
 1 - we clearly did the wrong thing
or
 2 - there is a legit use case for un-packetized(?) data not being
     counted

In case of (1) we should make it clearer that the new bit is in fact
a "fixed" version of the functionality.
For (2) we can view this as an extension of the existing functionality
so combining in the same bit with write back seems natural (and TBH 
I like the single syscall probing path more than try-one-then-the-other,
but that's 100% subjective).

Anyway, don't wanna waste too much of your time. If you prefer to keep
as is the doc change is good enough for me.
Willem de Bruijn Dec. 6, 2022, 9:21 p.m. UTC | #5
On Tue, Dec 6, 2022 at 3:58 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 6 Dec 2022 15:46:25 -0500 Willem de Bruijn wrote:
> > > We can't just copy back the value of
> > >
> > >         tcp_sk(sk)->snd_una - tcp_sk(sk)->write_seq
> > >
> > > to the user if the input of setsockopt is large enough (ie. extend the
> > > struct, if len >= sizeof(new struct) -> user is asking to get this?
> > > Or even add a bit somewhere that requests a copy back?
> >
> > We could, but indeed then we first need a way to signal that the
> > kernel is new enough to actually write something meaningful back that
> > is safe to read.
>
> It should be sufficient to init the memory to -1.
> (I guess I'm not helping my own "this is less hacky" argument? :)
>
> > And if we change the kernel API and applications, I find this a
> > somewhat hacky approach: why program the slightly wrong thing and
> > return the offset to patch it up in userspace, if we can just program
> > the right thing to begin with?
>
> Ah, so you'd also switch all your apps to use this new bit?
>
> What wasn't clear to me whether this is a
>  1 - we clearly did the wrong thing
> or
>  2 - there is a legit use case for un-packetized(?) data not being
>      counted
>
> In case of (1) we should make it clearer that the new bit is in fact
> a "fixed" version of the functionality.
> For (2) we can view this as an extension of the existing functionality
> so combining in the same bit with write back seems natural (and TBH
> I like the single syscall probing path more than try-one-then-the-other,
> but that's 100% subjective).
>
> Anyway, don't wanna waste too much of your time. If you prefer to keep
> as is the doc change is good enough for me.

It's definitely 1. I'll be more explicit in the documentation and
commit message about that.

I would have just made the one line 's/snd_una/write_seq/' change if I
could be certain that no existing code relies on the current behavior.
But I already came across it in one test and production. It's too
risky.
diff mbox series

Patch

diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst
index be4eb1242057..578f24731be5 100644
--- a/Documentation/networking/timestamping.rst
+++ b/Documentation/networking/timestamping.rst
@@ -192,6 +192,25 @@  SOF_TIMESTAMPING_OPT_ID:
   among all possibly concurrently outstanding timestamp requests for
   that socket.
 
+SOF_TIMESTAMPING_OPT_ID_TCP:
+  Pass this modifier along with SOF_TIMESTAMPING_OPT_ID for new TCP
+  timestamping applications. SOF_TIMESTAMPING_OPT_ID defines how the
+  counter increments for stream sockets, but its starting point is
+  not entirely trivial. This modifier option changes that point.
+
+  A reasonable expectation is that the counter is reset to zero with
+  the system call, so that a subsequent write() of N bytes generates
+  a timestamp with counter N-1. SOF_TIMESTAMPING_OPT_ID_TCP
+  implements this behavior under all conditions.
+
+  SOF_TIMESTAMPING_OPT_ID without modifier often reports the same,
+  especially when the socket option is set when no data is in
+  transmission. If data is being transmitted, it may be off by the
+  length of the output queue (SIOCOUTQ) due to being based on snd_una
+  rather than write_seq. That offset depends on factors outside of
+  process control, including network RTT and peer response time. The
+  difference is subtle and unlikely to be noticed when confiugred at
+  initial socket creation. But .._OPT_ID behavior is more predictable.
 
 SOF_TIMESTAMPING_OPT_CMSG:
   Support recv() cmsg for all timestamped packets. Control messages
diff --git a/include/net/sock.h b/include/net/sock.h
index 6d207e7c4ad0..ecea3dcc2217 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -503,10 +503,10 @@  struct sock {
 #if BITS_PER_LONG==32
 	seqlock_t		sk_stamp_seq;
 #endif
-	u16			sk_tsflags;
-	u8			sk_shutdown;
 	atomic_t		sk_tskey;
 	atomic_t		sk_zckey;
+	u32			sk_tsflags;
+	u8			sk_shutdown;
 
 	u8			sk_clockid;
 	u8			sk_txtime_deadline_mode : 1,
@@ -1899,7 +1899,7 @@  static inline void sock_replace_proto(struct sock *sk, struct proto *proto)
 struct sockcm_cookie {
 	u64 transmit_time;
 	u32 mark;
-	u16 tsflags;
+	u32 tsflags;
 };
 
 static inline void sockcm_init(struct sockcm_cookie *sockc,
diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index 55501e5e7ac8..a2c66b3d7f0f 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -31,8 +31,9 @@  enum {
 	SOF_TIMESTAMPING_OPT_PKTINFO = (1<<13),
 	SOF_TIMESTAMPING_OPT_TX_SWHW = (1<<14),
 	SOF_TIMESTAMPING_BIND_PHC = (1 << 15),
+	SOF_TIMESTAMPING_OPT_ID_TCP = (1 << 16),
 
-	SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_BIND_PHC,
+	SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_ID_TCP,
 	SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_LAST - 1) |
 				 SOF_TIMESTAMPING_LAST
 };
diff --git a/net/core/sock.c b/net/core/sock.c
index 4571914a4aa8..b0ab841e0aed 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -901,13 +901,20 @@  int sock_set_timestamping(struct sock *sk, int optname,
 	if (val & ~SOF_TIMESTAMPING_MASK)
 		return -EINVAL;
 
+	if (val & SOF_TIMESTAMPING_OPT_ID_TCP &&
+	    !(val & SOF_TIMESTAMPING_OPT_ID))
+		return -EINVAL;
+
 	if (val & SOF_TIMESTAMPING_OPT_ID &&
 	    !(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
 		if (sk_is_tcp(sk)) {
 			if ((1 << sk->sk_state) &
 			    (TCPF_CLOSE | TCPF_LISTEN))
 				return -EINVAL;
-			atomic_set(&sk->sk_tskey, tcp_sk(sk)->snd_una);
+			if (val & SOF_TIMESTAMPING_OPT_ID_TCP)
+				atomic_set(&sk->sk_tskey, tcp_sk(sk)->write_seq);
+			else
+				atomic_set(&sk->sk_tskey, tcp_sk(sk)->snd_una);
 		} else {
 			atomic_set(&sk->sk_tskey, 0);
 		}
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index 21cfe8557205..6f399afc2ff2 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -417,6 +417,7 @@  const char sof_timestamping_names[][ETH_GSTRING_LEN] = {
 	[const_ilog2(SOF_TIMESTAMPING_OPT_PKTINFO)]  = "option-pktinfo",
 	[const_ilog2(SOF_TIMESTAMPING_OPT_TX_SWHW)]  = "option-tx-swhw",
 	[const_ilog2(SOF_TIMESTAMPING_BIND_PHC)]     = "bind-phc",
+	[const_ilog2(SOF_TIMESTAMPING_OPT_ID_TCP)]   = "option-id-tcp",
 };
 static_assert(ARRAY_SIZE(sof_timestamping_names) == __SOF_TIMESTAMPING_CNT);