diff mbox series

[net-next] tcp: Add tracepoint for rxtstamp coalescing

Message ID 20240611045830.67640-1-lulie@linux.alibaba.com (mailing list archive)
State Rejected
Delegated to: Netdev Maintainers
Headers show
Series [net-next] tcp: Add tracepoint for rxtstamp coalescing | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 874 this patch: 874
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 1 maintainers not CCed: linux-trace-kernel@vger.kernel.org
netdev/build_clang success Errors and warnings before: 856 this patch: 856
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 878 this patch: 878
netdev/checkpatch warning CHECK: Alignment should match open parenthesis CHECK: Lines should not end with a '(' CHECK: No space is necessary after a cast WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 94 exceeds 80 columns WARNING: line length of 97 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 1 this patch: 1
netdev/source_inline success Was 0 now: 0

Commit Message

Philo Lu June 11, 2024, 4:58 a.m. UTC
During tcp coalescence, rx timestamps of the former skb ("to" in
tcp_try_coalesce), will be lost. This may lead to inaccurate
timestamping results if skbs come out of order.

Here is an example.
Assume a message consists of 3 skbs, namely A, B, and C. And these skbs
are processed by tcp in the following order:
A -(1us)-> C -(1ms)-> B
If C is coalesced to B, the final rx timestamps of the message will be
those of C. That is, the timestamps show that we received the message
when C came (including hardware and software). However, we actually
received it 1ms later (when B came).

With the added tracepoint, we can recognize such cases and report them
if we want.

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
---
 include/trace/events/tcp.h | 61 ++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_input.c       |  2 ++
 2 files changed, 63 insertions(+)

Comments

Paolo Abeni June 14, 2024, 8:09 a.m. UTC | #1
On Tue, 2024-06-11 at 12:58 +0800, Philo Lu wrote:
> During tcp coalescence, rx timestamps of the former skb ("to" in
> tcp_try_coalesce), will be lost. This may lead to inaccurate
> timestamping results if skbs come out of order.
> 
> Here is an example.
> Assume a message consists of 3 skbs, namely A, B, and C. And these skbs
> are processed by tcp in the following order:
> A -(1us)-> C -(1ms)-> B

IMHO the above order makes the changelog confusing

> If C is coalesced to B, the final rx timestamps of the message will be
> those of C. That is, the timestamps show that we received the message
> when C came (including hardware and software). However, we actually
> received it 1ms later (when B came).
> 
> With the added tracepoint, we can recognize such cases and report them
> if we want.

We really need very good reasons to add new tracepoints to TCP. I'm
unsure if the above example match such requirement. The reported
timestamp actually matches the first byte in the aggregate segment,
inferring anything more is IMHO stretching too far the API semantic.

Cheers,

Paolo
Eric Dumazet June 14, 2024, 8:25 a.m. UTC | #2
On Fri, Jun 14, 2024 at 10:09 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Tue, 2024-06-11 at 12:58 +0800, Philo Lu wrote:
> > During tcp coalescence, rx timestamps of the former skb ("to" in
> > tcp_try_coalesce), will be lost. This may lead to inaccurate
> > timestamping results if skbs come out of order.
> >
> > Here is an example.
> > Assume a message consists of 3 skbs, namely A, B, and C. And these skbs
> > are processed by tcp in the following order:
> > A -(1us)-> C -(1ms)-> B
>
> IMHO the above order makes the changelog confusing
>
> > If C is coalesced to B, the final rx timestamps of the message will be
> > those of C. That is, the timestamps show that we received the message
> > when C came (including hardware and software). However, we actually
> > received it 1ms later (when B came).
> >
> > With the added tracepoint, we can recognize such cases and report them
> > if we want.
>
> We really need very good reasons to add new tracepoints to TCP. I'm
> unsure if the above example match such requirement. The reported
> timestamp actually matches the first byte in the aggregate segment,
> inferring anything more is IMHO stretching too far the API semantic.
>

Note the current behavior was a conscious choice, see
commit 98aaa913b4ed2503244 ("tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE
to TCP recvmsg")
for the rationale.

Perhaps another application would need to add a new timestamp to report
both the oldest and newest timestamps.

Or add a socket flag to prevent coalescing for applications needing
precise timestamps.

Willem might know better about this.

I agree the tracepoint seems not needed. What about solving the issue instead ?
Philo Lu June 14, 2024, 11:27 a.m. UTC | #3
On 2024/6/14 16:09, Paolo Abeni wrote:
> On Tue, 2024-06-11 at 12:58 +0800, Philo Lu wrote:
>> During tcp coalescence, rx timestamps of the former skb ("to" in
>> tcp_try_coalesce), will be lost. This may lead to inaccurate
>> timestamping results if skbs come out of order.
>>
>> Here is an example.
>> Assume a message consists of 3 skbs, namely A, B, and C. And these skbs
>> are processed by tcp in the following order:
>> A -(1us)-> C -(1ms)-> B
> 
> IMHO the above order makes the changelog confusing
>
Thank you for pointing out this, I'll try to explain it in another way.

Here it means these packets come in order of A-C-B, and B comes much 
later than C, such as 1ms later. Currently, the 1ms delay would be lost 
because of tcp coalescing.
Philo Lu June 14, 2024, 11:27 a.m. UTC | #4
On 2024/6/14 16:25, Eric Dumazet wrote:
> On Fri, Jun 14, 2024 at 10:09 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>
>> On Tue, 2024-06-11 at 12:58 +0800, Philo Lu wrote:
>>> During tcp coalescence, rx timestamps of the former skb ("to" in
>>> tcp_try_coalesce), will be lost. This may lead to inaccurate
>>> timestamping results if skbs come out of order.
>>>
>>> Here is an example.
>>> Assume a message consists of 3 skbs, namely A, B, and C. And these skbs
>>> are processed by tcp in the following order:
>>> A -(1us)-> C -(1ms)-> B
>>
>> IMHO the above order makes the changelog confusing
>>
>>> If C is coalesced to B, the final rx timestamps of the message will be
>>> those of C. That is, the timestamps show that we received the message
>>> when C came (including hardware and software). However, we actually
>>> received it 1ms later (when B came).
>>>
>>> With the added tracepoint, we can recognize such cases and report them
>>> if we want.
>>
>> We really need very good reasons to add new tracepoints to TCP. I'm
>> unsure if the above example match such requirement. The reported
>> timestamp actually matches the first byte in the aggregate segment,
>> inferring anything more is IMHO stretching too far the API semantic.
>>
> 
> Note the current behavior was a conscious choice, see
> commit 98aaa913b4ed2503244 ("tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE
> to TCP recvmsg")
> for the rationale.
> 

IIUC, the behavior of returning the timestamp of the skb with highest 
sequence number works well without disorder. But once disorder occurs, 
tcp coalescence can cause this issue.

> Perhaps another application would need to add a new timestamp to report
> both the oldest and newest timestamps.

I prefer this way, we do need both oldest and newest timestamps of a 
message to find if any packet is unexpected delayed after sending.
But given there can be both hardware and software timestamps, we may 
need more fields in sk_buff to carry these new timestamps.

> 
> Or add a socket flag to prevent coalescing for applications needing
> precise timestamps.
> 
> Willem might know better about this.
> 
> I agree the tracepoint seems not needed. What about solving the issue instead ?
Thanks.
Willem de Bruijn June 14, 2024, 12:02 p.m. UTC | #5
> >> On Tue, 2024-06-11 at 12:58 +0800, Philo Lu wrote:
> >>> During tcp coalescence, rx timestamps of the former skb ("to" in
> >>> tcp_try_coalesce), will be lost. This may lead to inaccurate
> >>> timestamping results if skbs come out of order.
> >>>
> >>> Here is an example.
> >>> Assume a message consists of 3 skbs, namely A, B, and C. And these skbs
> >>> are processed by tcp in the following order:
> >>> A -(1us)-> C -(1ms)-> B
> >>
> >> IMHO the above order makes the changelog confusing
> >>
> >>> If C is coalesced to B, the final rx timestamps of the message will be
> >>> those of C. That is, the timestamps show that we received the message
> >>> when C came (including hardware and software). However, we actually
> >>> received it 1ms later (when B came).
> >>>
> >>> With the added tracepoint, we can recognize such cases and report them
> >>> if we want.
> >>
> >> We really need very good reasons to add new tracepoints to TCP. I'm
> >> unsure if the above example match such requirement. The reported
> >> timestamp actually matches the first byte in the aggregate segment,
> >> inferring anything more is IMHO stretching too far the API semantic.
> >>
> >
> > Note the current behavior was a conscious choice, see
> > commit 98aaa913b4ed2503244 ("tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE
> > to TCP recvmsg")
> > for the rationale.
> >
>
> IIUC, the behavior of returning the timestamp of the skb with highest
> sequence number works well without disorder. But once disorder occurs,
> tcp coalescence can cause this issue.
>
> > Perhaps another application would need to add a new timestamp to report
> > both the oldest and newest timestamps.
>
> I prefer this way, we do need both oldest and newest timestamps of a
> message to find if any packet is unexpected delayed after sending.
> But given there can be both hardware and software timestamps, we may
> need more fields in sk_buff to carry these new timestamps.

Unfortunately returning multiple timestamps in tcp_recv_timestamp
requires a new extended struct scm_timestamping, and likely an extra
field to store both after coalescing.

FWIW, I maintain a patch that also changes semantics, by returning not
the timestamp associated with the last byte in the message (which is
the current defined behavior), but the first byte that makes the
socket readable. Usually just the first byte, unless SO_RCVLOWAT is
set.

It is definitely easier to define a flag like SOF_TIMESTAMPING_POLLIN
that changes behavior of the one timestamp returned, than to return
two timestamps.


> >
> > Or add a socket flag to prevent coalescing for applications needing
> > precise timestamps.
> >
> > Willem might know better about this.
> >
> > I agree the tracepoint seems not needed. What about solving the issue instead ?
> Thanks.

A tracepoint is also not needed as a bpftrace program with kfunc on
tcp_try_coalesce should be able to access this information already
without kernel modifications. Or if it has to be at this line, a
program with kprobe at offset, but that requires manual register
reading.
Philo Lu June 18, 2024, 12:11 p.m. UTC | #6
On 2024/6/14 20:02, Willem de Bruijn wrote:
>>>> On Tue, 2024-06-11 at 12:58 +0800, Philo Lu wrote:
>>>>> During tcp coalescence, rx timestamps of the former skb ("to" in
>>>>> tcp_try_coalesce), will be lost. This may lead to inaccurate
>>>>> timestamping results if skbs come out of order.
>>>>>
>>>>> Here is an example.
>>>>> Assume a message consists of 3 skbs, namely A, B, and C. And these skbs
>>>>> are processed by tcp in the following order:
>>>>> A -(1us)-> C -(1ms)-> B
>>>>
>>>> IMHO the above order makes the changelog confusing
>>>>
>>>>> If C is coalesced to B, the final rx timestamps of the message will be
>>>>> those of C. That is, the timestamps show that we received the message
>>>>> when C came (including hardware and software). However, we actually
>>>>> received it 1ms later (when B came).
>>>>>
>>>>> With the added tracepoint, we can recognize such cases and report them
>>>>> if we want.
>>>>
>>>> We really need very good reasons to add new tracepoints to TCP. I'm
>>>> unsure if the above example match such requirement. The reported
>>>> timestamp actually matches the first byte in the aggregate segment,
>>>> inferring anything more is IMHO stretching too far the API semantic.
>>>>
>>>
>>> Note the current behavior was a conscious choice, see
>>> commit 98aaa913b4ed2503244 ("tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE
>>> to TCP recvmsg")
>>> for the rationale.
>>>
>>
>> IIUC, the behavior of returning the timestamp of the skb with highest
>> sequence number works well without disorder. But once disorder occurs,
>> tcp coalescence can cause this issue.
>>
>>> Perhaps another application would need to add a new timestamp to report
>>> both the oldest and newest timestamps.
>>
>> I prefer this way, we do need both oldest and newest timestamps of a
>> message to find if any packet is unexpected delayed after sending.
>> But given there can be both hardware and software timestamps, we may
>> need more fields in sk_buff to carry these new timestamps.
> 
> Unfortunately returning multiple timestamps in tcp_recv_timestamp
> requires a new extended struct scm_timestamping, and likely an extra
> field to store both after coalescing.
> 
> FWIW, I maintain a patch that also changes semantics, by returning not
> the timestamp associated with the last byte in the message (which is
> the current defined behavior), but the first byte that makes the
> socket readable. Usually just the first byte, unless SO_RCVLOWAT is
> set.
> 
> It is definitely easier to define a flag like SOF_TIMESTAMPING_POLLIN
> that changes behavior of the one timestamp returned, than to return
> two timestamps.

I believe this is a step forward because now we can choose to get the 
oldest or newest timestamps. However, even with this option, it seems 
still unclear whether and how long an skb gets stuck in the out-of-order 
queue.

>>>
>>> Or add a socket flag to prevent coalescing for applications needing
>>> precise timestamps.
>>>
>>> Willem might know better about this.
>>>
>>> I agree the tracepoint seems not needed. What about solving the issue instead ?
>> Thanks.
> 
> A tracepoint is also not needed as a bpftrace program with kfunc on
> tcp_try_coalesce should be able to access this information already
> without kernel modifications. Or if it has to be at this line, a
> program with kprobe at offset, but that requires manual register
> reading.

Though using bpf fentry/kprobe could be feasible, I wonder if there are 
better solutions with lower cost.
(And bpf fentry/kprobe may fail because tcp_try_coalesce is static)

I mean, usually we use timestamping to reason where a jitter occurs, so 
it is expected to keep working in background (with the target 
application). In this case, kprobe without offset introduces much 
overhead in such a hot function, while kprobe with offset is of low 
compatibility for production.

The tracepoint can solve the problem, which doesn't bother the receiver 
with TIMESTAMPING disabled, and is stable enough. And I'm looking 
forward to any suggestions.

Thanks.
diff mbox series

Patch

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 49b5ee091cf6..c4219ca2bcf0 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -411,6 +411,67 @@  TRACE_EVENT(tcp_cong_state_set,
 		  __entry->cong_state)
 );
 
+/*
+ * When called, TCP_SKB_CB(from)->has_rxtstamp must be true, but TCP_SKB_CB(to)->has_rxtstamp may
+ * not. So has_rxtstamp is checked before reading timestamps of skb "to".
+ */
+TRACE_EVENT(tcp_rxtstamp_coalesce,
+
+	TP_PROTO(const struct sock *sk, const struct sk_buff *to, const struct sk_buff *from),
+
+	TP_ARGS(sk, to, from),
+
+	TP_STRUCT__entry(
+		__field(__u16, sport)
+		__field(__u16, dport)
+		__field(__u16, family)
+		__array(__u8, saddr, 4)
+		__array(__u8, daddr, 4)
+		__array(__u8, saddr_v6, 16)
+		__array(__u8, daddr_v6, 16)
+		__field(__u64, to_tstamp)
+		__field(__u64, to_hwtstamp)
+		__field(__u64, from_tstamp)
+		__field(__u64, from_hwtstamp)
+	),
+
+	TP_fast_assign(
+		const struct inet_sock *inet = inet_sk(sk);
+		__be32 *p32;
+
+		__entry->sport = ntohs(inet->inet_sport);
+		__entry->dport = ntohs(inet->inet_dport);
+		__entry->family = sk->sk_family;
+
+		p32 = (__be32 *) __entry->saddr;
+		*p32 = inet->inet_saddr;
+
+		p32 = (__be32 *) __entry->daddr;
+		*p32 = inet->inet_daddr;
+
+		TP_STORE_ADDRS(__entry, inet->inet_saddr, inet->inet_daddr,
+			       sk->sk_v6_rcv_saddr, sk->sk_v6_daddr);
+
+		if (TCP_SKB_CB(to)->has_rxtstamp) {
+			__entry->to_tstamp = to->tstamp;
+			__entry->to_hwtstamp = skb_shinfo(to)->hwtstamps.hwtstamp;
+		} else {
+			__entry->to_tstamp = 0;
+			__entry->to_hwtstamp = 0;
+		}
+
+		__entry->from_tstamp = from->tstamp;
+		__entry->from_hwtstamp = skb_shinfo(from)->hwtstamps.hwtstamp;
+	),
+
+	TP_printk("family=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c to_tstamp=%llu to_hwtstamp=%llu from_tstamp=%llu from_hwtstamp=%llu",
+		  show_family_name(__entry->family),
+		  __entry->sport, __entry->dport, __entry->saddr, __entry->daddr,
+		  __entry->saddr_v6, __entry->daddr_v6,
+		  __entry->to_tstamp, __entry->to_hwtstamp,
+		  __entry->from_tstamp, __entry->from_hwtstamp)
+);
+
 #endif /* _TRACE_TCP_H */
 
 /* This part must be outside protection */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index eb187450e4d7..7024c6ba20ae 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4827,6 +4827,8 @@  static bool tcp_try_coalesce(struct sock *sk,
 	TCP_SKB_CB(to)->tcp_flags |= TCP_SKB_CB(from)->tcp_flags;
 
 	if (TCP_SKB_CB(from)->has_rxtstamp) {
+		trace_tcp_rxtstamp_coalesce(sk, to, from);
+
 		TCP_SKB_CB(to)->has_rxtstamp = true;
 		to->tstamp = from->tstamp;
 		skb_hwtstamps(to)->hwtstamp = skb_hwtstamps(from)->hwtstamp;