Message ID | 20240611045830.67640-1-lulie@linux.alibaba.com (mailing list archive) |
---|---|
State | Rejected |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] tcp: Add tracepoint for rxtstamp coalescing | expand |
On Tue, 2024-06-11 at 12:58 +0800, Philo Lu wrote: > During tcp coalescence, rx timestamps of the former skb ("to" in > tcp_try_coalesce), will be lost. This may lead to inaccurate > timestamping results if skbs come out of order. > > Here is an example. > Assume a message consists of 3 skbs, namely A, B, and C. And these skbs > are processed by tcp in the following order: > A -(1us)-> C -(1ms)-> B IMHO the above order makes the changelog confusing > If C is coalesced to B, the final rx timestamps of the message will be > those of C. That is, the timestamps show that we received the message > when C came (including hardware and software). However, we actually > received it 1ms later (when B came). > > With the added tracepoint, we can recognize such cases and report them > if we want. We really need very good reasons to add new tracepoints to TCP. I'm unsure if the above example match such requirement. The reported timestamp actually matches the first byte in the aggregate segment, inferring anything more is IMHO stretching too far the API semantic. Cheers, Paolo
On Fri, Jun 14, 2024 at 10:09 AM Paolo Abeni <pabeni@redhat.com> wrote: > > On Tue, 2024-06-11 at 12:58 +0800, Philo Lu wrote: > > During tcp coalescence, rx timestamps of the former skb ("to" in > > tcp_try_coalesce), will be lost. This may lead to inaccurate > > timestamping results if skbs come out of order. > > > > Here is an example. > > Assume a message consists of 3 skbs, namely A, B, and C. And these skbs > > are processed by tcp in the following order: > > A -(1us)-> C -(1ms)-> B > > IMHO the above order makes the changelog confusing > > > If C is coalesced to B, the final rx timestamps of the message will be > > those of C. That is, the timestamps show that we received the message > > when C came (including hardware and software). However, we actually > > received it 1ms later (when B came). > > > > With the added tracepoint, we can recognize such cases and report them > > if we want. > > We really need very good reasons to add new tracepoints to TCP. I'm > unsure if the above example match such requirement. The reported > timestamp actually matches the first byte in the aggregate segment, > inferring anything more is IMHO stretching too far the API semantic. > Note the current behavior was a conscious choice, see commit 98aaa913b4ed2503244 ("tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE to TCP recvmsg") for the rationale. Perhaps another application would need to add a new timestamp to report both the oldest and newest timestamps. Or add a socket flag to prevent coalescing for applications needing precise timestamps. Willem might know better about this. I agree the tracepoint seems not needed. What about solving the issue instead ?
On 2024/6/14 16:09, Paolo Abeni wrote: > On Tue, 2024-06-11 at 12:58 +0800, Philo Lu wrote: >> During tcp coalescence, rx timestamps of the former skb ("to" in >> tcp_try_coalesce), will be lost. This may lead to inaccurate >> timestamping results if skbs come out of order. >> >> Here is an example. >> Assume a message consists of 3 skbs, namely A, B, and C. And these skbs >> are processed by tcp in the following order: >> A -(1us)-> C -(1ms)-> B > > IMHO the above order makes the changelog confusing > Thank you for pointing out this, I'll try to explain it in another way. Here it means these packets come in order of A-C-B, and B comes much later than C, such as 1ms later. Currently, the 1ms delay would be lost because of tcp coalescing.
On 2024/6/14 16:25, Eric Dumazet wrote: > On Fri, Jun 14, 2024 at 10:09 AM Paolo Abeni <pabeni@redhat.com> wrote: >> >> On Tue, 2024-06-11 at 12:58 +0800, Philo Lu wrote: >>> During tcp coalescence, rx timestamps of the former skb ("to" in >>> tcp_try_coalesce), will be lost. This may lead to inaccurate >>> timestamping results if skbs come out of order. >>> >>> Here is an example. >>> Assume a message consists of 3 skbs, namely A, B, and C. And these skbs >>> are processed by tcp in the following order: >>> A -(1us)-> C -(1ms)-> B >> >> IMHO the above order makes the changelog confusing >> >>> If C is coalesced to B, the final rx timestamps of the message will be >>> those of C. That is, the timestamps show that we received the message >>> when C came (including hardware and software). However, we actually >>> received it 1ms later (when B came). >>> >>> With the added tracepoint, we can recognize such cases and report them >>> if we want. >> >> We really need very good reasons to add new tracepoints to TCP. I'm >> unsure if the above example match such requirement. The reported >> timestamp actually matches the first byte in the aggregate segment, >> inferring anything more is IMHO stretching too far the API semantic. >> > > Note the current behavior was a conscious choice, see > commit 98aaa913b4ed2503244 ("tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE > to TCP recvmsg") > for the rationale. > IIUC, the behavior of returning the timestamp of the skb with highest sequence number works well without disorder. But once disorder occurs, tcp coalescence can cause this issue. > Perhaps another application would need to add a new timestamp to report > both the oldest and newest timestamps. I prefer this way, we do need both oldest and newest timestamps of a message to find if any packet is unexpected delayed after sending. But given there can be both hardware and software timestamps, we may need more fields in sk_buff to carry these new timestamps. > > Or add a socket flag to prevent coalescing for applications needing > precise timestamps. > > Willem might know better about this. > > I agree the tracepoint seems not needed. What about solving the issue instead ? Thanks.
> >> On Tue, 2024-06-11 at 12:58 +0800, Philo Lu wrote: > >>> During tcp coalescence, rx timestamps of the former skb ("to" in > >>> tcp_try_coalesce), will be lost. This may lead to inaccurate > >>> timestamping results if skbs come out of order. > >>> > >>> Here is an example. > >>> Assume a message consists of 3 skbs, namely A, B, and C. And these skbs > >>> are processed by tcp in the following order: > >>> A -(1us)-> C -(1ms)-> B > >> > >> IMHO the above order makes the changelog confusing > >> > >>> If C is coalesced to B, the final rx timestamps of the message will be > >>> those of C. That is, the timestamps show that we received the message > >>> when C came (including hardware and software). However, we actually > >>> received it 1ms later (when B came). > >>> > >>> With the added tracepoint, we can recognize such cases and report them > >>> if we want. > >> > >> We really need very good reasons to add new tracepoints to TCP. I'm > >> unsure if the above example match such requirement. The reported > >> timestamp actually matches the first byte in the aggregate segment, > >> inferring anything more is IMHO stretching too far the API semantic. > >> > > > > Note the current behavior was a conscious choice, see > > commit 98aaa913b4ed2503244 ("tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE > > to TCP recvmsg") > > for the rationale. > > > > IIUC, the behavior of returning the timestamp of the skb with highest > sequence number works well without disorder. But once disorder occurs, > tcp coalescence can cause this issue. > > > Perhaps another application would need to add a new timestamp to report > > both the oldest and newest timestamps. > > I prefer this way, we do need both oldest and newest timestamps of a > message to find if any packet is unexpected delayed after sending. > But given there can be both hardware and software timestamps, we may > need more fields in sk_buff to carry these new timestamps. Unfortunately returning multiple timestamps in tcp_recv_timestamp requires a new extended struct scm_timestamping, and likely an extra field to store both after coalescing. FWIW, I maintain a patch that also changes semantics, by returning not the timestamp associated with the last byte in the message (which is the current defined behavior), but the first byte that makes the socket readable. Usually just the first byte, unless SO_RCVLOWAT is set. It is definitely easier to define a flag like SOF_TIMESTAMPING_POLLIN that changes behavior of the one timestamp returned, than to return two timestamps. > > > > Or add a socket flag to prevent coalescing for applications needing > > precise timestamps. > > > > Willem might know better about this. > > > > I agree the tracepoint seems not needed. What about solving the issue instead ? > Thanks. A tracepoint is also not needed as a bpftrace program with kfunc on tcp_try_coalesce should be able to access this information already without kernel modifications. Or if it has to be at this line, a program with kprobe at offset, but that requires manual register reading.
On 2024/6/14 20:02, Willem de Bruijn wrote: >>>> On Tue, 2024-06-11 at 12:58 +0800, Philo Lu wrote: >>>>> During tcp coalescence, rx timestamps of the former skb ("to" in >>>>> tcp_try_coalesce), will be lost. This may lead to inaccurate >>>>> timestamping results if skbs come out of order. >>>>> >>>>> Here is an example. >>>>> Assume a message consists of 3 skbs, namely A, B, and C. And these skbs >>>>> are processed by tcp in the following order: >>>>> A -(1us)-> C -(1ms)-> B >>>> >>>> IMHO the above order makes the changelog confusing >>>> >>>>> If C is coalesced to B, the final rx timestamps of the message will be >>>>> those of C. That is, the timestamps show that we received the message >>>>> when C came (including hardware and software). However, we actually >>>>> received it 1ms later (when B came). >>>>> >>>>> With the added tracepoint, we can recognize such cases and report them >>>>> if we want. >>>> >>>> We really need very good reasons to add new tracepoints to TCP. I'm >>>> unsure if the above example match such requirement. The reported >>>> timestamp actually matches the first byte in the aggregate segment, >>>> inferring anything more is IMHO stretching too far the API semantic. >>>> >>> >>> Note the current behavior was a conscious choice, see >>> commit 98aaa913b4ed2503244 ("tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE >>> to TCP recvmsg") >>> for the rationale. >>> >> >> IIUC, the behavior of returning the timestamp of the skb with highest >> sequence number works well without disorder. But once disorder occurs, >> tcp coalescence can cause this issue. >> >>> Perhaps another application would need to add a new timestamp to report >>> both the oldest and newest timestamps. >> >> I prefer this way, we do need both oldest and newest timestamps of a >> message to find if any packet is unexpected delayed after sending. >> But given there can be both hardware and software timestamps, we may >> need more fields in sk_buff to carry these new timestamps. > > Unfortunately returning multiple timestamps in tcp_recv_timestamp > requires a new extended struct scm_timestamping, and likely an extra > field to store both after coalescing. > > FWIW, I maintain a patch that also changes semantics, by returning not > the timestamp associated with the last byte in the message (which is > the current defined behavior), but the first byte that makes the > socket readable. Usually just the first byte, unless SO_RCVLOWAT is > set. > > It is definitely easier to define a flag like SOF_TIMESTAMPING_POLLIN > that changes behavior of the one timestamp returned, than to return > two timestamps. I believe this is a step forward because now we can choose to get the oldest or newest timestamps. However, even with this option, it seems still unclear whether and how long an skb gets stuck in the out-of-order queue. >>> >>> Or add a socket flag to prevent coalescing for applications needing >>> precise timestamps. >>> >>> Willem might know better about this. >>> >>> I agree the tracepoint seems not needed. What about solving the issue instead ? >> Thanks. > > A tracepoint is also not needed as a bpftrace program with kfunc on > tcp_try_coalesce should be able to access this information already > without kernel modifications. Or if it has to be at this line, a > program with kprobe at offset, but that requires manual register > reading. Though using bpf fentry/kprobe could be feasible, I wonder if there are better solutions with lower cost. (And bpf fentry/kprobe may fail because tcp_try_coalesce is static) I mean, usually we use timestamping to reason where a jitter occurs, so it is expected to keep working in background (with the target application). In this case, kprobe without offset introduces much overhead in such a hot function, while kprobe with offset is of low compatibility for production. The tracepoint can solve the problem, which doesn't bother the receiver with TIMESTAMPING disabled, and is stable enough. And I'm looking forward to any suggestions. Thanks.
diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h index 49b5ee091cf6..c4219ca2bcf0 100644 --- a/include/trace/events/tcp.h +++ b/include/trace/events/tcp.h @@ -411,6 +411,67 @@ TRACE_EVENT(tcp_cong_state_set, __entry->cong_state) ); +/* + * When called, TCP_SKB_CB(from)->has_rxtstamp must be true, but TCP_SKB_CB(to)->has_rxtstamp may + * not. So has_rxtstamp is checked before reading timestamps of skb "to". + */ +TRACE_EVENT(tcp_rxtstamp_coalesce, + + TP_PROTO(const struct sock *sk, const struct sk_buff *to, const struct sk_buff *from), + + TP_ARGS(sk, to, from), + + TP_STRUCT__entry( + __field(__u16, sport) + __field(__u16, dport) + __field(__u16, family) + __array(__u8, saddr, 4) + __array(__u8, daddr, 4) + __array(__u8, saddr_v6, 16) + __array(__u8, daddr_v6, 16) + __field(__u64, to_tstamp) + __field(__u64, to_hwtstamp) + __field(__u64, from_tstamp) + __field(__u64, from_hwtstamp) + ), + + TP_fast_assign( + const struct inet_sock *inet = inet_sk(sk); + __be32 *p32; + + __entry->sport = ntohs(inet->inet_sport); + __entry->dport = ntohs(inet->inet_dport); + __entry->family = sk->sk_family; + + p32 = (__be32 *) __entry->saddr; + *p32 = inet->inet_saddr; + + p32 = (__be32 *) __entry->daddr; + *p32 = inet->inet_daddr; + + TP_STORE_ADDRS(__entry, inet->inet_saddr, inet->inet_daddr, + sk->sk_v6_rcv_saddr, sk->sk_v6_daddr); + + if (TCP_SKB_CB(to)->has_rxtstamp) { + __entry->to_tstamp = to->tstamp; + __entry->to_hwtstamp = skb_shinfo(to)->hwtstamps.hwtstamp; + } else { + __entry->to_tstamp = 0; + __entry->to_hwtstamp = 0; + } + + __entry->from_tstamp = from->tstamp; + __entry->from_hwtstamp = skb_shinfo(from)->hwtstamps.hwtstamp; + ), + + TP_printk("family=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c to_tstamp=%llu to_hwtstamp=%llu from_tstamp=%llu from_hwtstamp=%llu", + show_family_name(__entry->family), + __entry->sport, __entry->dport, __entry->saddr, __entry->daddr, + __entry->saddr_v6, __entry->daddr_v6, + __entry->to_tstamp, __entry->to_hwtstamp, + __entry->from_tstamp, __entry->from_hwtstamp) +); + #endif /* _TRACE_TCP_H */ /* This part must be outside protection */ diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index eb187450e4d7..7024c6ba20ae 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4827,6 +4827,8 @@ static bool tcp_try_coalesce(struct sock *sk, TCP_SKB_CB(to)->tcp_flags |= TCP_SKB_CB(from)->tcp_flags; if (TCP_SKB_CB(from)->has_rxtstamp) { + trace_tcp_rxtstamp_coalesce(sk, to, from); + TCP_SKB_CB(to)->has_rxtstamp = true; to->tstamp = from->tstamp; skb_hwtstamps(to)->hwtstamp = skb_hwtstamps(from)->hwtstamp;
During tcp coalescence, rx timestamps of the former skb ("to" in tcp_try_coalesce), will be lost. This may lead to inaccurate timestamping results if skbs come out of order. Here is an example. Assume a message consists of 3 skbs, namely A, B, and C. And these skbs are processed by tcp in the following order: A -(1us)-> C -(1ms)-> B If C is coalesced to B, the final rx timestamps of the message will be those of C. That is, the timestamps show that we received the message when C came (including hardware and software). However, we actually received it 1ms later (when B came). With the added tracepoint, we can recognize such cases and report them if we want. Signed-off-by: Philo Lu <lulie@linux.alibaba.com> --- include/trace/events/tcp.h | 61 ++++++++++++++++++++++++++++++++++++++ net/ipv4/tcp_input.c | 2 ++ 2 files changed, 63 insertions(+)