[net-next] tcp: Add tracepoint for rxtstamp coalescing

Message ID	20240611045830.67640-1-lulie@linux.alibaba.com (mailing list archive)
State	Rejected
Delegated to:	Netdev Maintainers
Headers	show Received: from out30-98.freemail.mail.aliyun.com (out30-98.freemail.mail.aliyun.com [115.124.30.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 14BDE14290 for <netdev@vger.kernel.org>; Tue, 11 Jun 2024 04:58:39 +0000 (UTC) From: Philo Lu <lulie@linux.alibaba.com> To: netdev@vger.kernel.org Cc: edumazet@google.com, rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com, davem@davemloft.net, dsahern@kernel.org, kuba@kernel.org, pabeni@redhat.com, xuanzhuo@linux.alibaba.com, dust.li@linux.alibaba.com Subject: [PATCH net-next] tcp: Add tracepoint for rxtstamp coalescing Date: Tue, 11 Jun 2024 12:58:30 +0800 Message-Id: <20240611045830.67640-1-lulie@linux.alibaba.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[net-next] tcp: Add tracepoint for rxtstamp coalescing \| expand [net-next] tcp: Add tracepoint for rxtstamp coalescing

Context	Check	Description
netdev/series_format	success	Single patches do not need cover letters
netdev/tree_selection	success	Clearly marked for net-next
netdev/ynl	success	Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 874 this patch: 874
netdev/build_tools	success	Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers	warning	1 maintainers not CCed: linux-trace-kernel@vger.kernel.org
netdev/build_clang	success	Errors and warnings before: 856 this patch: 856
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 878 this patch: 878
netdev/checkpatch	warning	CHECK: Alignment should match open parenthesis CHECK: Lines should not end with a '(' CHECK: No space is necessary after a cast WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 94 exceeds 80 columns WARNING: line length of 97 exceeds 80 columns
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 1 this patch: 1
netdev/source_inline	success	Was 0 now: 0

Philo Lu June 11, 2024, 4:58 a.m. UTC

During tcp coalescence, rx timestamps of the former skb ("to" in
tcp_try_coalesce), will be lost. This may lead to inaccurate
timestamping results if skbs come out of order.

Here is an example.
Assume a message consists of 3 skbs, namely A, B, and C. And these skbs
are processed by tcp in the following order:
A -(1us)-> C -(1ms)-> B
If C is coalesced to B, the final rx timestamps of the message will be
those of C. That is, the timestamps show that we received the message
when C came (including hardware and software). However, we actually
received it 1ms later (when B came).

With the added tracepoint, we can recognize such cases and report them
if we want.

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
---
 include/trace/events/tcp.h | 61 ++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_input.c       |  2 ++
 2 files changed, 63 insertions(+)

Paolo Abeni June 14, 2024, 8:09 a.m. UTC | #1

On Tue, 2024-06-11 at 12:58 +0800, Philo Lu wrote:
> During tcp coalescence, rx timestamps of the former skb ("to" in
> tcp_try_coalesce), will be lost. This may lead to inaccurate
> timestamping results if skbs come out of order.
> 
> Here is an example.
> Assume a message consists of 3 skbs, namely A, B, and C. And these skbs
> are processed by tcp in the following order:
> A -(1us)-> C -(1ms)-> B

IMHO the above order makes the changelog confusing

> If C is coalesced to B, the final rx timestamps of the message will be
> those of C. That is, the timestamps show that we received the message
> when C came (including hardware and software). However, we actually
> received it 1ms later (when B came).
> 
> With the added tracepoint, we can recognize such cases and report them
> if we want.

We really need very good reasons to add new tracepoints to TCP. I'm
unsure if the above example match such requirement. The reported
timestamp actually matches the first byte in the aggregate segment,
inferring anything more is IMHO stretching too far the API semantic.

Cheers,

Paolo

Eric Dumazet June 14, 2024, 8:25 a.m. UTC | #2

On Fri, Jun 14, 2024 at 10:09 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Tue, 2024-06-11 at 12:58 +0800, Philo Lu wrote:
> > During tcp coalescence, rx timestamps of the former skb ("to" in
> > tcp_try_coalesce), will be lost. This may lead to inaccurate
> > timestamping results if skbs come out of order.
> >
> > Here is an example.
> > Assume a message consists of 3 skbs, namely A, B, and C. And these skbs
> > are processed by tcp in the following order:
> > A -(1us)-> C -(1ms)-> B
>
> IMHO the above order makes the changelog confusing
>
> > If C is coalesced to B, the final rx timestamps of the message will be
> > those of C. That is, the timestamps show that we received the message
> > when C came (including hardware and software). However, we actually
> > received it 1ms later (when B came).
> >
> > With the added tracepoint, we can recognize such cases and report them
> > if we want.
>
> We really need very good reasons to add new tracepoints to TCP. I'm
> unsure if the above example match such requirement. The reported
> timestamp actually matches the first byte in the aggregate segment,
> inferring anything more is IMHO stretching too far the API semantic.
>

Note the current behavior was a conscious choice, see
commit 98aaa913b4ed2503244 ("tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE
to TCP recvmsg")
for the rationale.

Perhaps another application would need to add a new timestamp to report
both the oldest and newest timestamps.

Or add a socket flag to prevent coalescing for applications needing
precise timestamps.

Willem might know better about this.

I agree the tracepoint seems not needed. What about solving the issue instead ?

Philo Lu June 14, 2024, 11:27 a.m. UTC | #3

On 2024/6/14 16:09, Paolo Abeni wrote:
> On Tue, 2024-06-11 at 12:58 +0800, Philo Lu wrote:
>> During tcp coalescence, rx timestamps of the former skb ("to" in
>> tcp_try_coalesce), will be lost. This may lead to inaccurate
>> timestamping results if skbs come out of order.
>>
>> Here is an example.
>> Assume a message consists of 3 skbs, namely A, B, and C. And these skbs
>> are processed by tcp in the following order:
>> A -(1us)-> C -(1ms)-> B
> 
> IMHO the above order makes the changelog confusing
>
Thank you for pointing out this, I'll try to explain it in another way.

Here it means these packets come in order of A-C-B, and B comes much 
later than C, such as 1ms later. Currently, the 1ms delay would be lost 
because of tcp coalescing.

Philo Lu June 14, 2024, 11:27 a.m. UTC | #4

On 2024/6/14 16:25, Eric Dumazet wrote:
> On Fri, Jun 14, 2024 at 10:09 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>
>> On Tue, 2024-06-11 at 12:58 +0800, Philo Lu wrote:
>>> During tcp coalescence, rx timestamps of the former skb ("to" in
>>> tcp_try_coalesce), will be lost. This may lead to inaccurate
>>> timestamping results if skbs come out of order.
>>>
>>> Here is an example.
>>> Assume a message consists of 3 skbs, namely A, B, and C. And these skbs
>>> are processed by tcp in the following order:
>>> A -(1us)-> C -(1ms)-> B
>>
>> IMHO the above order makes the changelog confusing
>>
>>> If C is coalesced to B, the final rx timestamps of the message will be
>>> those of C. That is, the timestamps show that we received the message
>>> when C came (including hardware and software). However, we actually
>>> received it 1ms later (when B came).
>>>
>>> With the added tracepoint, we can recognize such cases and report them
>>> if we want.
>>
>> We really need very good reasons to add new tracepoints to TCP. I'm
>> unsure if the above example match such requirement. The reported
>> timestamp actually matches the first byte in the aggregate segment,
>> inferring anything more is IMHO stretching too far the API semantic.
>>
> 
> Note the current behavior was a conscious choice, see
> commit 98aaa913b4ed2503244 ("tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE
> to TCP recvmsg")
> for the rationale.
> 

IIUC, the behavior of returning the timestamp of the skb with highest 
sequence number works well without disorder. But once disorder occurs, 
tcp coalescence can cause this issue.

> Perhaps another application would need to add a new timestamp to report
> both the oldest and newest timestamps.

I prefer this way, we do need both oldest and newest timestamps of a 
message to find if any packet is unexpected delayed after sending.
But given there can be both hardware and software timestamps, we may 
need more fields in sk_buff to carry these new timestamps.

> 
> Or add a socket flag to prevent coalescing for applications needing
> precise timestamps.
> 
> Willem might know better about this.
> 
> I agree the tracepoint seems not needed. What about solving the issue instead ?
Thanks.

Willem de Bruijn June 14, 2024, 12:02 p.m. UTC | #5

> >> On Tue, 2024-06-11 at 12:58 +0800, Philo Lu wrote:
> >>> During tcp coalescence, rx timestamps of the former skb ("to" in
> >>> tcp_try_coalesce), will be lost. This may lead to inaccurate
> >>> timestamping results if skbs come out of order.
> >>>
> >>> Here is an example.
> >>> Assume a message consists of 3 skbs, namely A, B, and C. And these skbs
> >>> are processed by tcp in the following order:
> >>> A -(1us)-> C -(1ms)-> B
> >>
> >> IMHO the above order makes the changelog confusing
> >>
> >>> If C is coalesced to B, the final rx timestamps of the message will be
> >>> those of C. That is, the timestamps show that we received the message
> >>> when C came (including hardware and software). However, we actually
> >>> received it 1ms later (when B came).
> >>>
> >>> With the added tracepoint, we can recognize such cases and report them
> >>> if we want.
> >>
> >> We really need very good reasons to add new tracepoints to TCP. I'm
> >> unsure if the above example match such requirement. The reported
> >> timestamp actually matches the first byte in the aggregate segment,
> >> inferring anything more is IMHO stretching too far the API semantic.
> >>
> >
> > Note the current behavior was a conscious choice, see
> > commit 98aaa913b4ed2503244 ("tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE
> > to TCP recvmsg")
> > for the rationale.
> >
>
> IIUC, the behavior of returning the timestamp of the skb with highest
> sequence number works well without disorder. But once disorder occurs,
> tcp coalescence can cause this issue.
>
> > Perhaps another application would need to add a new timestamp to report
> > both the oldest and newest timestamps.
>
> I prefer this way, we do need both oldest and newest timestamps of a
> message to find if any packet is unexpected delayed after sending.
> But given there can be both hardware and software timestamps, we may
> need more fields in sk_buff to carry these new timestamps.

Unfortunately returning multiple timestamps in tcp_recv_timestamp
requires a new extended struct scm_timestamping, and likely an extra
field to store both after coalescing.

FWIW, I maintain a patch that also changes semantics, by returning not
the timestamp associated with the last byte in the message (which is
the current defined behavior), but the first byte that makes the
socket readable. Usually just the first byte, unless SO_RCVLOWAT is
set.

It is definitely easier to define a flag like SOF_TIMESTAMPING_POLLIN
that changes behavior of the one timestamp returned, than to return
two timestamps.


> >
> > Or add a socket flag to prevent coalescing for applications needing
> > precise timestamps.
> >
> > Willem might know better about this.
> >
> > I agree the tracepoint seems not needed. What about solving the issue instead ?
> Thanks.

A tracepoint is also not needed as a bpftrace program with kfunc on
tcp_try_coalesce should be able to access this information already
without kernel modifications. Or if it has to be at this line, a
program with kprobe at offset, but that requires manual register
reading.

Philo Lu June 18, 2024, 12:11 p.m. UTC | #6

On 2024/6/14 20:02, Willem de Bruijn wrote:
>>>> On Tue, 2024-06-11 at 12:58 +0800, Philo Lu wrote:
>>>>> During tcp coalescence, rx timestamps of the former skb ("to" in
>>>>> tcp_try_coalesce), will be lost. This may lead to inaccurate
>>>>> timestamping results if skbs come out of order.
>>>>>
>>>>> Here is an example.
>>>>> Assume a message consists of 3 skbs, namely A, B, and C. And these skbs
>>>>> are processed by tcp in the following order:
>>>>> A -(1us)-> C -(1ms)-> B
>>>>
>>>> IMHO the above order makes the changelog confusing
>>>>
>>>>> If C is coalesced to B, the final rx timestamps of the message will be
>>>>> those of C. That is, the timestamps show that we received the message
>>>>> when C came (including hardware and software). However, we actually
>>>>> received it 1ms later (when B came).
>>>>>
>>>>> With the added tracepoint, we can recognize such cases and report them
>>>>> if we want.
>>>>
>>>> We really need very good reasons to add new tracepoints to TCP. I'm
>>>> unsure if the above example match such requirement. The reported
>>>> timestamp actually matches the first byte in the aggregate segment,
>>>> inferring anything more is IMHO stretching too far the API semantic.
>>>>
>>>
>>> Note the current behavior was a conscious choice, see
>>> commit 98aaa913b4ed2503244 ("tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE
>>> to TCP recvmsg")
>>> for the rationale.
>>>
>>
>> IIUC, the behavior of returning the timestamp of the skb with highest
>> sequence number works well without disorder. But once disorder occurs,
>> tcp coalescence can cause this issue.
>>
>>> Perhaps another application would need to add a new timestamp to report
>>> both the oldest and newest timestamps.
>>
>> I prefer this way, we do need both oldest and newest timestamps of a
>> message to find if any packet is unexpected delayed after sending.
>> But given there can be both hardware and software timestamps, we may
>> need more fields in sk_buff to carry these new timestamps.
> 
> Unfortunately returning multiple timestamps in tcp_recv_timestamp
> requires a new extended struct scm_timestamping, and likely an extra
> field to store both after coalescing.
> 
> FWIW, I maintain a patch that also changes semantics, by returning not
> the timestamp associated with the last byte in the message (which is
> the current defined behavior), but the first byte that makes the
> socket readable. Usually just the first byte, unless SO_RCVLOWAT is
> set.
> 
> It is definitely easier to define a flag like SOF_TIMESTAMPING_POLLIN
> that changes behavior of the one timestamp returned, than to return
> two timestamps.

I believe this is a step forward because now we can choose to get the 
oldest or newest timestamps. However, even with this option, it seems 
still unclear whether and how long an skb gets stuck in the out-of-order 
queue.

>>>
>>> Or add a socket flag to prevent coalescing for applications needing
>>> precise timestamps.
>>>
>>> Willem might know better about this.
>>>
>>> I agree the tracepoint seems not needed. What about solving the issue instead ?
>> Thanks.
> 
> A tracepoint is also not needed as a bpftrace program with kfunc on
> tcp_try_coalesce should be able to access this information already
> without kernel modifications. Or if it has to be at this line, a
> program with kprobe at offset, but that requires manual register
> reading.

Though using bpf fentry/kprobe could be feasible, I wonder if there are 
better solutions with lower cost.
(And bpf fentry/kprobe may fail because tcp_try_coalesce is static)

I mean, usually we use timestamping to reason where a jitter occurs, so 
it is expected to keep working in background (with the target 
application). In this case, kprobe without offset introduces much 
overhead in such a hot function, while kprobe with offset is of low 
compatibility for production.

The tracepoint can solve the problem, which doesn't bother the receiver 
with TIMESTAMPING disabled, and is stable enough. And I'm looking 
forward to any suggestions.

Thanks.

[net-next] tcp: Add tracepoint for rxtstamp coalescing

Checks

Commit Message

Comments

Patch