[net-next] tcp: add tracepoints for data send/recv/acked

Message ID	20231204114322.9218-1-lulie@linux.alibaba.com (mailing list archive)
State	Handled Elsewhere
Headers	show Received: from out30-112.freemail.mail.aliyun.com (out30-112.freemail.mail.aliyun.com [115.124.30.112]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7DEE3107; Mon, 4 Dec 2023 03:43:26 -0800 (PST) X-Alimail-AntiSpam: AC=PASS;BC=-1\|-1;BR=01201311R151e4;CH=green;DM=\|\|false\|;DS=\|\|;FP=0\|-1\|-1\|-1\|0\|-1\|-1\|-1;HT=ay29a033018046059;MF=lulie@linux.alibaba.com;NM=1;PH=DS;RN=17;SR=0;TI=SMTPD_---0Vxp1Hac_1701690202; Received: from localhost(mailfrom:lulie@linux.alibaba.com fp:SMTPD_---0Vxp1Hac_1701690202) by smtp.aliyun-inc.com; Mon, 04 Dec 2023 19:43:24 +0800 From: Philo Lu <lulie@linux.alibaba.com> To: netdev@vger.kernel.org Cc: edumazet@google.com, rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com, davem@davemloft.net, dsahern@kernel.org, kuba@kernel.org, pabeni@redhat.com, martin.lau@linux.dev, linux-trace-kernel@vger.kernel.org, bpf@vger.kernel.org, xuanzhuo@linux.alibaba.com, dust.li@linux.alibaba.com, alibuda@linux.alibaba.com, guwen@linux.alibaba.com, hengqi@linux.alibaba.com Subject: [PATCH net-next] tcp: add tracepoints for data send/recv/acked Date: Mon, 4 Dec 2023 19:43:22 +0800 Message-Id: <20231204114322.9218-1-lulie@linux.alibaba.com> X-Mailer: git-send-email 2.32.0.3.g01195cf9f Precedence: bulk X-Mailing-List: linux-trace-kernel@vger.kernel.org List-Id: <linux-trace-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-trace-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-trace-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[net-next] tcp: add tracepoints for data send/recv/acked \| expand [net-next] tcp: add tracepoints for data send/recv/acked

Philo Lu Dec. 4, 2023, 11:43 a.m. UTC

Add 3 tracepoints, namely tcp_data_send/tcp_data_recv/tcp_data_acked,
which will be called every time a tcp data packet is sent, received, and
acked.
tcp_data_send: called after a data packet is sent.
tcp_data_recv: called after a data packet is receviced.
tcp_data_acked: called after a valid ack packet is processed (some sent
data are ackknowledged).

We use these callbacks for fine-grained tcp monitoring, which collects
and analyses every tcp request/response event information. The whole
system has been described in SIGMOD'18 (see
https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
achieve this with bpf, we require hooks for data events that call bpf
prog (1) when any data packet is sent/received/acked, and (2) after
critical tcp state variables have been updated (e.g., snd_una, snd_nxt,
rcv_nxt). However, existing bpf hooks cannot meet our requirements.
Besides, these tracepoints help to debug tcp when data send/recv/acked.

Though kretprobe/fexit can also be used to collect these information,
they will not work if the kernel functions get inlined. Considering the
stability, we prefer tracepoint as the solution.

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 include/trace/events/tcp.h | 21 +++++++++++++++++++++
 net/ipv4/tcp_input.c       |  4 ++++
 net/ipv4/tcp_output.c      |  2 ++
 3 files changed, 27 insertions(+)

Eric Dumazet Dec. 4, 2023, 12:28 p.m. UTC | #1

On Mon, Dec 4, 2023 at 12:43 PM Philo Lu <lulie@linux.alibaba.com> wrote:
>
> Add 3 tracepoints, namely tcp_data_send/tcp_data_recv/tcp_data_acked,
> which will be called every time a tcp data packet is sent, received, and
> acked.
> tcp_data_send: called after a data packet is sent.
> tcp_data_recv: called after a data packet is receviced.
> tcp_data_acked: called after a valid ack packet is processed (some sent
> data are ackknowledged).
>
> We use these callbacks for fine-grained tcp monitoring, which collects
> and analyses every tcp request/response event information. The whole
> system has been described in SIGMOD'18 (see
> https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
> achieve this with bpf, we require hooks for data events that call bpf
> prog (1) when any data packet is sent/received/acked, and (2) after
> critical tcp state variables have been updated (e.g., snd_una, snd_nxt,
> rcv_nxt). However, existing bpf hooks cannot meet our requirements.
> Besides, these tracepoints help to debug tcp when data send/recv/acked.

This I do not understand.

>
> Though kretprobe/fexit can also be used to collect these information,
> they will not work if the kernel functions get inlined. Considering the
> stability, we prefer tracepoint as the solution.

I dunno, this seems quite weak to me. I see many patches coming to add
tracing in the stack, but no patches fixing any issues.

It really looks like : We do not know how TCP stack works, we do not
know if there is any issue,
let us add trace points to help us to make forward progress in our analysis.

These tracepoints will not tell how many segments/bytes were
sent/acked/received, I really do not see
how we will avoid adding in the future more stuff, forcing the
compiler to save more state
just in case the tracepoint needs the info.

The argument of "add minimal info", so that we can silently add more
stuff in the future "for free" is not something I buy.

I very much prefer that you make sure the stuff you need is not
inlined, so that standard kprobe/kretprobe facility can be used.

Xuan Zhuo Dec. 5, 2023, 1:48 a.m. UTC | #2

On Mon, 4 Dec 2023 13:28:21 +0100, Eric Dumazet <edumazet@google.com> wrote:
> On Mon, Dec 4, 2023 at 12:43 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> >
> > Add 3 tracepoints, namely tcp_data_send/tcp_data_recv/tcp_data_acked,
> > which will be called every time a tcp data packet is sent, received, and
> > acked.
> > tcp_data_send: called after a data packet is sent.
> > tcp_data_recv: called after a data packet is receviced.
> > tcp_data_acked: called after a valid ack packet is processed (some sent
> > data are ackknowledged).
> >
> > We use these callbacks for fine-grained tcp monitoring, which collects
> > and analyses every tcp request/response event information. The whole
> > system has been described in SIGMOD'18 (see
> > https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
> > achieve this with bpf, we require hooks for data events that call bpf
> > prog (1) when any data packet is sent/received/acked, and (2) after
> > critical tcp state variables have been updated (e.g., snd_una, snd_nxt,
> > rcv_nxt). However, existing bpf hooks cannot meet our requirements.
> > Besides, these tracepoints help to debug tcp when data send/recv/acked.
>
> This I do not understand.
>
> >
> > Though kretprobe/fexit can also be used to collect these information,
> > they will not work if the kernel functions get inlined. Considering the
> > stability, we prefer tracepoint as the solution.
>
> I dunno, this seems quite weak to me. I see many patches coming to add
> tracing in the stack, but no patches fixing any issues.

We have implemented a mechanism to split the request and response from the TCP
connection using these "hookers", which can handle various protocols such as
HTTP, HTTPS, Redis, and MySQL. This mechanism allows us to record important
information about each request and response, including the amount of data
uploaded, the time taken by the server to handle the request, and the time taken
for the client to receive the response. This mechanism has been running
internally for many years and has proven to be very useful.

One of the main benefits of this mechanism is that it helps in locating the
source of any issues or problems that may arise. For example, if there is a
problem with the network, the application, or the machine, we can use this
mechanism to identify and isolate the issue.

TCP has long been a challenge when it comes to tracking the transmission of data
on the network. The application can only confirm that it has sent a certain
amount of data to the kernel, but it has limited visibility into whether the
client has actually received this data. Our mechanism addresses this issue by
providing insights into the amount of data received by the client and the time
it was received. Furthermore, we can also detect any packet loss or delays
caused by the server.

https://help-static-aliyun-doc.aliyuncs.com/assets/img/zh-CN/7912288961/9732df025beny.svg

So, we do not want to add some tracepoint to do some unknow debug.
We have a clear goal. debugging is just an incidental capability.

Thanks.

>
> It really looks like : We do not know how TCP stack works, we do not
> know if there is any issue,
> let us add trace points to help us to make forward progress in our analysis.
>
> These tracepoints will not tell how many segments/bytes were
> sent/acked/received, I really do not see
> how we will avoid adding in the future more stuff, forcing the
> compiler to save more state
> just in case the tracepoint needs the info.
>
> The argument of "add minimal info", so that we can silently add more
> stuff in the future "for free" is not something I buy.
>
> I very much prefer that you make sure the stuff you need is not
> inlined, so that standard kprobe/kretprobe facility can be used.

Eric Dumazet Dec. 5, 2023, 7:39 p.m. UTC | #3

On Tue, Dec 5, 2023 at 3:11 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> On Mon, 4 Dec 2023 13:28:21 +0100, Eric Dumazet <edumazet@google.com> wrote:
> > On Mon, Dec 4, 2023 at 12:43 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> > >
> > > Add 3 tracepoints, namely tcp_data_send/tcp_data_recv/tcp_data_acked,
> > > which will be called every time a tcp data packet is sent, received, and
> > > acked.
> > > tcp_data_send: called after a data packet is sent.
> > > tcp_data_recv: called after a data packet is receviced.
> > > tcp_data_acked: called after a valid ack packet is processed (some sent
> > > data are ackknowledged).
> > >
> > > We use these callbacks for fine-grained tcp monitoring, which collects
> > > and analyses every tcp request/response event information. The whole
> > > system has been described in SIGMOD'18 (see
> > > https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
> > > achieve this with bpf, we require hooks for data events that call bpf
> > > prog (1) when any data packet is sent/received/acked, and (2) after
> > > critical tcp state variables have been updated (e.g., snd_una, snd_nxt,
> > > rcv_nxt). However, existing bpf hooks cannot meet our requirements.
> > > Besides, these tracepoints help to debug tcp when data send/recv/acked.
> >
> > This I do not understand.
> >
> > >
> > > Though kretprobe/fexit can also be used to collect these information,
> > > they will not work if the kernel functions get inlined. Considering the
> > > stability, we prefer tracepoint as the solution.
> >
> > I dunno, this seems quite weak to me. I see many patches coming to add
> > tracing in the stack, but no patches fixing any issues.
>
>
> We have implemented a mechanism to split the request and response from the TCP
> connection using these "hookers", which can handle various protocols such as
> HTTP, HTTPS, Redis, and MySQL. This mechanism allows us to record important
> information about each request and response, including the amount of data
> uploaded, the time taken by the server to handle the request, and the time taken
> for the client to receive the response. This mechanism has been running
> internally for many years and has proven to be very useful.
>
> One of the main benefits of this mechanism is that it helps in locating the
> source of any issues or problems that may arise. For example, if there is a
> problem with the network, the application, or the machine, we can use this
> mechanism to identify and isolate the issue.
>
> TCP has long been a challenge when it comes to tracking the transmission of data
> on the network. The application can only confirm that it has sent a certain
> amount of data to the kernel, but it has limited visibility into whether the
> client has actually received this data. Our mechanism addresses this issue by
> providing insights into the amount of data received by the client and the time
> it was received. Furthermore, we can also detect any packet loss or delays
> caused by the server.
>
> https://help-static-aliyun-doc.aliyuncs.com/assets/img/zh-CN/7912288961/9732df025beny.svg
>
> So, we do not want to add some tracepoint to do some unknow debug.
> We have a clear goal. debugging is just an incidental capability.
>

We have powerful mechanisms in the stack already that ordinary (no
privilege requested) applications can readily use.

We have been using them for a while.

If existing mechanisms are missing something you need, please expand them.

For reference, start looking at tcp_get_timestamping_opt_stats() history.

Sender side can for instance get precise timestamps.

Combinations of these timestamps reveal different parts of the overall
network latency,

T0: sendmsg() enters TCP
T1: first byte enters qdisc
T2: first byte sent to the NIC
T3: first byte ACKed in TCP
T4: last byte sent to the NIC
T5: last byte ACKed
T1 - T0: how long the first byte was blocked in the TCP layer ("Head
of Line Blocking" latency).
T2 - T1: how long the first byte was blocked in the Linux traffic
shaping layer (known as QDisc).
T3 - T2: the network ‘distance’ (propagation delay + current queuing
delay along the network path and at the receiver).
T5 - T2: how fast the sent chunk was delivered.
Message Size / (T5 - T0): goodput (from application’s perspective)

Steven Rostedt Dec. 5, 2023, 8:50 p.m. UTC | #4

On Tue, 5 Dec 2023 20:39:28 +0100
Eric Dumazet <edumazet@google.com> wrote:

> > So, we do not want to add some tracepoint to do some unknow debug.
> > We have a clear goal. debugging is just an incidental capability.
> >  
> 
> We have powerful mechanisms in the stack already that ordinary (no
> privilege requested) applications can readily use.
>

I'm not arguing for or against this patch set, but tracepoints are
available for other utilities that may have non privilege access. They are
not just for tracers.

-- Steve

Xuan Zhuo Dec. 7, 2023, 2:28 a.m. UTC | #5

On Tue, 5 Dec 2023 20:39:28 +0100, Eric Dumazet <edumazet@google.com> wrote:
> On Tue, Dec 5, 2023 at 3:11 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> >
> > On Mon, 4 Dec 2023 13:28:21 +0100, Eric Dumazet <edumazet@google.com> wrote:
> > > On Mon, Dec 4, 2023 at 12:43 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> > > >
> > > > Add 3 tracepoints, namely tcp_data_send/tcp_data_recv/tcp_data_acked,
> > > > which will be called every time a tcp data packet is sent, received, and
> > > > acked.
> > > > tcp_data_send: called after a data packet is sent.
> > > > tcp_data_recv: called after a data packet is receviced.
> > > > tcp_data_acked: called after a valid ack packet is processed (some sent
> > > > data are ackknowledged).
> > > >
> > > > We use these callbacks for fine-grained tcp monitoring, which collects
> > > > and analyses every tcp request/response event information. The whole
> > > > system has been described in SIGMOD'18 (see
> > > > https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
> > > > achieve this with bpf, we require hooks for data events that call bpf
> > > > prog (1) when any data packet is sent/received/acked, and (2) after
> > > > critical tcp state variables have been updated (e.g., snd_una, snd_nxt,
> > > > rcv_nxt). However, existing bpf hooks cannot meet our requirements.
> > > > Besides, these tracepoints help to debug tcp when data send/recv/acked.
> > >
> > > This I do not understand.
> > >
> > > >
> > > > Though kretprobe/fexit can also be used to collect these information,
> > > > they will not work if the kernel functions get inlined. Considering the
> > > > stability, we prefer tracepoint as the solution.
> > >
> > > I dunno, this seems quite weak to me. I see many patches coming to add
> > > tracing in the stack, but no patches fixing any issues.
> >
> >
> > We have implemented a mechanism to split the request and response from the TCP
> > connection using these "hookers", which can handle various protocols such as
> > HTTP, HTTPS, Redis, and MySQL. This mechanism allows us to record important
> > information about each request and response, including the amount of data
> > uploaded, the time taken by the server to handle the request, and the time taken
> > for the client to receive the response. This mechanism has been running
> > internally for many years and has proven to be very useful.
> >
> > One of the main benefits of this mechanism is that it helps in locating the
> > source of any issues or problems that may arise. For example, if there is a
> > problem with the network, the application, or the machine, we can use this
> > mechanism to identify and isolate the issue.
> >
> > TCP has long been a challenge when it comes to tracking the transmission of data
> > on the network. The application can only confirm that it has sent a certain
> > amount of data to the kernel, but it has limited visibility into whether the
> > client has actually received this data. Our mechanism addresses this issue by
> > providing insights into the amount of data received by the client and the time
> > it was received. Furthermore, we can also detect any packet loss or delays
> > caused by the server.
> >
> > https://help-static-aliyun-doc.aliyuncs.com/assets/img/zh-CN/7912288961/9732df025beny.svg
> >
> > So, we do not want to add some tracepoint to do some unknow debug.
> > We have a clear goal. debugging is just an incidental capability.
> >
>
> We have powerful mechanisms in the stack already that ordinary (no
> privilege requested) applications can readily use.
>
> We have been using them for a while.
>
> If existing mechanisms are missing something you need, please expand them.
>
> For reference, start looking at tcp_get_timestamping_opt_stats() history.
>
> Sender side can for instance get precise timestamps.
>
> Combinations of these timestamps reveal different parts of the overall
> network latency,
>
> T0: sendmsg() enters TCP
> T1: first byte enters qdisc
> T2: first byte sent to the NIC
> T3: first byte ACKed in TCP
> T4: last byte sent to the NIC
> T5: last byte ACKed
> T1 - T0: how long the first byte was blocked in the TCP layer ("Head
> of Line Blocking" latency).
> T2 - T1: how long the first byte was blocked in the Linux traffic
> shaping layer (known as QDisc).
> T3 - T2: the network ‘distance’ (propagation delay + current queuing
> delay along the network path and at the receiver).
> T5 - T2: how fast the sent chunk was delivered.
> Message Size / (T5 - T0): goodput (from application’s perspective)


The key point is that using our mechanism, the application does not need to be
modified.

As long as the app's network protocol is request-response, we can trace tcp
connection at any time to analyze the request and response. And record the start
and end times of request and response. Of course there is some ttl and other
information.

Thanks.

[net-next] tcp: add tracepoints for data send/recv/acked

Commit Message

Comments

Patch