diff mbox series

[net-next] trace: tcp: Add tracepoint for tcp_sendmsg()

Message ID 20250224-tcpsendmsg-v1-1-bac043c59cc8@debian.org (mailing list archive)
State Handled Elsewhere
Headers show
Series [net-next] trace: tcp: Add tracepoint for tcp_sendmsg() | expand

Commit Message

Breno Leitao Feb. 24, 2025, 6:24 p.m. UTC
Add a lightweight tracepoint to monitor TCP sendmsg operations, enabling
the tracing of TCP messages being sent.

Meta has been using BPF programs to monitor this function for years,
indicating significant interest in observing this important
functionality. Adding a proper tracepoint provides a stable API for all
users who need visibility into TCP message transmission.

The implementation uses DECLARE_TRACE instead of TRACE_EVENT to avoid
creating unnecessary trace event infrastructure and tracefs exports,
keeping the implementation minimal while stabilizing the API.

Given that this patch creates a rawtracepoint, you could hook into it
using regular tooling, like bpftrace, using regular rawtracepoint
infrastructure, such as:

	rawtracepoint:tcp_sendmsg_tp {
		....
	}

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/trace/events/tcp.h | 5 +++++
 net/ipv4/tcp.c             | 2 ++
 2 files changed, 7 insertions(+)


---
base-commit: e13b6da7045f997e1a5a5efd61d40e63c4fc20e8
change-id: 20250224-tcpsendmsg-4f0a236751d7

Best regards,

Comments

Eric Dumazet Feb. 24, 2025, 7:03 p.m. UTC | #1
On Mon, Feb 24, 2025 at 7:24 PM Breno Leitao <leitao@debian.org> wrote:
>
> Add a lightweight tracepoint to monitor TCP sendmsg operations, enabling
> the tracing of TCP messages being sent.
>
> Meta has been using BPF programs to monitor this function for years,
> indicating significant interest in observing this important
> functionality. Adding a proper tracepoint provides a stable API for all
> users who need visibility into TCP message transmission.
>
> The implementation uses DECLARE_TRACE instead of TRACE_EVENT to avoid
> creating unnecessary trace event infrastructure and tracefs exports,
> keeping the implementation minimal while stabilizing the API.
>
> Given that this patch creates a rawtracepoint, you could hook into it
> using regular tooling, like bpftrace, using regular rawtracepoint
> infrastructure, such as:
>
>         rawtracepoint:tcp_sendmsg_tp {
>                 ....
>         }

I would expect tcp_sendmsg() being stable enough ?

kprobe:tcp_sendmsg {
}
Yonghong Song Feb. 24, 2025, 7:12 p.m. UTC | #2
> ________________________________________
>
> On Mon, Feb 24, 2025 at 7:24 PM Breno Leitao <leitao@debian.org> wrote:
>>
>> Add a lightweight tracepoint to monitor TCP sendmsg operations, enabling
>> the tracing of TCP messages being sent.
>>
>> Meta has been using BPF programs to monitor this function for years,
>> indicating significant interest in observing this important
>> functionality. Adding a proper tracepoint provides a stable API for all
>> users who need visibility into TCP message transmission.
>>
>> The implementation uses DECLARE_TRACE instead of TRACE_EVENT to avoid
>> creating unnecessary trace event infrastructure and tracefs exports,
>> keeping the implementation minimal while stabilizing the API.
>>
>> Given that this patch creates a rawtracepoint, you could hook into it
>> using regular tooling, like bpftrace, using regular rawtracepoint
>> infrastructure, such as:
>>
>>         rawtracepoint:tcp_sendmsg_tp {
>>                 ....
>>         }
>
> I would expect tcp_sendmsg() being stable enough ?
>
> kprobe:tcp_sendmsg {
> }

In LTO mode, tcp_sendmsg could be inlined cross files. For example,

  net/ipv4/tcp.c: 
       int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
  net/ipv4/tcp_bpf.c:
       ...
      return tcp_sendmsg(sk, msg, size);
  net/ipv6/af_inet6.c:
       ...
       return INDIRECT_CALL_2(prot->sendmsg, tcp_sendmsg, udpv6_sendmsg, ...)

And this does happen in our production environment.
David Ahern Feb. 24, 2025, 7:16 p.m. UTC | #3
On 2/24/25 12:03 PM, Eric Dumazet wrote:
> On Mon, Feb 24, 2025 at 7:24 PM Breno Leitao <leitao@debian.org> wrote:
>>
>> Add a lightweight tracepoint to monitor TCP sendmsg operations, enabling
>> the tracing of TCP messages being sent.
>>
>> Meta has been using BPF programs to monitor this function for years,
>> indicating significant interest in observing this important
>> functionality. Adding a proper tracepoint provides a stable API for all
>> users who need visibility into TCP message transmission.
>>
>> The implementation uses DECLARE_TRACE instead of TRACE_EVENT to avoid
>> creating unnecessary trace event infrastructure and tracefs exports,
>> keeping the implementation minimal while stabilizing the API.
>>
>> Given that this patch creates a rawtracepoint, you could hook into it
>> using regular tooling, like bpftrace, using regular rawtracepoint
>> infrastructure, such as:
>>
>>         rawtracepoint:tcp_sendmsg_tp {
>>                 ....
>>         }
> 
> I would expect tcp_sendmsg() being stable enough ?
> 
> kprobe:tcp_sendmsg {
> }

Also, if a tracepoint is added, inside of tcp_sendmsg_locked would cover
more use cases (see kernel references to it).

We have a patch for a couple years now with a tracepoint inside the

while (msg_data_left(msg)) {
}

loop which is more useful than just entry to sendmsg.
Eric Dumazet Feb. 24, 2025, 7:23 p.m. UTC | #4
On Mon, Feb 24, 2025 at 8:13 PM Yonghong Song <yhs@meta.com> wrote:
>
> > ________________________________________
> >
> > On Mon, Feb 24, 2025 at 7:24 PM Breno Leitao <leitao@debian.org> wrote:
> >>
> >> Add a lightweight tracepoint to monitor TCP sendmsg operations, enabling
> >> the tracing of TCP messages being sent.
> >>
> >> Meta has been using BPF programs to monitor this function for years,
> >> indicating significant interest in observing this important
> >> functionality. Adding a proper tracepoint provides a stable API for all
> >> users who need visibility into TCP message transmission.
> >>
> >> The implementation uses DECLARE_TRACE instead of TRACE_EVENT to avoid
> >> creating unnecessary trace event infrastructure and tracefs exports,
> >> keeping the implementation minimal while stabilizing the API.
> >>
> >> Given that this patch creates a rawtracepoint, you could hook into it
> >> using regular tooling, like bpftrace, using regular rawtracepoint
> >> infrastructure, such as:
> >>
> >>         rawtracepoint:tcp_sendmsg_tp {
> >>                 ....
> >>         }
> >
> > I would expect tcp_sendmsg() being stable enough ?
> >
> > kprobe:tcp_sendmsg {
> > }
>
> In LTO mode, tcp_sendmsg could be inlined cross files. For example,
>
>   net/ipv4/tcp.c:
>        int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
>   net/ipv4/tcp_bpf.c:
>        ...
>       return tcp_sendmsg(sk, msg, size);
>   net/ipv6/af_inet6.c:
>        ...
>        return INDIRECT_CALL_2(prot->sendmsg, tcp_sendmsg, udpv6_sendmsg, ...)
>
> And this does happen in our production environment.

And we do not have a way to make the kprobe work even if LTO decided
to inline a function ?

This seems like a tracing or LTO issue, this could be addressed there
in a generic way
and avoid many other patches to work around this.
Breno Leitao Feb. 25, 2025, 10:58 a.m. UTC | #5
Hello Eric,

On Mon, Feb 24, 2025 at 08:23:00PM +0100, Eric Dumazet wrote:
> On Mon, Feb 24, 2025 at 8:13 PM Yonghong Song <yhs@meta.com> wrote:
> >
> > > ________________________________________
> > >
> > > On Mon, Feb 24, 2025 at 7:24 PM Breno Leitao <leitao@debian.org> wrote:
> > >>
> > >> Add a lightweight tracepoint to monitor TCP sendmsg operations, enabling
> > >> the tracing of TCP messages being sent.
> > >>
> > >> Meta has been using BPF programs to monitor this function for years,
> > >> indicating significant interest in observing this important
> > >> functionality. Adding a proper tracepoint provides a stable API for all
> > >> users who need visibility into TCP message transmission.
> > >>
> > >> The implementation uses DECLARE_TRACE instead of TRACE_EVENT to avoid
> > >> creating unnecessary trace event infrastructure and tracefs exports,
> > >> keeping the implementation minimal while stabilizing the API.
> > >>
> > >> Given that this patch creates a rawtracepoint, you could hook into it
> > >> using regular tooling, like bpftrace, using regular rawtracepoint
> > >> infrastructure, such as:
> > >>
> > >>         rawtracepoint:tcp_sendmsg_tp {
> > >>                 ....
> > >>         }
> > >
> > > I would expect tcp_sendmsg() being stable enough ?
> > >
> > > kprobe:tcp_sendmsg {
> > > }
> >
> > In LTO mode, tcp_sendmsg could be inlined cross files. For example,
> >
> >   net/ipv4/tcp.c:
> >        int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> >   net/ipv4/tcp_bpf.c:
> >        ...
> >       return tcp_sendmsg(sk, msg, size);
> >   net/ipv6/af_inet6.c:
> >        ...
> >        return INDIRECT_CALL_2(prot->sendmsg, tcp_sendmsg, udpv6_sendmsg, ...)
> >
> > And this does happen in our production environment.
> 
> And we do not have a way to make the kprobe work even if LTO decided
> to inline a function ?
> 
> This seems like a tracing or LTO issue, this could be addressed there
> in a generic way
> and avoid many other patches to work around this.

I understand that the {raw}tracepoint ensures the compiler cannot
interfere with these hook points. For everything else, we rely on the
hope that the compiler behaves favorably, which is far from ideal.
Breno Leitao Feb. 26, 2025, 4:10 p.m. UTC | #6
Hello David,

On Mon, Feb 24, 2025 at 12:16:04PM -0700, David Ahern wrote:
> On 2/24/25 12:03 PM, Eric Dumazet wrote:
> > On Mon, Feb 24, 2025 at 7:24 PM Breno Leitao <leitao@debian.org> wrote:
> >>
> >> Add a lightweight tracepoint to monitor TCP sendmsg operations, enabling
> >> the tracing of TCP messages being sent.
> >>
> >> Meta has been using BPF programs to monitor this function for years,
> >> indicating significant interest in observing this important
> >> functionality. Adding a proper tracepoint provides a stable API for all
> >> users who need visibility into TCP message transmission.
> >>
> >> The implementation uses DECLARE_TRACE instead of TRACE_EVENT to avoid
> >> creating unnecessary trace event infrastructure and tracefs exports,
> >> keeping the implementation minimal while stabilizing the API.
> >>
> >> Given that this patch creates a rawtracepoint, you could hook into it
> >> using regular tooling, like bpftrace, using regular rawtracepoint
> >> infrastructure, such as:
> >>
> >>         rawtracepoint:tcp_sendmsg_tp {
> >>                 ....
> >>         }
> > 
> > I would expect tcp_sendmsg() being stable enough ?
> > 
> > kprobe:tcp_sendmsg {
> > }
> 
> Also, if a tracepoint is added, inside of tcp_sendmsg_locked would cover
> more use cases (see kernel references to it).

Agree, this seems to provide more useful information

> We have a patch for a couple years now with a tracepoint inside the

Sorry, where do you have this patch? is it downstream?

> while (msg_data_left(msg)) {
> }
> 
> loop which is more useful than just entry to sendmsg.

Do you mean something like the following?

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 1a40c41ff8c30..23318e252d6b9 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -259,6 +259,11 @@ TRACE_EVENT(tcp_retransmit_synack,
 		  __entry->saddr_v6, __entry->daddr_v6)
 );
 
+DECLARE_TRACE(tcp_sendmsg_tp,
+	TP_PROTO(const struct sock *sk, const struct msghdr *msg, size_t size, ssize_t copied),
+	TP_ARGS(sk, msg, size, copied)
+);
+
 DECLARE_TRACE(tcp_cwnd_reduction_tp,
 	TP_PROTO(const struct sock *sk, int newly_acked_sacked,
 		 int newly_lost, int flag),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 08d73f17e8162..5fcef82275d4a 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1290,6 +1290,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			sk_mem_charge(sk, copy);
 		}
 
+		trace_tcp_sendmsg_tp(sk, msg, size, copy);
+
 		if (!copied)
 			TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;
David Ahern Feb. 26, 2025, 5:12 p.m. UTC | #7
On 2/26/25 9:10 AM, Breno Leitao wrote:
>> Also, if a tracepoint is added, inside of tcp_sendmsg_locked would cover
>> more use cases (see kernel references to it).
> 
> Agree, this seems to provide more useful information
> 
>> We have a patch for a couple years now with a tracepoint inside the
> 
> Sorry, where do you have this patch? is it downstream?

company tree. Attached. Where to put tracepoints and what arguments to
supply so that it is beneficial to multiple users is always a touchy
subject :-), so I have not tried to push the patch out. sock arg should
be added to it for example.

The key is to see how tcp_sendmsg_locked breaks up the buffers, and then
another one in tcp_write_xmit to see when the actual push out happens.
At the time I was looking at latency in the stack - from sendmsg call to
driver pushing descriptors to hardware.
From 2298ca66c15bae6a95698abd8d029b9271fbefa3 Mon Sep 17 00:00:00 2001
From: David Ahern <dsahern@gmail.com>
Date: Fri, 24 Mar 2023 22:07:53 +0000
Subject: [PATCH] tcp: Add tracepoints to tcp_write_xmit and tcp_sendmsg_locked

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/trace/events/tcp.h | 57 ++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp.c             |  4 +++
 net/ipv4/tcp_output.c      |  1 +
 3 files changed, 62 insertions(+)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 901b440238d5..6b19e1d4d79d 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -187,6 +187,63 @@ DEFINE_EVENT(tcp_event_sk, tcp_rcv_space_adjust,
 	TP_ARGS(sk)
 );
 
+TRACE_EVENT(tcp_sendmsg_locked,
+
+	TP_PROTO(struct msghdr *msg, struct sk_buff *skb, int size_goal),
+
+	TP_ARGS(msg, skb, size_goal),
+
+	TP_STRUCT__entry(
+		__field(__u64, skb)
+		__field(int, skb_len)
+		__field(int, msg_left)
+		__field(int, size_goal)
+	),
+
+	TP_fast_assign(
+		__entry->skb = (__u64)skb;
+		__entry->skb_len = skb ? skb->len : 0;
+		__entry->msg_left = msg_data_left(msg);
+		__entry->size_goal = size_goal;
+	),
+
+	TP_printk("skb %llx len %d msg_left %d size_goal %d",
+		  __entry->skb, __entry->skb_len,
+		  __entry->msg_left, __entry->size_goal)
+);
+
+TRACE_EVENT(tcp_write_xmit,
+
+	TP_PROTO(struct sock *sk, unsigned int mss_now, int nonagle, u32 max_segs),
+
+	TP_ARGS(sk, mss_now, nonagle,max_segs),
+
+	TP_STRUCT__entry(
+		__field(__u64, tcp_wstamp_ns)
+		__field(__u64, tcp_clock_cache)
+		__field(unsigned int, mss_now)
+		__field(unsigned int, max_segs)
+		__field(int, nonagle)
+		__field(int, sk_pacing)
+	),
+
+	TP_fast_assign(
+		struct tcp_sock *tp = tcp_sk(sk);
+
+		__entry->mss_now  = mss_now;
+		__entry->nonagle  = nonagle;
+		__entry->max_segs = max_segs;
+		__entry->sk_pacing = tcp_needs_internal_pacing(sk);
+		__entry->tcp_wstamp_ns = tp->tcp_wstamp_ns;
+		__entry->tcp_clock_cache = tp->tcp_clock_cache;
+	),
+
+	TP_printk("mss %u segs %u nonagle %d sk_pacing %d tcp_wstamp_ns %lld tcp_clock_cache %lld",
+		  __entry->mss_now, __entry->max_segs,
+		  __entry->nonagle, __entry->sk_pacing,
+		  __entry->tcp_wstamp_ns, __entry->tcp_clock_cache)
+);
+
 TRACE_EVENT(tcp_retransmit_synack,
 
 	TP_PROTO(const struct sock *sk, const struct request_sock *req),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 6bb8eb803105..0fa2c8e035b7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -279,6 +279,7 @@
 #include <linux/uaccess.h>
 #include <asm/ioctls.h>
 #include <net/busy_poll.h>
+#include <trace/events/tcp.h>
 
 /* Track pending CMSGs. */
 enum {
@@ -1312,6 +1313,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 		if (skb)
 			copy = size_goal - skb->len;
 
+		trace_tcp_sendmsg_locked(msg, skb, size_goal);
+
 		if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) {
 			bool first_skb;
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 74190518708a..c84f18cd9b7e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2623,6 +2623,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 	}
 
 	max_segs = tcp_tso_segs(sk, mss_now);
+	trace_tcp_write_xmit(sk, mss_now, nonagle, max_segs);
 	while ((skb = tcp_send_head(sk))) {
 		unsigned int limit;
Breno Leitao Feb. 26, 2025, 6:18 p.m. UTC | #8
Hello David,

On Wed, Feb 26, 2025 at 10:12:08AM -0700, David Ahern wrote:
> On 2/26/25 9:10 AM, Breno Leitao wrote:
> >> Also, if a tracepoint is added, inside of tcp_sendmsg_locked would cover
> >> more use cases (see kernel references to it).
> > 
> > Agree, this seems to provide more useful information
> > 
> >> We have a patch for a couple years now with a tracepoint inside the
> > 
> > Sorry, where do you have this patch? is it downstream?
> 
> company tree. Attached. Where to put tracepoints and what arguments to
> supply so that it is beneficial to multiple users is always a touchy
> subject :-)

Thanks. I would like to state that this would be useful for Meta as
well.

Right now, we (Meta) are using nasty `noinline` attribute in
tcp_sendmsg() in order to make the API stable, and this tracepoint would
solve this problem avoiding the `noinline` hack. So, at least two type
of users would benefit from it.

> so I have not tried to push the patch out. sock arg should
> be added to it for example.

True, if it becomes a tracepoint instead of a rawtracepoint, the sock
arg might be useful.

How would you recommend me proceeding in this case?

Thanks
--breno
Eric Dumazet Feb. 26, 2025, 6:27 p.m. UTC | #9
On Wed, Feb 26, 2025 at 7:18 PM Breno Leitao <leitao@debian.org> wrote:
>
> Hello David,
>
> On Wed, Feb 26, 2025 at 10:12:08AM -0700, David Ahern wrote:
> > On 2/26/25 9:10 AM, Breno Leitao wrote:
> > >> Also, if a tracepoint is added, inside of tcp_sendmsg_locked would cover
> > >> more use cases (see kernel references to it).
> > >
> > > Agree, this seems to provide more useful information
> > >
> > >> We have a patch for a couple years now with a tracepoint inside the
> > >
> > > Sorry, where do you have this patch? is it downstream?
> >
> > company tree. Attached. Where to put tracepoints and what arguments to
> > supply so that it is beneficial to multiple users is always a touchy
> > subject :-)
>
> Thanks. I would like to state that this would be useful for Meta as
> well.
>
> Right now, we (Meta) are using nasty `noinline` attribute in
> tcp_sendmsg() in order to make the API stable, and this tracepoint would
> solve this problem avoiding the `noinline` hack. So, at least two type
> of users would benefit from it.
>
> > so I have not tried to push the patch out. sock arg should
> > be added to it for example.
>
> True, if it becomes a tracepoint instead of a rawtracepoint, the sock
> arg might be useful.
>
> How would you recommend me proceeding in this case?

In 2022, Menglong Dong added __fix_address

Then later , Yafang Shao  added noinline_for_tracing .

Would one of them be sufficient ?
David Ahern Feb. 26, 2025, 6:31 p.m. UTC | #10
On 2/26/25 11:27 AM, Eric Dumazet wrote:
> On Wed, Feb 26, 2025 at 7:18 PM Breno Leitao <leitao@debian.org> wrote:
>>
>> Hello David,
>>
>> On Wed, Feb 26, 2025 at 10:12:08AM -0700, David Ahern wrote:
>>> On 2/26/25 9:10 AM, Breno Leitao wrote:
>>>>> Also, if a tracepoint is added, inside of tcp_sendmsg_locked would cover
>>>>> more use cases (see kernel references to it).
>>>>
>>>> Agree, this seems to provide more useful information
>>>>
>>>>> We have a patch for a couple years now with a tracepoint inside the
>>>>
>>>> Sorry, where do you have this patch? is it downstream?
>>>
>>> company tree. Attached. Where to put tracepoints and what arguments to
>>> supply so that it is beneficial to multiple users is always a touchy
>>> subject :-)
>>
>> Thanks. I would like to state that this would be useful for Meta as
>> well.
>>
>> Right now, we (Meta) are using nasty `noinline` attribute in
>> tcp_sendmsg() in order to make the API stable, and this tracepoint would
>> solve this problem avoiding the `noinline` hack. So, at least two type
>> of users would benefit from it.
>>
>>> so I have not tried to push the patch out. sock arg should
>>> be added to it for example.
>>
>> True, if it becomes a tracepoint instead of a rawtracepoint, the sock
>> arg might be useful.
>>
>> How would you recommend me proceeding in this case?
> 
> In 2022, Menglong Dong added __fix_address
> 
> Then later , Yafang Shao  added noinline_for_tracing .
> 
> Would one of them be sufficient ?

tcp_sendmsg_locked should not be getting inlined; it is the
sendmsg_locked handler and directly called by several subsystems.

ie., moving the tracepoint to tcp_sendmsg_locked should solve the inline
problem. From there, the question is inside the loop or at entry to the
function. Inside the loop has been very helpful for me.
Jason Xing Feb. 26, 2025, 10:46 p.m. UTC | #11
Hi David,

On Thu, Feb 27, 2025 at 1:14 AM David Ahern <dsahern@kernel.org> wrote:
>
> On 2/26/25 9:10 AM, Breno Leitao wrote:
> >> Also, if a tracepoint is added, inside of tcp_sendmsg_locked would cover
> >> more use cases (see kernel references to it).
> >
> > Agree, this seems to provide more useful information
> >
> >> We have a patch for a couple years now with a tracepoint inside the
> >
> > Sorry, where do you have this patch? is it downstream?
>
> company tree. Attached. Where to put tracepoints and what arguments to
> supply so that it is beneficial to multiple users is always a touchy

Right. I am always eager to establish a standard evaluation/method
which developers have common sense in. It's really hard because I gave
it a try before. Maintainers seem not to like to see too many
tracepoints appearing in the stack.

> subject :-), so I have not tried to push the patch out. sock arg should
> be added to it for example.
>
> The key is to see how tcp_sendmsg_locked breaks up the buffers, and then
> another one in tcp_write_xmit to see when the actual push out happens.

Agreed on this point because a fine-grained BPF program can take
advantage of it. But it seems another small topic that is probably
different from what the original motivation from Breno is in this
patch: I guess, making the tcp_sendmsg_locked non-inlined can allow
the BPF program to calculate the delta between when tcp_sendmsg_locked
starts and when tcp_sendmsg_locked ends? I don't know. Probably as
Eric said, using noinline or something like this is simpler?

> At the time I was looking at latency in the stack - from sendmsg call to
> driver pushing descriptors to hardware.

So do I.

Thanks,
Jason
Masami Hiramatsu (Google) Feb. 26, 2025, 11:46 p.m. UTC | #12
On Mon, 24 Feb 2025 20:23:00 +0100
Eric Dumazet <edumazet@google.com> wrote:

> On Mon, Feb 24, 2025 at 8:13 PM Yonghong Song <yhs@meta.com> wrote:
> >
> > > ________________________________________
> > >
> > > On Mon, Feb 24, 2025 at 7:24 PM Breno Leitao <leitao@debian.org> wrote:
> > >>
> > >> Add a lightweight tracepoint to monitor TCP sendmsg operations, enabling
> > >> the tracing of TCP messages being sent.
> > >>
> > >> Meta has been using BPF programs to monitor this function for years,
> > >> indicating significant interest in observing this important
> > >> functionality. Adding a proper tracepoint provides a stable API for all
> > >> users who need visibility into TCP message transmission.
> > >>
> > >> The implementation uses DECLARE_TRACE instead of TRACE_EVENT to avoid
> > >> creating unnecessary trace event infrastructure and tracefs exports,
> > >> keeping the implementation minimal while stabilizing the API.
> > >>
> > >> Given that this patch creates a rawtracepoint, you could hook into it
> > >> using regular tooling, like bpftrace, using regular rawtracepoint
> > >> infrastructure, such as:
> > >>
> > >>         rawtracepoint:tcp_sendmsg_tp {
> > >>                 ....
> > >>         }
> > >
> > > I would expect tcp_sendmsg() being stable enough ?
> > >
> > > kprobe:tcp_sendmsg {
> > > }
> >
> > In LTO mode, tcp_sendmsg could be inlined cross files. For example,
> >
> >   net/ipv4/tcp.c:
> >        int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> >   net/ipv4/tcp_bpf.c:
> >        ...
> >       return tcp_sendmsg(sk, msg, size);
> >   net/ipv6/af_inet6.c:
> >        ...
> >        return INDIRECT_CALL_2(prot->sendmsg, tcp_sendmsg, udpv6_sendmsg, ...)
> >
> > And this does happen in our production environment.
> 
> And we do not have a way to make the kprobe work even if LTO decided
> to inline a function ?

There is `perf probe` command to set it up based on the debuginfo. But
the function parameters sometimes optimized out and no more accessible.
(You can try.)

Thus if you needs this tracepoint not temporarily nor debugging,
I would recommend to use tracepoint (or trace event).

Thank you,

> 
> This seems like a tracing or LTO issue, this could be addressed there
> in a generic way
> and avoid many other patches to work around this.
Breno Leitao Feb. 27, 2025, 4:26 p.m. UTC | #13
On Wed, Feb 26, 2025 at 11:31:49AM -0700, David Ahern wrote:
> On 2/26/25 11:27 AM, Eric Dumazet wrote:
> > In 2022, Menglong Dong added __fix_address
> > 
> > Then later , Yafang Shao  added noinline_for_tracing .
> > 
> > Would one of them be sufficient ?
> 
> tcp_sendmsg_locked should not be getting inlined; it is the
> sendmsg_locked handler and directly called by several subsystems.
> 
> ie., moving the tracepoint to tcp_sendmsg_locked should solve the inline
> problem. From there, the question is inside the loop or at entry to the
> function. Inside the loop has been very helpful for me.

Inside the loop would work for me as well.
diff mbox series

Patch

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 1a40c41ff8c30..7c0171d2dacdc 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -259,6 +259,11 @@  TRACE_EVENT(tcp_retransmit_synack,
 		  __entry->saddr_v6, __entry->daddr_v6)
 );
 
+DECLARE_TRACE(tcp_sendmsg_tp,
+	TP_PROTO(const struct sock *sk, const struct msghdr *msg, size_t size),
+	TP_ARGS(sk, msg, size)
+);
+
 DECLARE_TRACE(tcp_cwnd_reduction_tp,
 	TP_PROTO(const struct sock *sk, int newly_acked_sacked,
 		 int newly_lost, int flag),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 08d73f17e8162..5ef86fbd8aa85 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1362,6 +1362,8 @@  int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 {
 	int ret;
 
+	trace_tcp_sendmsg_tp(sk, msg, size);
+
 	lock_sock(sk);
 	ret = tcp_sendmsg_locked(sk, msg, size);
 	release_sock(sk);