[bpf-next,v1,0/2] Handle possible NULL trusted raw_tp arguments

Message ID	20241101000017.3424165-1-memxor@gmail.com (mailing list archive)
Headers	show Received: from mail-wm1-f65.google.com (mail-wm1-f65.google.com [209.85.128.65]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C182714A4DE for <bpf@vger.kernel.org>; Fri, 1 Nov 2024 00:00:20 +0000 (UTC) From: Kumar Kartikeya Dwivedi <memxor@gmail.com> To: bpf@vger.kernel.org Cc: kkd@meta.com, Alexei Starovoitov <ast@kernel.org>, Andrii Nakryiko <andrii@kernel.org>, Martin KaFai Lau <martin.lau@kernel.org>, Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>, Yonghong Song <yonghong.song@linux.dev>, Steven Rostedt <rostedt@goodmis.org>, Jiri Olsa <olsajiri@gmail.com>, Juri Lelli <juri.lelli@redhat.com>, kernel-team@fb.com Subject: [PATCH bpf-next v1 0/2] Handle possible NULL trusted raw_tp arguments Date: Thu, 31 Oct 2024 17:00:15 -0700 Message-ID: <20241101000017.3424165-1-memxor@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Handle possible NULL trusted raw_tp arguments \| expand [bpf-next,v1,0/2] Handle possible NULL trusted raw_tp arguments [bpf-next,v1,1/2] bpf: Mark raw_tp arguments with PTR_MAYBE_NULL [bpf-next,v1,2/2] selftests/bpf: Add tests for raw_tp null handling

Kumar Kartikeya Dwivedi Nov. 1, 2024, midnight UTC

More context is available in [0], but the TLDR; is that the verifier
incorrectly assumes that any raw tracepoint argument will always be
non-NULL. This means that even when users correctly check possible NULL
arguments, the verifier can remove the NULL check due to incorrect
knowledge of the NULL-ness of the pointer. Secondly, kernel helpers or
kfuncs taking these trusted tracepoint arguments incorrectly assume that
all arguments will always be valid non-NULL.

In this set, we mark raw_tp arguments as PTR_MAYBE_NULL on top of
PTR_TRUSTED, but special case their behavior when dereferencing them or
pointer arithmetic over them is involved. When passing trusted args to
helpers or kfuncs, raw_tp programs are permitted to pass possibly NULL
pointers in such cases.

Any loads into such maybe NULL trusted PTR_TO_BTF_ID is promoted to a
PROBE_MEM load to handle emanating page faults. The verifier will ensure
NULL checks on such pointers are preserved and do not lead to dead code
elimination.

This new behavior is not applied when ref_obj_id is non-zero, as those
pointers do not belong to raw_tp arguments, but instead acquired
objects.

Since helpers and kfuncs already require attention for PTR_TO_BTF_ID
(non-trusted) pointers, we do not implement any protection for such
cases in this patch set, and leave it as future work for an upcoming
series.

A selftest is included with this patch set to verify the new behavior,
and it crashes the kernel without the first patch.

 [0]: https://lore.kernel.org/bpf/CAADnVQLMPPavJQR6JFsi3dtaaLHB816JN4HCV_TFWohJ61D+wQ@mail.gmail.com

Kumar Kartikeya Dwivedi (2):
  bpf: Mark raw_tp arguments with PTR_MAYBE_NULL
  selftests/bpf: Add tests for raw_tp null handling

 include/linux/bpf.h                           |  6 ++
 kernel/bpf/btf.c                              |  5 +-
 kernel/bpf/verifier.c                         | 75 +++++++++++++++++--
 .../bpf/bpf_testmod/bpf_testmod-events.h      |  8 ++
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  2 +
 .../selftests/bpf/prog_tests/raw_tp_null.c    | 25 +++++++
 .../testing/selftests/bpf/progs/raw_tp_null.c | 27 +++++++
 .../bpf/progs/test_tp_btf_nullable.c          |  6 +-
 8 files changed, 145 insertions(+), 9 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/raw_tp_null.c
 create mode 100644 tools/testing/selftests/bpf/progs/raw_tp_null.c


base-commit: e626a13f6fbb4697f8734333432dca577628d09a

Kumar Kartikeya Dwivedi Nov. 1, 2024, 1:18 p.m. UTC | #1

On Fri, 1 Nov 2024 at 01:00, Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> More context is available in [0], but the TLDR; is that the verifier
> incorrectly assumes that any raw tracepoint argument will always be
> non-NULL. This means that even when users correctly check possible NULL
> arguments, the verifier can remove the NULL check due to incorrect
> knowledge of the NULL-ness of the pointer. Secondly, kernel helpers or
> kfuncs taking these trusted tracepoint arguments incorrectly assume that
> all arguments will always be valid non-NULL.
>
> In this set, we mark raw_tp arguments as PTR_MAYBE_NULL on top of
> PTR_TRUSTED, but special case their behavior when dereferencing them or
> pointer arithmetic over them is involved. When passing trusted args to
> helpers or kfuncs, raw_tp programs are permitted to pass possibly NULL
> pointers in such cases.
>
> Any loads into such maybe NULL trusted PTR_TO_BTF_ID is promoted to a
> PROBE_MEM load to handle emanating page faults. The verifier will ensure
> NULL checks on such pointers are preserved and do not lead to dead code
> elimination.
>
> This new behavior is not applied when ref_obj_id is non-zero, as those
> pointers do not belong to raw_tp arguments, but instead acquired
> objects.
>
> Since helpers and kfuncs already require attention for PTR_TO_BTF_ID
> (non-trusted) pointers, we do not implement any protection for such
> cases in this patch set, and leave it as future work for an upcoming
> series.
>
> A selftest is included with this patch set to verify the new behavior,
> and it crashes the kernel without the first patch.

I see that all selftests except one passed. The one that didn't
appears to have been cancelled after running for an hour, and stalled
after select_reuseport:OK.
Looking at the LLVM 18
(https://github.com/kernel-patches/bpf/actions/runs/11621768944/job/32366412581?pr=7999)
run instead of LLVM 17
(https://github.com/kernel-patches/bpf/actions/runs/11621768944/job/32366400714?pr=7999,
which failed), it seems the next test send_signal_tracepoint.

Is this known to be flaky? I'm guessing not and it is probably caused
by my patch, but just want to confirm before I begin debugging.

>
>  [...]
>

Eduard Zingerman Nov. 2, 2024, 12:21 a.m. UTC | #2

On Fri, 2024-11-01 at 14:18 +0100, Kumar Kartikeya Dwivedi wrote:

[...]

> I see that all selftests except one passed. The one that didn't
> appears to have been cancelled after running for an hour, and stalled
> after select_reuseport:OK.
> Looking at the LLVM 18
> (https://github.com/kernel-patches/bpf/actions/runs/11621768944/job/32366412581?pr=7999)
> run instead of LLVM 17
> (https://github.com/kernel-patches/bpf/actions/runs/11621768944/job/32366400714?pr=7999,
> which failed), it seems the next test send_signal_tracepoint.
> 
> Is this known to be flaky? I'm guessing not and it is probably caused
> by my patch, but just want to confirm before I begin debugging.

I suspect this is a test send_signal.
It started to hang for me yesterday w/o any apparent reason (on master branch).
I added workaround to avoid stalls, but this does not address the
issue with the test. Workaround follows.

---

diff --git a/tools/testing/selftests/bpf/prog_tests/send_signal.c b/tools/testing/selftests/bpf/prog_tests/send_signal.c
index ee5a221b4103..4af127945417 100644
--- a/tools/testing/selftests/bpf/prog_tests/send_signal.c
+++ b/tools/testing/selftests/bpf/prog_tests/send_signal.c
@@ -18,6 +18,38 @@ static void sigusr1_siginfo_handler(int s, siginfo_t *i, void *v)
 
 static char log_buf[64 * 1024];
 
+static ssize_t read_with_timeout(int fd, void *buf, size_t count)
+{
+	struct timeval tv = { 1, 0 };
+	fd_set fds;
+	int err;
+
+	FD_ZERO(&fds);
+	FD_SET(fd, &fds);
+	err = select(fd + 1, &fds, NULL, NULL, &tv);
+	if (!ASSERT_GE(err, 0, "read select"))
+		return err;
+	if (FD_ISSET(fd, &fds))
+		return read(fd, buf, count);
+	return -EAGAIN;
+}
+
+static ssize_t write_with_timeout(int fd, void *buf, size_t count)
+{
+	struct timeval tv = { 1, 0 };
+	fd_set fds;
+	int err;
+
+	FD_ZERO(&fds);
+	FD_SET(fd, &fds);
+	err = select(fd + 1, NULL, &fds, NULL, &tv);
+	if (!ASSERT_GE(err, 0, "write select"))
+		return err;
+	if (FD_ISSET(fd, &fds))
+		return write(fd, buf, count);
+	return -EAGAIN;
+}
+
 static void test_send_signal_common(struct perf_event_attr *attr,
 				    bool signal_thread, bool remote)
 {
@@ -75,10 +107,10 @@ static void test_send_signal_common(struct perf_event_attr *attr,
 		}
 
 		/* notify parent signal handler is installed */
-		ASSERT_EQ(write(pipe_c2p[1], buf, 1), 1, "pipe_write");
+		ASSERT_EQ(write_with_timeout(pipe_c2p[1], buf, 1), 1, "pipe_write");
 
 		/* make sure parent enabled bpf program to send_signal */
-		ASSERT_EQ(read(pipe_p2c[0], buf, 1), 1, "pipe_read");
+		ASSERT_EQ(read_with_timeout(pipe_p2c[0], buf, 1), 1, "pipe_read");
 
 		/* wait a little for signal handler */
 		for (int i = 0; i < 1000000000 && !sigusr1_received; i++) {
@@ -94,10 +126,10 @@ static void test_send_signal_common(struct perf_event_attr *attr,
 		buf[0] = sigusr1_received;
 
 		ASSERT_EQ(sigusr1_received, 8, "sigusr1_received");
-		ASSERT_EQ(write(pipe_c2p[1], buf, 1), 1, "pipe_write");
+		ASSERT_EQ(write_with_timeout(pipe_c2p[1], buf, 1), 1, "pipe_write");
 
 		/* wait for parent notification and exit */
-		ASSERT_EQ(read(pipe_p2c[0], buf, 1), 1, "pipe_read");
+		ASSERT_EQ(read_with_timeout(pipe_p2c[0], buf, 1), 1, "pipe_read");
 
 		/* restore the old priority */
 		if (!remote)
@@ -158,7 +190,7 @@ static void test_send_signal_common(struct perf_event_attr *attr,
 	}
 
 	/* wait until child signal handler installed */
-	ASSERT_EQ(read(pipe_c2p[0], buf, 1), 1, "pipe_read");
+	ASSERT_EQ(read_with_timeout(pipe_c2p[0], buf, 1), 1, "pipe_read");
 
 	/* trigger the bpf send_signal */
 	skel->bss->signal_thread = signal_thread;
@@ -172,7 +204,7 @@ static void test_send_signal_common(struct perf_event_attr *attr,
 	}
 
 	/* notify child that bpf program can send_signal now */
-	ASSERT_EQ(write(pipe_p2c[1], buf, 1), 1, "pipe_write");
+	ASSERT_EQ(write_with_timeout(pipe_p2c[1], buf, 1), 1, "pipe_write");
 
 	/* For the remote test, the BPF program is triggered from this
 	 * process but the other process/thread is signaled.
@@ -188,7 +220,7 @@ static void test_send_signal_common(struct perf_event_attr *attr,
 	}
 
 	/* wait for result */
-	err = read(pipe_c2p[0], buf, 1);
+	err = read_with_timeout(pipe_c2p[0], buf, 1);
 	if (!ASSERT_GE(err, 0, "reading pipe"))
 		goto disable_pmu;
 	if (!ASSERT_GT(err, 0, "reading pipe error: size 0")) {
@@ -199,7 +231,7 @@ static void test_send_signal_common(struct perf_event_attr *attr,
 	ASSERT_EQ(buf[0], 8, "incorrect result");
 
 	/* notify child safe to exit */
-	ASSERT_EQ(write(pipe_p2c[1], buf, 1), 1, "pipe_write");
+	ASSERT_EQ(write_with_timeout(pipe_p2c[1], buf, 1), 1, "pipe_write");
 
 disable_pmu:
 	close(pmu_fd);

Alexei Starovoitov Nov. 2, 2024, 12:29 a.m. UTC | #3

On Fri, Nov 1, 2024 at 5:21 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Fri, 2024-11-01 at 14:18 +0100, Kumar Kartikeya Dwivedi wrote:
>
> [...]
>
> > I see that all selftests except one passed. The one that didn't
> > appears to have been cancelled after running for an hour, and stalled
> > after select_reuseport:OK.
> > Looking at the LLVM 18
> > (https://github.com/kernel-patches/bpf/actions/runs/11621768944/job/32366412581?pr=7999)
> > run instead of LLVM 17
> > (https://github.com/kernel-patches/bpf/actions/runs/11621768944/job/32366400714?pr=7999,
> > which failed), it seems the next test send_signal_tracepoint.
> >
> > Is this known to be flaky? I'm guessing not and it is probably caused
> > by my patch, but just want to confirm before I begin debugging.
>
> I suspect this is a test send_signal.
> It started to hang for me yesterday w/o any apparent reason (on master branch).
> I added workaround to avoid stalls, but this does not address the
> issue with the test. Workaround follows.

Hmm.
Puranjay touched it last with extra logic.

And before that David Vernet tried to address flakiness
in commit 4a54de65964d.
Yonghong also noticed lockups in paravirt
and added workaround 7015843afc.

Your additional timeout/workaround makes sense to me,
but would be good to bisect whether Puranjay's change caused it.

> ---
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/send_signal.c b/tools/testing/selftests/bpf/prog_tests/send_signal.c
> index ee5a221b4103..4af127945417 100644
> --- a/tools/testing/selftests/bpf/prog_tests/send_signal.c
> +++ b/tools/testing/selftests/bpf/prog_tests/send_signal.c
> @@ -18,6 +18,38 @@ static void sigusr1_siginfo_handler(int s, siginfo_t *i, void *v)
>
>  static char log_buf[64 * 1024];
>
> +static ssize_t read_with_timeout(int fd, void *buf, size_t count)
> +{
> +       struct timeval tv = { 1, 0 };
> +       fd_set fds;
> +       int err;
> +
> +       FD_ZERO(&fds);
> +       FD_SET(fd, &fds);
> +       err = select(fd + 1, &fds, NULL, NULL, &tv);
> +       if (!ASSERT_GE(err, 0, "read select"))
> +               return err;
> +       if (FD_ISSET(fd, &fds))
> +               return read(fd, buf, count);
> +       return -EAGAIN;
> +}
> +
> +static ssize_t write_with_timeout(int fd, void *buf, size_t count)
> +{
> +       struct timeval tv = { 1, 0 };
> +       fd_set fds;
> +       int err;
> +
> +       FD_ZERO(&fds);
> +       FD_SET(fd, &fds);
> +       err = select(fd + 1, NULL, &fds, NULL, &tv);
> +       if (!ASSERT_GE(err, 0, "write select"))
> +               return err;
> +       if (FD_ISSET(fd, &fds))
> +               return write(fd, buf, count);
> +       return -EAGAIN;
> +}
> +
>  static void test_send_signal_common(struct perf_event_attr *attr,
>                                     bool signal_thread, bool remote)
>  {
> @@ -75,10 +107,10 @@ static void test_send_signal_common(struct perf_event_attr *attr,
>                 }
>
>                 /* notify parent signal handler is installed */
> -               ASSERT_EQ(write(pipe_c2p[1], buf, 1), 1, "pipe_write");
> +               ASSERT_EQ(write_with_timeout(pipe_c2p[1], buf, 1), 1, "pipe_write");
>
>                 /* make sure parent enabled bpf program to send_signal */
> -               ASSERT_EQ(read(pipe_p2c[0], buf, 1), 1, "pipe_read");
> +               ASSERT_EQ(read_with_timeout(pipe_p2c[0], buf, 1), 1, "pipe_read");
>
>                 /* wait a little for signal handler */
>                 for (int i = 0; i < 1000000000 && !sigusr1_received; i++) {
> @@ -94,10 +126,10 @@ static void test_send_signal_common(struct perf_event_attr *attr,
>                 buf[0] = sigusr1_received;
>
>                 ASSERT_EQ(sigusr1_received, 8, "sigusr1_received");
> -               ASSERT_EQ(write(pipe_c2p[1], buf, 1), 1, "pipe_write");
> +               ASSERT_EQ(write_with_timeout(pipe_c2p[1], buf, 1), 1, "pipe_write");
>
>                 /* wait for parent notification and exit */
> -               ASSERT_EQ(read(pipe_p2c[0], buf, 1), 1, "pipe_read");
> +               ASSERT_EQ(read_with_timeout(pipe_p2c[0], buf, 1), 1, "pipe_read");
>
>                 /* restore the old priority */
>                 if (!remote)
> @@ -158,7 +190,7 @@ static void test_send_signal_common(struct perf_event_attr *attr,
>         }
>
>         /* wait until child signal handler installed */
> -       ASSERT_EQ(read(pipe_c2p[0], buf, 1), 1, "pipe_read");
> +       ASSERT_EQ(read_with_timeout(pipe_c2p[0], buf, 1), 1, "pipe_read");
>
>         /* trigger the bpf send_signal */
>         skel->bss->signal_thread = signal_thread;
> @@ -172,7 +204,7 @@ static void test_send_signal_common(struct perf_event_attr *attr,
>         }
>
>         /* notify child that bpf program can send_signal now */
> -       ASSERT_EQ(write(pipe_p2c[1], buf, 1), 1, "pipe_write");
> +       ASSERT_EQ(write_with_timeout(pipe_p2c[1], buf, 1), 1, "pipe_write");
>
>         /* For the remote test, the BPF program is triggered from this
>          * process but the other process/thread is signaled.
> @@ -188,7 +220,7 @@ static void test_send_signal_common(struct perf_event_attr *attr,
>         }
>
>         /* wait for result */
> -       err = read(pipe_c2p[0], buf, 1);
> +       err = read_with_timeout(pipe_c2p[0], buf, 1);
>         if (!ASSERT_GE(err, 0, "reading pipe"))
>                 goto disable_pmu;
>         if (!ASSERT_GT(err, 0, "reading pipe error: size 0")) {
> @@ -199,7 +231,7 @@ static void test_send_signal_common(struct perf_event_attr *attr,
>         ASSERT_EQ(buf[0], 8, "incorrect result");
>
>         /* notify child safe to exit */
> -       ASSERT_EQ(write(pipe_p2c[1], buf, 1), 1, "pipe_write");
> +       ASSERT_EQ(write_with_timeout(pipe_p2c[1], buf, 1), 1, "pipe_write");
>
>  disable_pmu:
>         close(pmu_fd);
>

Eduard Zingerman Nov. 2, 2024, 12:32 a.m. UTC | #4

On Fri, 2024-11-01 at 17:29 -0700, Alexei Starovoitov wrote:

[...]

> Hmm.
> Puranjay touched it last with extra logic.
> 
> And before that David Vernet tried to address flakiness
> in commit 4a54de65964d.
> Yonghong also noticed lockups in paravirt
> and added workaround 7015843afc.
> 
> Your additional timeout/workaround makes sense to me,
> but would be good to bisect whether Puranjay's change caused it.

I'll debug what's going on some time later today or on Sat.

Eduard Zingerman Nov. 8, 2024, 5:08 a.m. UTC | #5

On Fri, 2024-11-01 at 17:32 -0700, Eduard Zingerman wrote:
> On Fri, 2024-11-01 at 17:29 -0700, Alexei Starovoitov wrote:
>
> [...]
>
> > Hmm.
> > Puranjay touched it last with extra logic.
> >
> > And before that David Vernet tried to address flakiness
> > in commit 4a54de65964d.
> > Yonghong also noticed lockups in paravirt
> > and added workaround 7015843afc.
> >
> > Your additional timeout/workaround makes sense to me,
> > but would be good to bisect whether Puranjay's change caused it.
>
> I'll debug what's going on some time later today or on Sat.

I finally had time to investigate this a bit.
First, here is how to trigger lockup:

  t1=send_signal/send_signal_perf_thread_remote; \
  t2=send_signal/send_signal_nmi_thread_remote; \
  for i in $(seq 1 100); do ./test_progs -t $t1,$t2; done

Must be both tests for whatever reason.
The failing test is 'send_signal_nmi_thread_remote'.

The test is organized as parent and child processes communicating
various events to each other. The intended sequence of events:
- child:
  - install SIGUSR1 handler
  - notify parent
  - wait for parent
- parent:
  - open PERF_COUNT_SW_CPU_CLOCK event
  - attach BPF program to the event
  - notify child
  - enter busy loop for 10^8 iterations
  - wait for child
- BPF program:
  - send SIGUSR1 to child
- child:
  - poll for SIGUSR1 in a busy loop
  - notify parent
- parent:
  - check value communicated by child,
    terminate test.

The lockup happens because on every other test run perf event is not
triggered, child does not receive SIGUSR1 and thus both parent and
child are stuck.

For 'send_signal_nmi_thread_remote' perf event is defined as:

	struct perf_event_attr attr = {
		.sample_period = 1,
		.type = PERF_TYPE_HARDWARE,
		.config = PERF_COUNT_HW_CPU_CYCLES,
	};

And is opened for parent process pid.

Apparently, the perf event is not always triggered between lines
send_signal.c:165-180. And at line 180 parent enters system call,
so cpu cycles stop ticking for 'parent', thus if perf event
had not been triggered already it won't be triggered at all
(as far as I understand).

Applying same fix as Yonghong did in 7015843afc is sufficient to
reliably trigger perf event:

--- a/tools/testing/selftests/bpf/prog_tests/send_signal.c
+++ b/tools/testing/selftests/bpf/prog_tests/send_signal.c
@@ -223,7 +223,8 @@ static void test_send_signal_perf(bool signal_thread, bool remote)
 static void test_send_signal_nmi(bool signal_thread, bool remote)
 {
        struct perf_event_attr attr = {
-               .sample_period = 1,
+               .freq = 1,
+               .sample_freq = 1000,
                .type = PERF_TYPE_HARDWARE,
                .config = PERF_COUNT_HW_CPU_CYCLES,
        };

But I don't understand why.
As far as I can figure from kernel source code,
sample_period is measured in nanoseconds (is it?),
so busy loop at send_signal.c:174-175 should run long enough for perf
event to be triggered before.

Can someone with understanding of how perf event work explain why
above change helps?

Alexei Starovoitov Nov. 8, 2024, 8:13 p.m. UTC | #6

On Thu, Nov 7, 2024 at 9:08 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Fri, 2024-11-01 at 17:32 -0700, Eduard Zingerman wrote:
> > On Fri, 2024-11-01 at 17:29 -0700, Alexei Starovoitov wrote:
> >
> > [...]
> >
> > > Hmm.
> > > Puranjay touched it last with extra logic.
> > >
> > > And before that David Vernet tried to address flakiness
> > > in commit 4a54de65964d.
> > > Yonghong also noticed lockups in paravirt
> > > and added workaround 7015843afc.
> > >
> > > Your additional timeout/workaround makes sense to me,
> > > but would be good to bisect whether Puranjay's change caused it.
> >
> > I'll debug what's going on some time later today or on Sat.
>
> I finally had time to investigate this a bit.
> First, here is how to trigger lockup:
>
>   t1=send_signal/send_signal_perf_thread_remote; \
>   t2=send_signal/send_signal_nmi_thread_remote; \
>   for i in $(seq 1 100); do ./test_progs -t $t1,$t2; done
>
> Must be both tests for whatever reason.
> The failing test is 'send_signal_nmi_thread_remote'.
>
> The test is organized as parent and child processes communicating
> various events to each other. The intended sequence of events:
> - child:
>   - install SIGUSR1 handler
>   - notify parent
>   - wait for parent
> - parent:
>   - open PERF_COUNT_SW_CPU_CLOCK event
>   - attach BPF program to the event
>   - notify child
>   - enter busy loop for 10^8 iterations
>   - wait for child
> - BPF program:
>   - send SIGUSR1 to child
> - child:
>   - poll for SIGUSR1 in a busy loop
>   - notify parent
> - parent:
>   - check value communicated by child,
>     terminate test.
>
> The lockup happens because on every other test run perf event is not
> triggered, child does not receive SIGUSR1 and thus both parent and
> child are stuck.
>
> For 'send_signal_nmi_thread_remote' perf event is defined as:
>
>         struct perf_event_attr attr = {
>                 .sample_period = 1,
>                 .type = PERF_TYPE_HARDWARE,
>                 .config = PERF_COUNT_HW_CPU_CYCLES,
>         };
>
> And is opened for parent process pid.
>
> Apparently, the perf event is not always triggered between lines
> send_signal.c:165-180. And at line 180 parent enters system call,
> so cpu cycles stop ticking for 'parent', thus if perf event
> had not been triggered already it won't be triggered at all
> (as far as I understand).
>
> Applying same fix as Yonghong did in 7015843afc is sufficient to
> reliably trigger perf event:
>
> --- a/tools/testing/selftests/bpf/prog_tests/send_signal.c
> +++ b/tools/testing/selftests/bpf/prog_tests/send_signal.c
> @@ -223,7 +223,8 @@ static void test_send_signal_perf(bool signal_thread, bool remote)
>  static void test_send_signal_nmi(bool signal_thread, bool remote)
>  {
>         struct perf_event_attr attr = {
> -               .sample_period = 1,
> +               .freq = 1,
> +               .sample_freq = 1000,
>                 .type = PERF_TYPE_HARDWARE,
>                 .config = PERF_COUNT_HW_CPU_CYCLES,
>         };
>
> But I don't understand why.
> As far as I can figure from kernel source code,
> sample_period is measured in nanoseconds (is it?),

I believe sample_period is a number of samples.
1 means that perf suppose to generate event very often.
It means nanoseconds only for SW cpu_cycles.

let's apply above workaround and move on. Pls send a patch.

Yonghong Song Nov. 8, 2024, 9:57 p.m. UTC | #7

On 11/8/24 12:13 PM, Alexei Starovoitov wrote:
> On Thu, Nov 7, 2024 at 9:08 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>> On Fri, 2024-11-01 at 17:32 -0700, Eduard Zingerman wrote:
>>> On Fri, 2024-11-01 at 17:29 -0700, Alexei Starovoitov wrote:
>>>
>>> [...]
>>>
>>>> Hmm.
>>>> Puranjay touched it last with extra logic.
>>>>
>>>> And before that David Vernet tried to address flakiness
>>>> in commit 4a54de65964d.
>>>> Yonghong also noticed lockups in paravirt
>>>> and added workaround 7015843afc.
>>>>
>>>> Your additional timeout/workaround makes sense to me,
>>>> but would be good to bisect whether Puranjay's change caused it.
>>> I'll debug what's going on some time later today or on Sat.
>> I finally had time to investigate this a bit.
>> First, here is how to trigger lockup:
>>
>>    t1=send_signal/send_signal_perf_thread_remote; \
>>    t2=send_signal/send_signal_nmi_thread_remote; \
>>    for i in $(seq 1 100); do ./test_progs -t $t1,$t2; done
>>
>> Must be both tests for whatever reason.
>> The failing test is 'send_signal_nmi_thread_remote'.
>>
>> The test is organized as parent and child processes communicating
>> various events to each other. The intended sequence of events:
>> - child:
>>    - install SIGUSR1 handler
>>    - notify parent
>>    - wait for parent
>> - parent:
>>    - open PERF_COUNT_SW_CPU_CLOCK event
>>    - attach BPF program to the event
>>    - notify child
>>    - enter busy loop for 10^8 iterations
>>    - wait for child
>> - BPF program:
>>    - send SIGUSR1 to child
>> - child:
>>    - poll for SIGUSR1 in a busy loop
>>    - notify parent
>> - parent:
>>    - check value communicated by child,
>>      terminate test.
>>
>> The lockup happens because on every other test run perf event is not
>> triggered, child does not receive SIGUSR1 and thus both parent and
>> child are stuck.
>>
>> For 'send_signal_nmi_thread_remote' perf event is defined as:
>>
>>          struct perf_event_attr attr = {
>>                  .sample_period = 1,
>>                  .type = PERF_TYPE_HARDWARE,
>>                  .config = PERF_COUNT_HW_CPU_CYCLES,
>>          };
>>
>> And is opened for parent process pid.
>>
>> Apparently, the perf event is not always triggered between lines
>> send_signal.c:165-180. And at line 180 parent enters system call,
>> so cpu cycles stop ticking for 'parent', thus if perf event
>> had not been triggered already it won't be triggered at all
>> (as far as I understand).
>>
>> Applying same fix as Yonghong did in 7015843afc is sufficient to
>> reliably trigger perf event:
>>
>> --- a/tools/testing/selftests/bpf/prog_tests/send_signal.c
>> +++ b/tools/testing/selftests/bpf/prog_tests/send_signal.c
>> @@ -223,7 +223,8 @@ static void test_send_signal_perf(bool signal_thread, bool remote)
>>   static void test_send_signal_nmi(bool signal_thread, bool remote)
>>   {
>>          struct perf_event_attr attr = {
>> -               .sample_period = 1,
>> +               .freq = 1,
>> +               .sample_freq = 1000,
>>                  .type = PERF_TYPE_HARDWARE,
>>                  .config = PERF_COUNT_HW_CPU_CYCLES,
>>          };
>>
>> But I don't understand why.
>> As far as I can figure from kernel source code,
>> sample_period is measured in nanoseconds (is it?),
> I believe sample_period is a number of samples.
> 1 means that perf suppose to generate event very often.
> It means nanoseconds only for SW cpu_cycles.
>
> let's apply above workaround and move on. Pls send a patch.

sample_period = 1 intends to make sure we can get at one sample
since many samples will be generated. But too many samples may
cause *rare* issues in internal kernel logic in certain *corner*
cases. Agree with Alexei, let us just use a reasonable sample_freq
to fix the issue. Thanks!

[bpf-next,v1,0/2] Handle possible NULL trusted raw_tp arguments

Message

Comments