diff mbox series

tools/rtla: Add basic test suite

Message ID 20250109133852.360122-1-tglozar@redhat.com (mailing list archive)
State Superseded
Headers show
Series tools/rtla: Add basic test suite | expand

Commit Message

Tomas Glozar Jan. 9, 2025, 1:38 p.m. UTC
Implement a simple TAP-based test engine in bash and a few basic tests
using it, to be used to check for bugs and regressions.

A new "check" target is added to the rtla Makefile that runs the test suite
using the "prove" command implemented by Test::Harness.

The only test format currently supported is running rtla with defined
command arguments per test, checking its exit code. In case the exit
code is non-zero, the output of rtla is displayed.

The test cases are adopted from rtla tests in the Continuous Kernel
Integration (CKI) project [1] with the authors' approval.

[1] https://gitlab.com/redhat/centos-stream/tests/kernel/kernel-tests/-/blob/main/rt-tests/us/rtla/

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
 tools/tracing/rtla/Makefile         |  4 +++-
 tools/tracing/rtla/tests/engine.sh  | 29 +++++++++++++++++++++++++++++
 tools/tracing/rtla/tests/hwnoise.t  | 19 +++++++++++++++++++
 tools/tracing/rtla/tests/osnoise.t  | 17 +++++++++++++++++
 tools/tracing/rtla/tests/timerlat.t | 25 +++++++++++++++++++++++++
 5 files changed, 93 insertions(+), 1 deletion(-)
 create mode 100644 tools/tracing/rtla/tests/engine.sh
 create mode 100644 tools/tracing/rtla/tests/hwnoise.t
 create mode 100644 tools/tracing/rtla/tests/osnoise.t
 create mode 100644 tools/tracing/rtla/tests/timerlat.t

Comments

Steven Rostedt Jan. 17, 2025, 1:18 a.m. UTC | #1
On Thu,  9 Jan 2025 14:38:52 +0100
Tomas Glozar <tglozar@redhat.com> wrote:

> Implement a simple TAP-based test engine in bash and a few basic tests
> using it, to be used to check for bugs and regressions.
> 
> A new "check" target is added to the rtla Makefile that runs the test suite
> using the "prove" command implemented by Test::Harness.
> 
> The only test format currently supported is running rtla with defined
> command arguments per test, checking its exit code. In case the exit
> code is non-zero, the output of rtla is displayed.
> 
> The test cases are adopted from rtla tests in the Continuous Kernel
> Integration (CKI) project [1] with the authors' approval.
> 
> [1] https://gitlab.com/redhat/centos-stream/tests/kernel/kernel-tests/-/blob/main/rt-tests/us/rtla/
> 
> Signed-off-by: Tomas Glozar <tglozar@redhat.com>

I tested this out, and it failed the last test:

make[1]: Leaving directory '/work/git/linux.git/tools/tracing/rtla'
RTLA=/work/git/linux.git/tools/tracing/rtla/rtla prove -o -f tests/
tests/hwnoise.t ... ok      
tests/osnoise.t ... ok   
tests/timerlat.t .. 9/9 
not ok 9 - hist test in nanoseconds
# 
tests/timerlat.t .. Failed 1/9 subtests 

Test Summary Report
-------------------
tests/timerlat.t (Wstat: 0 Tests: 9 Failed: 1)
  Failed test:  9
Files=3, Tests=20, 451 wallclock secs ( 0.07 usr  0.11 sys + 10.57 cusr 74.09 csys = 84.84 CPU)
Result: FAIL
make: *** [Makefile:89: check] Error 1

-- Steve
Tomas Glozar Jan. 17, 2025, 10:53 a.m. UTC | #2
pá 17. 1. 2025 v 2:18 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal:
> I tested this out, and it failed the last test:
>

Thank you for testing this.

> make[1]: Leaving directory '/work/git/linux.git/tools/tracing/rtla'
> RTLA=/work/git/linux.git/tools/tracing/rtla/rtla prove -o -f tests/
> tests/hwnoise.t ... ok
> tests/osnoise.t ... ok
> tests/timerlat.t .. 9/9
> not ok 9 - hist test in nanoseconds
> #
> tests/timerlat.t .. Failed 1/9 subtests
>
> Test Summary Report
> -------------------
> tests/timerlat.t (Wstat: 0 Tests: 9 Failed: 1)
>   Failed test:  9
> Files=3, Tests=20, 451 wallclock secs ( 0.07 usr  0.11 sys + 10.57 cusr 74.09 csys = 84.84 CPU)
> Result: FAIL
> make: *** [Makefile:89: check] Error 1
>
>

I have not seen this when testing the patch. It looks like rtla exits
without any output, I will send a v2 that makes the test engine print
the error code in that case, it will be useful not only for this.

That being said, it's not like nanosecond resolution rtla-timerlat is
without bugs:

# RTLA timerlat histogram
# Time unit is nanoseconds (ns)
# Duration:   0 00:00:02
Index   IRQ-000   Thr-000   Usr-000
over:         1         0         0
count:        1         0         0
min:       6102         -         -
avg:       6102         -         -
max:       6102         -         -
ALL:        IRQ       Thr       Usr
count:        1         0         0
min:       6102         -         -
avg:       6102         -         -
max:       6102         -         -
rtla timerlat hit stop tracing
## CPU 0 hit stop tracing, analyzing it ##
 IRQ handler delay:
1442069690324.42 us (23632738287847.00 %)
 IRQ latency:                                               6.10 us
 Blocking thread:
                               :0
------------------------------------------------------------------------
    IRQ latency:                                            6.10 us (100%)
Max timerlat IRQ latency from idle: 6.10 us in cpu 0

Maybe after enough time repeating the command I'm also going to hit
the bug you saw.

Tomas
Steven Rostedt Jan. 17, 2025, 10:59 a.m. UTC | #3
On Fri, 17 Jan 2025 11:53:48 +0100
Tomas Glozar <tglozar@redhat.com> wrote:

> Maybe after enough time repeating the command I'm also going to hit
> the bug you saw.

Note, I am running it on a VM, not necessarily the best place to run an RT app.

Also, I'm in my BRANCH PROFILING phase, where my machines have the
#define if enabled, and is running about 1/5th the normal speed.

Thus, it will trigger large latency and delays.

-- Steve
Steven Rostedt Jan. 17, 2025, 9:02 p.m. UTC | #4
On Fri, 17 Jan 2025 11:53:48 +0100
Tomas Glozar <tglozar@redhat.com> wrote:

> Maybe after enough time repeating the command I'm also going to hit
> the bug you saw.

Nevermind. The divide by zero fix went into my tools/urgent branch and I
pushed it up to Linus, but I had already started the tools/core branch for
the next merge window that I've been adding these patches to. It does not
include the fix.

The reason for that crash was because of the divide by zero bug. After
cherry-picking it for testing, it works.

Now, what I also noticed was. When it did crash and die. The osnoise tracer
no longer creates any osnoise threads. Can you take a look.

Remove the fix, and run that test:

 ~# echo osnoise > /sys/kernel/tracing/current_tracer
 ~# ps aux |grep osnois
root         942  105  0.0      0     0 ?        R    16:00   0:02 [osnoise/0]
root         943  104  0.0      0     0 ?        R    16:00   0:02 [osnoise/1]
root         944  105  0.0      0     0 ?        R    16:00   0:02 [osnoise/2]
root         945  105  0.0      0     0 ?        R    16:00   0:02 [osnoise/3]
root         946  105  0.0      0     0 ?        R    16:00   0:02 [osnoise/4]
root         947  105  0.0      0     0 ?        R    16:00   0:02 [osnoise/5]
root         948  104  0.0      0     0 ?        R    16:00   0:02 [osnoise/6]
root         949  130  0.0      0     0 ?        R    16:00   0:03 [osnoise/7]
root         953  9.0  0.0   6468  2152 pts/0    S+   16:00   0:00 grep osnois

 ~# echo nop > /sys/kernel/tracing/current_tracer
 ~# cd linux.git
 ~# git show cfff4d019b4c01 | patch -p1 -R
 ~# cd tools/tracing/rtla/
 ~# make
 ~# ./rtla timerlat hist -i 2 -c 0 -n -d 30s
 # RTLA timerlat histogram
 # Time unit is nanoseconds (ns)
 # Duration:   0 00:00:02
 Index   IRQ-000   Thr-000   Usr-000
 over:         1         0         0 
 count:        1         0         0 
 min:    1481365         -         - 
 avg:    1481365         -         - 
 max:    1481365         -         - 
 Floating point exception

 ~# rmdir /sys/kernel/tracing/instances/timerlat_aa/
 ~# rmdir /sys/kernel/tracing/instances/timerlat_hist/

 ~# echo osnoise > /sys/kernel/tracing/current_tracer
 ~# ps aux |grep osnois
root         969 11.1  0.0   6468  2244 pts/0    S+   15:54   0:00 grep osnois

So that failure puts the system into a state that can no longer run osnoise.
It takes a reboot to fix it.

That should be fixed where a reboot is not needed.

-- Steve
Tomas Glozar Jan. 20, 2025, 7:38 a.m. UTC | #5
pá 17. 1. 2025 v 22:02 odesílatel Steven Rostedt <rostedt@goodmis.org> napsal:
>
> Now, what I also noticed was. When it did crash and die. The osnoise tracer
> no longer creates any osnoise threads. Can you take a look.
>

It's likely that you have NO_OSNOISE_WORKLOAD set. The default value
is OSNOISE_WORKLOAD; rtla relies on it not being set, and rtla
timerlat -u resets it to the original value after it ends, so usually,
everything works. But it breaks either when you set
NO_OSNOISE_WORKLOAD manually or when rtla timerlat -u crashes (and
fails to set it back).

> So that failure puts the system into a state that can no longer run osnoise.
> It takes a reboot to fix it.
>
> That should be fixed where a reboot is not needed.

I sent patches to fix this for timerlat [1] and as I commented in the
thread [2], the same has to be done for osnoise, too. I just haven't
sent that yet since I was working on the rtla hang in sample
collection bug [3]. I also discovered that OSNOISE_WORKLOAD is not the
only option that rtla relies on being set properly: stop_tracing_us is
also affected and possibly other options. rtla should set all osnoise
options properly on every run, not relying on them being set by the
cleanup of the previous rtla run.

[1] https://lore.kernel.org/linux-trace-kernel/20250107144823.239782-1-tglozar@redhat.com/T/
[2[ https://lore.kernel.org/linux-trace-kernel/CAP4=nvRCw_2hKPUyd7obZv2Xho_mjdSC7EmozFsoeS0OZC9xaA@mail.gmail.com/
[3] https://lore.kernel.org/linux-trace-kernel/20250116144931.649593-1-tglozar@redhat.com/

Tomas
Tomas Glozar Jan. 20, 2025, 12:14 p.m. UTC | #6
po 20. 1. 2025 v 8:38 odesílatel Tomas Glozar <tglozar@redhat.com> napsal:
> It's likely that you have NO_OSNOISE_WORKLOAD set. The default value
> is OSNOISE_WORKLOAD; rtla relies on it not being set, and rtla
> timerlat -u resets it to the original value after it ends, so usually,
> everything works. But it breaks either when you set
> NO_OSNOISE_WORKLOAD manually or when rtla timerlat -u crashes (and
> fails to set it back).
>

I just realized that the tests are not running rtla timerlat with -k
and upstream, it has been defaulting to user threads for a long time,
so that's probably not it.
diff mbox series

Patch

diff --git a/tools/tracing/rtla/Makefile b/tools/tracing/rtla/Makefile
index a6a7dee16622..8b5101457c70 100644
--- a/tools/tracing/rtla/Makefile
+++ b/tools/tracing/rtla/Makefile
@@ -85,4 +85,6 @@  clean: doc_clean fixdep-clean
 	$(Q)find . -name '*.o' -delete -o -name '\.*.cmd' -delete -o -name '\.*.d' -delete
 	$(Q)rm -f rtla rtla-static fixdep FEATURE-DUMP rtla-*
 	$(Q)rm -rf feature
-.PHONY: FORCE clean
+check: $(RTLA)
+	RTLA=$(RTLA) prove -o -f tests/
+.PHONY: FORCE clean check
diff --git a/tools/tracing/rtla/tests/engine.sh b/tools/tracing/rtla/tests/engine.sh
new file mode 100644
index 000000000000..547490960a77
--- /dev/null
+++ b/tools/tracing/rtla/tests/engine.sh
@@ -0,0 +1,29 @@ 
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+test_begin() {
+	# Count tests to allow the test harness to double-check if all were
+	# included correctly.
+	ctr=0
+	[ -z "$RTLA" ] && RTLA="./rtla"
+	[ -n "$TEST_COUNT" ] && echo "1..$TEST_COUNT"
+}
+
+check() {
+	# Simple check: run rtla with given arguments and test exit code.
+	# If TEST_COUNT is set, run the test. Otherwise, just count.
+	ctr=$(($ctr + 1))
+	if [ -n "$TEST_COUNT" ]
+	then
+		# Run rtla; in case of failure, include its output as comment
+		# in the test results.
+		result=$("$RTLA" $2 2>&1) && echo "ok $ctr - $1" || echo "not ok $ctr - $1
+$(echo "$result" | while read line; do echo "# $line"; done)
+"
+	fi
+}
+
+test_end() {
+	# If running without TEST_COUNT, tests are not actually run, just
+	# counted. In that case, re-run the test with the correct count.
+	[ -z "$TEST_COUNT" ] && TEST_COUNT=$ctr exec bash $0 || true
+}
diff --git a/tools/tracing/rtla/tests/hwnoise.t b/tools/tracing/rtla/tests/hwnoise.t
new file mode 100644
index 000000000000..491383e30373
--- /dev/null
+++ b/tools/tracing/rtla/tests/hwnoise.t
@@ -0,0 +1,19 @@ 
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+source tests/engine.sh
+test_begin
+
+check "verify help page" \
+	"hwnoise --help"
+check "detect noise higher than one microsecond" \
+	"hwnoise -c 0 -T 1 -d 5s -q"
+check "set the automatic trace mode" \
+	"hwnoise -a 5 -d 30s"
+check "set scheduling param to the osnoise tracer threads" \
+	"hwnoise -P F:1 -c 0 -r 900000 -d 1M -q"
+check "stop the trace if a single sample is higher than 1 us" \
+	"hwnoise -s 1 -T 1 -t -d 30s"
+check "enable a trace event trigger" \
+	"hwnoise -t -e osnoise:irq_noise trigger=\"hist:key=desc,duration:sort=desc,duration:vals=hitcount\" -d 1m"
+
+test_end
diff --git a/tools/tracing/rtla/tests/osnoise.t b/tools/tracing/rtla/tests/osnoise.t
new file mode 100644
index 000000000000..83a7625d2226
--- /dev/null
+++ b/tools/tracing/rtla/tests/osnoise.t
@@ -0,0 +1,17 @@ 
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+source tests/engine.sh
+test_begin
+
+check "verify help page" \
+	"osnoise --help"
+check "verify the --priority/-P param" \
+	"osnoise top -P F:1 -c 0 -r 900000 -d 1M -q"
+check "verify the --stop/-s param" \
+	"osnoise top -s 30 -T 1 -t"
+check "verify the  --trace param" \
+	"osnoise hist -s 30 -T 1 -t"
+check "verify the --entries/-E param" \
+	"osnoise hist -P F:1 -c 0 -r 900000 -d 1M -b 10 -E 25"
+
+test_end
diff --git a/tools/tracing/rtla/tests/timerlat.t b/tools/tracing/rtla/tests/timerlat.t
new file mode 100644
index 000000000000..a17218a064bb
--- /dev/null
+++ b/tools/tracing/rtla/tests/timerlat.t
@@ -0,0 +1,25 @@ 
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+source tests/engine.sh
+test_begin
+
+check "verify help page" \
+	"timerlat --help"
+check "verify -s/--stack" \
+	"timerlat top -s 3 -T 10 -t"
+check "verify -P/--priority" \
+	"timerlat top -P F:1 -c 0 -d 1M -q"
+check "test in nanoseconds" \
+	"timerlat top -i 2 -c 0 -n -d 30s"
+check "set the automatic trace mode" \
+	"timerlat top -a 5 --dump-tasks"
+check "print the auto-analysis if hits the stop tracing condition" \
+	"timerlat top --aa-only 5"
+check "disable auto-analysis" \
+	"timerlat top -s 3 -T 10 -t --no-aa"
+check "verify -c/--cpus" \
+	"timerlat hist -c 0 -d 30s"
+check "hist test in nanoseconds" \
+	"timerlat hist -i 2 -c 0 -n -d 30s"
+
+test_end