mbox series

[bpf-next,0/4] selftests/bpf: fix for bpf_signal stalls, watchdog for test_progs

Message ID 20241112110906.3045278-1-eddyz87@gmail.com (mailing list archive)
Headers show
Series selftests/bpf: fix for bpf_signal stalls, watchdog for test_progs | expand

Message

Eduard Zingerman Nov. 12, 2024, 11:09 a.m. UTC
Test case 'bpf_signal' had been recently reported to stall, both on
the mailing list [1] and CI [2]. The stall is caused by CPU cycles
perf event not being delivered within expected time frame, before test
process enters system call and waits indefinitely.

This patch-set addresses the issue in several ways:
- A watchdog timer is added to test_progs.c runner:
  - it prints current sub-test name to stderr if sub-test takes longer
    than 10 seconds to finish;
  - it terminates process executing sub-test if sub-test takes longer
    than 120 seconds to finish.
- The test case is updated to await perf event notification with a
  timeout and a few retries, this serves two purposes:
  - busy loops longer to increase the time frame for CPU cycles event
    generation/delivery;
  - makes a timeout, not stall, a worst case scenario.
- The test case is updated to lower frequency of perf events, as high
  frequency of such events caused events generation throttling,
  which in turn delayed events delivery by amount of time sufficient
  to cause test case failure.

Note:

  librt pthread-based timer API is used to implement watchdog timer.
  I chose this API over SIGALRM because signal handler execution
  within test process context was sufficient to trigger perf event
  delivery for send_signal/send_signal_nmi_thread_remote test case,
  w/o any additional changes. Thus I concluded that SIGALRM based
  implementation interferes with tests execution.

[1] https://lore.kernel.org/bpf/CAP01T75OUeE8E-Lw9df84dm8ag2YmHW619f1DmPSVZ5_O89+Bg@mail.gmail.com/
[2] https://github.com/kernel-patches/bpf/actions/runs/11791485271/job/32843996871

Eduard Zingerman (4):
  selftests/bpf: watchdog timer for test_progs
  selftests/bpf: add read_with_timeout() utility function
  selftests/bpf: allow send_signal test to timeout
  selftests/bpf: update send_signal to lower perf evemts frequency

 tools/testing/selftests/bpf/Makefile          |   1 +
 tools/testing/selftests/bpf/io_helpers.c      |  21 ++++
 tools/testing/selftests/bpf/io_helpers.h      |   7 ++
 .../selftests/bpf/prog_tests/bpf_iter.c       |   8 +-
 .../testing/selftests/bpf/prog_tests/iters.c  |   4 +-
 .../selftests/bpf/prog_tests/send_signal.c    |  35 +++---
 tools/testing/selftests/bpf/test_progs.c      | 104 ++++++++++++++++++
 tools/testing/selftests/bpf/test_progs.h      |   6 +
 8 files changed, 166 insertions(+), 20 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/io_helpers.c
 create mode 100644 tools/testing/selftests/bpf/io_helpers.h

Comments

patchwork-bot+netdevbpf@kernel.org Nov. 12, 2024, 10:10 p.m. UTC | #1
Hello:

This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Tue, 12 Nov 2024 03:09:02 -0800 you wrote:
> Test case 'bpf_signal' had been recently reported to stall, both on
> the mailing list [1] and CI [2]. The stall is caused by CPU cycles
> perf event not being delivered within expected time frame, before test
> process enters system call and waits indefinitely.
> 
> This patch-set addresses the issue in several ways:
> - A watchdog timer is added to test_progs.c runner:
>   - it prints current sub-test name to stderr if sub-test takes longer
>     than 10 seconds to finish;
>   - it terminates process executing sub-test if sub-test takes longer
>     than 120 seconds to finish.
> - The test case is updated to await perf event notification with a
>   timeout and a few retries, this serves two purposes:
>   - busy loops longer to increase the time frame for CPU cycles event
>     generation/delivery;
>   - makes a timeout, not stall, a worst case scenario.
> - The test case is updated to lower frequency of perf events, as high
>   frequency of such events caused events generation throttling,
>   which in turn delayed events delivery by amount of time sufficient
>   to cause test case failure.
> 
> [...]

Here is the summary with links:
  - [bpf-next,1/4] selftests/bpf: watchdog timer for test_progs
    https://git.kernel.org/bpf/bpf-next/c/d9d4d127e813
  - [bpf-next,2/4] selftests/bpf: add read_with_timeout() utility function
    https://git.kernel.org/bpf/bpf-next/c/03066ed3105a
  - [bpf-next,3/4] selftests/bpf: allow send_signal test to timeout
    https://git.kernel.org/bpf/bpf-next/c/3209139d00e5
  - [bpf-next,4/4] selftests/bpf: update send_signal to lower perf evemts frequency
    https://git.kernel.org/bpf/bpf-next/c/4edab4c55d2d

You are awesome, thank you!