mbox series

[v2,0/9] ci: make Git's GitHub workflow output much more helpful

Message ID pull.1117.v2.git.1646130289.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series ci: make Git's GitHub workflow output much more helpful | expand

Message

Kazuhiro Kato via GitGitGadget March 1, 2022, 10:24 a.m. UTC
Changes since v1:

 * In the patch that removed MAKE_TARGETS, a stale comment about that
   variable is also removed.
 * The comment about set -x has been adjusted because it no longer applies
   as-is.
 * The commit message of "ci: make it easier to find failed tests' logs in
   the GitHub workflow" has been adjusted to motivate the improvement
   better.


Background
==========

Recent patches intended to help readers figure out CI failures much quicker
than before. Unfortunately, they haven't been entirely positive for me. For
example, they broke the branch protections in Microsoft's fork of Git, where
we require Pull Requests to pass a certain set of Checks (which are
identified by their names) and therefore caused follow-up work.

Using CI and in general making it easier for new contributors is an area I'm
passionate about, and one I'd like to see improved.


The current situation
=====================

Let me walk you through the current experience when a PR build fails: I get
a notification mail that only says that a certain job failed. There's no
indication of which test failed (or was it the build?). I can click on a
link at it takes me to the workflow run. Once there, all it says is "Process
completed with exit code 1", or even "code 2". Sure, I can click on one of
the failed jobs. It even expands the failed step's log (collapsing the other
steps). And what do I see there?

Let's look at an example of a failed linux-clang (ubuntu-latest) job
[https://github.com/git-for-windows/git/runs/4822802185?check_suite_focus=true]:

[...]
Test Summary Report
-------------------
t1092-sparse-checkout-compatibility.sh           (Wstat: 256 Tests: 53 Failed: 1)
  Failed test:  49
  Non-zero exit status: 1
t3701-add-interactive.sh                         (Wstat: 0 Tests: 71 Failed: 0)
  TODO passed:   45, 47
Files=957, Tests=25489, 645 wallclock secs ( 5.74 usr  1.56 sys + 866.28 cusr 364.34 csys = 1237.92 CPU)
Result: FAIL
make[1]: *** [Makefile:53: prove] Error 1
make[1]: Leaving directory '/home/runner/work/git/git/t'
make: *** [Makefile:3018: test] Error 2


That's it. I count myself lucky not to be a new contributor being faced with
something like this.

Now, since I am active in the Git project for a couple of days or so, I can
make sense of the "TODO passed" label and know that for the purpose of
fixing the build failures, I need to ignore this, and that I need to focus
on the "Failed test" part instead.

I also know that I do not have to get myself an ubuntu-latest box just to
reproduce the error, I do not even have to check out the code and run it
just to learn what that "49" means.

I know, and I do not expect any new contributor, not even most seasoned
contributors to know, that I have to patiently collapse the "Run
ci/run-build-and-tests.sh" job's log, and instead expand the "Run
ci/print-test-failures.sh" job log (which did not fail and hence does not
draw any attention to it).

I know, and again: I do not expect many others to know this, that I then
have to click into the "Search logs" box (not the regular web browser's
search via Ctrl+F!) and type in "not ok" to find the log of the failed test
case (and this might still be a "known broken" one that is marked via
test_expect_failure and once again needs to be ignored).

To be excessively clear: This is not a great experience!


Improved output
===============

Our previous Azure Pipelines-based CI builds had a much nicer UI, one that
even showed flaky tests, and trends e.g. how long the test cases ran. When I
ported Git's CI over to GitHub workflows (to make CI more accessible to new
contributors), I knew fully well that we would leave this very nice UI
behind, and I had hoped that we would get something similar back via new,
community-contributed GitHub Actions that can be used in GitHub workflows.
However, most likely because we use a home-grown test framework implemented
in opinionated POSIX shells scripts, that did not happen.

So I had a look at what standards exist e.g. when testing PowerShell
modules, in the way of marking up their test output in GitHub workflows, and
I was not disappointed: GitHub workflows support "grouping" of output lines,
i.e. marking sections of the output as a group that is then collapsed by
default and can be expanded. And it is this feature I decided to use in this
patch series, along with GitHub workflows' commands to display errors or
notices that are also shown on the summary page of the workflow run. Now, in
addition to "Process completed with exit code" on the summary page, we also
read something like:

⊗ linux-gcc (ubuntu-latest)
   failed: t9800.20 submit from detached head


Even better, this message is a link, and following that, the reader is
presented with something like this
[https://github.com/dscho/git/runs/4840190622?check_suite_focus=true]:

⏵ Run ci/run-build-and-tests.sh
⏵ CI setup
  + ln -s /home/runner/none/.prove t/.prove
  + run_tests=t
  + export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
  + group Build make
  + set +x
⏵ Build
⏵ Run tests
  === Failed test: t9800-git-p4-basic ===
⏵ ok: t9800.1 start p4d
⏵ ok: t9800.2 add p4 files
⏵ ok: t9800.3 basic git p4 clone
⏵ ok: t9800.4 depot typo error
⏵ ok: t9800.5 git p4 clone @all
⏵ ok: t9800.6 git p4 sync uninitialized repo
⏵ ok: t9800.7 git p4 sync new branch
⏵ ok: t9800.8 clone two dirs
⏵ ok: t9800.9 clone two dirs, @all
⏵ ok: t9800.10 clone two dirs, @all, conflicting files
⏵ ok: t9800.11 clone two dirs, each edited by submit, single git commit
⏵ ok: t9800.12 clone using non-numeric revision ranges
⏵ ok: t9800.13 clone with date range, excluding some changes
⏵ ok: t9800.14 exit when p4 fails to produce marshaled output
⏵ ok: t9800.15 exit gracefully for p4 server errors
⏵ ok: t9800.16 clone --bare should make a bare repository
⏵ ok: t9800.17 initial import time from top change time
⏵ ok: t9800.18 unresolvable host in P4PORT should display error
⏵ ok: t9800.19 run hook p4-pre-submit before submit
  Error: failed: t9800.20 submit from detached head
⏵ failure: t9800.20 submit from detached head 
  Error: failed: t9800.21 submit from worktree
⏵ failure: t9800.21 submit from worktree 
  === Failed test: t9801-git-p4-branch ===
  [...]


The "Failed test:" lines are colored in yellow to give a better visual clue
about the logs' structure, the "Error:" label is colored in red to draw the
attention to the important part of the log, and the "⏵" characters indicate
that part of the log is collapsed and can be expanded by clicking on it.

To drill down, the reader merely needs to expand the (failed) test case's
log by clicking on it, and then study the log. If needed (e.g. when the test
case relies on side effects from previous test cases), the logs of preceding
test cases can be expanded as well. In this example, when expanding
t9800.20, it looks like this (for ease of reading, I cut a few chunks of
lines, indicated by "[...]"):

[...]
⏵ ok: t9800.19 run hook p4-pre-submit before submit
  Error: failed: t9800.20 submit from detached head
⏷ failure: t9800.20 submit from detached head 
      test_when_finished cleanup_git &&
      git p4 clone --dest="$git" //depot &&
        (
          cd "$git" &&
          git checkout p4/master &&
          >detached_head_test &&
          git add detached_head_test &&
          git commit -m "add detached_head" &&
          git config git-p4.skipSubmitEdit true &&
          git p4 submit &&
            git p4 rebase &&
            git log p4/master | grep detached_head
        )
    [...]
    Depot paths: //depot/
    Import destination: refs/remotes/p4/master
    
    Importing revision 9 (100%)Perforce db files in '.' will be created if missing...
    Perforce db files in '.' will be created if missing...
    
    Traceback (most recent call last):
      File "/home/runner/work/git/git/git-p4", line 4455, in <module>
        main()
      File "/home/runner/work/git/git/git-p4", line 4449, in main
        if not cmd.run(args):
      File "/home/runner/work/git/git/git-p4", line 2590, in run
        rebase.rebase()
      File "/home/runner/work/git/git/git-p4", line 4121, in rebase
        if len(read_pipe("git diff-index HEAD --")) > 0:
      File "/home/runner/work/git/git/git-p4", line 297, in read_pipe
        retcode, out, err = read_pipe_full(c, *k, **kw)
      File "/home/runner/work/git/git/git-p4", line 284, in read_pipe_full
        p = subprocess.Popen(
      File "/usr/lib/python3.8/subprocess.py", line 858, in __init__
        self._execute_child(args, executable, preexec_fn, close_fds,
      File "/usr/lib/python3.8/subprocess.py", line 1704, in _execute_child
        raise child_exception_type(errno_num, err_msg, err_filename)
    FileNotFoundError: [Errno 2] No such file or directory: 'git diff-index HEAD --'
    error: last command exited with $?=1
    + cleanup_git
    + retry_until_success rm -r /home/runner/work/git/git/t/trash directory.t9800-git-p4-basic/git
    + nr_tries_left=60
    + rm -r /home/runner/work/git/git/t/trash directory.t9800-git-p4-basic/git
    + test_path_is_missing /home/runner/work/git/git/t/trash directory.t9800-git-p4-basic/git
    + test 1 -ne 1
    + test -e /home/runner/work/git/git/t/trash directory.t9800-git-p4-basic/git
    + retry_until_success mkdir /home/runner/work/git/git/t/trash directory.t9800-git-p4-basic/git
    + nr_tries_left=60
    + mkdir /home/runner/work/git/git/t/trash directory.t9800-git-p4-basic/git
    + exit 1
    + eval_ret=1
    + :
    not ok 20 - submit from detached head
    #    
    #        test_when_finished cleanup_git &&
    #        git p4 clone --dest="$git" //depot &&
    #        (
    #            cd "$git" &&
    #            git checkout p4/master &&
    #            >detached_head_test &&
    #            git add detached_head_test &&
    #            git commit -m "add detached_head" &&
    #            git config git-p4.skipSubmitEdit true &&
    #            git p4 submit &&
    #            git p4 rebase &&
    #            git log p4/master | grep detached_head
    #        )
    #    
  Error: failed: t9800.21 submit from worktree
  [...]


Is this the best UI we can have for test failures in CI runs? I hope we can
do better. Having said that, this patch series presents a pretty good start,
and offers a basis for future improvements.

Johannes Schindelin (9):
  ci: fix code style
  ci/run-build-and-tests: take a more high-level view
  ci: make it easier to find failed tests' logs in the GitHub workflow
  ci/run-build-and-tests: add some structure to the GitHub workflow
    output
  tests: refactor --write-junit-xml code
  test(junit): avoid line feeds in XML attributes
  ci: optionally mark up output in the GitHub workflow
  ci: use `--github-workflow-markup` in the GitHub workflow
  ci: call `finalize_test_case_output` a little later

 .github/workflows/main.yml           |  12 ---
 ci/lib.sh                            |  82 +++++++++++++++--
 ci/run-build-and-tests.sh            |  14 +--
 ci/run-test-slice.sh                 |   5 +-
 t/test-lib-functions.sh              |   4 +-
 t/test-lib-github-workflow-markup.sh |  50 ++++++++++
 t/test-lib-junit.sh                  | 132 +++++++++++++++++++++++++++
 t/test-lib.sh                        | 128 ++++----------------------
 8 files changed, 288 insertions(+), 139 deletions(-)
 create mode 100644 t/test-lib-github-workflow-markup.sh
 create mode 100644 t/test-lib-junit.sh


base-commit: af4e5f569bc89f356eb34a9373d7f82aca6faa8a
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1117%2Fdscho%2Fuse-grouping-in-ci-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1117/dscho/use-grouping-in-ci-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1117

Range-diff vs v1:

  1:  db08b07c37a =  1:  db08b07c37a ci: fix code style
  2:  d2ff51bb5da !  2:  42ff3e170bf ci/run-build-and-tests: take a more high-level view
     @@ ci/run-build-and-tests.sh: pedantic)
       	;;
       esac
       
     - # Any new "test" targets should not go after this "make", but should
     - # adjust $MAKE_TARGETS. Otherwise compilation-only targets above will
     - # start running tests.
     +-# Any new "test" targets should not go after this "make", but should
     +-# adjust $MAKE_TARGETS. Otherwise compilation-only targets above will
     +-# start running tests.
      -make $MAKE_TARGETS
      +make
      +if test -n "$run_tests"
  3:  98891b0d3f7 !  3:  bbbe1623257 ci: make it easier to find failed tests' logs in the GitHub workflow
     @@ Metadata
       ## Commit message ##
          ci: make it easier to find failed tests' logs in the GitHub workflow
      
     +    When investigating a test failure, the time that matters most is the
     +    time it takes from getting aware of the failure to displaying the output
     +    of the failing test case.
     +
          You currently have to know a lot of implementation details when
          investigating test failures in the CI runs. The first step is easy: the
          failed job is marked quite clearly, but when opening it, the failed step
     @@ Commit message
          The actually interesting part is in the detailed log of said failed
          test script. But that log is shown in the CI run's step that runs
          `ci/print-test-failures.sh`. And that step is _not_ expanded in the web
     -    UI by default.
     +    UI by default. It is even marked as "successful", which makes it very
     +    easy to miss that there is useful information hidden in there.
      
          Let's help the reader by showing the failed tests' detailed logs in the
          step that is expanded automatically, i.e. directly after the test suite
  4:  9333ba781b8 !  4:  f72254a9ac6 ci/run-build-and-tests: add some structure to the GitHub workflow output
     @@ ci/lib.sh
      +
      +# Set 'exit on error' for all CI scripts to let the caller know that
      +# something went wrong.
     -+# Set tracing executed commands, primarily setting environment variables
     -+# and installing dependencies.
     ++#
     ++# We already enabled tracing executed commands earlier. This helps by showing
     ++# how # environment variables are set and and dependencies are installed.
      +set -e
      +
       skip_branch_tip_with_tag () {
     @@ ci/lib.sh: linux-leaks)
      +set -x
      
       ## ci/run-build-and-tests.sh ##
     -@@ ci/run-build-and-tests.sh: esac
     - # Any new "test" targets should not go after this "make", but should
     - # adjust $MAKE_TARGETS. Otherwise compilation-only targets above will
     - # start running tests.
     +@@ ci/run-build-and-tests.sh: pedantic)
     + 	;;
     + esac
     + 
      -make
      +group Build make
       if test -n "$run_tests"
  5:  94dcbe1bc43 =  5:  9eda6574313 tests: refactor --write-junit-xml code
  6:  41230100091 =  6:  c8b240af749 test(junit): avoid line feeds in XML attributes
  7:  98b32630fcd =  7:  15f199e810e ci: optionally mark up output in the GitHub workflow
  8:  1a6bd1846bc =  8:  91ea54f36c5 ci: use `--github-workflow-markup` in the GitHub workflow
  9:  992b1575889 =  9:  be2a83f5da3 ci: call `finalize_test_case_output` a little later

Comments

Junio C Hamano March 1, 2022, 7:07 p.m. UTC | #1
"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> Changes since v1:
>
>  * In the patch that removed MAKE_TARGETS, a stale comment about that
>    variable is also removed.
>  * The comment about set -x has been adjusted because it no longer applies
>    as-is.
>  * The commit message of "ci: make it easier to find failed tests' logs in
>    the GitHub workflow" has been adjusted to motivate the improvement
>    better.

Will queue.  Thanks.
Ævar Arnfjörð Bjarmason March 2, 2022, 12:22 p.m. UTC | #2
On Tue, Mar 01 2022, Johannes Schindelin via GitGitGadget wrote:

> Changes since v1:
>
>  * In the patch that removed MAKE_TARGETS, a stale comment about that
>    variable is also removed.
>  * The comment about set -x has been adjusted because it no longer applies
>    as-is.
>  * The commit message of "ci: make it easier to find failed tests' logs in
>    the GitHub workflow" has been adjusted to motivate the improvement
>    better.

Just briefly: Much of the feedback I had on v1 is still unanswered, or
in the case of the performance issues I think just saying that this
output is aimed at non-long-time-contributors doesn't really justify the
large observed slowdowns.

I've been clicking around on the various "seen" links you posted, and
while some of the output is improved it's *much* slower, to the point
where I don't find those gains worth it.

For the early part of this series teaching the "ci/" this GitHub
markdown that's something that's unnecessary in my alternate approach in
[1]. This series is is obviously a relatively shorter way to get there,
but knowing that you *can* do it in a simpler way now it's odd that the
v2 doesn't discuss why getting from "A" to that "B" is still desirable
with this method.

As for the later parts:

>   5:  94dcbe1bc43 =  5:  9eda6574313 tests: refactor --write-junit-xml code
>   6:  41230100091 =  6:  c8b240af749 test(junit): avoid line feeds in XML attributes
>   7:  98b32630fcd =  7:  15f199e810e ci: optionally mark up output in the GitHub workflow
>   8:  1a6bd1846bc =  8:  91ea54f36c5 ci: use `--github-workflow-markup` in the GitHub workflow
>   9:  992b1575889 =  9:  be2a83f5da3 ci: call `finalize_test_case_output` a little later

I think (and I pointed out to you before) that I really don't see why
teaching the test-lib.sh N output formats when we already have a
machine-readable output format with TAP *and* are in fact running code
with "prove" that is parsing that output format into nicely arranged
objects/structures makes sense.

So as a demo for how that can work, here's a quick POC I hacked up a
while ago to use "prove" output plugins for this sort of thing.

First we intentionally break a test:

	$ git -P diff
	diff --git a/t/t0002-gitfile.sh b/t/t0002-gitfile.sh
	index 76052cb5620..f085188baf9 100755
	--- a/t/t0002-gitfile.sh
	+++ b/t/t0002-gitfile.sh
	@@ -44,7 +44,7 @@ test_expect_success 'check hash-object' '
	 
	 test_expect_success 'check cat-file' '
	        git cat-file blob $SHA >actual &&
	-       test_cmp bar actual
	+       test_cmp /dev/null actual
	 '
	 
	 test_expect_success 'check update-index' '
	@@ -67,7 +67,7 @@ test_expect_success 'check commit-tree' '
	 
	 test_expect_success 'check rev-list' '
	        git update-ref "HEAD" "$SHA" &&
	-       test "$SHA" = "$(git rev-list HEAD)"
	+       test "$SHA" = "$(xgit rev-list HEAD)"
	 '

And here's the resulting output, which is the same as the existing
"prove" summary output, but then followed by nicely formatted details
about just the failing tests:

	$ PERL5LIB=$PWD prove --formatter Fmt t0002-gitfile.sh :: -vx
	 
	 test_expect_success 'setup_git_dir twice in subdir' '
	t0002-gitfile.sh .. Dubious, test returned 1 (wstat 256, 0x100)
	Failed 2/14 subtests 
	
	Test Summary Report
	-------------------
	t0002-gitfile.sh (Wstat: 256 Tests: 14 Failed: 2)
	  Failed tests:  6, 10
	  Non-zero exit status: 1
	Files=1, Tests=14,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.13 cusr  0.05 csys =  0.20 CPU)
	Result: FAIL
	Failed in t0002-gitfile.sh#6:
	 ==> test   => not ok 6 - check cat-file
	 ==> source => #
	 ==> source => #        git cat-file blob $SHA >actual &&
	 ==> source => #        test_cmp /dev/null actual
	 ==> source => #
	 ==> trace  => + git cat-file blob 257cc5642cb1a054f08cc83f2d943e56fd3ebe99
	 ==> trace  => + test_cmp /dev/null actual
	 ==> trace  => + test 2 -ne 2
	 ==> trace  => + eval diff -u "$@"
	 ==> trace  => + diff -u /dev/null actual
	 ==> output => --- /dev/null    2022-01-25 04:40:50.187529644 +0000
	 ==> output => +++ actual       2022-03-02 12:29:28.217960535 +0000
	 ==> output => @@ -0,0 +1 @@
	 ==> output => +foo
	 ==> output => error: last command exited with $?=1
	 ==> output => 
	Failed in t0002-gitfile.sh#10:
	 ==> test   => not ok 10 - check rev-list
	 ==> source => #
	 ==> source => #        git update-ref "HEAD" "$SHA" &&
	 ==> source => #        test "$SHA" = "$(xgit rev-list HEAD)"
	 ==> source => #
	 ==> trace  => + git update-ref HEAD f2e10ff57e8c01fe514c650d9de97e913257ba0c
	 ==> trace  => + xgit rev-list HEAD
	 ==> trace  => + test f2e10ff57e8c01fe514c650d9de97e913257ba0c = 
	 ==> output => t0002-gitfile.sh: 5: eval: xgit: not found
	 ==> output => error: last command exited with $?=1
	 ==> output => 

Now, some of that is on top of some output changes to test-lib.sh I had
locally, but the "test", "source" etc. there is not a hardcoded part of
the output, it's corresponding to (some of which I re-labeled,
e.g. "comment" => "source") the individual object types the TAP::Parser
emits.

The code to do that is below, with brief (unindented) comments:

	package Git::TAP::Formatter::Session;
	use v5.18.2;
	use strict;
	use warnings;
	use base 'TAP::Formatter::Console::ParallelSession';
	
	our %STATE;
	sub result {
		my $self = shift;
		my $result = shift;

So here we just have the TAP object for the current tests (as in "ok",
"not ok" etc.), and save all of those away (both for successful and
non-successful) tests in %STATE for later:
	
		my $res = $self->SUPER::result($result);
		my $test_name = $self->name;
	
		# An AoO of test numbers and their output lines
		$STATE{$test_name} ||= [{lines => []}];
	
		push @{$STATE{$test_name}->[-1]->{lines}} => $result;
	
		# When we see a new test add a new AoA for its output. We do
		# end up with the "plan" type as part of the last test, and
		# might want to split it up
		if ($result->type eq 'test') {
			push @{$STATE{$test_name}} => {};
		}
	
		return $res;
	}

This "Fmt" package is a "prove" plugin. It's just hooking into the code
that emits the current summary, now it just shows the "Test Summary
Report" that your CL notes issues with, but this shows how easy it is to
change or amend it (you can override another accessor here to fully
suppress the current output, I just didn't do that):

	package Fmt;
	use strict;
	use warnings;
	use List::MoreUtils qw(firstidx);
	use base 'TAP::Formatter::Console';
	
	sub open_test {
		my $self = shift;
	
		my $session = $self->SUPER::open_test(@_);
		use Data::Dumper;
		#warn "session is = " . Dumper $session;
		return bless $session => 'Git::TAP::Formatter::Session';
	}
	
	sub summary {
		my $self = shift;
		$self->SUPER::summary(@_);
	
		## This state machine needs to go past the "ok" line and grab
		## the comments emitted by e.g. "say_color_tap_comment_lines" in
		## test_ok_()
		for my $test (sort keys %STATE) {
			for (my $i = 1; $i <= $#{$STATE{$test}}; $i++) {
				my @lines = @{$STATE{$test}->[$i]->{lines}};
				my $break = firstidx { $_->raw eq '' } @lines;
				my @source = splice @lines, 0, $break;
	
				splice @lines, 0, 1; # Splice out the '' item
				push @{$STATE{$test}->[$i - 1]->{lines}} => @source;
	
				$STATE{$test}->[$i]->{lines} = \@lines;
	
				# Since we parsed out the source already,
				# let's make it easily machine-readable, and
				# parse the rest.
				$STATE{$test}->[$i]->{source} = \@source;
				my @trace = map { $_->{is_trace} = 1 } grep { $_->raw =~ /^\+ / } @lines;
				$STATE{$test}->[$i]->{trace} = \@trace if @trace;
			}
		}
	
		my $aggregator = $_[0];
		for my $failed ($aggregator->failed) {
			my ($parser) = $aggregator->parsers($failed);
			for my $i ($parser->failed) {
				my $idx = $i - 1;
				my @lines = @{$STATE{$failed}->[$idx]->{lines}};
				my ($test) = grep { $_->is_test } @lines;
	
				say "Failed in $failed#$i:";
				say join "\n", map {
					s/^/ ==> /gr;
				} map {
					sprintf("%-6s => %s",
					     $_->{is_trace} ? "trace" :
					     $_->is_unknown ? "output" :
					     $_->is_comment ? "source" :
					     $_->type,
					     $_->raw,
				     );
				} sort {
					# the "is_trace" may be undef
					no warnings qw(uninitialized);
					# The "[not ]ok" line first...
					$b->is_test <=> $a->is_test ||
					# Then "comment" (i.e. test source)
					$b->is_comment <=> $a->is_comment ||
					# Then the "+ " trace of execution
					$b->{is_trace} <=> $a->{is_trace}
				} @lines;
			}
		}
	}
	
	1;


1. https://lore.kernel.org/git/cover-00.25-00000000000-20220221T143936Z-avarab@gmail.com/
2. https://lore.kernel.org/git/220222.86y2236ndp.gmgdl@evledraar.gmail.com/
Johannes Schindelin March 7, 2022, 3:57 p.m. UTC | #3
Hi Ævar,

On Wed, 2 Mar 2022, Ævar Arnfjörð Bjarmason wrote:

>
> On Tue, Mar 01 2022, Johannes Schindelin via GitGitGadget wrote:
>
> > Changes since v1:
> >
> >  * In the patch that removed MAKE_TARGETS, a stale comment about that
> >    variable is also removed.
> >  * The comment about set -x has been adjusted because it no longer applies
> >    as-is.
> >  * The commit message of "ci: make it easier to find failed tests' logs in
> >    the GitHub workflow" has been adjusted to motivate the improvement
> >    better.
>
> Just briefly: Much of the feedback I had on v1 is still unanswered,

Yes, because I got the sense that your feedback ignores the goal of making
it easier to diagnose test failures.

> or in the case of the performance issues I think just saying that this
> output is aimed at non-long-time-contributors doesn't really justify the
> large observed slowdowns.

What good is a quickly-loading web site when it is less than useful?

I'd much rather have a slow-loading web site that gets me to where I need
to be than a fast-loading one that leaves me as puzzled as before.

Ciao,
Johannes
Ævar Arnfjörð Bjarmason March 7, 2022, 4:05 p.m. UTC | #4
On Mon, Mar 07 2022, Johannes Schindelin wrote:

> On Wed, 2 Mar 2022, Ævar Arnfjörð Bjarmason wrote:
>
>>
>> On Tue, Mar 01 2022, Johannes Schindelin via GitGitGadget wrote:
>>
>> > Changes since v1:
>> >
>> >  * In the patch that removed MAKE_TARGETS, a stale comment about that
>> >    variable is also removed.
>> >  * The comment about set -x has been adjusted because it no longer applies
>> >    as-is.
>> >  * The commit message of "ci: make it easier to find failed tests' logs in
>> >    the GitHub workflow" has been adjusted to motivate the improvement
>> >    better.
>>
>> Just briefly: Much of the feedback I had on v1 is still unanswered,
>
> Yes, because I got the sense that your feedback ignores the goal of making
> it easier to diagnose test failures.

I think that's a rather strange conclusion given that I've submitted a
parallel series that makes some of those failures easier to diagnose
than the same changes in this series. I.e. the failures in build
v.s. test phases, not the individual test format output (but those are
orthagonal).

I think it's a fair summary of our differences that we're just placing
different values on UX responsiveness. I'm pretty sure there's some
amount of UX slowdown you'd consider unacceptable, no matter how much
the output was improved.

Clearly we just use it differently.

>> or in the case of the performance issues I think just saying that this
>> output is aimed at non-long-time-contributors doesn't really justify the
>> large observed slowdowns.
>
> What good is a quickly-loading web site when it is less than useful?

For all the flaws in the current output there are cases now where you
can click on a failure, see a summary of the 1-2 tests that failed, and
even find your way through the right place in the rather verbose raw log
output in 1/4 or 1/2 the time it takes the initial page on the new UX to
loda.

> I'd much rather have a slow-loading web site that gets me to where I need
> to be than a fast-loading one that leaves me as puzzled as before.

I think it's clear that we're going to disagree on this point, but I'd
still think that:

 * In a re-roll, you should amend these patches to clearly note that's a
   UX trade-off you're making, perhaps with rough before/after timings
   similar to the ones I've posted.

   I.e. now those patches say nothing about the UX change resulting in
   UX that's *much* slower than before. Clearly noting that trade-off
   for reviewers is not the same as saying the trade-off can't be made.

 * I don't see why the changes here can't be made configurable (and
   perhaps you'd argue they should be on by default) via the ci-config
   phase.
Junio C Hamano March 7, 2022, 5:36 p.m. UTC | #5
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> I think that's a rather strange conclusion given that I've submitted a
> parallel series that makes some of those failures easier to diagnose
> than the same changes in this series. I.e. the failures in build
> v.s. test phases, not the individual test format output (but those are
> orthagonal).

If you have a counter-proposal that you feel is solid enough, I do
not mind dropping the topic in question and replacing it with the
counter-proposal to let people see how it fares for a few days.  If
it allows others to view the output easily if you revert the merge
of this topic into 'seen' and replace with the counter-proposal and
push it to your own repository, that would be an even better way to
highlight the differences of two approaches, as that would allow us
to see the same failures side-by-side.

Am I correct to understand that one of the the common goals here is
to eliminate the need to discover how to get to the first failure
output without turning it slow by 10x to load the output?

> I think it's clear that we're going to disagree on this point, but I'd
> still think that:
>
>  * In a re-roll, you should amend these patches to clearly note that's a
>    UX trade-off you're making, perhaps with rough before/after timings
>    similar to the ones I've posted.
>
>    I.e. now those patches say nothing about the UX change resulting in
>    UX that's *much* slower than before. Clearly noting that trade-off
>    for reviewers is not the same as saying the trade-off can't be made.

Whether we perform counter-proposal comparison or not, the above is
a reasonable thing to ask.

>  * I don't see why the changes here can't be made configurable (and
>    perhaps you'd argue they should be on by default) via the ci-config
>    phase.

I do not know if such a knob is feasible, though.

Thanks.
Ævar Arnfjörð Bjarmason March 9, 2022, 10:56 a.m. UTC | #6
On Mon, Mar 07 2022, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> I think that's a rather strange conclusion given that I've submitted a
>> parallel series that makes some of those failures easier to diagnose
>> than the same changes in this series. I.e. the failures in build
>> v.s. test phases, not the individual test format output (but those are
>> orthagonal).
>
> If you have a counter-proposal that you feel is solid enough, I do
> not mind dropping the topic in question and replacing it with the
> counter-proposal to let people see how it fares for a few days.  If
> it allows others to view the output easily if you revert the merge
> of this topic into 'seen' and replace with the counter-proposal and
> push it to your own repository, that would be an even better way to
> highlight the differences of two approaches, as that would allow us
> to see the same failures side-by-side.

I proposed [1] as a counter-proposal to *part of* this series, which has
various UX comparisons.

Note that it doesn't aim to change the test failure output
significantly, but to structurally simplify .github/workflows/main.yml
and ci/ for a significant line-count reduction, and improve part of the
output that we get for "make" v.s. "make test", and to make it obvious
what the parameters of each step are.

Which is partially overlapping with 1/2 of Johannes's series here, I
think it makes sense to split up those two concerns & address this more
incrementally.

I.e. my series shows that you can get what the first half of this series
proposes to do by adding GitHub-specific output to ci/* without any such
CI-backend-specific output, and the resulting presentation in the UX is
better.

It'll still apply to "master" with this topic ejected, there was a minor
comment (needing commit message rephrasing) from SZEDER Gábor on it, I
could re-roll it if you're interested.

> Am I correct to understand that one of the the common goals here is
> to eliminate the need to discover how to get to the first failure
> output without turning it slow by 10x to load the output?

That's definitely an eventual common goal, I have a POC for how to do
that with an alternate approach that doesn't suffer from that slowdown,
and shows you much more targeted failure output (only the specific tests
that failed) at [2].

I just think it makes sense to split up the concerns how we arrange
.github/workflows/main.yml & ci/* and how doing that differently can
improve the CI UX, v.s. the mostly orthagonal concern of how test-lib.sh
+ prove can best summarize their failure output.

>> I think it's clear that we're going to disagree on this point, but I'd
>> still think that:
>>
>>  * In a re-roll, you should amend these patches to clearly note that's a
>>    UX trade-off you're making, perhaps with rough before/after timings
>>    similar to the ones I've posted.
>>
>>    I.e. now those patches say nothing about the UX change resulting in
>>    UX that's *much* slower than before. Clearly noting that trade-off
>>    for reviewers is not the same as saying the trade-off can't be made.
>
> Whether we perform counter-proposal comparison or not, the above is
> a reasonable thing to ask.
>
>>  * I don't see why the changes here can't be made configurable (and
>>    perhaps you'd argue they should be on by default) via the ci-config
>>    phase.
>
> I do not know if such a knob is feasible, though.

It would be rather trivial. Basically a matter of adding a "if:
needs.ci-config.outputs.style == 'basic'" to ci/print-test-failures.sh,
and a corresponding flag passed down to ci/lib.sh to have it invoke
test-lib.sh with --github-workflow-markup or not.

I.e. this series detects it's running in GitHub CI and optionally tells
test-lib.sh to emit different output, so to get the old output you'd
just need to not opt-in to that.

I think we can have our cake and eat it too though, so I don't think
there's any point in such a knob per-se. The only reason I'm suggesting
it is because Johannes doesn't otherwise seem to want to address the
significant CI UX slowdowns in this series.

I do think the approach taken by this series is inherently limited in
solving that problem though, in a way that the approach outlined in [2]
isn't.

I.e. the problem is that we're throwing say ~30k-90k lines of raw CI
output at some GitHub JavaScript to format and present. Your browser
needs to download all the output, and then the browser needs to spin at
100% CPU to present it to you.

The reason for that is that anything that tweaks the test-lib.sh output
is something you need to consume *in its entirety*. I.e. when you have a
sequence like:

    ok 1 test one
    ok 2 test two
    not ok 3 test three
    1..3

You don't know until the third test that you've had a failure. Short of
teaching test-lib.sh even more complexity (e.g. pre-buffering its
"passing" output) a UX layer needs to present all of it, and won't
benefit from a parsed representation of it.

Which is why I think adding other output formatters to test-lib.sh is a
dead end approach to this problem.

I mean, sure we could start parsing the new output emitted here, but
that doesn't make sense either.

We already run a full TAP parser over the output of the entire test
suite, which we can easily use to only print details about those tests
that failed[2]. We could also still have the full & raw output, but that
could be in some other tab (or "stage", just like
"ci/print-test-failures.sh" is now.

To be fair that isn't quite as trivial in terms of patch count. In
particular we have edge cases currently where the TAP output isn't valid
due to bugs in test-lib.sh and/or tests, and can't combine it with the
--verbose output. The demo at [2] is on top of some patches I've got
locally to fix all of that.

But I really think it's worth it to move a bit slower there & get it
right rather than heading down the wrong direction of having another
not-quite-machine-readable output target.

I.e. once it's machine readable & parsed we can present it however we
like, and can do *much better* output such as correlating trace output
to the test source, and e.g. showing what specific line we failed
on. Right now all of that needs to be eyeballed & inferred from the
"--verbose -x" output (which is often non-obvious & a pain, especially
when the "test_expect_success" block calls helper functions).

1. https://lore.kernel.org/git/cover-00.25-00000000000-20220221T143936Z-avarab@gmail.com/
2. https://lore.kernel.org/git/220302.86mti87cj2.gmgdl@evledraar.gmail.com/
Johannes Schindelin March 9, 2022, 1:20 p.m. UTC | #7
Hi Ævar,

On Mon, 7 Mar 2022, Ævar Arnfjörð Bjarmason wrote:

> On Mon, Mar 07 2022, Johannes Schindelin wrote:
>
> > On Wed, 2 Mar 2022, Ævar Arnfjörð Bjarmason wrote:
> >
> >>
> >> On Tue, Mar 01 2022, Johannes Schindelin via GitGitGadget wrote:
> >>
> >> > Changes since v1:
> >> >
> >> >  * In the patch that removed MAKE_TARGETS, a stale comment about that
> >> >    variable is also removed.
> >> >  * The comment about set -x has been adjusted because it no longer applies
> >> >    as-is.
> >> >  * The commit message of "ci: make it easier to find failed tests' logs in
> >> >    the GitHub workflow" has been adjusted to motivate the improvement
> >> >    better.
> >>
> >> Just briefly: Much of the feedback I had on v1 is still unanswered,
> >
> > Yes, because I got the sense that your feedback ignores the goal of making
> > it easier to diagnose test failures.
>
> I think that's a rather strange conclusion given that I've submitted a
> parallel series that makes some of those failures easier to diagnose
> than the same changes in this series. I.e. the failures in build
> v.s. test phases, not the individual test format output (but those are
> orthagonal).

I do not know what your parallel series implements, as I did not have the
time to read it yet (and it contains about double the number of patches of
my series, hopefully not intended to make it impossible for me to spare
the time to even taking a glimpse at it).

Also: I thought we had the rule of trying to be mindful of other
contributors and avoid interfering with patch series that are in flight?
It could be viewed as unnecessarily adversarial.

> I think it's a fair summary of our differences that we're just placing
> different values on UX responsiveness. I'm pretty sure there's some
> amount of UX slowdown you'd consider unacceptable, no matter how much
> the output was improved.
>
> Clearly we just use it differently.

I would gladly trade my convenience in return for making it easier for
others to diagnose why their PR builds fail.

At the moment, the way our CI/PR builds present test failures very likely
makes every new contributor feel stupid for not knowing how to proceed.
But they are not stupid. The failure is not theirs. The fault lies
squarely with us, for making it so frigging hard.

> >> or in the case of the performance issues I think just saying that this
> >> output is aimed at non-long-time-contributors doesn't really justify the
> >> large observed slowdowns.
> >
> > What good is a quickly-loading web site when it is less than useful?
>
> For all the flaws in the current output there are cases now where you
> can click on a failure, see a summary of the 1-2 tests that failed, and
> even find your way through the right place in the rather verbose raw log
> output in 1/4 or 1/2 the time it takes the initial page on the new UX to
> loda.

I wonder where the hard data is that backs up these numbers.

I do not have hard data, either, except for one: apart from you and Junio,
I have yet to talk to any contributor who said "oh yeah, I found the logs
right away" rather than things like "when I finally figured out that the
logs were in `print-test-failures`, I had a chance to make sense of the
failures" or even "zOMG _that_ is where the logs are?". And let me tell
you that I heard this from a lot of people. Way more than a mere two.
Far, far more.

> > I'd much rather have a slow-loading web site that gets me to where I need
> > to be than a fast-loading one that leaves me as puzzled as before.
>
> I think it's clear that we're going to disagree on this point, but I'd
> still think that:
>
>  * In a re-roll, you should amend these patches to clearly note that's a
>    UX trade-off you're making, perhaps with rough before/after timings
>    similar to the ones I've posted.
>
>    I.e. now those patches say nothing about the UX change resulting in
>    UX that's *much* slower than before. Clearly noting that trade-off
>    for reviewers is not the same as saying the trade-off can't be made.

Sure, I can do that.

>  * I don't see why the changes here can't be made configurable (and
>    perhaps you'd argue they should be on by default) via the ci-config
>    phase.

I do see why. If my goal is to unhide the logs by default, so that new
contributors can find them more easily, I will not hide that new behavior
behind a hard-to-find config option, an option that new contributors are
even less likely to find. That would be highly counterproductive. If your
goal is to help new contributors, I am certain that you will agree.

Ciao,
Johannes
Junio C Hamano March 9, 2022, 7:39 p.m. UTC | #8
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> I do not have hard data, either, except for one: apart from you and Junio,
> I have yet to talk to any contributor who said "oh yeah, I found the logs
> right away" rather than things like "when I finally figured out that the
> logs were in `print-test-failures`, I had a chance to make sense of the
> failures" or even "zOMG _that_ is where the logs are?". And let me tell
> you that I heard this from a lot of people. Way more than a mere two.
> Far, far more.

Stop counting me there.  I didn't find the logs right away, and I
already said that it is a good idea to eliminate the need to open
the thing other than the one that opens by default.
Ævar Arnfjörð Bjarmason March 9, 2022, 7:47 p.m. UTC | #9
On Wed, Mar 09 2022, Johannes Schindelin wrote:

> Hi Ævar,
>
> On Mon, 7 Mar 2022, Ævar Arnfjörð Bjarmason wrote:
>
>> On Mon, Mar 07 2022, Johannes Schindelin wrote:
>>
>> > On Wed, 2 Mar 2022, Ævar Arnfjörð Bjarmason wrote:
>> >
>> >>
>> >> On Tue, Mar 01 2022, Johannes Schindelin via GitGitGadget wrote:
>> >>
>> >> > Changes since v1:
>> >> >
>> >> >  * In the patch that removed MAKE_TARGETS, a stale comment about that
>> >> >    variable is also removed.
>> >> >  * The comment about set -x has been adjusted because it no longer applies
>> >> >    as-is.
>> >> >  * The commit message of "ci: make it easier to find failed tests' logs in
>> >> >    the GitHub workflow" has been adjusted to motivate the improvement
>> >> >    better.
>> >>
>> >> Just briefly: Much of the feedback I had on v1 is still unanswered,
>> >
>> > Yes, because I got the sense that your feedback ignores the goal of making
>> > it easier to diagnose test failures.
>>
>> I think that's a rather strange conclusion given that I've submitted a
>> parallel series that makes some of those failures easier to diagnose
>> than the same changes in this series. I.e. the failures in build
>> v.s. test phases, not the individual test format output (but those are
>> orthagonal).
>
> I do not know what your parallel series implements, as I did not have the
> time to read it yet (and it contains about double the number of patches of
> my series, hopefully not intended to make it impossible for me to spare
> the time to even taking a glimpse at it).

No, I'm not arranging patches in such a way as to make them harder for
you to review specifically. I thought those changes made sense as a
logical progression.

> Also: I thought we had the rule of trying to be mindful of other
> contributors and avoid interfering with patch series that are in flight?
> It could be viewed as unnecessarily adversarial.

You don't need to look at the whole thing, but in
https://lore.kernel.org/git/cover-00.25-00000000000-20220221T143936Z-avarab@gmail.com/
scroll down to "The following compares CI output" and compare:

  master: https://github.com/avar/git/runs/5274251909?check_suite_focus=true
  this: https://github.com/avar/git/runs/5274464670?check_suite_focus=true
  js: https://github.com/avar/git/runs/5272239403?check_suite_focus=true

I.e. for the build v.s. test "grouping" you're doing early in your
series we can get the same with a significantly negative instead of
positive diffstat to .github & ci/, and it frees up the "nested groups"
that you note as a limitation in your 4/9 with another potential group
level (your 4/9:
https://lore.kernel.org/git/9333ba781b8240f704e739b00d274f8c3d887e39.1643050574.git.gitgitgadget@gmail.com/)

So it's not meant to be adversarial, but to help out this topic at large
by showing that a constraint you ran up against is something we don't
need to be limited by, and it makes that part of the CI output better.

I think posting working code to demonstrate that we can indeed do that
is the most productive thing to do, talk being cheap & all that.

So yes, I agree that in general it's better to avoid conflicting topics
etc., but in a case where a topic proposes to add a significant amount
of code & complexity to get to some desired end-state, it makes sense to
demonstrate with a patch or patches that we can get to the same
end-state in some simpler way.

>> I think it's a fair summary of our differences that we're just placing
>> different values on UX responsiveness. I'm pretty sure there's some
>> amount of UX slowdown you'd consider unacceptable, no matter how much
>> the output was improved.
>>
>> Clearly we just use it differently.
>
> I would gladly trade my convenience in return for making it easier for
> others to diagnose why their PR builds fail.
>
> At the moment, the way our CI/PR builds present test failures very likely
> makes every new contributor feel stupid for not knowing how to proceed.
> But they are not stupid. The failure is not theirs. The fault lies
> squarely with us, for making it so frigging hard.

I agree with you that it's relatively bad & could be improved. I don't
have much issue with the end result we're left with once the page
actually loads at the end of this series, just the practicalities of how
slow the resulting UX is.

>> >> or in the case of the performance issues I think just saying that this
>> >> output is aimed at non-long-time-contributors doesn't really justify the
>> >> large observed slowdowns.
>> >
>> > What good is a quickly-loading web site when it is less than useful?
>>
>> For all the flaws in the current output there are cases now where you
>> can click on a failure, see a summary of the 1-2 tests that failed, and
>> even find your way through the right place in the rather verbose raw log
>> output in 1/4 or 1/2 the time it takes the initial page on the new UX to
>> loda.
>
> I wonder where the hard data is that backs up these numbers.

I've posted some in several replies to this series,
e.g. https://lore.kernel.org/git/220222.86tucr6kz5.gmgdl@evledraar.gmail.com/;
Have you tried to reproduce some of those?

I.e. the "hard data" is that usually takes me 10-20 seconds to go from a
CI link to the summary output & opening the "raw dump" view now, and the
same page is taking about a minute to just load with the new UX.

> [...]
>> > I'd much rather have a slow-loading web site that gets me to where I need
>> > to be than a fast-loading one that leaves me as puzzled as before.
>>
>> I think it's clear that we're going to disagree on this point, but I'd
>> still think that:
>>
>>  * In a re-roll, you should amend these patches to clearly note that's a
>>    UX trade-off you're making, perhaps with rough before/after timings
>>    similar to the ones I've posted.
>>
>>    I.e. now those patches say nothing about the UX change resulting in
>>    UX that's *much* slower than before. Clearly noting that trade-off
>>    for reviewers is not the same as saying the trade-off can't be made.
>
> Sure, I can do that.
>
>>  * I don't see why the changes here can't be made configurable (and
>>    perhaps you'd argue they should be on by default) via the ci-config
>>    phase.
>
> I do see why. If my goal is to unhide the logs by default, so that new
> contributors can find them more easily, I will not hide that new behavior
> behind a hard-to-find config option, an option that new contributors are
> even less likely to find. That would be highly counterproductive. If your
> goal is to help new contributors, I am certain that you will agree.

I meant that they could be on by default, but to relatively easily give
us an out to A/B test the performance of new fancy rendering v.s. the
dumb raw dump we have now.

If that's done via CI config it's a rather trivial matter of
e.g. re-pushing "master" somewhere, whereas if it needs a patch/revert
on top...
Victoria Dye March 25, 2022, 12:48 a.m. UTC | #10
Johannes Schindelin via GitGitGadget wrote:
> Changes since v1:
> 
>  * In the patch that removed MAKE_TARGETS, a stale comment about that
>    variable is also removed.
>  * The comment about set -x has been adjusted because it no longer applies
>    as-is.
>  * The commit message of "ci: make it easier to find failed tests' logs in
>    the GitHub workflow" has been adjusted to motivate the improvement
>    better.
> 
> 
> Background
> ==========
> 
> Recent patches intended to help readers figure out CI failures much quicker
> than before. Unfortunately, they haven't been entirely positive for me. For
> example, they broke the branch protections in Microsoft's fork of Git, where
> we require Pull Requests to pass a certain set of Checks (which are
> identified by their names) and therefore caused follow-up work.
> 
> Using CI and in general making it easier for new contributors is an area I'm
> passionate about, and one I'd like to see improved.
> 
> 
> The current situation
> =====================
> 
> Let me walk you through the current experience when a PR build fails: I get
> a notification mail that only says that a certain job failed. There's no
> indication of which test failed (or was it the build?). I can click on a
> link at it takes me to the workflow run. Once there, all it says is "Process
> completed with exit code 1", or even "code 2". Sure, I can click on one of
> the failed jobs. It even expands the failed step's log (collapsing the other
> steps). And what do I see there?
> 
> Let's look at an example of a failed linux-clang (ubuntu-latest) job
> [https://github.com/git-for-windows/git/runs/4822802185?check_suite_focus=true]:
> 
> [...]
> Test Summary Report
> -------------------
> t1092-sparse-checkout-compatibility.sh           (Wstat: 256 Tests: 53 Failed: 1)
>   Failed test:  49
>   Non-zero exit status: 1
> t3701-add-interactive.sh                         (Wstat: 0 Tests: 71 Failed: 0)
>   TODO passed:   45, 47
> Files=957, Tests=25489, 645 wallclock secs ( 5.74 usr  1.56 sys + 866.28 cusr 364.34 csys = 1237.92 CPU)
> Result: FAIL
> make[1]: *** [Makefile:53: prove] Error 1
> make[1]: Leaving directory '/home/runner/work/git/git/t'
> make: *** [Makefile:3018: test] Error 2
> 
> 
> That's it. I count myself lucky not to be a new contributor being faced with
> something like this.
> 
> Now, since I am active in the Git project for a couple of days or so, I can
> make sense of the "TODO passed" label and know that for the purpose of
> fixing the build failures, I need to ignore this, and that I need to focus
> on the "Failed test" part instead.
> 
> I also know that I do not have to get myself an ubuntu-latest box just to
> reproduce the error, I do not even have to check out the code and run it
> just to learn what that "49" means.
> 
> I know, and I do not expect any new contributor, not even most seasoned
> contributors to know, that I have to patiently collapse the "Run
> ci/run-build-and-tests.sh" job's log, and instead expand the "Run
> ci/print-test-failures.sh" job log (which did not fail and hence does not
> draw any attention to it).
> 

You can count me as one of the ones that, until this series, had no idea
this information existed in the CI logs. And it definitely would have been
helpful in figuring out some of the OS-specific bugs I've run into in the
past. :) 

> I know, and again: I do not expect many others to know this, that I then
> have to click into the "Search logs" box (not the regular web browser's
> search via Ctrl+F!) and type in "not ok" to find the log of the failed test
> case (and this might still be a "known broken" one that is marked via
> test_expect_failure and once again needs to be ignored).
> 
> To be excessively clear: This is not a great experience!
> 
> 
> Improved output
> ===============
> 
> Our previous Azure Pipelines-based CI builds had a much nicer UI, one that
> even showed flaky tests, and trends e.g. how long the test cases ran. When I
> ported Git's CI over to GitHub workflows (to make CI more accessible to new
> contributors), I knew fully well that we would leave this very nice UI
> behind, and I had hoped that we would get something similar back via new,
> community-contributed GitHub Actions that can be used in GitHub workflows.
> However, most likely because we use a home-grown test framework implemented
> in opinionated POSIX shells scripts, that did not happen.
> 
> So I had a look at what standards exist e.g. when testing PowerShell
> modules, in the way of marking up their test output in GitHub workflows, and
> I was not disappointed: GitHub workflows support "grouping" of output lines,
> i.e. marking sections of the output as a group that is then collapsed by
> default and can be expanded. And it is this feature I decided to use in this
> patch series, along with GitHub workflows' commands to display errors or
> notices that are also shown on the summary page of the workflow run. Now, in
> addition to "Process completed with exit code" on the summary page, we also
> read something like:
> 
> ⊗ linux-gcc (ubuntu-latest)
>    failed: t9800.20 submit from detached head
> 
> 
> Even better, this message is a link, and following that, the reader is
> presented with something like this
> [https://github.com/dscho/git/runs/4840190622?check_suite_focus=true]:
> 
> ⏵ Run ci/run-build-and-tests.sh
> ⏵ CI setup
>   + ln -s /home/runner/none/.prove t/.prove
>   + run_tests=t
>   + export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
>   + group Build make
>   + set +x
> ⏵ Build
> ⏵ Run tests
>   === Failed test: t9800-git-p4-basic ===
> ⏵ ok: t9800.1 start p4d
> ⏵ ok: t9800.2 add p4 files
> ⏵ ok: t9800.3 basic git p4 clone
> ⏵ ok: t9800.4 depot typo error
> ⏵ ok: t9800.5 git p4 clone @all
> ⏵ ok: t9800.6 git p4 sync uninitialized repo
> ⏵ ok: t9800.7 git p4 sync new branch
> ⏵ ok: t9800.8 clone two dirs
> ⏵ ok: t9800.9 clone two dirs, @all
> ⏵ ok: t9800.10 clone two dirs, @all, conflicting files
> ⏵ ok: t9800.11 clone two dirs, each edited by submit, single git commit
> ⏵ ok: t9800.12 clone using non-numeric revision ranges
> ⏵ ok: t9800.13 clone with date range, excluding some changes
> ⏵ ok: t9800.14 exit when p4 fails to produce marshaled output
> ⏵ ok: t9800.15 exit gracefully for p4 server errors
> ⏵ ok: t9800.16 clone --bare should make a bare repository
> ⏵ ok: t9800.17 initial import time from top change time
> ⏵ ok: t9800.18 unresolvable host in P4PORT should display error
> ⏵ ok: t9800.19 run hook p4-pre-submit before submit
>   Error: failed: t9800.20 submit from detached head
> ⏵ failure: t9800.20 submit from detached head 
>   Error: failed: t9800.21 submit from worktree
> ⏵ failure: t9800.21 submit from worktree 
>   === Failed test: t9801-git-p4-branch ===
>   [...]
> 
> 
> The "Failed test:" lines are colored in yellow to give a better visual clue
> about the logs' structure, the "Error:" label is colored in red to draw the
> attention to the important part of the log, and the "⏵" characters indicate
> that part of the log is collapsed and can be expanded by clicking on it.
> 
> To drill down, the reader merely needs to expand the (failed) test case's
> log by clicking on it, and then study the log. If needed (e.g. when the test
> case relies on side effects from previous test cases), the logs of preceding
> test cases can be expanded as well. In this example, when expanding
> t9800.20, it looks like this (for ease of reading, I cut a few chunks of
> lines, indicated by "[...]"):
> 
> [...]
> ⏵ ok: t9800.19 run hook p4-pre-submit before submit
>   Error: failed: t9800.20 submit from detached head
> ⏷ failure: t9800.20 submit from detached head 
>       test_when_finished cleanup_git &&
>       git p4 clone --dest="$git" //depot &&
>         (
>           cd "$git" &&
>           git checkout p4/master &&
>           >detached_head_test &&
>           git add detached_head_test &&
>           git commit -m "add detached_head" &&
>           git config git-p4.skipSubmitEdit true &&
>           git p4 submit &&
>             git p4 rebase &&
>             git log p4/master | grep detached_head
>         )
>     [...]
>     Depot paths: //depot/
>     Import destination: refs/remotes/p4/master
>     
>     Importing revision 9 (100%)Perforce db files in '.' will be created if missing...
>     Perforce db files in '.' will be created if missing...
>     
>     Traceback (most recent call last):
>       File "/home/runner/work/git/git/git-p4", line 4455, in <module>
>         main()
>       File "/home/runner/work/git/git/git-p4", line 4449, in main
>         if not cmd.run(args):
>       File "/home/runner/work/git/git/git-p4", line 2590, in run
>         rebase.rebase()
>       File "/home/runner/work/git/git/git-p4", line 4121, in rebase
>         if len(read_pipe("git diff-index HEAD --")) > 0:
>       File "/home/runner/work/git/git/git-p4", line 297, in read_pipe
>         retcode, out, err = read_pipe_full(c, *k, **kw)
>       File "/home/runner/work/git/git/git-p4", line 284, in read_pipe_full
>         p = subprocess.Popen(
>       File "/usr/lib/python3.8/subprocess.py", line 858, in __init__
>         self._execute_child(args, executable, preexec_fn, close_fds,
>       File "/usr/lib/python3.8/subprocess.py", line 1704, in _execute_child
>         raise child_exception_type(errno_num, err_msg, err_filename)
>     FileNotFoundError: [Errno 2] No such file or directory: 'git diff-index HEAD --'
>     error: last command exited with $?=1
>     + cleanup_git
>     + retry_until_success rm -r /home/runner/work/git/git/t/trash directory.t9800-git-p4-basic/git
>     + nr_tries_left=60
>     + rm -r /home/runner/work/git/git/t/trash directory.t9800-git-p4-basic/git
>     + test_path_is_missing /home/runner/work/git/git/t/trash directory.t9800-git-p4-basic/git
>     + test 1 -ne 1
>     + test -e /home/runner/work/git/git/t/trash directory.t9800-git-p4-basic/git
>     + retry_until_success mkdir /home/runner/work/git/git/t/trash directory.t9800-git-p4-basic/git
>     + nr_tries_left=60
>     + mkdir /home/runner/work/git/git/t/trash directory.t9800-git-p4-basic/git
>     + exit 1
>     + eval_ret=1
>     + :
>     not ok 20 - submit from detached head
>     #    
>     #        test_when_finished cleanup_git &&
>     #        git p4 clone --dest="$git" //depot &&
>     #        (
>     #            cd "$git" &&
>     #            git checkout p4/master &&
>     #            >detached_head_test &&
>     #            git add detached_head_test &&
>     #            git commit -m "add detached_head" &&
>     #            git config git-p4.skipSubmitEdit true &&
>     #            git p4 submit &&
>     #            git p4 rebase &&
>     #            git log p4/master | grep detached_head
>     #        )
>     #    
>   Error: failed: t9800.21 submit from worktree
>   [...]
> 
> 
> Is this the best UI we can have for test failures in CI runs? I hope we can
> do better. Having said that, this patch series presents a pretty good start,
> and offers a basis for future improvements.
> 

I think these are really valuable improvements over our current state, but I
also understand the concerns about performance elsewhere in this thread
(it's really slow to load for me as well, and scrolling/expanding the log
groups can be a bit glitchy in my browser). That said, I think there are a
couple ways you could improve the load time without sacrificing the (very
helpful) improvements you've made to error log visibility. I experimented a
bit (example result [1]) and came up with some things that may help:

* group errors by test file, rather than by test case (to reduce
  parsing/rendering time for lots of groups).
* print the verbose logs only for the failed test cases (to massively cut
  down on the size of the log, particularly when there's only a couple
  failures in a test file with a lot of passing tests).
* skip printing the full text of the test in 'finalize_test_case_output'
  when creating the group, i.e., use '$1' instead of '$*' (in both passing
  and failing tests, this information is already printed via some other
  means).

If you wanted to make sure a user could still access the full failure logs
(i.e., including the "ok" test results), you could print a link to the
artifacts page as well - that way, all of the information we currently
provide to users can still be found somewhere.

[1] https://github.com/vdye/git/runs/5666973267

> Johannes Schindelin (9):
>   ci: fix code style
>   ci/run-build-and-tests: take a more high-level view
>   ci: make it easier to find failed tests' logs in the GitHub workflow
>   ci/run-build-and-tests: add some structure to the GitHub workflow
>     output
>   tests: refactor --write-junit-xml code
>   test(junit): avoid line feeds in XML attributes
>   ci: optionally mark up output in the GitHub workflow
>   ci: use `--github-workflow-markup` in the GitHub workflow
>   ci: call `finalize_test_case_output` a little later
> 

The organization of these commits makes the series a bit confusing to read,
mainly due to the JUnit changes in the middle. Patches 5-6 don't appear to
be dependent on patches 1-4, so they could be moved to the beginning (after
patch 1). With that change, I think this series would flow more smoothly:
"Cleanup/non-functional change" -> "JUnit XML improvements" -> "Log UX
improvements".

>  .github/workflows/main.yml           |  12 ---
>  ci/lib.sh                            |  82 +++++++++++++++--
>  ci/run-build-and-tests.sh            |  14 +--
>  ci/run-test-slice.sh                 |   5 +-
>  t/test-lib-functions.sh              |   4 +-
>  t/test-lib-github-workflow-markup.sh |  50 ++++++++++
>  t/test-lib-junit.sh                  | 132 +++++++++++++++++++++++++++
>  t/test-lib.sh                        | 128 ++++----------------------
>  8 files changed, 288 insertions(+), 139 deletions(-)
>  create mode 100644 t/test-lib-github-workflow-markup.sh
>  create mode 100644 t/test-lib-junit.sh
> 
> 
> base-commit: af4e5f569bc89f356eb34a9373d7f82aca6faa8a
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1117%2Fdscho%2Fuse-grouping-in-ci-v2
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1117/dscho/use-grouping-in-ci-v2
> Pull-Request: https://github.com/gitgitgadget/git/pull/1117
> 
> Range-diff vs v1:
> 
>   1:  db08b07c37a =  1:  db08b07c37a ci: fix code style
>   2:  d2ff51bb5da !  2:  42ff3e170bf ci/run-build-and-tests: take a more high-level view
>      @@ ci/run-build-and-tests.sh: pedantic)
>        	;;
>        esac
>        
>      - # Any new "test" targets should not go after this "make", but should
>      - # adjust $MAKE_TARGETS. Otherwise compilation-only targets above will
>      - # start running tests.
>      +-# Any new "test" targets should not go after this "make", but should
>      +-# adjust $MAKE_TARGETS. Otherwise compilation-only targets above will
>      +-# start running tests.
>       -make $MAKE_TARGETS
>       +make
>       +if test -n "$run_tests"
>   3:  98891b0d3f7 !  3:  bbbe1623257 ci: make it easier to find failed tests' logs in the GitHub workflow
>      @@ Metadata
>        ## Commit message ##
>           ci: make it easier to find failed tests' logs in the GitHub workflow
>       
>      +    When investigating a test failure, the time that matters most is the
>      +    time it takes from getting aware of the failure to displaying the output
>      +    of the failing test case.
>      +
>           You currently have to know a lot of implementation details when
>           investigating test failures in the CI runs. The first step is easy: the
>           failed job is marked quite clearly, but when opening it, the failed step
>      @@ Commit message
>           The actually interesting part is in the detailed log of said failed
>           test script. But that log is shown in the CI run's step that runs
>           `ci/print-test-failures.sh`. And that step is _not_ expanded in the web
>      -    UI by default.
>      +    UI by default. It is even marked as "successful", which makes it very
>      +    easy to miss that there is useful information hidden in there.
>       
>           Let's help the reader by showing the failed tests' detailed logs in the
>           step that is expanded automatically, i.e. directly after the test suite
>   4:  9333ba781b8 !  4:  f72254a9ac6 ci/run-build-and-tests: add some structure to the GitHub workflow output
>      @@ ci/lib.sh
>       +
>       +# Set 'exit on error' for all CI scripts to let the caller know that
>       +# something went wrong.
>      -+# Set tracing executed commands, primarily setting environment variables
>      -+# and installing dependencies.
>      ++#
>      ++# We already enabled tracing executed commands earlier. This helps by showing
>      ++# how # environment variables are set and and dependencies are installed.
>       +set -e
>       +
>        skip_branch_tip_with_tag () {
>      @@ ci/lib.sh: linux-leaks)
>       +set -x
>       
>        ## ci/run-build-and-tests.sh ##
>      -@@ ci/run-build-and-tests.sh: esac
>      - # Any new "test" targets should not go after this "make", but should
>      - # adjust $MAKE_TARGETS. Otherwise compilation-only targets above will
>      - # start running tests.
>      +@@ ci/run-build-and-tests.sh: pedantic)
>      + 	;;
>      + esac
>      + 
>       -make
>       +group Build make
>        if test -n "$run_tests"
>   5:  94dcbe1bc43 =  5:  9eda6574313 tests: refactor --write-junit-xml code
>   6:  41230100091 =  6:  c8b240af749 test(junit): avoid line feeds in XML attributes
>   7:  98b32630fcd =  7:  15f199e810e ci: optionally mark up output in the GitHub workflow
>   8:  1a6bd1846bc =  8:  91ea54f36c5 ci: use `--github-workflow-markup` in the GitHub workflow
>   9:  992b1575889 =  9:  be2a83f5da3 ci: call `finalize_test_case_output` a little later
>
Ævar Arnfjörð Bjarmason March 25, 2022, 9:02 a.m. UTC | #11
On Thu, Mar 24 2022, Victoria Dye wrote:

> Johannes Schindelin via GitGitGadget wrote:
> [...]
>> Is this the best UI we can have for test failures in CI runs? I hope we can
>> do better. Having said that, this patch series presents a pretty good start,
>> and offers a basis for future improvements.
>> 
>
> I think these are really valuable improvements over our current state, but I
> also understand the concerns about performance elsewhere in this thread
> (it's really slow to load for me as well, and scrolling/expanding the log
> groups can be a bit glitchy in my browser). That said, I think there are a
> couple ways you could improve the load time without sacrificing the (very
> helpful) improvements you've made to error log visibility. I experimented a
> bit (example result [1]) and came up with some things that may help:
>
> * group errors by test file, rather than by test case (to reduce
>   parsing/rendering time for lots of groups).
> * print the verbose logs only for the failed test cases (to massively cut
>   down on the size of the log, particularly when there's only a couple
>   failures in a test file with a lot of passing tests).
> * skip printing the full text of the test in 'finalize_test_case_output'
>   when creating the group, i.e., use '$1' instead of '$*' (in both passing
>   and failing tests, this information is already printed via some other
>   means).
>
> If you wanted to make sure a user could still access the full failure logs
> (i.e., including the "ok" test results), you could print a link to the
> artifacts page as well - that way, all of the information we currently
> provide to users can still be found somewhere.
>
> [1] https://github.com/vdye/git/runs/5666973267

Thanks a lot for trying to address those concerns.

I took a look at this and it definitely performs better, although in
this case the overall output is ~3k lines.

I'd be curious to see how it performs on some of the cases discussed in
earlier threads of >~50k lines, although it looks like in this case that
would require failures to be really widespread in the test suite.

I just looked at this briefly, but looking at the branch I see you
removed the "checking known breakage of[...]" etc. from the non-GitHub
markdown output, I didn't spot how that was related/needed.

>> Johannes Schindelin (9):
>>   ci: fix code style
>>   ci/run-build-and-tests: take a more high-level view
>>   ci: make it easier to find failed tests' logs in the GitHub workflow
>>   ci/run-build-and-tests: add some structure to the GitHub workflow
>>     output
>>   tests: refactor --write-junit-xml code
>>   test(junit): avoid line feeds in XML attributes
>>   ci: optionally mark up output in the GitHub workflow
>>   ci: use `--github-workflow-markup` in the GitHub workflow
>>   ci: call `finalize_test_case_output` a little later
>> 
>
> The organization of these commits makes the series a bit confusing to read,
> mainly due to the JUnit changes in the middle. Patches 5-6 don't appear to
> be dependent on patches 1-4, so they could be moved to the beginning (after
> patch 1). With that change, I think this series would flow more smoothly:
> "Cleanup/non-functional change" -> "JUnit XML improvements" -> "Log UX
> improvements".

Have you had a change to look at the approach my suggestion of an
alternate approach to the early part of this series takes?:
https://lore.kernel.org/git/cover-00.25-00000000000-20220221T143936Z-avarab@gmail.com/

I.e. to not build up ci/lib.sh to know to group the "build" etc. within
the "run-build-and-test" step, but instead just to pull those to the
top-level by running separate build & test steps.
Victoria Dye March 25, 2022, 6:38 p.m. UTC | #12
Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, Mar 24 2022, Victoria Dye wrote:
> 
>> Johannes Schindelin via GitGitGadget wrote:
>> [...]
>>> Is this the best UI we can have for test failures in CI runs? I hope we can
>>> do better. Having said that, this patch series presents a pretty good start,
>>> and offers a basis for future improvements.
>>>
>>
>> I think these are really valuable improvements over our current state, but I
>> also understand the concerns about performance elsewhere in this thread
>> (it's really slow to load for me as well, and scrolling/expanding the log
>> groups can be a bit glitchy in my browser). That said, I think there are a
>> couple ways you could improve the load time without sacrificing the (very
>> helpful) improvements you've made to error log visibility. I experimented a
>> bit (example result [1]) and came up with some things that may help:
>>
>> * group errors by test file, rather than by test case (to reduce
>>   parsing/rendering time for lots of groups).
>> * print the verbose logs only for the failed test cases (to massively cut
>>   down on the size of the log, particularly when there's only a couple
>>   failures in a test file with a lot of passing tests).
>> * skip printing the full text of the test in 'finalize_test_case_output'
>>   when creating the group, i.e., use '$1' instead of '$*' (in both passing
>>   and failing tests, this information is already printed via some other
>>   means).
>>
>> If you wanted to make sure a user could still access the full failure logs
>> (i.e., including the "ok" test results), you could print a link to the
>> artifacts page as well - that way, all of the information we currently
>> provide to users can still be found somewhere.
>>
>> [1] https://github.com/vdye/git/runs/5666973267
> 
> Thanks a lot for trying to address those concerns.
> 
> I took a look at this and it definitely performs better, although in
> this case the overall output is ~3k lines.
> 
> I'd be curious to see how it performs on some of the cases discussed in
> earlier threads of >~50k lines, although it looks like in this case that
> would require failures to be really widespread in the test suite.
> 

Unfortunately, I don't have a direct comparison to that (the longest I found
elsewhere in the thread was ~33k lines [1], but those failures came from
strange interactions on the 'shears/seen' branch of Git for Windows that I
couldn't easily replicate). If it helps, though, here's a 1:1 comparison of
my "experiment" branch's forced test failures with and without the
optimizations I tried (without optimization, the total log is ~28k lines):

without optimization: https://github.com/vdye/git/runs/5696305589 with
optimization: https://github.com/vdye/git/runs/5666973267

So it's definitely faster - it still takes a couple seconds to load, but not
so long that my browser struggles with it (which was my main issue with the
original approach).

[1] https://github.com/dscho/git/runs/4840190622

> I just looked at this briefly, but looking at the branch I see you
> removed the "checking known breakage of[...]" etc. from the non-GitHub
> markdown output, I didn't spot how that was related/needed.
> 

It was mostly just another attempt to cut down on extraneous output (since,
if a test fails, the test definition is printed after the failure, so we
would end up with the same information twice). 

That said, if that were to be incorporated here, it'd need to be smarter
than what I tried - my change removed it entirely from the '.out' logs, and
it means that any test that *does* pass wouldn't have its test definition
logged anywhere (I think). The ideal situation would be the extraneous test
definition is only removed from the '.markup' files, but that's probably a
change better saved for a future patch/series.

>>> Johannes Schindelin (9):
>>>   ci: fix code style
>>>   ci/run-build-and-tests: take a more high-level view
>>>   ci: make it easier to find failed tests' logs in the GitHub workflow
>>>   ci/run-build-and-tests: add some structure to the GitHub workflow
>>>     output
>>>   tests: refactor --write-junit-xml code
>>>   test(junit): avoid line feeds in XML attributes
>>>   ci: optionally mark up output in the GitHub workflow
>>>   ci: use `--github-workflow-markup` in the GitHub workflow
>>>   ci: call `finalize_test_case_output` a little later
>>>
>>
>> The organization of these commits makes the series a bit confusing to read,
>> mainly due to the JUnit changes in the middle. Patches 5-6 don't appear to
>> be dependent on patches 1-4, so they could be moved to the beginning (after
>> patch 1). With that change, I think this series would flow more smoothly:
>> "Cleanup/non-functional change" -> "JUnit XML improvements" -> "Log UX
>> improvements".
> 
> Have you had a change to look at the approach my suggestion of an
> alternate approach to the early part of this series takes?:
> https://lore.kernel.org/git/cover-00.25-00000000000-20220221T143936Z-avarab@gmail.com/
> 
> I.e. to not build up ci/lib.sh to know to group the "build" etc. within
> the "run-build-and-test" step, but instead just to pull those to the
> top-level by running separate build & test steps.
> 

I looked at it a while ago, but I actually had a similar issue following
that series as I did this one; it's difficult to tell what's cleanup, what's
refactoring unrelated to this series, and what's an explicit difference in
approach compared with this series. 

Revisiting it now, I did the same thing I did with dscho's series: ran your
branch with some forced test failures and looked at the results [2]. Based
on that, there are a couple of helpful things I see in your series that
contribute to the same overarching goal as this dscho's:

* Separating build & test into different steps.
    * This makes it more immediately obvious to a user whether the issue was
      a compiler error or a test failure. Since test failures can only even
      happen if the compilation passes, this doesn't create (another)
      situation where the relevant failure information is in a different
      step than the auto-expanded failing one.
* Separating 'lib.sh --build' and 'make' into different steps. 
    * I was initially unsure of the value of this (conceptually, wouldn't
      they both be part of "build"?), but I eventually understood it to be
      "setup the environment for [build|test]" followed by "run the
      [build|test]". Since the main thing dscho's series is addressing is
      information visibility, I like that this similarly "unburies" the
      environment configuration at the beginning of build/test.

Those changes are great (and they probably have some positive impact on load
times). But as far as I can tell, nothing else in your series directly
addresses the main problem dscho is fixing here, which is that the verbose
failure logs are effectively hidden from the user (unless you know exactly
where to look). As a result, it doesn't really fit as a "replacement" to
this one for me. Honestly, my ideal "final form" of all of this may be a
combination of both series, having the CI steps:

- setup build environment
- run build (make)
- setup test environment
- run test (make test) & print failure logs

You can still pull the 'make' executions out of 'run-build-and-test.sh', but
I think the "& print failure logs" part of the 'test' step (i.e., the added
'|| handle_failed_tests') is the critical piece that, although it would slow
things down to some extent (and, of course, it's subjective where the "too
slow" line is), it would relevant failure information a whole lot more
accessible. That's the real "value-add" of this series for me, if only
because I know it would have helped me a bunch of times in the past - I
absolutely believe it would similarly help new contributors in the future.

[2] https://github.com/vdye/git/runs/5695895629
Johannes Schindelin May 21, 2022, 9:42 p.m. UTC | #13
Hi Victoria,

[sorry for the long delay, I was being pulled into too many directions,
but am incredibly grateful for your fantastic investigation.]

On Thu, 24 Mar 2022, Victoria Dye wrote:

> Johannes Schindelin via GitGitGadget wrote:
> > Changes since v1:
> >
> >  * In the patch that removed MAKE_TARGETS, a stale comment about that
> >    variable is also removed.
> >  * The comment about set -x has been adjusted because it no longer applies
> >    as-is.
> >  * The commit message of "ci: make it easier to find failed tests' logs in
> >    the GitHub workflow" has been adjusted to motivate the improvement
> >    better.
> >
> >
> > Background
> > ==========
> >
> > Recent patches intended to help readers figure out CI failures much quicker
> > than before. Unfortunately, they haven't been entirely positive for me. For
> > example, they broke the branch protections in Microsoft's fork of Git, where
> > we require Pull Requests to pass a certain set of Checks (which are
> > identified by their names) and therefore caused follow-up work.
> >
> > Using CI and in general making it easier for new contributors is an area I'm
> > passionate about, and one I'd like to see improved.
> >
> >
> > The current situation
> > =====================
> >
> > Let me walk you through the current experience when a PR build fails: I get
> > a notification mail that only says that a certain job failed. There's no
> > indication of which test failed (or was it the build?). I can click on a
> > link at it takes me to the workflow run. Once there, all it says is "Process
> > completed with exit code 1", or even "code 2". Sure, I can click on one of
> > the failed jobs. It even expands the failed step's log (collapsing the other
> > steps). And what do I see there?
> >
> > Let's look at an example of a failed linux-clang (ubuntu-latest) job
> > [https://github.com/git-for-windows/git/runs/4822802185?check_suite_focus=true]:
> >
> > [...]
> > Test Summary Report
> > -------------------
> > t1092-sparse-checkout-compatibility.sh           (Wstat: 256 Tests: 53 Failed: 1)
> >   Failed test:  49
> >   Non-zero exit status: 1
> > t3701-add-interactive.sh                         (Wstat: 0 Tests: 71 Failed: 0)
> >   TODO passed:   45, 47
> > Files=957, Tests=25489, 645 wallclock secs ( 5.74 usr  1.56 sys + 866.28 cusr 364.34 csys = 1237.92 CPU)
> > Result: FAIL
> > make[1]: *** [Makefile:53: prove] Error 1
> > make[1]: Leaving directory '/home/runner/work/git/git/t'
> > make: *** [Makefile:3018: test] Error 2
> >
> >
> > That's it. I count myself lucky not to be a new contributor being faced with
> > something like this.
> >
> > Now, since I am active in the Git project for a couple of days or so, I can
> > make sense of the "TODO passed" label and know that for the purpose of
> > fixing the build failures, I need to ignore this, and that I need to focus
> > on the "Failed test" part instead.
> >
> > I also know that I do not have to get myself an ubuntu-latest box just to
> > reproduce the error, I do not even have to check out the code and run it
> > just to learn what that "49" means.
> >
> > I know, and I do not expect any new contributor, not even most seasoned
> > contributors to know, that I have to patiently collapse the "Run
> > ci/run-build-and-tests.sh" job's log, and instead expand the "Run
> > ci/print-test-failures.sh" job log (which did not fail and hence does not
> > draw any attention to it).
> >
>
> You can count me as one of the ones that, until this series, had no idea
> this information existed in the CI logs. And it definitely would have been
> helpful in figuring out some of the OS-specific bugs I've run into in the
> past. :)

Excellent. I think I have enough accounts to consider it a universal truth
that our CI output _needs_ something like this here patch series.

> > I know, and again: I do not expect many others to know this, that I then
> > have to click into the "Search logs" box (not the regular web browser's
> > search via Ctrl+F!) and type in "not ok" to find the log of the failed test
> > case (and this might still be a "known broken" one that is marked via
> > test_expect_failure and once again needs to be ignored).
> >
> > To be excessively clear: This is not a great experience!
> >
> >
> > Improved output
> > ===============
> >
> > Our previous Azure Pipelines-based CI builds had a much nicer UI, one that
> > even showed flaky tests, and trends e.g. how long the test cases ran. When I
> > ported Git's CI over to GitHub workflows (to make CI more accessible to new
> > contributors), I knew fully well that we would leave this very nice UI
> > behind, and I had hoped that we would get something similar back via new,
> > community-contributed GitHub Actions that can be used in GitHub workflows.
> > However, most likely because we use a home-grown test framework implemented
> > in opinionated POSIX shells scripts, that did not happen.
> >
> > So I had a look at what standards exist e.g. when testing PowerShell
> > modules, in the way of marking up their test output in GitHub workflows, and
> > I was not disappointed: GitHub workflows support "grouping" of output lines,
> > i.e. marking sections of the output as a group that is then collapsed by
> > default and can be expanded. And it is this feature I decided to use in this
> > patch series, along with GitHub workflows' commands to display errors or
> > notices that are also shown on the summary page of the workflow run. Now, in
> > addition to "Process completed with exit code" on the summary page, we also
> > read something like:
> >
> > ⊗ linux-gcc (ubuntu-latest)
> >    failed: t9800.20 submit from detached head
> >
> >
> > Even better, this message is a link, and following that, the reader is
> > presented with something like this
> > [https://github.com/dscho/git/runs/4840190622?check_suite_focus=true]:
> >
> > ⏵ Run ci/run-build-and-tests.sh
> > ⏵ CI setup
> >   + ln -s /home/runner/none/.prove t/.prove
> >   + run_tests=t
> >   + export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
> >   + group Build make
> >   + set +x
> > ⏵ Build
> > ⏵ Run tests
> >   === Failed test: t9800-git-p4-basic ===
> > ⏵ ok: t9800.1 start p4d
> > ⏵ ok: t9800.2 add p4 files
> > ⏵ ok: t9800.3 basic git p4 clone
> > ⏵ ok: t9800.4 depot typo error
> > ⏵ ok: t9800.5 git p4 clone @all
> > ⏵ ok: t9800.6 git p4 sync uninitialized repo
> > ⏵ ok: t9800.7 git p4 sync new branch
> > ⏵ ok: t9800.8 clone two dirs
> > ⏵ ok: t9800.9 clone two dirs, @all
> > ⏵ ok: t9800.10 clone two dirs, @all, conflicting files
> > ⏵ ok: t9800.11 clone two dirs, each edited by submit, single git commit
> > ⏵ ok: t9800.12 clone using non-numeric revision ranges
> > ⏵ ok: t9800.13 clone with date range, excluding some changes
> > ⏵ ok: t9800.14 exit when p4 fails to produce marshaled output
> > ⏵ ok: t9800.15 exit gracefully for p4 server errors
> > ⏵ ok: t9800.16 clone --bare should make a bare repository
> > ⏵ ok: t9800.17 initial import time from top change time
> > ⏵ ok: t9800.18 unresolvable host in P4PORT should display error
> > ⏵ ok: t9800.19 run hook p4-pre-submit before submit
> >   Error: failed: t9800.20 submit from detached head
> > ⏵ failure: t9800.20 submit from detached head
> >   Error: failed: t9800.21 submit from worktree
> > ⏵ failure: t9800.21 submit from worktree
> >   === Failed test: t9801-git-p4-branch ===
> >   [...]
> >
> >
> > The "Failed test:" lines are colored in yellow to give a better visual clue
> > about the logs' structure, the "Error:" label is colored in red to draw the
> > attention to the important part of the log, and the "⏵" characters indicate
> > that part of the log is collapsed and can be expanded by clicking on it.
> >
> > To drill down, the reader merely needs to expand the (failed) test case's
> > log by clicking on it, and then study the log. If needed (e.g. when the test
> > case relies on side effects from previous test cases), the logs of preceding
> > test cases can be expanded as well. In this example, when expanding
> > t9800.20, it looks like this (for ease of reading, I cut a few chunks of
> > lines, indicated by "[...]"):
> >
> > [...]
> > ⏵ ok: t9800.19 run hook p4-pre-submit before submit
> >   Error: failed: t9800.20 submit from detached head
> > ⏷ failure: t9800.20 submit from detached head
> >       test_when_finished cleanup_git &&
> >       git p4 clone --dest="$git" //depot &&
> >         (
> >           cd "$git" &&
> >           git checkout p4/master &&
> >           >detached_head_test &&
> >           git add detached_head_test &&
> >           git commit -m "add detached_head" &&
> >           git config git-p4.skipSubmitEdit true &&
> >           git p4 submit &&
> >             git p4 rebase &&
> >             git log p4/master | grep detached_head
> >         )
> >     [...]
> >     Depot paths: //depot/
> >     Import destination: refs/remotes/p4/master
> >
> >     Importing revision 9 (100%)Perforce db files in '.' will be created if missing...
> >     Perforce db files in '.' will be created if missing...
> >
> >     Traceback (most recent call last):
> >       File "/home/runner/work/git/git/git-p4", line 4455, in <module>
> >         main()
> >       File "/home/runner/work/git/git/git-p4", line 4449, in main
> >         if not cmd.run(args):
> >       File "/home/runner/work/git/git/git-p4", line 2590, in run
> >         rebase.rebase()
> >       File "/home/runner/work/git/git/git-p4", line 4121, in rebase
> >         if len(read_pipe("git diff-index HEAD --")) > 0:
> >       File "/home/runner/work/git/git/git-p4", line 297, in read_pipe
> >         retcode, out, err = read_pipe_full(c, *k, **kw)
> >       File "/home/runner/work/git/git/git-p4", line 284, in read_pipe_full
> >         p = subprocess.Popen(
> >       File "/usr/lib/python3.8/subprocess.py", line 858, in __init__
> >         self._execute_child(args, executable, preexec_fn, close_fds,
> >       File "/usr/lib/python3.8/subprocess.py", line 1704, in _execute_child
> >         raise child_exception_type(errno_num, err_msg, err_filename)
> >     FileNotFoundError: [Errno 2] No such file or directory: 'git diff-index HEAD --'
> >     error: last command exited with $?=1
> >     + cleanup_git
> >     + retry_until_success rm -r /home/runner/work/git/git/t/trash directory.t9800-git-p4-basic/git
> >     + nr_tries_left=60
> >     + rm -r /home/runner/work/git/git/t/trash directory.t9800-git-p4-basic/git
> >     + test_path_is_missing /home/runner/work/git/git/t/trash directory.t9800-git-p4-basic/git
> >     + test 1 -ne 1
> >     + test -e /home/runner/work/git/git/t/trash directory.t9800-git-p4-basic/git
> >     + retry_until_success mkdir /home/runner/work/git/git/t/trash directory.t9800-git-p4-basic/git
> >     + nr_tries_left=60
> >     + mkdir /home/runner/work/git/git/t/trash directory.t9800-git-p4-basic/git
> >     + exit 1
> >     + eval_ret=1
> >     + :
> >     not ok 20 - submit from detached head
> >     #
> >     #        test_when_finished cleanup_git &&
> >     #        git p4 clone --dest="$git" //depot &&
> >     #        (
> >     #            cd "$git" &&
> >     #            git checkout p4/master &&
> >     #            >detached_head_test &&
> >     #            git add detached_head_test &&
> >     #            git commit -m "add detached_head" &&
> >     #            git config git-p4.skipSubmitEdit true &&
> >     #            git p4 submit &&
> >     #            git p4 rebase &&
> >     #            git log p4/master | grep detached_head
> >     #        )
> >     #
> >   Error: failed: t9800.21 submit from worktree
> >   [...]
> >
> >
> > Is this the best UI we can have for test failures in CI runs? I hope we can
> > do better. Having said that, this patch series presents a pretty good start,
> > and offers a basis for future improvements.
> >
>
> I think these are really valuable improvements over our current state, but I
> also understand the concerns about performance elsewhere in this thread
> (it's really slow to load for me as well, and scrolling/expanding the log
> groups can be a bit glitchy in my browser). That said, I think there are a
> couple ways you could improve the load time without sacrificing the (very
> helpful) improvements you've made to error log visibility. I experimented a
> bit (example result [1]) and came up with some things that may help:
>
> * group errors by test file, rather than by test case (to reduce
>   parsing/rendering time for lots of groups).

I really would like to avoid that, based on my past experience with
diagnosing test failures. It is definitely helpful if the structure lets
the reader expand individual test cases.

> * print the verbose logs only for the failed test cases (to massively cut
>   down on the size of the log, particularly when there's only a couple
>   failures in a test file with a lot of passing tests).

That's an amazingly simple trick to improve the speed by a ton, indeed.
Thank you for this splendid idea!

> * skip printing the full text of the test in 'finalize_test_case_output'
>   when creating the group, i.e., use '$1' instead of '$*' (in both passing
>   and failing tests, this information is already printed via some other
>   means).
>
> If you wanted to make sure a user could still access the full failure logs
> (i.e., including the "ok" test results), you could print a link to the
> artifacts page as well - that way, all of the information we currently
> provide to users can still be found somewhere.

That's a good point, I added that hint to the output (the link is
unfortunately not available at the time we print that advice).

>
> [1] https://github.com/vdye/git/runs/5666973267
>
> > Johannes Schindelin (9):
> >   ci: fix code style
> >   ci/run-build-and-tests: take a more high-level view
> >   ci: make it easier to find failed tests' logs in the GitHub workflow
> >   ci/run-build-and-tests: add some structure to the GitHub workflow
> >     output
> >   tests: refactor --write-junit-xml code
> >   test(junit): avoid line feeds in XML attributes
> >   ci: optionally mark up output in the GitHub workflow
> >   ci: use `--github-workflow-markup` in the GitHub workflow
> >   ci: call `finalize_test_case_output` a little later
> >
>
> The organization of these commits makes the series a bit confusing to read,
> mainly due to the JUnit changes in the middle. Patches 5-6 don't appear to
> be dependent on patches 1-4, so they could be moved to the beginning (after
> patch 1). With that change, I think this series would flow more smoothly:
> "Cleanup/non-functional change" -> "JUnit XML improvements" -> "Log UX
> improvements".

Great feedback! I changed the order as suggested.

Again, thank you so much for helping me improve the user experience of
Git's CI/PR builds.

Ciao,
Dscho
Junio C Hamano May 21, 2022, 11:05 p.m. UTC | #14
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> * print the verbose logs only for the failed test cases (to massively cut
>>   down on the size of the log, particularly when there's only a couple
>>   failures in a test file with a lot of passing tests).
>
> That's an amazingly simple trick to improve the speed by a ton, indeed.
> Thank you for this splendid idea!
>
>> * skip printing the full text of the test in 'finalize_test_case_output'
>>   when creating the group, i.e., use '$1' instead of '$*' (in both passing
>>   and failing tests, this information is already printed via some other
>>   means).
>>
>> If you wanted to make sure a user could still access the full failure logs
>> (i.e., including the "ok" test results), you could print a link to the
>> artifacts page as well - that way, all of the information we currently
>> provide to users can still be found somewhere.
>
> That's a good point, I added that hint to the output (the link is
> unfortunately not available at the time we print that advice).

https://github.com/git/git/runs/6539786128 shows that all in-flight
topics merged to 'seen', except for the ds/bundle-uri-more, passes
the linux-leaks job.  The ds/bundle-uri-more topic introduces some
leaks to commands that happen to be used in tests that are marked as
leak-checker clean, making the job fail.

Which makes a great guinea pig for the CI output improvement topic.

So, I created two variants of 'seen' with this linux-leaks breakage.
One is with the js/ci-github-workflow-markup topic on this thread.
The other one is with the ab/ci-github-workflow-markup topic (which
uses a preliminary clean-up ab/ci-setup-simplify topic as its base).
They should show the identical test results and failures.

And here are their output:

 - https://github.com/git/git/runs/6539835065
 - https://github.com/git/git/runs/6539900608

If I recall correctly, the selling point of the ab/* variant over
js/* variant was that it would give quicker UI response compared to
the former, but other than that, both variants' UI are supposed to
be as newbie friendly as the other.

When I tried the former, it reacted too poorly to my attempt to
scroll (with mouse scroll wheel, if it makes a difference) that
sometimes I was staring a blank dark-gray space for a few seconds
waiting for it to be filled by something, which was a bit jarring
experience.  When I tried the latter, it didn't show anything to
help diagnosing the details of the breakage in "run make test" step
and the user needed to know "print test failures" needs to be looked
at, which I am not sure is an inherent limitation of the approach.
After the single extra click, navigating the test output to find the
failed steps among many others that succeeded was not a very pleasant
experience.

Those who are interested in UX experiment may want to visit these
two output to see how usable each of these is for themselves.

Thanks.
Johannes Schindelin May 22, 2022, 6:48 p.m. UTC | #15
Hi Junio,

On Sat, 21 May 2022, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> >> * print the verbose logs only for the failed test cases (to massively cut
> >>   down on the size of the log, particularly when there's only a couple
> >>   failures in a test file with a lot of passing tests).
> >
> > That's an amazingly simple trick to improve the speed by a ton, indeed.
> > Thank you for this splendid idea!
> >
> >> * skip printing the full text of the test in 'finalize_test_case_output'
> >>   when creating the group, i.e., use '$1' instead of '$*' (in both passing
> >>   and failing tests, this information is already printed via some other
> >>   means).
> >>
> >> If you wanted to make sure a user could still access the full failure logs
> >> (i.e., including the "ok" test results), you could print a link to the
> >> artifacts page as well - that way, all of the information we currently
> >> provide to users can still be found somewhere.
> >
> > That's a good point, I added that hint to the output (the link is
> > unfortunately not available at the time we print that advice).
>
> https://github.com/git/git/runs/6539786128 shows that all in-flight
> topics merged to 'seen', except for the ds/bundle-uri-more, passes
> the linux-leaks job.  The ds/bundle-uri-more topic introduces some
> leaks to commands that happen to be used in tests that are marked as
> leak-checker clean, making the job fail.
>
> Which makes a great guinea pig for the CI output improvement topic.
>
> So, I created two variants of 'seen' with this linux-leaks breakage.
> One is with the js/ci-github-workflow-markup topic on this thread.
> The other one is with the ab/ci-github-workflow-markup topic (which
> uses a preliminary clean-up ab/ci-setup-simplify topic as its base).
> They should show the identical test results and failures.
>
> And here are their output:
>
>  - https://github.com/git/git/runs/6539835065

I see that this is still with the previous iteration, and therefore
exposes the same speed (or slowness) as was investigated so wonderfully by
Victoria.

So I really do not understand why you pointed to that run, given that it
still contains all the successful test cases' logs, which contributes in a
major way to said slowness.

Maybe you meant to refer to https://github.com/git/git/runs/6540394142
instead, which at least for me loads much faster _and_ makes the output as
helpful as my intention was?

Ciao,
Dscho

>  - https://github.com/git/git/runs/6539900608
>
> If I recall correctly, the selling point of the ab/* variant over
> js/* variant was that it would give quicker UI response compared to
> the former, but other than that, both variants' UI are supposed to
> be as newbie friendly as the other.
>
> When I tried the former, it reacted too poorly to my attempt to
> scroll (with mouse scroll wheel, if it makes a difference) that
> sometimes I was staring a blank dark-gray space for a few seconds
> waiting for it to be filled by something, which was a bit jarring
> experience.  When I tried the latter, it didn't show anything to
> help diagnosing the details of the breakage in "run make test" step
> and the user needed to know "print test failures" needs to be looked
> at, which I am not sure is an inherent limitation of the approach.
> After the single extra click, navigating the test output to find the
> failed steps among many others that succeeded was not a very pleasant
> experience.
>
> Those who are interested in UX experiment may want to visit these
> two output to see how usable each of these is for themselves.
>
> Thanks.
>
>
>
>
>
Junio C Hamano May 22, 2022, 7:10 p.m. UTC | #16
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> I see that this is still with the previous iteration, and therefore
> exposes the same speed (or slowness) as was investigated so wonderfully by
> Victoria.
>
> So I really do not understand why you pointed to that run, given that it

Simply because your updated version came to my tree a lot after I
prepared two trees that are otherwise identical for comparison to
write the message you are responding to.  If the new round is much
improved than the previous one, that is a very good news.

I do not appreciate that you have to always talk back to others in
such an aggressive tone, and I do not think it is only to me, by the
way.

You could have said the same thing in a lot more cordial way,
e.g. "There is a newer version than those being compared---could you
look at this run instead for comparison, even though admittably
there probably are changes in other topics in flight so the exact
failures may be different?"
Junio C Hamano May 22, 2022, 11:27 p.m. UTC | #17
Junio C Hamano <gitster@pobox.com> writes:

> Which makes a great guinea pig for the CI output improvement topic.
>
> So, I created two variants of 'seen' with this linux-leaks breakage.
> One is with the js/ci-github-workflow-markup topic on this thread.
> The other one is with the ab/ci-github-workflow-markup topic (which
> uses a preliminary clean-up ab/ci-setup-simplify topic as its base).
> They should show the identical test results and failures.

The two runs to look at have been updated.

 - The one with Ævar's change was missing the primary "workflow
   markup" topic (it only had preliminary clean-up topic), so it is
   not a fair feature-to-feature comparison to begin with.

 - The other one with Johannes's change was done with the version
   before the latest round from yesterday, which has improvements.

With all the other in-flight topics (including the one that shows
failures in linux-leaks job) merged to the same base in the same
order, I prepared two variants of 'seen' that resulted in these
logs:

 - https://github.com/git/git/runs/6546816978
 - https://github.com/git/git/runs/6546750379

One is with both of the required topics from Ævar (with a fix-up [*]),
and the other is with the latest from Johannes's series.

I do not want to taint other folks' eyes with my observations, so I'd
send my impression in a separate message as a response to this
message after waiting for some time.

Thanks.

[Footnote]

* 76253615 (ci: optionally mark up output in the GitHub workflow,
  2022-04-21) added references to ci/print-test-failures.sh and
  ci/print-test-failures-github.sh to the workflow file, while the
  latter script does not exist, but it appears that these references
  want to run the same script, so I've made a stupid and obvious
  fix-up today before pushing the result of merging all out.

  This prevented "make test || ci/print-test-failures.sh" from
  running correctly [*], ever since 76253615 (ci: optionally mark up
  output in the GitHub workflow, 2022-04-21) was queued, and it
  seems that nobody noticed nor complained.  Sigh.
Ævar Arnfjörð Bjarmason May 23, 2022, 9:05 a.m. UTC | #18
On Sat, May 21 2022, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
>>> * print the verbose logs only for the failed test cases (to massively cut
>>>   down on the size of the log, particularly when there's only a couple
>>>   failures in a test file with a lot of passing tests).
>>
>> That's an amazingly simple trick to improve the speed by a ton, indeed.
>> Thank you for this splendid idea!
>>
>>> * skip printing the full text of the test in 'finalize_test_case_output'
>>>   when creating the group, i.e., use '$1' instead of '$*' (in both passing
>>>   and failing tests, this information is already printed via some other
>>>   means).
>>>
>>> If you wanted to make sure a user could still access the full failure logs
>>> (i.e., including the "ok" test results), you could print a link to the
>>> artifacts page as well - that way, all of the information we currently
>>> provide to users can still be found somewhere.
>>
>> That's a good point, I added that hint to the output (the link is
>> unfortunately not available at the time we print that advice).
>
> https://github.com/git/git/runs/6539786128 shows that all in-flight
> topics merged to 'seen', except for the ds/bundle-uri-more, passes
> the linux-leaks job.  The ds/bundle-uri-more topic introduces some
> leaks to commands that happen to be used in tests that are marked as
> leak-checker clean, making the job fail.
>
> Which makes a great guinea pig for the CI output improvement topic.
>
> So, I created two variants of 'seen' with this linux-leaks breakage.
> One is with the js/ci-github-workflow-markup topic on this thread.
> The other one is with the ab/ci-github-workflow-markup topic (which
> uses a preliminary clean-up ab/ci-setup-simplify topic as its base).
> They should show the identical test results and failures.
>
> And here are their output:
>
>  - https://github.com/git/git/runs/6539835065
>  - https://github.com/git/git/runs/6539900608
>
> If I recall correctly, the selling point of the ab/* variant over
> js/* variant was that it would give quicker UI response compared to
> the former, but other than that, both variants' UI are supposed to
> be as newbie friendly as the other.

...

> When I tried the former, it reacted too poorly to my attempt to
> scroll (with mouse scroll wheel, if it makes a difference) that
> sometimes I was staring a blank dark-gray space for a few seconds
> waiting for it to be filled by something, which was a bit jarring
> experience.  When I tried the latter, it didn't show anything to
> help diagnosing the details of the breakage in "run make test" step
> and the user needed to know "print test failures" needs to be looked
> at, which I am not sure is an inherent limitation of the approach.
> After the single extra click, navigating the test output to find the
> failed steps among many others that succeeded was not a very pleasant
> experience.
>
> Those who are interested in UX experiment may want to visit these
> two output to see how usable each of these is for themselves.

Re selling point & feature comparison: The point of the ab/* variant was
to re-roll Johannes's onto a "base" topic that made much of his
unnecessary, because the building up of features to emit GitHub markup
can be replaced by unrolling things like "make" and "make test" to the
top-level.

That has its own UX benefits, e.g. you can see at a glance what command
was run and what the environment was, and "make" and "make test" are now
split up from one monolithic "build and test" step.

But the primary intention was not to provide a prettier UX, but to show
that this arrangement made sense. I was hoping that Johannes would reply
with some variant of "ah, I see what you mean, that does make things
simpler!" and run with it, but alas...

So small bits in the UX like what you pointed out with needing an extra
click are in there, that one would be easy to solve, it's because we
"focus" on the last step with a false exit code, so we'd just have to
arrange for the "print" step to be that step.

Anyway, the v3 CL of Johannes's series claims that the re-roll "improves
the time to load pages".

I ran both the Firefox and Chrome debugger with performance benchmarks
against:

    https://github.com/git/git/runs/6540394142

And:

    https://github.com/avar/git/runs/6551581584?check_suite_focus=true

The former is what Johannes noted as the correct v3 in
https://lore.kernel.org/git/nycvar.QRO.7.76.6.2205222045130.352@tvgsbejvaqbjf.bet/,
the latter is the current "seen" with ab/ci-github-workflow-markup
reverted, i.e. just my "base" changes.

In Chrome/Firefox the time to load the page (as in the spinner stops,
and we "focus" on the right content) is:

    JS: ~60s / ~80s 
    Æ: ~25s / ~18s

This is with Chrome Version 101.0.4951.54 (Official Build) (64-bit) and
Firefox 91.8.0esr (64-bit), both on a Debain Linux x86_64 Dell laptop.

The case of Chrome is quite revealing (since its developer tools seem to
show a better summary). It shows that the "Æ" version spent ~200ms on
"scripting", ~1ms on "rendering", and ~20k ms "idle".

For "JS" that's ~30k ms on "scripting", 15k ms on "rendering", then 7k
ms on "painting" (which is ~0ms in the other one). 7k ms are spent on
"idle".

So these are basically the same performance results as I reported in
earlier iterations.

I think a v4 of this series really deserves a much less terse
CL. I.e. there are specific reports about major slowdowns in the UX. Any
re-roll should really be re-testing those with the same/similar software
and reporting before/after results.

Clearly the primary goal of improving the CI UX should not be to
optimize the rendering of the results as a goal in itself, but in this
case it becomes *so slow* that it precludes certain major use-cases,
such as seeing a failure and being able to view it in some timely
fashion.
Johannes Schindelin May 23, 2022, 12:58 p.m. UTC | #19
Hi Junio,

On Sun, 22 May 2022, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> > I see that this is still with the previous iteration, and therefore
> > exposes the same speed (or slowness) as was investigated so wonderfully by
> > Victoria.
> >
> > So I really do not understand why you pointed to that run, given that it
>
> Simply because your updated version came to my tree a lot after I
> prepared two trees that are otherwise identical for comparison to
> write the message you are responding to.

Oh sorry, I only noticed that your mail came in after I sent the new
iteration, and I incorrectly thought that you had put the patch series
into the Stalled/To-Drop pile, so I assumed that your mail was in response
to my new iteration.

I had missed that you replied to v2 instead of to v3.

> I do not appreciate that you have to always talk back to others in
> such an aggressive tone

I apologize for that. As you might have guessed, it was not my intention
to be aggressive in any way. I merely meant to express my puzzlement, and
your explanation resolved that very nicely.

Thank you,
Dscho
Johannes Schindelin May 23, 2022, 6:41 p.m. UTC | #20
Hi Ævar,

On Mon, 23 May 2022, Ævar Arnfjörð Bjarmason wrote:

> Re selling point & feature comparison: The point of the ab/* variant was
> to re-roll Johannes's onto a "base" topic that made much of his
> unnecessary, because the building up of features to emit GitHub markup
> can be replaced by unrolling things like "make" and "make test" to the
> top-level.
>
> That has its own UX benefits, e.g. you can see at a glance what command
> was run and what the environment was, and "make" and "make test" are now
> split up from one monolithic "build and test" step.
>
> But the primary intention was not to provide a prettier UX, but to show
> that this arrangement made sense. I was hoping that Johannes would reply
> with some variant of "ah, I see what you mean, that does make things
> simpler!" and run with it, but alas...

I believe that we share the goal to make the Git project more welcoming
and easier to navigate for new contributors.

The patch series you wanted me to look at claims to make the CI/PR
definitions/scripts simpler. As it matters more to contributors how to
investigate test failures, i.e. what information they are provided about
the failures, I disagree that that patch series needs to be connected to
my patch series in any way.

Further, the result does not look like a simplification to me. For
example, I consider it an absolute no-go to remove the remnants of Azure
Pipelines support. As I had hinted, and as you saw on the git-security
list, I require this support for embargoed releases. That’s what I did
when working on the patches that made it into v2.35.2. In my book,
removing such vital (if dormant) code is not a simplification, but a
Chesterton’s Fence. While we do not need to use Azure Pipelines for our
regular CI, we definitely need it for embargoed releases. “Simply revert
it back” is not an excuse for removing something that should not be
removed in the first place.

As another example where I have a different concept of what constitutes
“simple”: In Git for Windows’ fork, we carry a patch that integrates the
`git-subtree` tests into the CI builds. This patch touches two places,
`ci/run-build-and-tests.sh` and `ci/run-test-slice.sh`. These changes
would be inherited by any CI definition that uses the scripts in `ci/`.
With the proposed patches, there are four places to patch, and they are
all limited to the GitHub workflow definition. Since you asked me for my
assessment: this is de-DRYing the code, making it more cumbersome instead
of simpler.

In other words, I have fundamental objections about the approach and about
tying it to the patches that improve the output of Git’s CI/PR runs.

> In Chrome/Firefox the time to load the page (as in the spinner stops,
> and we "focus" on the right content) is:
>
>     JS: ~60s / ~80s
>     Æ: ~25s / ~18s

My focus is on the experience of occasional and new contributors who need
to investigate test failures in the CI/PR runs. In this thread, we already
discussed the balance between speed of loading the page on the one hand
and how well the reader is guided toward the relevant parts on the other
hand. I disagree with you that the former should be prioritized over the
latter, on the contrary, guiding the readers along a path to success is
much more important than optimizing for a quick page load.

Most contributors who chimed in seemed to not mind a longer page load time
anyway, as long as the result would help them identify quickly what causes
the test failures. Besides, the page load times are only likely to become
better anyway, as GitHub engineers continuously improve Actions.

Ciao,
Dscho
Junio C Hamano May 23, 2022, 6:55 p.m. UTC | #21
Junio C Hamano <gitster@pobox.com> writes:

> I do not want to taint other folks' eyes with my observations, so I'd
> send my impression in a separate message as a response to this
> message after waiting for some time.

Between the previous and latest of the Johannes's topic, the test
output got a lot shorter by discarding the "ok" output and keeping
only the failures and skips.  Because the readers are mostly
interested in seeing failures (they can download the full log if
they want to), and this design decision probably makes sense to me.
The same "while scrolling, the user has to stare into the gray void
for several seconds" is still there and needs a bit of getting used
to (I do not know if it is a browser's problem, or something the
output can help giving a better user experience---the lines in the
folded part may probably not be "counted" correctly or something
silly like that).

The ones with the topic from Ævar last night, as I've mentioned
already, lacked the main part of the logic, and it wouldn't have
worked correctly because there was a show-stopper bug in one of the
steps in it.  With that fixed, the "extra click" I complained last
night seems to be gone.  I guess the same "discard the test steps
that successfully ran" trick would give us the same "shorter"
output.  I observe the same "staring into the gray void while
scrolling" when it comes to the print-test-failures output, just as
in the output from Johannes's topic.

Common to the both approaches, folding output from each test piece
to one line (typically "ok" but sometimes "failed" heading) may be
the source of UI responsiveness irritation I have been observing,
but I wonder, with the removal of all "ok" pieces, it may make sense
not to fold anything and instead give a flat "here are the traces of
all failed and skipped tests".

In any case, either implementation seems to give us a good
improvement over what is in 'master'.

Thanks.
Johannes Schindelin May 23, 2022, 7:21 p.m. UTC | #22
Hi Junio,

On Mon, 23 May 2022, Junio C Hamano wrote:

> [...] the test output got a lot shorter by discarding the "ok" output
> and keeping only the failures and skips.  Because the readers are mostly
> interested in seeing failures (they can download the full log if
> they want to), and this design decision probably makes sense to me.

For the record, Victoria suggested to group by file rather than by failed
test case.

However, I do speak from a lot of experience diagnosing test failures in
CI/PR runs when I say: it is frequently very helpful to have a look at one
failed test case at a time. I'd much rather suffer a minor lag while
scrolling than having to find the boundaries manually, in particular when
`test_expect_failure` test cases are present (which are reported as
"broken" in the current iteration instead of "failed").

Besides, the scroll issue is probably similar between both approaches to
grouping (and may be independent of the grouping, as you pointed out by
reporting similar issues in the current `print-test-failures` step), and
is something I hope the Actions engineers are working on.

> Common to the both approaches, folding output from each test piece
> to one line (typically "ok" but sometimes "failed" heading) may be
> the source of UI responsiveness irritation I have been observing,
> but I wonder, with the removal of all "ok" pieces, it may make sense
> not to fold anything and instead give a flat "here are the traces of
> all failed and skipped tests".

As I mentioned above, I'd rather keep the grouping by failed test case.

Obviously, the ideal way to decide would be to set up some A/B testing
with real people, but I have no way to set up anything like that.

> In any case, either implementation seems to give us a good improvement
> over what is in 'master'.

There are two things I would like to add:

- In the current iteration's summary page, you will see the failed test
  cases' titles in the errors, and they are clickable (and will get you to
  the corresponding part of the logs). I find this very convenient.

- The addition of the suggestion to look at the run's artifacts for the
  full logs might not look like a big deal, but I bet that it will help in
  particular new contributors. This was yet another great suggestion by
  Victoria.

Thanks,
Dscho
Ævar Arnfjörð Bjarmason May 24, 2022, 8:40 a.m. UTC | #23
On Mon, May 23 2022, Johannes Schindelin wrote:

> Hi Ævar,
>
> On Mon, 23 May 2022, Ævar Arnfjörð Bjarmason wrote:
>
>> Re selling point & feature comparison: The point of the ab/* variant was
>> to re-roll Johannes's onto a "base" topic that made much of his
>> unnecessary, because the building up of features to emit GitHub markup
>> can be replaced by unrolling things like "make" and "make test" to the
>> top-level.
>>
>> That has its own UX benefits, e.g. you can see at a glance what command
>> was run and what the environment was, and "make" and "make test" are now
>> split up from one monolithic "build and test" step.
>>
>> But the primary intention was not to provide a prettier UX, but to show
>> that this arrangement made sense. I was hoping that Johannes would reply
>> with some variant of "ah, I see what you mean, that does make things
>> simpler!" and run with it, but alas...
>
> I believe that we share the goal to make the Git project more welcoming
> and easier to navigate for new contributors.

Yes, definitely.

> The patch series you wanted me to look at claims to make the CI/PR
> definitions/scripts simpler. As it matters more to contributors how to
> investigate test failures, i.e. what information they are provided about
> the failures, I disagree that that patch series needs to be connected to
> my patch series in any way.

Our two set of patches change different parts of the CI UX, so no. The
set of patches I've been proposing isn't just making CI/PR
definitions/scripts simpler, although it also does that.

So e.g. in your patches you need to massage the CI output to split the
"build" step from the "test" step. As you can see in an earlier RFC
re-roll of them on top of my topic that something you'd get for free:
https://lore.kernel.org/git/RFC-cover-v5-00.10-00000000000-20220421T183001Z-avarab@gmail.com/

> Further, the result does not look like a simplification to me. For
> example, I consider it an absolute no-go to remove the remnants of Azure
> Pipelines support. As I had hinted, and as you saw on the git-security
> list, I require this support for embargoed releases. That’s what I did
> when working on the patches that made it into v2.35.2. In my book,
> removing such vital (if dormant) code is not a simplification, but a
> Chesterton’s Fence. While we do not need to use Azure Pipelines for our
> regular CI, we definitely need it for embargoed releases. “Simply revert
> it back” is not an excuse for removing something that should not be
> removed in the first place.

Can you please reply to this 3 month old and still-waiting-on-your-reply
E-Mail on this topic so we can figure out a way forward with this:
https://lore.kernel.org/git/220222.86y2236ndp.gmgdl@evledraar.gmail.com/

> As another example where I have a different concept of what constitutes
> “simple”: In Git for Windows’ fork, we carry a patch that integrates the
> `git-subtree` tests into the CI builds. This patch touches two places,
> `ci/run-build-and-tests.sh` and `ci/run-test-slice.sh`. These changes
> would be inherited by any CI definition that uses the scripts in `ci/`.
> With the proposed patches, there are four places to patch, and they are
> all limited to the GitHub workflow definition. Since you asked me for my
> assessment: this is de-DRYing the code, making it more cumbersome instead
> of simpler.

No, you'd still have two places to patch:

 1. The top-level Makefile to have "make test" run those subtree tests
    depending on some flag, i.e. the same as your
    ci/run-build-and-tests.sh.

 2. ci/run-test-slice.sh as before (which is only needed for the
 Windows-specific tests).

Because we'd be having the Makefile drive the logic you could also run
such a "make test" locally, which is something we should have
anyway. E.g. when I build my own git I run the subtree tests, and would
like to eventually make "run contrib tests too" some configurable
option.

So it is exactly the DRY principle. By avoiding making things needlessly
CI-specific we can just control this behavior with flags, both in and
outside CI.

> In other words, I have fundamental objections about the approach and about
> tying it to the patches that improve the output of Git’s CI/PR runs.

I would too if after my series you needed to patch every place we run
"make test" or whatever to run your subtree tests, but as noted above
that's not the case. So hopefully this addresses that.

More generally: I noted a while ago that if you pointed out issues like
that I'd be happy to address them for you.  Based on this I see
d08496f2c40 (ci: run `contrib/subtree` tests in CI builds, 2021-08-05),
and that would be easy to generalize.

>> In Chrome/Firefox the time to load the page (as in the spinner stops,
>> and we "focus" on the right content) is:
>>
>>     JS: ~60s / ~80s
>>     Æ: ~25s / ~18s
>
> My focus is on the experience of occasional and new contributors who need
> to investigate test failures in the CI/PR runs. In this thread, we already
> discussed the balance between speed of loading the page on the one hand
> and how well the reader is guided toward the relevant parts on the other
> hand.

First, your re-roll claims thta it "improves the time to load pages",
but based on the sort of testing I'd done before when I reported the
severe slowness introduced by this topic I can't reproduce that.

So how exactly are you testing the performance of these load times, and
can you share the numbers you have for master, your previous iteration
and this re-roll?

> I disagree with you that the former should be prioritized over the
> latter, on the contrary, guiding the readers along a path to success is
> much more important than optimizing for a quick page load.

I think a better UX is certainly worth some cost to load times, so I'm
not trying to be difficult in saying that this costs us some
milliseconds so it's a no-go.

But really, this is making it so slow that it's borderline unusable.

The main way I use this interface is that I'll get an E-Mail with a
failure report, or see the "X" in the UX and click through to the
failure, then see the logs etc, and hopefully be able to see from that
what's wrong, or how I could begin to reproduce it.

Right now that's fast enough that I'll do that all in one browser
click-through session, but if I'm having to wait *more than a minute*
v.s. the current 10-20 seconds (which is already quite bad)?

Your latest series also seems to either be buggy (or trigger some bug in
GitHub Actions?) where even after that minute you'll see almost nothing
on your screen. So a user who doesn't know the UX would end up waiting
much longer than that.

You seemingly need to know that it's done when it shows you that blank
screen, and trigger a re-render by scrolling up or down, which will show
you your actual failures.

That's not an issue I saw in any iteration of this before this v3.

> Most contributors who chimed in seemed to not mind a longer page load time
> anyway, as long as the result would help them identify quickly what causes
> the test failures.

Wasn't much of that discussion a follow-up to your initial demos of this
topic?

I don't think those were as slow as what I'm pointing out above, which I
think is just because those failures happened to involve much fewer
lines of log. The slowness seems to be at correlated with how many lines
we're dealing with in total.

> Besides, the page load times are only likely to become
> better anyway, as GitHub engineers continuously improve Actions.

Sure, and if this were all magically made better by GH engineers these
concerns would be addressed.

But right now that isn't the case, and we don't know if/when that would
happen, so we need to review these proposed changes on the basis of how
they'd change the current GitHub CI UX overall.