[3/3] ci: stop installing "gcc-13" for osx-gcc

Message ID	20240509162544.GC1708042@coredump.intra.peff.net (mailing list archive)
State	Accepted
Commit	7df2405b38ebd91b9b542167829883ddd6c02295
Headers	show Received: from cloud.peff.net (cloud.peff.net [104.130.231.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A620417083A for <git@vger.kernel.org>; Thu, 9 May 2024 16:25:46 +0000 (UTC) Date: Thu, 9 May 2024 12:25:44 -0400 From: Jeff King <peff@peff.net> To: git@vger.kernel.org Subject: [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc Message-ID: <20240509162544.GC1708042@coredump.intra.peff.net> References: <20240509162219.GA1707955@coredump.intra.peff.net> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20240509162219.GA1707955@coredump.intra.peff.net>
Series	un-breaking osx-gcc ci job \| expand [0/3] un-breaking osx-gcc ci job [1/3] ci: drop mention of BREW_INSTALL_PACKAGES variable [2/3] ci: avoid bare "gcc" for osx-gcc job [3/3] ci: stop installing "gcc-13" for osx-gcc

Jeff King May 9, 2024, 4:25 p.m. UTC

Our osx-gcc job explicitly asks to install gcc-13. But since the GitHub
runner image already comes with gcc-13 installed, this is mostly doing
nothing (or in some cases it may install an incremental update over the
runner image). But worse, it recently started causing errors like:

    ==> Fetching gcc@13
    ==> Downloading https://ghcr.io/v2/homebrew/core/gcc/13/blobs/sha256:fb2403d97e2ce67eb441b54557cfb61980830f3ba26d4c5a1fe5ecd0c9730d1a
    ==> Pouring gcc@13--13.2.0.ventura.bottle.tar.gz
    Error: The `brew link` step did not complete successfully
    The formula built, but is not symlinked into /usr/local
    Could not symlink bin/c++-13
    Target /usr/local/bin/c++-13
    is a symlink belonging to gcc. You can unlink it:
      brew unlink gcc

which cause the whole CI job to bail.

I didn't track down the root cause, but I suspect it may be related to
homebrew recently switching the "gcc" default to gcc-14. And it may even
be fixed when a new runner image is released. But if we don't need to
run brew at all, it's one less thing for us to worry about.

Signed-off-by: Jeff King <peff@peff.net>
---
I don't think this was due to anything on our side. I tried to re-run
versions of git.git which had previously passed and ran into the same
issue.

I'd like to report that this let me get a successful CI run, but I'm
running into the thing where osx jobs seem to randomly hang sometimes
and hit the 6-hour timeout. But I did confirm that this lets us get to
the actual build/test, and not barf while installing dependencies.

 .github/workflows/main.yml | 1 -
 1 file changed, 1 deletion(-)

Patrick Steinhardt May 10, 2024, 7 a.m. UTC | #1

On Thu, May 09, 2024 at 12:25:44PM -0400, Jeff King wrote:
[snip]
> I'd like to report that this let me get a successful CI run, but I'm
> running into the thing where osx jobs seem to randomly hang sometimes
> and hit the 6-hour timeout. But I did confirm that this lets us get to
> the actual build/test, and not barf while installing dependencies.

Yeah, this one is puzzling to me. We see the same thing on GitLab CI,
and until now I haven't yet figured out why that is.

>  .github/workflows/main.yml | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
> index 5f92a50271..13cc0fe807 100644
> --- a/.github/workflows/main.yml
> +++ b/.github/workflows/main.yml
> @@ -285,7 +285,6 @@ jobs:
>              pool: macos-13
>            - jobname: osx-gcc
>              cc: gcc-13
> -            cc_package: gcc-13

As far as I can see this means that we don't install GCC at all anymore
via Homebrew. Does this mean that we now rely on the GCC version that is
preinstalled by Homebrew? Won't this break every time that Homebrew
changes the default GCC version?

I may be missing the obvious.

Patrick

Jeff King May 10, 2024, 8:13 p.m. UTC | #2

On Fri, May 10, 2024 at 09:00:04AM +0200, Patrick Steinhardt wrote:

> On Thu, May 09, 2024 at 12:25:44PM -0400, Jeff King wrote:
> [snip]
> > I'd like to report that this let me get a successful CI run, but I'm
> > running into the thing where osx jobs seem to randomly hang sometimes
> > and hit the 6-hour timeout. But I did confirm that this lets us get to
> > the actual build/test, and not barf while installing dependencies.
> 
> Yeah, this one is puzzling to me. We see the same thing on GitLab CI,
> and until now I haven't yet figured out why that is.

Drat. I was hoping maybe it was a problem in GitHub CI and somebody else
would eventually fix it. ;)

It feels like a deadlock somewhere, though whether it is in our code, or
in our tests, or some system-ish issue with prove, perl, etc, I don't
know. It would be nice to catch it in the act and see what the process
tree looks like. I guess poking around in the test environment with
tmate might work, though I don't know if there's a way to get tmate
running simultaneously with the hung step (so you'd probably have to
connect, kick off the "make test" manually and hope it hangs).

> > diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
> > index 5f92a50271..13cc0fe807 100644
> > --- a/.github/workflows/main.yml
> > +++ b/.github/workflows/main.yml
> > @@ -285,7 +285,6 @@ jobs:
> >              pool: macos-13
> >            - jobname: osx-gcc
> >              cc: gcc-13
> > -            cc_package: gcc-13
> 
> As far as I can see this means that we don't install GCC at all anymore
> via Homebrew. Does this mean that we now rely on the GCC version that is
> preinstalled by Homebrew? Won't this break every time that Homebrew
> changes the default GCC version?
>  
> I may be missing the obvious.

Yes, we'll have to update to a different version when the runner image
stops carrying gcc-13. But it's not based on homebrew's default.
According to:

  https://github.com/actions/runner-images/blob/macos-13/20240506.1/images/macos/macos-13-Readme.md

the runner image contains gcc-11, gcc-12, and gcc-13. So presumably it
would be a while before gcc-13 ages out and we have to bother bumping. I
do agree it would be nice to just use the latest gcc in the image, but I
don't think we can specify that here. I guess we could say "gcc-auto" or
something, and then the actual shell code could poke around for it.

-Peff

Patrick Steinhardt May 11, 2024, 7:17 a.m. UTC | #3

On Fri, May 10, 2024 at 04:13:48PM -0400, Jeff King wrote:
> On Fri, May 10, 2024 at 09:00:04AM +0200, Patrick Steinhardt wrote:
> 
> > On Thu, May 09, 2024 at 12:25:44PM -0400, Jeff King wrote:
> > [snip]
> > > I'd like to report that this let me get a successful CI run, but I'm
> > > running into the thing where osx jobs seem to randomly hang sometimes
> > > and hit the 6-hour timeout. But I did confirm that this lets us get to
> > > the actual build/test, and not barf while installing dependencies.
> > 
> > Yeah, this one is puzzling to me. We see the same thing on GitLab CI,
> > and until now I haven't yet figured out why that is.
> 
> Drat. I was hoping maybe it was a problem in GitHub CI and somebody else
> would eventually fix it. ;)
> 
> It feels like a deadlock somewhere, though whether it is in our code, or
> in our tests, or some system-ish issue with prove, perl, etc, I don't
> know. It would be nice to catch it in the act and see what the process
> tree looks like. I guess poking around in the test environment with
> tmate might work, though I don't know if there's a way to get tmate
> running simultaneously with the hung step (so you'd probably have to
> connect, kick off the "make test" manually and hope it hangs).

My hunch tells me that it's the Perforce tests -- after all, this is
where the jobs get stuck, too. In "lib-git-p4.sh" we already document
that p4d is known to crash at times, and overall the logic to spawn the
server is quite convoluted.

I did try to get more useful logs yesterday. But as usual, once you
_want_ to reproduce a failure like this is doesn't happen anymore.

> > > diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
> > > index 5f92a50271..13cc0fe807 100644
> > > --- a/.github/workflows/main.yml
> > > +++ b/.github/workflows/main.yml
> > > @@ -285,7 +285,6 @@ jobs:
> > >              pool: macos-13
> > >            - jobname: osx-gcc
> > >              cc: gcc-13
> > > -            cc_package: gcc-13
> > 
> > As far as I can see this means that we don't install GCC at all anymore
> > via Homebrew. Does this mean that we now rely on the GCC version that is
> > preinstalled by Homebrew? Won't this break every time that Homebrew
> > changes the default GCC version?
> >  
> > I may be missing the obvious.
> 
> Yes, we'll have to update to a different version when the runner image
> stops carrying gcc-13. But it's not based on homebrew's default.
> According to:
> 
>   https://github.com/actions/runner-images/blob/macos-13/20240506.1/images/macos/macos-13-Readme.md
> 
> the runner image contains gcc-11, gcc-12, and gcc-13. So presumably it
> would be a while before gcc-13 ages out and we have to bother bumping. I
> do agree it would be nice to just use the latest gcc in the image, but I
> don't think we can specify that here. I guess we could say "gcc-auto" or
> something, and then the actual shell code could poke around for it.

Ah, thanks for the explanation. I was worried that things might break at
arbitrary points in time. But if this is only containde to whenever we
upgrade the runner image, then I don't see this as much of a problem.

Patrick

Patrick Steinhardt May 16, 2024, 12:36 p.m. UTC | #4

On Sat, May 11, 2024 at 09:17:41AM +0200, Patrick Steinhardt wrote:
> On Fri, May 10, 2024 at 04:13:48PM -0400, Jeff King wrote:
> > On Fri, May 10, 2024 at 09:00:04AM +0200, Patrick Steinhardt wrote:
> > 
> > > On Thu, May 09, 2024 at 12:25:44PM -0400, Jeff King wrote:
> > > [snip]
> > > > I'd like to report that this let me get a successful CI run, but I'm
> > > > running into the thing where osx jobs seem to randomly hang sometimes
> > > > and hit the 6-hour timeout. But I did confirm that this lets us get to
> > > > the actual build/test, and not barf while installing dependencies.
> > > 
> > > Yeah, this one is puzzling to me. We see the same thing on GitLab CI,
> > > and until now I haven't yet figured out why that is.
> > 
> > Drat. I was hoping maybe it was a problem in GitHub CI and somebody else
> > would eventually fix it. ;)
> > 
> > It feels like a deadlock somewhere, though whether it is in our code, or
> > in our tests, or some system-ish issue with prove, perl, etc, I don't
> > know. It would be nice to catch it in the act and see what the process
> > tree looks like. I guess poking around in the test environment with
> > tmate might work, though I don't know if there's a way to get tmate
> > running simultaneously with the hung step (so you'd probably have to
> > connect, kick off the "make test" manually and hope it hangs).
> 
> My hunch tells me that it's the Perforce tests -- after all, this is
> where the jobs get stuck, too. In "lib-git-p4.sh" we already document
> that p4d is known to crash at times, and overall the logic to spawn the
> server is quite convoluted.
> 
> I did try to get more useful logs yesterday. But as usual, once you
> _want_ to reproduce a failure like this is doesn't happen anymore.

I was spending (or rather wasting?) some more time on this. With the
below diff I was able to get a list of processes running after ~50
minutes:

    diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
    index 98dda42045..d5570b59d3 100755
    --- a/ci/run-build-and-tests.sh
    +++ b/ci/run-build-and-tests.sh
    @@ -51,8 +51,15 @@ esac
     group Build make
     if test -n "$run_tests"
     then
    -	group "Run tests" make test ||
    +	(
    +		sleep 3200 &&
    +		mkdir -p t/failed-test-artifacts &&
    +		ps -A >t/failed-test-artifacts/ps 2>&1
    +	) &
    +	pid=$!
    +	group "Run tests" gtimeout 1h make test ||
        handle_failed_tests
    +	kill "$pid"
     fi
     check_unignored_build_artifacts

I trimmed that process list to the following set of relevant processes:

  PID TTY           TIME CMD
 5196 ??         0:00.01 /bin/sh t9211-scalar-clone.sh --verbose-log -x
 5242 ??         0:00.00 /bin/sh t9211-scalar-clone.sh --verbose-log -x
 5244 ??         0:00.00 tee -a /Volumes/RAMDisk/test-results/t9211-scalar-clone.out
 5245 ??         0:00.09 /bin/sh t9211-scalar-clone.sh --verbose-log -x
 7235 ??         0:00.02 /Users/gitlab/builds/gitlab-org/git/scalar clone file:///Volumes/RAMDisk/trash directory.t9211-scalar-clone/to-clone maint-fail
 7265 ??         0:00.01 /Users/gitlab/builds/gitlab-org/git/git fetch --quiet --no-progress origin
 7276 ??         0:00.01 /Users/gitlab/builds/gitlab-org/git/git fsmonitor--daemon run --detach --ipc-threads=8

So it seems like the issue is t9211, and the hang happens in "scalar
clone warns when background maintenance fails" specifically. What
exactly the root cause is I have no clue though. Maybe an fsmonitor
race, maybe something else entirely. Hard to say as I have never seen
this happen on any other platform than macOS, and I do not have access
to a Mac myself.

The issue also doesn't seem to occur when running t9211 on its own, but
only when running the full test suite. This may further indicate that
there is a race condition, where the additional load improves the
likelihood of it. Or there is bad interaction with another test.

Patrick

Jeff King May 17, 2024, 8:11 a.m. UTC | #5

On Thu, May 16, 2024 at 02:36:19PM +0200, Patrick Steinhardt wrote:

> I was spending (or rather wasting?) some more time on this. With the
> below diff I was able to get a list of processes running after ~50
> minutes:

I was going to say "good, now I don't have to waste time on it". But
your findings only nerd-sniped me into digging more. ;)

> So it seems like the issue is t9211, and the hang happens in "scalar
> clone warns when background maintenance fails" specifically. What
> exactly the root cause is I have no clue though. Maybe an fsmonitor
> race, maybe something else entirely. Hard to say as I have never seen
> this happen on any other platform than macOS, and I do not have access
> to a Mac myself.
> 
> The issue also doesn't seem to occur when running t9211 on its own, but
> only when running the full test suite. This may further indicate that
> there is a race condition, where the additional load improves the
> likelihood of it. Or there is bad interaction with another test.

I can reproduce it at will and relatively quickly using "--stress" with
t9211. I pushed up a hacky commit that removes all CI jobs except for
os-clang, and it stops short of running the build/tests and opens a
shell using tmate. For reference (though you'd need to work out
something similar for GitLab).

  https://github.com/peff/git/commit/f825fa36ed95bed414b0d6d9e8b21799e2e167e4

And then just:

  make -j8
  cd t
  ./t9211-scalar-clone.sh --stress

Give it a minute or two, and you'll see most of the jobs have hung, with
one or two "winners" continuing (once most of them are hanging, the load
is low enough that the race doesn't happen). So you'll see 3.17, 3.18,
3.19, and so on, indicating that job 3 is still going and completing its
19th run. But everything else is stuck and stops producing output.

You can likewise see processes in "ps" that are a few minutes old, which
is another way to find the stuck ones. And I get the same three
processes as you: scalar clone, fetch, and fsmonitor--daemon.

And here's where I ran into tooling issues.

Normally I'd "strace -p" to see what the hung processes are doing. We
don't have that on macOS. Doing "sudo dtruss -p" runs without complaint,
but it looks like it doesn't report on the current syscall (where we're
presumably blocking).

I installed gdb, which does seem to work, but attaching to the running
processes doesn't show a useful backtrace (even after making sure to
build with "-g -O0", and confirming that regular "gdb ./git" works OK).

One can guess that scalar is in waitpid() waiting for git-fetch. But
what's fetch waiting on? The other side of upload-pack is dead.
According to lsof, it does have a unix socket open to fsmonitor. So
maybe it's trying to read there?

Curiously killing fsmonitor doesn't un-stick fetch, and nor does killing
fetch unstick scalar. So either my guesses above are wrong, or there's
something else weird causing them to hang.

I imagine there may be better tools to poke at things, but I'm at the
limits of my macOS knowledge. But maybe the recipe above is enough for
somebody more clueful to recreate and investigate the situation (it
probably would also be easy to just run the --stress script locally if
somebody actually has a mac).

-Peff

Patrick Steinhardt May 17, 2024, 8:25 a.m. UTC | #6

On Fri, May 17, 2024 at 04:11:32AM -0400, Jeff King wrote:
> On Thu, May 16, 2024 at 02:36:19PM +0200, Patrick Steinhardt wrote:
[snip]
> One can guess that scalar is in waitpid() waiting for git-fetch. But
> what's fetch waiting on? The other side of upload-pack is dead.
> According to lsof, it does have a unix socket open to fsmonitor. So
> maybe it's trying to read there?

That was also my guess. I tried whether disabling fsmonitor via
`core.fsmonitor=false` helps, but that did not seem to be the case.
Either because it didn't have the desired effect, or because the root
cause is not fsmonitor. No idea which of both it is.

> Curiously killing fsmonitor doesn't un-stick fetch, and nor does killing
> fetch unstick scalar. So either my guesses above are wrong, or there's
> something else weird causing them to hang.
> 
> I imagine there may be better tools to poke at things, but I'm at the
> limits of my macOS knowledge. But maybe the recipe above is enough for
> somebody more clueful to recreate and investigate the situation (it
> probably would also be easy to just run the --stress script locally if
> somebody actually has a mac).

Let's hope it is :)

Patrick

Patrick Steinhardt May 17, 2024, 11:30 a.m. UTC | #7

On Fri, May 17, 2024 at 10:25:20AM +0200, Patrick Steinhardt wrote:
> On Fri, May 17, 2024 at 04:11:32AM -0400, Jeff King wrote:
> > On Thu, May 16, 2024 at 02:36:19PM +0200, Patrick Steinhardt wrote:
> [snip]
> > One can guess that scalar is in waitpid() waiting for git-fetch. But
> > what's fetch waiting on? The other side of upload-pack is dead.
> > According to lsof, it does have a unix socket open to fsmonitor. So
> > maybe it's trying to read there?
> 
> That was also my guess. I tried whether disabling fsmonitor via
> `core.fsmonitor=false` helps, but that did not seem to be the case.
> Either because it didn't have the desired effect, or because the root
> cause is not fsmonitor. No idea which of both it is.

The root cause actually is the fsmonitor. I was using your tmate hack to
SSH into one of the failed jobs, and there had been 7 instances of the
fsmonitor lurking. After killing all of them the job got unstuck and ran
to completion.

The reason why setting `core.fsmonitor=false` is ineffective is because
in "scalar.c" we always configure `core.fsmonitor=true` in the repo
config and thus override the setting. I was checking whether it would
make sense to defer enabling the fsmonitor until after the fetch and
checkout have concluded. But funny enough, the below patch caused the
pipeline to now hang deterministically.

Puzzled.

Patrick

diff --git a/scalar.c b/scalar.c
index 7234049a1b..67f85c7adc 100644
--- a/scalar.c
+++ b/scalar.c
@@ -178,13 +178,6 @@ static int set_recommended_config(int reconfigure)
                     config[i].key, config[i].value);
    }
 
-	if (have_fsmonitor_support()) {
-		struct scalar_config fsmonitor = { "core.fsmonitor", "true" };
-		if (set_scalar_config(&fsmonitor, reconfigure))
-			return error(_("could not configure %s=%s"),
-				     fsmonitor.key, fsmonitor.value);
-	}
-
    /*
     * The `log.excludeDecoration` setting is special because it allows
     * for multiple values.
@@ -539,6 +532,13 @@ static int cmd_clone(int argc, const char **argv)
    if (res)
        goto cleanup;
 
+	if (have_fsmonitor_support()) {
+		struct scalar_config fsmonitor = { "core.fsmonitor", "true" };
+		if (set_scalar_config(&fsmonitor, 0))
+			return error(_("could not configure %s=%s"),
+				     fsmonitor.key, fsmonitor.value);
+	}
+
    res = register_dir();
 
 cleanup:

Philip Peterson May 26, 2024, 6:34 a.m. UTC | #8

Part of the problem seems to be that the Github actions runner has a bug
on OSX: https://github.com/actions/runner/issues/884

Based on investigating this for a while by setting up a self-hosted actions
runner, it seems to have to do with a broken pipe triggering incomplete
output capture / termination detection by either Github Action Runner (
see issue thread) or maybe even Dotnet Core's
System.Diagnostics.Process functionality.

As for the actual failing test, t9211, what I got on my machine was a
failure during clone: `unknown repository extension found: refstorage`.
In the trash directory, the .git/config did specify that extension.
Perhaps some interference coming from
t9500-gitweb-standalone-no-errors.sh, since it invokes:

> git config extensions.refstorage "$refstorage"

Cheers,
Phil

On Fri, May 17, 2024 at 7:30 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Fri, May 17, 2024 at 10:25:20AM +0200, Patrick Steinhardt wrote:
> > On Fri, May 17, 2024 at 04:11:32AM -0400, Jeff King wrote:
> > > On Thu, May 16, 2024 at 02:36:19PM +0200, Patrick Steinhardt wrote:
> > [snip]
> > > One can guess that scalar is in waitpid() waiting for git-fetch. But
> > > what's fetch waiting on? The other side of upload-pack is dead.
> > > According to lsof, it does have a unix socket open to fsmonitor. So
> > > maybe it's trying to read there?
> >
> > That was also my guess. I tried whether disabling fsmonitor via
> > `core.fsmonitor=false` helps, but that did not seem to be the case.
> > Either because it didn't have the desired effect, or because the root
> > cause is not fsmonitor. No idea which of both it is.
>
> The root cause actually is the fsmonitor. I was using your tmate hack to
> SSH into one of the failed jobs, and there had been 7 instances of the
> fsmonitor lurking. After killing all of them the job got unstuck and ran
> to completion.
>
> The reason why setting `core.fsmonitor=false` is ineffective is because
> in "scalar.c" we always configure `core.fsmonitor=true` in the repo
> config and thus override the setting. I was checking whether it would
> make sense to defer enabling the fsmonitor until after the fetch and
> checkout have concluded. But funny enough, the below patch caused the
> pipeline to now hang deterministically.
>
> Puzzled.
>
> Patrick
>
> diff --git a/scalar.c b/scalar.c
> index 7234049a1b..67f85c7adc 100644
> --- a/scalar.c
> +++ b/scalar.c
> @@ -178,13 +178,6 @@ static int set_recommended_config(int reconfigure)
>                      config[i].key, config[i].value);
>     }
>
> -       if (have_fsmonitor_support()) {
> -               struct scalar_config fsmonitor = { "core.fsmonitor", "true" };
> -               if (set_scalar_config(&fsmonitor, reconfigure))
> -                       return error(_("could not configure %s=%s"),
> -                                    fsmonitor.key, fsmonitor.value);
> -       }
> -
>     /*
>      * The `log.excludeDecoration` setting is special because it allows
>      * for multiple values.
> @@ -539,6 +532,13 @@ static int cmd_clone(int argc, const char **argv)
>     if (res)
>         goto cleanup;
>
> +       if (have_fsmonitor_support()) {
> +               struct scalar_config fsmonitor = { "core.fsmonitor", "true" };
> +               if (set_scalar_config(&fsmonitor, 0))
> +                       return error(_("could not configure %s=%s"),
> +                                    fsmonitor.key, fsmonitor.value);
> +       }
> +
>     res = register_dir();
>
>  cleanup:

Junio C Hamano May 26, 2024, 7:23 p.m. UTC | #9

Philip <philip.c.peterson@gmail.com> writes:

> Part of the problem seems to be that the Github actions runner has a bug
> on OSX: https://github.com/actions/runner/issues/884
>
> Based on investigating this for a while by setting up a self-hosted actions
> runner, it seems to have to do with a broken pipe triggering incomplete
> output capture / termination detection by either Github Action Runner (
> see issue thread) or maybe even Dotnet Core's
> System.Diagnostics.Process functionality.

Thanks for digging into this.

> As for the actual failing test, t9211, what I got on my machine was a
> failure during clone: `unknown repository extension found: refstorage`.
> In the trash directory, the .git/config did specify that extension.
> Perhaps some interference coming from
> t9500-gitweb-standalone-no-errors.sh, since it invokes:
>
>> git config extensions.refstorage "$refstorage"

Puzzled.  We run t9211 in "t/trash directory.t9211-whatever/"
directory with its own repository, so that what t9500 does in its
own playpen, "t/trash directory.t9500-gitweb-standalone-no-errors/"
directory would not interfere with it to begin with.  How would that
setting seep through to an unrelated test run next door?  It's not
like they share TCP port number or anything like that?

Patrick Steinhardt May 27, 2024, 5:12 a.m. UTC | #10

On Sun, May 26, 2024 at 12:23:18PM -0700, Junio C Hamano wrote:
> Philip <philip.c.peterson@gmail.com> writes:
> 
> > Part of the problem seems to be that the Github actions runner has a bug
> > on OSX: https://github.com/actions/runner/issues/884
> >
> > Based on investigating this for a while by setting up a self-hosted actions
> > runner, it seems to have to do with a broken pipe triggering incomplete
> > output capture / termination detection by either Github Action Runner (
> > see issue thread) or maybe even Dotnet Core's
> > System.Diagnostics.Process functionality.
> 
> Thanks for digging into this.

Indeed, thanks for digging.

In any case, whatever it is, it cannot be exclusively due to a bug with
GitHub given that we see the same issue happening with GitLab CI.

> > As for the actual failing test, t9211, what I got on my machine was a
> > failure during clone: `unknown repository extension found: refstorage`.
> > In the trash directory, the .git/config did specify that extension.
> > Perhaps some interference coming from
> > t9500-gitweb-standalone-no-errors.sh, since it invokes:
> >
> >> git config extensions.refstorage "$refstorage"
> 
> Puzzled.  We run t9211 in "t/trash directory.t9211-whatever/"
> directory with its own repository, so that what t9500 does in its
> own playpen, "t/trash directory.t9500-gitweb-standalone-no-errors/"
> directory would not interfere with it to begin with.  How would that
> setting seep through to an unrelated test run next door?  It's not
> like they share TCP port number or anything like that?

This error looks somewhat weird to me. Why should anything part of Git
not recognize the refstorage extension? It almost feels as if there were
different versions of Git being used.

I'm quite positive by now that the error is somewhere in the fsmonitor.
I was double checking whether there is an issue with reuse of some of
its sockets across test suites. But given that the tests have different
HOMEs and that they have different repository paths, I couldn't really
find anything that might be reused across invocations of scalar(1) or
the git-fsmonitor--daemon(1).

Patrtick

Jeff King May 29, 2024, 9:27 a.m. UTC | #11

On Sun, May 26, 2024 at 02:34:05AM -0400, Philip wrote:

> Part of the problem seems to be that the Github actions runner has a bug
> on OSX: https://github.com/actions/runner/issues/884

I think this may be a red herring. The issue described there is that
processes in the runner environment are spawned with SIGPIPE ignored.
But one of the things we do in common-main.c is make sure SIGPIPE is set
back to the default. So unless the problem is in a non-git program (say,
an issue with the shell, or prove, or something), I don't think this
would affect us.

That issue also goes back to 2020, and anecdotally, I would say that the
hangs we are seeing started in the last 6 months or so.

-Peff

[3/3] ci: stop installing "gcc-13" for osx-gcc

Commit Message

Comments

Patch