Message ID | 20240509162544.GC1708042@coredump.intra.peff.net (mailing list archive) |
---|---|
State | Accepted |
Commit | 7df2405b38ebd91b9b542167829883ddd6c02295 |
Headers | show |
Series | un-breaking osx-gcc ci job | expand |
On Thu, May 09, 2024 at 12:25:44PM -0400, Jeff King wrote: [snip] > I'd like to report that this let me get a successful CI run, but I'm > running into the thing where osx jobs seem to randomly hang sometimes > and hit the 6-hour timeout. But I did confirm that this lets us get to > the actual build/test, and not barf while installing dependencies. Yeah, this one is puzzling to me. We see the same thing on GitLab CI, and until now I haven't yet figured out why that is. > .github/workflows/main.yml | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml > index 5f92a50271..13cc0fe807 100644 > --- a/.github/workflows/main.yml > +++ b/.github/workflows/main.yml > @@ -285,7 +285,6 @@ jobs: > pool: macos-13 > - jobname: osx-gcc > cc: gcc-13 > - cc_package: gcc-13 As far as I can see this means that we don't install GCC at all anymore via Homebrew. Does this mean that we now rely on the GCC version that is preinstalled by Homebrew? Won't this break every time that Homebrew changes the default GCC version? I may be missing the obvious. Patrick
On Fri, May 10, 2024 at 09:00:04AM +0200, Patrick Steinhardt wrote: > On Thu, May 09, 2024 at 12:25:44PM -0400, Jeff King wrote: > [snip] > > I'd like to report that this let me get a successful CI run, but I'm > > running into the thing where osx jobs seem to randomly hang sometimes > > and hit the 6-hour timeout. But I did confirm that this lets us get to > > the actual build/test, and not barf while installing dependencies. > > Yeah, this one is puzzling to me. We see the same thing on GitLab CI, > and until now I haven't yet figured out why that is. Drat. I was hoping maybe it was a problem in GitHub CI and somebody else would eventually fix it. ;) It feels like a deadlock somewhere, though whether it is in our code, or in our tests, or some system-ish issue with prove, perl, etc, I don't know. It would be nice to catch it in the act and see what the process tree looks like. I guess poking around in the test environment with tmate might work, though I don't know if there's a way to get tmate running simultaneously with the hung step (so you'd probably have to connect, kick off the "make test" manually and hope it hangs). > > diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml > > index 5f92a50271..13cc0fe807 100644 > > --- a/.github/workflows/main.yml > > +++ b/.github/workflows/main.yml > > @@ -285,7 +285,6 @@ jobs: > > pool: macos-13 > > - jobname: osx-gcc > > cc: gcc-13 > > - cc_package: gcc-13 > > As far as I can see this means that we don't install GCC at all anymore > via Homebrew. Does this mean that we now rely on the GCC version that is > preinstalled by Homebrew? Won't this break every time that Homebrew > changes the default GCC version? > > I may be missing the obvious. Yes, we'll have to update to a different version when the runner image stops carrying gcc-13. But it's not based on homebrew's default. According to: https://github.com/actions/runner-images/blob/macos-13/20240506.1/images/macos/macos-13-Readme.md the runner image contains gcc-11, gcc-12, and gcc-13. So presumably it would be a while before gcc-13 ages out and we have to bother bumping. I do agree it would be nice to just use the latest gcc in the image, but I don't think we can specify that here. I guess we could say "gcc-auto" or something, and then the actual shell code could poke around for it. -Peff
On Fri, May 10, 2024 at 04:13:48PM -0400, Jeff King wrote: > On Fri, May 10, 2024 at 09:00:04AM +0200, Patrick Steinhardt wrote: > > > On Thu, May 09, 2024 at 12:25:44PM -0400, Jeff King wrote: > > [snip] > > > I'd like to report that this let me get a successful CI run, but I'm > > > running into the thing where osx jobs seem to randomly hang sometimes > > > and hit the 6-hour timeout. But I did confirm that this lets us get to > > > the actual build/test, and not barf while installing dependencies. > > > > Yeah, this one is puzzling to me. We see the same thing on GitLab CI, > > and until now I haven't yet figured out why that is. > > Drat. I was hoping maybe it was a problem in GitHub CI and somebody else > would eventually fix it. ;) > > It feels like a deadlock somewhere, though whether it is in our code, or > in our tests, or some system-ish issue with prove, perl, etc, I don't > know. It would be nice to catch it in the act and see what the process > tree looks like. I guess poking around in the test environment with > tmate might work, though I don't know if there's a way to get tmate > running simultaneously with the hung step (so you'd probably have to > connect, kick off the "make test" manually and hope it hangs). My hunch tells me that it's the Perforce tests -- after all, this is where the jobs get stuck, too. In "lib-git-p4.sh" we already document that p4d is known to crash at times, and overall the logic to spawn the server is quite convoluted. I did try to get more useful logs yesterday. But as usual, once you _want_ to reproduce a failure like this is doesn't happen anymore. > > > diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml > > > index 5f92a50271..13cc0fe807 100644 > > > --- a/.github/workflows/main.yml > > > +++ b/.github/workflows/main.yml > > > @@ -285,7 +285,6 @@ jobs: > > > pool: macos-13 > > > - jobname: osx-gcc > > > cc: gcc-13 > > > - cc_package: gcc-13 > > > > As far as I can see this means that we don't install GCC at all anymore > > via Homebrew. Does this mean that we now rely on the GCC version that is > > preinstalled by Homebrew? Won't this break every time that Homebrew > > changes the default GCC version? > > > > I may be missing the obvious. > > Yes, we'll have to update to a different version when the runner image > stops carrying gcc-13. But it's not based on homebrew's default. > According to: > > https://github.com/actions/runner-images/blob/macos-13/20240506.1/images/macos/macos-13-Readme.md > > the runner image contains gcc-11, gcc-12, and gcc-13. So presumably it > would be a while before gcc-13 ages out and we have to bother bumping. I > do agree it would be nice to just use the latest gcc in the image, but I > don't think we can specify that here. I guess we could say "gcc-auto" or > something, and then the actual shell code could poke around for it. Ah, thanks for the explanation. I was worried that things might break at arbitrary points in time. But if this is only containde to whenever we upgrade the runner image, then I don't see this as much of a problem. Patrick
On Sat, May 11, 2024 at 09:17:41AM +0200, Patrick Steinhardt wrote: > On Fri, May 10, 2024 at 04:13:48PM -0400, Jeff King wrote: > > On Fri, May 10, 2024 at 09:00:04AM +0200, Patrick Steinhardt wrote: > > > > > On Thu, May 09, 2024 at 12:25:44PM -0400, Jeff King wrote: > > > [snip] > > > > I'd like to report that this let me get a successful CI run, but I'm > > > > running into the thing where osx jobs seem to randomly hang sometimes > > > > and hit the 6-hour timeout. But I did confirm that this lets us get to > > > > the actual build/test, and not barf while installing dependencies. > > > > > > Yeah, this one is puzzling to me. We see the same thing on GitLab CI, > > > and until now I haven't yet figured out why that is. > > > > Drat. I was hoping maybe it was a problem in GitHub CI and somebody else > > would eventually fix it. ;) > > > > It feels like a deadlock somewhere, though whether it is in our code, or > > in our tests, or some system-ish issue with prove, perl, etc, I don't > > know. It would be nice to catch it in the act and see what the process > > tree looks like. I guess poking around in the test environment with > > tmate might work, though I don't know if there's a way to get tmate > > running simultaneously with the hung step (so you'd probably have to > > connect, kick off the "make test" manually and hope it hangs). > > My hunch tells me that it's the Perforce tests -- after all, this is > where the jobs get stuck, too. In "lib-git-p4.sh" we already document > that p4d is known to crash at times, and overall the logic to spawn the > server is quite convoluted. > > I did try to get more useful logs yesterday. But as usual, once you > _want_ to reproduce a failure like this is doesn't happen anymore. I was spending (or rather wasting?) some more time on this. With the below diff I was able to get a list of processes running after ~50 minutes: diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh index 98dda42045..d5570b59d3 100755 --- a/ci/run-build-and-tests.sh +++ b/ci/run-build-and-tests.sh @@ -51,8 +51,15 @@ esac group Build make if test -n "$run_tests" then - group "Run tests" make test || + ( + sleep 3200 && + mkdir -p t/failed-test-artifacts && + ps -A >t/failed-test-artifacts/ps 2>&1 + ) & + pid=$! + group "Run tests" gtimeout 1h make test || handle_failed_tests + kill "$pid" fi check_unignored_build_artifacts I trimmed that process list to the following set of relevant processes: PID TTY TIME CMD 5196 ?? 0:00.01 /bin/sh t9211-scalar-clone.sh --verbose-log -x 5242 ?? 0:00.00 /bin/sh t9211-scalar-clone.sh --verbose-log -x 5244 ?? 0:00.00 tee -a /Volumes/RAMDisk/test-results/t9211-scalar-clone.out 5245 ?? 0:00.09 /bin/sh t9211-scalar-clone.sh --verbose-log -x 7235 ?? 0:00.02 /Users/gitlab/builds/gitlab-org/git/scalar clone file:///Volumes/RAMDisk/trash directory.t9211-scalar-clone/to-clone maint-fail 7265 ?? 0:00.01 /Users/gitlab/builds/gitlab-org/git/git fetch --quiet --no-progress origin 7276 ?? 0:00.01 /Users/gitlab/builds/gitlab-org/git/git fsmonitor--daemon run --detach --ipc-threads=8 So it seems like the issue is t9211, and the hang happens in "scalar clone warns when background maintenance fails" specifically. What exactly the root cause is I have no clue though. Maybe an fsmonitor race, maybe something else entirely. Hard to say as I have never seen this happen on any other platform than macOS, and I do not have access to a Mac myself. The issue also doesn't seem to occur when running t9211 on its own, but only when running the full test suite. This may further indicate that there is a race condition, where the additional load improves the likelihood of it. Or there is bad interaction with another test. Patrick
On Thu, May 16, 2024 at 02:36:19PM +0200, Patrick Steinhardt wrote: > I was spending (or rather wasting?) some more time on this. With the > below diff I was able to get a list of processes running after ~50 > minutes: I was going to say "good, now I don't have to waste time on it". But your findings only nerd-sniped me into digging more. ;) > So it seems like the issue is t9211, and the hang happens in "scalar > clone warns when background maintenance fails" specifically. What > exactly the root cause is I have no clue though. Maybe an fsmonitor > race, maybe something else entirely. Hard to say as I have never seen > this happen on any other platform than macOS, and I do not have access > to a Mac myself. > > The issue also doesn't seem to occur when running t9211 on its own, but > only when running the full test suite. This may further indicate that > there is a race condition, where the additional load improves the > likelihood of it. Or there is bad interaction with another test. I can reproduce it at will and relatively quickly using "--stress" with t9211. I pushed up a hacky commit that removes all CI jobs except for os-clang, and it stops short of running the build/tests and opens a shell using tmate. For reference (though you'd need to work out something similar for GitLab). https://github.com/peff/git/commit/f825fa36ed95bed414b0d6d9e8b21799e2e167e4 And then just: make -j8 cd t ./t9211-scalar-clone.sh --stress Give it a minute or two, and you'll see most of the jobs have hung, with one or two "winners" continuing (once most of them are hanging, the load is low enough that the race doesn't happen). So you'll see 3.17, 3.18, 3.19, and so on, indicating that job 3 is still going and completing its 19th run. But everything else is stuck and stops producing output. You can likewise see processes in "ps" that are a few minutes old, which is another way to find the stuck ones. And I get the same three processes as you: scalar clone, fetch, and fsmonitor--daemon. And here's where I ran into tooling issues. Normally I'd "strace -p" to see what the hung processes are doing. We don't have that on macOS. Doing "sudo dtruss -p" runs without complaint, but it looks like it doesn't report on the current syscall (where we're presumably blocking). I installed gdb, which does seem to work, but attaching to the running processes doesn't show a useful backtrace (even after making sure to build with "-g -O0", and confirming that regular "gdb ./git" works OK). One can guess that scalar is in waitpid() waiting for git-fetch. But what's fetch waiting on? The other side of upload-pack is dead. According to lsof, it does have a unix socket open to fsmonitor. So maybe it's trying to read there? Curiously killing fsmonitor doesn't un-stick fetch, and nor does killing fetch unstick scalar. So either my guesses above are wrong, or there's something else weird causing them to hang. I imagine there may be better tools to poke at things, but I'm at the limits of my macOS knowledge. But maybe the recipe above is enough for somebody more clueful to recreate and investigate the situation (it probably would also be easy to just run the --stress script locally if somebody actually has a mac). -Peff
On Fri, May 17, 2024 at 04:11:32AM -0400, Jeff King wrote: > On Thu, May 16, 2024 at 02:36:19PM +0200, Patrick Steinhardt wrote: [snip] > One can guess that scalar is in waitpid() waiting for git-fetch. But > what's fetch waiting on? The other side of upload-pack is dead. > According to lsof, it does have a unix socket open to fsmonitor. So > maybe it's trying to read there? That was also my guess. I tried whether disabling fsmonitor via `core.fsmonitor=false` helps, but that did not seem to be the case. Either because it didn't have the desired effect, or because the root cause is not fsmonitor. No idea which of both it is. > Curiously killing fsmonitor doesn't un-stick fetch, and nor does killing > fetch unstick scalar. So either my guesses above are wrong, or there's > something else weird causing them to hang. > > I imagine there may be better tools to poke at things, but I'm at the > limits of my macOS knowledge. But maybe the recipe above is enough for > somebody more clueful to recreate and investigate the situation (it > probably would also be easy to just run the --stress script locally if > somebody actually has a mac). Let's hope it is :) Patrick
On Fri, May 17, 2024 at 10:25:20AM +0200, Patrick Steinhardt wrote: > On Fri, May 17, 2024 at 04:11:32AM -0400, Jeff King wrote: > > On Thu, May 16, 2024 at 02:36:19PM +0200, Patrick Steinhardt wrote: > [snip] > > One can guess that scalar is in waitpid() waiting for git-fetch. But > > what's fetch waiting on? The other side of upload-pack is dead. > > According to lsof, it does have a unix socket open to fsmonitor. So > > maybe it's trying to read there? > > That was also my guess. I tried whether disabling fsmonitor via > `core.fsmonitor=false` helps, but that did not seem to be the case. > Either because it didn't have the desired effect, or because the root > cause is not fsmonitor. No idea which of both it is. The root cause actually is the fsmonitor. I was using your tmate hack to SSH into one of the failed jobs, and there had been 7 instances of the fsmonitor lurking. After killing all of them the job got unstuck and ran to completion. The reason why setting `core.fsmonitor=false` is ineffective is because in "scalar.c" we always configure `core.fsmonitor=true` in the repo config and thus override the setting. I was checking whether it would make sense to defer enabling the fsmonitor until after the fetch and checkout have concluded. But funny enough, the below patch caused the pipeline to now hang deterministically. Puzzled. Patrick diff --git a/scalar.c b/scalar.c index 7234049a1b..67f85c7adc 100644 --- a/scalar.c +++ b/scalar.c @@ -178,13 +178,6 @@ static int set_recommended_config(int reconfigure) config[i].key, config[i].value); } - if (have_fsmonitor_support()) { - struct scalar_config fsmonitor = { "core.fsmonitor", "true" }; - if (set_scalar_config(&fsmonitor, reconfigure)) - return error(_("could not configure %s=%s"), - fsmonitor.key, fsmonitor.value); - } - /* * The `log.excludeDecoration` setting is special because it allows * for multiple values. @@ -539,6 +532,13 @@ static int cmd_clone(int argc, const char **argv) if (res) goto cleanup; + if (have_fsmonitor_support()) { + struct scalar_config fsmonitor = { "core.fsmonitor", "true" }; + if (set_scalar_config(&fsmonitor, 0)) + return error(_("could not configure %s=%s"), + fsmonitor.key, fsmonitor.value); + } + res = register_dir(); cleanup:
Part of the problem seems to be that the Github actions runner has a bug on OSX: https://github.com/actions/runner/issues/884 Based on investigating this for a while by setting up a self-hosted actions runner, it seems to have to do with a broken pipe triggering incomplete output capture / termination detection by either Github Action Runner ( see issue thread) or maybe even Dotnet Core's System.Diagnostics.Process functionality. As for the actual failing test, t9211, what I got on my machine was a failure during clone: `unknown repository extension found: refstorage`. In the trash directory, the .git/config did specify that extension. Perhaps some interference coming from t9500-gitweb-standalone-no-errors.sh, since it invokes: > git config extensions.refstorage "$refstorage" Cheers, Phil On Fri, May 17, 2024 at 7:30 AM Patrick Steinhardt <ps@pks.im> wrote: > > On Fri, May 17, 2024 at 10:25:20AM +0200, Patrick Steinhardt wrote: > > On Fri, May 17, 2024 at 04:11:32AM -0400, Jeff King wrote: > > > On Thu, May 16, 2024 at 02:36:19PM +0200, Patrick Steinhardt wrote: > > [snip] > > > One can guess that scalar is in waitpid() waiting for git-fetch. But > > > what's fetch waiting on? The other side of upload-pack is dead. > > > According to lsof, it does have a unix socket open to fsmonitor. So > > > maybe it's trying to read there? > > > > That was also my guess. I tried whether disabling fsmonitor via > > `core.fsmonitor=false` helps, but that did not seem to be the case. > > Either because it didn't have the desired effect, or because the root > > cause is not fsmonitor. No idea which of both it is. > > The root cause actually is the fsmonitor. I was using your tmate hack to > SSH into one of the failed jobs, and there had been 7 instances of the > fsmonitor lurking. After killing all of them the job got unstuck and ran > to completion. > > The reason why setting `core.fsmonitor=false` is ineffective is because > in "scalar.c" we always configure `core.fsmonitor=true` in the repo > config and thus override the setting. I was checking whether it would > make sense to defer enabling the fsmonitor until after the fetch and > checkout have concluded. But funny enough, the below patch caused the > pipeline to now hang deterministically. > > Puzzled. > > Patrick > > diff --git a/scalar.c b/scalar.c > index 7234049a1b..67f85c7adc 100644 > --- a/scalar.c > +++ b/scalar.c > @@ -178,13 +178,6 @@ static int set_recommended_config(int reconfigure) > config[i].key, config[i].value); > } > > - if (have_fsmonitor_support()) { > - struct scalar_config fsmonitor = { "core.fsmonitor", "true" }; > - if (set_scalar_config(&fsmonitor, reconfigure)) > - return error(_("could not configure %s=%s"), > - fsmonitor.key, fsmonitor.value); > - } > - > /* > * The `log.excludeDecoration` setting is special because it allows > * for multiple values. > @@ -539,6 +532,13 @@ static int cmd_clone(int argc, const char **argv) > if (res) > goto cleanup; > > + if (have_fsmonitor_support()) { > + struct scalar_config fsmonitor = { "core.fsmonitor", "true" }; > + if (set_scalar_config(&fsmonitor, 0)) > + return error(_("could not configure %s=%s"), > + fsmonitor.key, fsmonitor.value); > + } > + > res = register_dir(); > > cleanup:
Philip <philip.c.peterson@gmail.com> writes: > Part of the problem seems to be that the Github actions runner has a bug > on OSX: https://github.com/actions/runner/issues/884 > > Based on investigating this for a while by setting up a self-hosted actions > runner, it seems to have to do with a broken pipe triggering incomplete > output capture / termination detection by either Github Action Runner ( > see issue thread) or maybe even Dotnet Core's > System.Diagnostics.Process functionality. Thanks for digging into this. > As for the actual failing test, t9211, what I got on my machine was a > failure during clone: `unknown repository extension found: refstorage`. > In the trash directory, the .git/config did specify that extension. > Perhaps some interference coming from > t9500-gitweb-standalone-no-errors.sh, since it invokes: > >> git config extensions.refstorage "$refstorage" Puzzled. We run t9211 in "t/trash directory.t9211-whatever/" directory with its own repository, so that what t9500 does in its own playpen, "t/trash directory.t9500-gitweb-standalone-no-errors/" directory would not interfere with it to begin with. How would that setting seep through to an unrelated test run next door? It's not like they share TCP port number or anything like that?
On Sun, May 26, 2024 at 12:23:18PM -0700, Junio C Hamano wrote: > Philip <philip.c.peterson@gmail.com> writes: > > > Part of the problem seems to be that the Github actions runner has a bug > > on OSX: https://github.com/actions/runner/issues/884 > > > > Based on investigating this for a while by setting up a self-hosted actions > > runner, it seems to have to do with a broken pipe triggering incomplete > > output capture / termination detection by either Github Action Runner ( > > see issue thread) or maybe even Dotnet Core's > > System.Diagnostics.Process functionality. > > Thanks for digging into this. Indeed, thanks for digging. In any case, whatever it is, it cannot be exclusively due to a bug with GitHub given that we see the same issue happening with GitLab CI. > > As for the actual failing test, t9211, what I got on my machine was a > > failure during clone: `unknown repository extension found: refstorage`. > > In the trash directory, the .git/config did specify that extension. > > Perhaps some interference coming from > > t9500-gitweb-standalone-no-errors.sh, since it invokes: > > > >> git config extensions.refstorage "$refstorage" > > Puzzled. We run t9211 in "t/trash directory.t9211-whatever/" > directory with its own repository, so that what t9500 does in its > own playpen, "t/trash directory.t9500-gitweb-standalone-no-errors/" > directory would not interfere with it to begin with. How would that > setting seep through to an unrelated test run next door? It's not > like they share TCP port number or anything like that? This error looks somewhat weird to me. Why should anything part of Git not recognize the refstorage extension? It almost feels as if there were different versions of Git being used. I'm quite positive by now that the error is somewhere in the fsmonitor. I was double checking whether there is an issue with reuse of some of its sockets across test suites. But given that the tests have different HOMEs and that they have different repository paths, I couldn't really find anything that might be reused across invocations of scalar(1) or the git-fsmonitor--daemon(1). Patrtick
On Sun, May 26, 2024 at 02:34:05AM -0400, Philip wrote: > Part of the problem seems to be that the Github actions runner has a bug > on OSX: https://github.com/actions/runner/issues/884 I think this may be a red herring. The issue described there is that processes in the runner environment are spawned with SIGPIPE ignored. But one of the things we do in common-main.c is make sure SIGPIPE is set back to the default. So unless the problem is in a non-git program (say, an issue with the shell, or prove, or something), I don't think this would affect us. That issue also goes back to 2020, and anecdotally, I would say that the hangs we are seeing started in the last 6 months or so. -Peff
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml index 5f92a50271..13cc0fe807 100644 --- a/.github/workflows/main.yml +++ b/.github/workflows/main.yml @@ -285,7 +285,6 @@ jobs: pool: macos-13 - jobname: osx-gcc cc: gcc-13 - cc_package: gcc-13 pool: macos-13 - jobname: linux-gcc-default cc: gcc
Our osx-gcc job explicitly asks to install gcc-13. But since the GitHub runner image already comes with gcc-13 installed, this is mostly doing nothing (or in some cases it may install an incremental update over the runner image). But worse, it recently started causing errors like: ==> Fetching gcc@13 ==> Downloading https://ghcr.io/v2/homebrew/core/gcc/13/blobs/sha256:fb2403d97e2ce67eb441b54557cfb61980830f3ba26d4c5a1fe5ecd0c9730d1a ==> Pouring gcc@13--13.2.0.ventura.bottle.tar.gz Error: The `brew link` step did not complete successfully The formula built, but is not symlinked into /usr/local Could not symlink bin/c++-13 Target /usr/local/bin/c++-13 is a symlink belonging to gcc. You can unlink it: brew unlink gcc which cause the whole CI job to bail. I didn't track down the root cause, but I suspect it may be related to homebrew recently switching the "gcc" default to gcc-14. And it may even be fixed when a new runner image is released. But if we don't need to run brew at all, it's one less thing for us to worry about. Signed-off-by: Jeff King <peff@peff.net> --- I don't think this was due to anything on our side. I tried to re-run versions of git.git which had previously passed and ran into the same issue. I'd like to report that this let me get a successful CI run, but I'm running into the thing where osx jobs seem to randomly hang sometimes and hit the 6-hour timeout. But I did confirm that this lets us get to the actual build/test, and not barf while installing dependencies. .github/workflows/main.yml | 1 - 1 file changed, 1 deletion(-)