diff mbox series

builtin/gc: provide hint when maintenance hits a stale schedule lock

Message ID 20241119-pks-maintenance-hint-with-stale-lock-v1-1-f9f9a98e12a0@pks.im (mailing list archive)
State Accepted
Commit 656ca9204a6b5dd19c55842ac345343c3f56111b
Headers show
Series builtin/gc: provide hint when maintenance hits a stale schedule lock | expand

Commit Message

Patrick Steinhardt Nov. 19, 2024, 10:48 a.m. UTC
When running scheduled maintenance via `git maintenance start`, we
acquire a lockfile to ensure that no other scheduled maintenance task is
running in the repository concurrently. If so, we do provide an error to
the user hinting that another process seems to be running in this repo.

There are two important cases why such a lockfile may exist:

  - An actual git-maintenance(1) process is still running in this
    repository.

  - An earlier process may have crashed or was interrupted part way
    through and has left a stale lockfile behind.

In c95547a394 (builtin/gc: fix crash when running `git maintenance
start`, 2024-10-10), we have fixed an issue where git-maintenance(1)
would crash with the "start" subcommand, and the underlying bug causes
the second scenario to trigger quite often now.

Most users don't know how to get out of that situation again though.
Ideally, we'd be removing the stale lock for our users automatically.
But in the context of repository maintenance this is rather risky, as it
can easily run for hours or even days. So finding a clear point where we
know that the old process has exited is basically impossible.

We have the same issue in other subsystems, e.g. when locking refs. Our
lockfile interfaces thus provide the `unable_to_lock_message()` function
for exactly this purpose: it provides a nice hint to the user that
explains what is going on and how to get out of that situation again by
manually removing the file.

Adapt git-maintenance(1) to print a similar hint. While we could use the
above function, we can provide a bit more context as we know exactly
what kind of process would create the lockfile.

Reported-by: Miguel Rincon Barahona <mrincon@gitlab.com>
Reported-by: Kev Kloss <kkloss@gitlab.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
Hi,

this issue was reported to me internally at GitLab with the suggestion
of providing a hint for how to get out of the situation again.

Patrick
---
 builtin/gc.c           | 11 ++++++++++-
 t/t7900-maintenance.sh |  8 ++++++++
 2 files changed, 18 insertions(+), 1 deletion(-)


---
base-commit: 090d24e9af6e9f59c3f7bee97c42bb1ae3c7f559
change-id: 20241119-pks-maintenance-hint-with-stale-lock-35e5417d632b

Comments

Justin Tobler Nov. 19, 2024, 5:17 p.m. UTC | #1
On 24/11/19 11:48AM, Patrick Steinhardt wrote:
> When running scheduled maintenance via `git maintenance start`, we
> acquire a lockfile to ensure that no other scheduled maintenance task is
> running in the repository concurrently. If so, we do provide an error to
> the user hinting that another process seems to be running in this repo.
> 
> There are two important cases why such a lockfile may exist:
> 
>   - An actual git-maintenance(1) process is still running in this
>     repository.
> 
>   - An earlier process may have crashed or was interrupted part way
>     through and has left a stale lockfile behind.
> 
> In c95547a394 (builtin/gc: fix crash when running `git maintenance
> start`, 2024-10-10), we have fixed an issue where git-maintenance(1)
> would crash with the "start" subcommand, and the underlying bug causes
> the second scenario to trigger quite often now.
> 
> Most users don't know how to get out of that situation again though.
> Ideally, we'd be removing the stale lock for our users automatically.
> But in the context of repository maintenance this is rather risky, as it
> can easily run for hours or even days. So finding a clear point where we
> know that the old process has exited is basically impossible.

Indeed, providing a hint to help affected users here make sense to me.

> 
> We have the same issue in other subsystems, e.g. when locking refs. Our
> lockfile interfaces thus provide the `unable_to_lock_message()` function
> for exactly this purpose: it provides a nice hint to the user that
> explains what is going on and how to get out of that situation again by
> manually removing the file.
> 
> Adapt git-maintenance(1) to print a similar hint. While we could use the
> above function, we can provide a bit more context as we know exactly
> what kind of process would create the lockfile.

The added context provided by having a more specific message seems
appropriate. Looking at the message provided through
`unable_to_lock_message()`, I could see the generic example it provides
being a bit confusing here.

> 
> Reported-by: Miguel Rincon Barahona <mrincon@gitlab.com>
> Reported-by: Kev Kloss <kkloss@gitlab.com>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> Hi,
> 
> this issue was reported to me internally at GitLab with the suggestion
> of providing a hint for how to get out of the situation again.
> 
> Patrick
> ---
>  builtin/gc.c           | 11 ++++++++++-
>  t/t7900-maintenance.sh |  8 ++++++++
>  2 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/builtin/gc.c b/builtin/gc.c
> index d52735354c9f87ba4e8acb593dd11aa0482223e1..34848626e47c777112994e5b5c558b65952ac8c2 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -2890,8 +2890,17 @@ static int update_background_schedule(const struct maintenance_start_opts *opts,
>  	char *lock_path = xstrfmt("%s/schedule", the_repository->objects->odb->path);
>  
>  	if (hold_lock_file_for_update(&lk, lock_path, LOCK_NO_DEREF) < 0) {
> +		if (errno == EEXIST)
> +			error(_("unable to create '%s.lock': %s.\n\n"
> +			    "Another scheduled git-maintenance(1) process seems to be running in this\n"
> +			    "repository. Please make sure no other maintenance processes are running and\n"
> +			    "then try again. If it still fails, a git-maintenance(1) process may have\n"
> +			    "crashed in this repository earlier: remove the file manually to continue."),
> +			    absolute_path(lock_path), strerror(errno));
> +		else
> +			error_errno(_("cannot acquire lock for scheduled background maintenance"));
>  		free(lock_path);
> -		return error(_("another process is scheduling background maintenance"));
> +		return -1;
>  	}
>  
>  	for (i = 1; i < ARRAY_SIZE(scheduler_fn); i++) {
> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> index c224c8450c85f567bc29258e18b4a59fe6682f0a..6d6ffaaf376bdbadecdb23a460994af1d218dc19 100755
> --- a/t/t7900-maintenance.sh
> +++ b/t/t7900-maintenance.sh
> @@ -1011,4 +1011,12 @@ test_expect_success 'repacking loose objects is quiet' '
>  	)
>  '
>  
> +test_expect_success 'maintenance aborts with existing lock file' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +	: >repo/.git/objects/schedule.lock &&
> +	test_must_fail git -C repo maintenance start 2>err &&
> +	test_grep "Another scheduled git-maintenance(1) process seems to be running" err
> +'
> +
>  test_done
> 
> ---
> base-commit: 090d24e9af6e9f59c3f7bee97c42bb1ae3c7f559
> change-id: 20241119-pks-maintenance-hint-with-stale-lock-35e5417d632b

The changes in this patch are straightforward and look good to me :)

-Justin
Jeff King Nov. 22, 2024, 3:30 p.m. UTC | #2
On Tue, Nov 19, 2024 at 11:48:43AM +0100, Patrick Steinhardt wrote:

> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> index c224c8450c85f567bc29258e18b4a59fe6682f0a..6d6ffaaf376bdbadecdb23a460994af1d218dc19 100755
> --- a/t/t7900-maintenance.sh
> +++ b/t/t7900-maintenance.sh
> @@ -1011,4 +1011,12 @@ test_expect_success 'repacking loose objects is quiet' '
>  	)
>  '
>  
> +test_expect_success 'maintenance aborts with existing lock file' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +	: >repo/.git/objects/schedule.lock &&
> +	test_must_fail git -C repo maintenance start 2>err &&
> +	test_grep "Another scheduled git-maintenance(1) process seems to be running" err
> +'

This failed for me in all of the macos jobs of GitHub CI. Looks like the
err file contains:

  fatal: launchctl scheduler is not available

So maybe it's a platform image issue?

-Peff
diff mbox series

Patch

diff --git a/builtin/gc.c b/builtin/gc.c
index d52735354c9f87ba4e8acb593dd11aa0482223e1..34848626e47c777112994e5b5c558b65952ac8c2 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -2890,8 +2890,17 @@  static int update_background_schedule(const struct maintenance_start_opts *opts,
 	char *lock_path = xstrfmt("%s/schedule", the_repository->objects->odb->path);
 
 	if (hold_lock_file_for_update(&lk, lock_path, LOCK_NO_DEREF) < 0) {
+		if (errno == EEXIST)
+			error(_("unable to create '%s.lock': %s.\n\n"
+			    "Another scheduled git-maintenance(1) process seems to be running in this\n"
+			    "repository. Please make sure no other maintenance processes are running and\n"
+			    "then try again. If it still fails, a git-maintenance(1) process may have\n"
+			    "crashed in this repository earlier: remove the file manually to continue."),
+			    absolute_path(lock_path), strerror(errno));
+		else
+			error_errno(_("cannot acquire lock for scheduled background maintenance"));
 		free(lock_path);
-		return error(_("another process is scheduling background maintenance"));
+		return -1;
 	}
 
 	for (i = 1; i < ARRAY_SIZE(scheduler_fn); i++) {
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index c224c8450c85f567bc29258e18b4a59fe6682f0a..6d6ffaaf376bdbadecdb23a460994af1d218dc19 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -1011,4 +1011,12 @@  test_expect_success 'repacking loose objects is quiet' '
 	)
 '
 
+test_expect_success 'maintenance aborts with existing lock file' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+	: >repo/.git/objects/schedule.lock &&
+	test_must_fail git -C repo maintenance start 2>err &&
+	test_grep "Another scheduled git-maintenance(1) process seems to be running" err
+'
+
 test_done