diff mbox series

[3/3] builtin/maintenance: fix loose objects task emitting pack hash

Message ID c25b5333f60a5920c1fade06532e3379c6686908.1724053639.git.ps@pks.im (mailing list archive)
State Accepted
Commit 8311e3b5515ea161c74cfd9b032270c0f459f2b8
Headers show
Series Fixups for git-maintenance(1) tests | expand

Commit Message

Patrick Steinhardt Aug. 19, 2024, 7:48 a.m. UTC
The "loose-objects" maintenance tasks executes git-pack-objects(1) to
pack all loose objects into a new packfile. This command ends up
printing the hash of the packfile to stdout though, which clutters the
output of `git maintenance run`.

Fix this issue by disabling stdout of the child process.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/gc.c           |  6 ++++++
 t/t7900-maintenance.sh | 16 ++++++++++++++++
 2 files changed, 22 insertions(+)

Comments

Jeff King Aug. 19, 2024, 8:55 a.m. UTC | #1
On Mon, Aug 19, 2024 at 09:48:05AM +0200, Patrick Steinhardt wrote:

> The "loose-objects" maintenance tasks executes git-pack-objects(1) to
> pack all loose objects into a new packfile. This command ends up
> printing the hash of the packfile to stdout though, which clutters the
> output of `git maintenance run`.
> 
> Fix this issue by disabling stdout of the child process.

Ah, I wondered where that output was coming from.

> diff --git a/builtin/gc.c b/builtin/gc.c
> index 13bc0572a3..be75efa17a 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -1159,6 +1159,12 @@ static int pack_loose(struct maintenance_run_opts *opts)
>  
>  	pack_proc.in = -1;
>  
> +	/*
> +	 * git-pack-objects(1) ends up writing the pack hash to stdout, which
> +	 * we do not care for.
> +	 */
> +	pack_proc.out = -1;
> +
>  	if (start_command(&pack_proc)) {
>  		error(_("failed to start 'git pack-objects' process"));
>  		return 1;

I have not paid much attention to the "maintenance" stuff. It is a
little weird to me that it is not building on "git repack", which
already handles this, but perhaps there are reasons. Anyway, totally
unrelated to your patch (which looks good to me).

> +++ b/t/t7900-maintenance.sh
> @@ -978,4 +978,20 @@ test_expect_success '--detach causes maintenance to run in background' '
>  	)
>  '
>  
> +test_expect_success 'repacking loose objects is quiet' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		test_commit something &&
> +		git config set maintenance.gc.enabled false &&
> +		git config set maintenance.loose-objects.enabled true &&
> +		git config set maintenance.loose-objects.auto 1 &&
> +
> +		git maintenance run --quiet >out 2>&1 &&
> +		test_must_be_empty out
> +	)
> +'

I wondered if you needed --no-detach here to avoid a race, but I guess
as a non-auto run, it would never background?

-Peff
Patrick Steinhardt Aug. 19, 2024, 9:07 a.m. UTC | #2
On Mon, Aug 19, 2024 at 04:55:22AM -0400, Jeff King wrote:
> On Mon, Aug 19, 2024 at 09:48:05AM +0200, Patrick Steinhardt wrote:
> 
> > The "loose-objects" maintenance tasks executes git-pack-objects(1) to
> > pack all loose objects into a new packfile. This command ends up
> > printing the hash of the packfile to stdout though, which clutters the
> > output of `git maintenance run`.
> > 
> > Fix this issue by disabling stdout of the child process.
> 
> Ah, I wondered where that output was coming from.
> 
> > diff --git a/builtin/gc.c b/builtin/gc.c
> > index 13bc0572a3..be75efa17a 100644
> > --- a/builtin/gc.c
> > +++ b/builtin/gc.c
> > @@ -1159,6 +1159,12 @@ static int pack_loose(struct maintenance_run_opts *opts)
> >  
> >  	pack_proc.in = -1;
> >  
> > +	/*
> > +	 * git-pack-objects(1) ends up writing the pack hash to stdout, which
> > +	 * we do not care for.
> > +	 */
> > +	pack_proc.out = -1;
> > +
> >  	if (start_command(&pack_proc)) {
> >  		error(_("failed to start 'git pack-objects' process"));
> >  		return 1;
> 
> I have not paid much attention to the "maintenance" stuff. It is a
> little weird to me that it is not building on "git repack", which
> already handles this, but perhaps there are reasons. Anyway, totally
> unrelated to your patch (which looks good to me).

git-repack(1) is way less efficient than running git-pack-objects(1)
directly. I've also noticed that at one point in time when revamping how
we do housekeeping in Git.

It mostly boils down to git-repack(1) doing a connectivity check,
whereas git-pack-objects(1) doesn't. We just soak up every single loose
object, and then eventually we expire them via git-multi-pack-index(1)'s
"expire" subcommand.

> > +++ b/t/t7900-maintenance.sh
> > @@ -978,4 +978,20 @@ test_expect_success '--detach causes maintenance to run in background' '
> >  	)
> >  '
> >  
> > +test_expect_success 'repacking loose objects is quiet' '
> > +	test_when_finished "rm -rf repo" &&
> > +	git init repo &&
> > +	(
> > +		cd repo &&
> > +
> > +		test_commit something &&
> > +		git config set maintenance.gc.enabled false &&
> > +		git config set maintenance.loose-objects.enabled true &&
> > +		git config set maintenance.loose-objects.auto 1 &&
> > +
> > +		git maintenance run --quiet >out 2>&1 &&
> > +		test_must_be_empty out
> > +	)
> > +'
> 
> I wondered if you needed --no-detach here to avoid a race, but I guess
> as a non-auto run, it would never background?

Even the `--auto` run does not background. That was the case for
git-gc(1), but is not the case for git-maintenance(1). You now have to
pass `--detach` explicitly to cause it to background, which I think is
the saner way to do this anyway.

Patrick
Jeff King Aug. 19, 2024, 9:17 a.m. UTC | #3
On Mon, Aug 19, 2024 at 11:07:51AM +0200, Patrick Steinhardt wrote:

> > I have not paid much attention to the "maintenance" stuff. It is a
> > little weird to me that it is not building on "git repack", which
> > already handles this, but perhaps there are reasons. Anyway, totally
> > unrelated to your patch (which looks good to me).
> 
> git-repack(1) is way less efficient than running git-pack-objects(1)
> directly. I've also noticed that at one point in time when revamping how
> we do housekeeping in Git.
> 
> It mostly boils down to git-repack(1) doing a connectivity check,
> whereas git-pack-objects(1) doesn't. We just soak up every single loose
> object, and then eventually we expire them via git-multi-pack-index(1)'s
> "expire" subcommand.

Hmph. I'd have suggested that we should teach git-repack to do the more
efficient thing. I'm a bit worried about having parallel universes of
how maintenance works making it harder to reason about when or how
things happen, and how various concurrent / racy behaviors work.

But it's probably a bit late to re-open that (and certainly it's not
part of your series).

> > I wondered if you needed --no-detach here to avoid a race, but I guess
> > as a non-auto run, it would never background?
> 
> Even the `--auto` run does not background. That was the case for
> git-gc(1), but is not the case for git-maintenance(1). You now have to
> pass `--detach` explicitly to cause it to background, which I think is
> the saner way to do this anyway.

Am I misreading the documentation? The entry for maintenance.autoDetach
on 'next' says:

  If unset, the value of `gc.autoDetach` is used as a fallback. Defaults
  to true if both are unset, meaning that the maintenance process will
  detach.

-Peff
Patrick Steinhardt Aug. 19, 2024, 9:26 a.m. UTC | #4
On Mon, Aug 19, 2024 at 05:17:15AM -0400, Jeff King wrote:
> On Mon, Aug 19, 2024 at 11:07:51AM +0200, Patrick Steinhardt wrote:
> 
> > > I have not paid much attention to the "maintenance" stuff. It is a
> > > little weird to me that it is not building on "git repack", which
> > > already handles this, but perhaps there are reasons. Anyway, totally
> > > unrelated to your patch (which looks good to me).
> > 
> > git-repack(1) is way less efficient than running git-pack-objects(1)
> > directly. I've also noticed that at one point in time when revamping how
> > we do housekeeping in Git.
> > 
> > It mostly boils down to git-repack(1) doing a connectivity check,
> > whereas git-pack-objects(1) doesn't. We just soak up every single loose
> > object, and then eventually we expire them via git-multi-pack-index(1)'s
> > "expire" subcommand.
> 
> Hmph. I'd have suggested that we should teach git-repack to do the more
> efficient thing. I'm a bit worried about having parallel universes of
> how maintenance works making it harder to reason about when or how
> things happen, and how various concurrent / racy behaviors work.
> 
> But it's probably a bit late to re-open that (and certainly it's not
> part of your series).
> 
> > > I wondered if you needed --no-detach here to avoid a race, but I guess
> > > as a non-auto run, it would never background?
> > 
> > Even the `--auto` run does not background. That was the case for
> > git-gc(1), but is not the case for git-maintenance(1). You now have to
> > pass `--detach` explicitly to cause it to background, which I think is
> > the saner way to do this anyway.
> 
> Am I misreading the documentation? The entry for maintenance.autoDetach
> on 'next' says:
> 
>   If unset, the value of `gc.autoDetach` is used as a fallback. Defaults
>   to true if both are unset, meaning that the maintenance process will
>   detach.

You've omitted the important part:

	Many Git commands trigger automatic maintenance after they have
	written data into the repository. This boolean config option
	controls whether this automatic maintenance shall happen in the
	foreground or whether the maintenance process shall detach and
	continue to run in the background.

The `maintenance.autoDetach` setting only impacts auto-maintentance as
run via `run_auto_maintenance()`. The `--auto` flag is somewhat
orthogonal: it asks the git-maintenance(1) job to do nothing in case the
repository is already optimal.

For git-gc(1) we indeed did tie the `--auto` flag to backgrounding,
which is somewhat nonsensical. There are usecases where you may want to
pass `--auto`, but still have it run in the foreground. That's why we
handle this differently for git-maintenance(1), which requires you to
pass an explicit `--detach` flag.

Also, we cannot change the behaviour of git-maintenance(1) retroactively
to make `--auto` detach. While it already essentially did detach for
git-gc(1), that was a bug. E.g. when running as part of the scheduler,
we'd always have detached and thus ended up with a bunch of concurrent
git-gc(1) processes. So even though it does make sense for the scheduler
to use `--auto`, it wouldn't want the process to detach.

Patrick
Jeff King Aug. 19, 2024, 10:26 a.m. UTC | #5
On Mon, Aug 19, 2024 at 11:26:06AM +0200, Patrick Steinhardt wrote:

> > Am I misreading the documentation? The entry for maintenance.autoDetach
> > on 'next' says:
> > 
> >   If unset, the value of `gc.autoDetach` is used as a fallback. Defaults
> >   to true if both are unset, meaning that the maintenance process will
> >   detach.
> 
> You've omitted the important part:
> 
> 	Many Git commands trigger automatic maintenance after they have
> 	written data into the repository. This boolean config option
> 	controls whether this automatic maintenance shall happen in the
> 	foreground or whether the maintenance process shall detach and
> 	continue to run in the background.
> 
> The `maintenance.autoDetach` setting only impacts auto-maintentance as
> run via `run_auto_maintenance()`. The `--auto` flag is somewhat
> orthogonal: it asks the git-maintenance(1) job to do nothing in case the
> repository is already optimal.

Ah. I naively assumed that they did so by passing the "--auto" flag. But
I see now that the caller actually checks the config and passes
"--detach" or not.

That seems kind of unfriendly to scripted porcelains which want to
invoke it, since they have to reimplement that logic. The idea of "git
gc --auto" was that it provided a single API for scripts to invoke,
including respecting the user's config. Now that "maintenance --auto"
has taken that over, I'd have expected it to do the same.

To be clear, I don't feel all that strongly about it, but I'm not sure I
buy the argument that it is orthogonal, or that here:

> For git-gc(1) we indeed did tie the `--auto` flag to backgrounding,
> which is somewhat nonsensical. There are usecases where you may want to
> pass `--auto`, but still have it run in the foreground. That's why we
> handle this differently for git-maintenance(1), which requires you to
> pass an explicit `--detach` flag.

we couldn't just patch "--no-detach" for cases where you want to be sure
it is in the foreground.

> Also, we cannot change the behaviour of git-maintenance(1) retroactively
> to make `--auto` detach. While it already essentially did detach for
> git-gc(1), that was a bug. E.g. when running as part of the scheduler,
> we'd always have detached and thus ended up with a bunch of concurrent
> git-gc(1) processes. So even though it does make sense for the scheduler
> to use `--auto`, it wouldn't want the process to detach.

Backwards compatibility is a more compelling argument here, if we've had
"maintenance --auto" that didn't ever detach (though it sounds like it
did, via gc, anyway). But yes, one kicked off from a scheduler should be
using --no-detach, I'd think.

Like I said, I don't feel strongly enough to work on any changes here.
I'd hoped to never think about repository maintenance ever again. So you
can take these as just impressions of a (relatively) clueful user seeing
it for the first time. ;)

-Peff
Junio C Hamano Aug. 19, 2024, 5:05 p.m. UTC | #6
Jeff King <peff@peff.net> writes:

> On Mon, Aug 19, 2024 at 11:07:51AM +0200, Patrick Steinhardt wrote:
>
>> It mostly boils down to git-repack(1) doing a connectivity check,
>> whereas git-pack-objects(1) doesn't. We just soak up every single loose
>> object, and then eventually we expire them via git-multi-pack-index(1)'s
>> "expire" subcommand.
>
> Hmph. I'd have suggested that we should teach git-repack to do the more
> efficient thing. I'm a bit worried about having parallel universes of
> how maintenance works making it harder to reason about when or how
> things happen, and how various concurrent / racy behaviors work.

I'd suggest being careful before going there.

The above only explains why it is OK not to exclude unreachable
cruft, but does not address another thing we should be worried
about, which is the quality of the resulting pack.

Throwing a random set of object names at pack-objects in the order
that they are discovered by for_each_loose_file_in_objdir(), which
is what gc.c:pack_loose() does, would give no locality benefit that
walking the commits would.  If we assume that we will pack_loose()
often enough that we won't have huge number of objects in the
resulting pack, packing objects that are close in the history may
not matter much, but on the other hand, if we run pack_loose() too
often to produce a small pack, you would not have a great delta base
selection.

So we should probably monitor how much "badness" the pack_loose()
is causing, and if it turns out to be too much, we may need to
reconsider its design.  Being able to produce ultra-quickly a pack
whose layout and delta base choice would hurt runtime performance is
not a feature.

> But it's probably a bit late to re-open that (and certainly it's not
> part of your series).

True.

Thanks.
Patrick Steinhardt Aug. 20, 2024, 7:39 a.m. UTC | #7
On Mon, Aug 19, 2024 at 06:26:02AM -0400, Jeff King wrote:
> On Mon, Aug 19, 2024 at 11:26:06AM +0200, Patrick Steinhardt wrote:
> 
> > > Am I misreading the documentation? The entry for maintenance.autoDetach
> > > on 'next' says:
> > > 
> > >   If unset, the value of `gc.autoDetach` is used as a fallback. Defaults
> > >   to true if both are unset, meaning that the maintenance process will
> > >   detach.
> > 
> > You've omitted the important part:
> > 
> > 	Many Git commands trigger automatic maintenance after they have
> > 	written data into the repository. This boolean config option
> > 	controls whether this automatic maintenance shall happen in the
> > 	foreground or whether the maintenance process shall detach and
> > 	continue to run in the background.
> > 
> > The `maintenance.autoDetach` setting only impacts auto-maintentance as
> > run via `run_auto_maintenance()`. The `--auto` flag is somewhat
> > orthogonal: it asks the git-maintenance(1) job to do nothing in case the
> > repository is already optimal.
> 
> Ah. I naively assumed that they did so by passing the "--auto" flag. But
> I see now that the caller actually checks the config and passes
> "--detach" or not.
> 
> That seems kind of unfriendly to scripted porcelains which want to
> invoke it, since they have to reimplement that logic. The idea of "git
> gc --auto" was that it provided a single API for scripts to invoke,
> including respecting the user's config. Now that "maintenance --auto"
> has taken that over, I'd have expected it to do the same.
> 
> To be clear, I don't feel all that strongly about it, but I'm not sure I
> buy the argument that it is orthogonal, or that here:
> 
> > For git-gc(1) we indeed did tie the `--auto` flag to backgrounding,
> > which is somewhat nonsensical. There are usecases where you may want to
> > pass `--auto`, but still have it run in the foreground. That's why we
> > handle this differently for git-maintenance(1), which requires you to
> > pass an explicit `--detach` flag.
> 
> we couldn't just patch "--no-detach" for cases where you want to be sure
> it is in the foreground.

We certainly could. But honestly, your scripted use case you mention
above is even more of an argument why we shouldn't do it, in my opinion.
We have long had the stance that the behaviour of plumbing tools should
_not_ be impacted by the user configuration. And detaching based on some
config to me very much sounds like the exact opposite.

Mind you, we are all quite used to `git gc --auto` detaching. But if I
were new to the project, I'd find it quite surprising that it may or may
not detach if all I want it to do is to decide for itself whether it
needs to garbage collect or not. It is much more straight forward and
way less surprising for a script writer to use `--detach` if they want
the script to detach, because now the command does what they want
without them having to worry about the user's config.

> > Also, we cannot change the behaviour of git-maintenance(1) retroactively
> > to make `--auto` detach. While it already essentially did detach for
> > git-gc(1), that was a bug. E.g. when running as part of the scheduler,
> > we'd always have detached and thus ended up with a bunch of concurrent
> > git-gc(1) processes. So even though it does make sense for the scheduler
> > to use `--auto`, it wouldn't want the process to detach.
> 
> Backwards compatibility is a more compelling argument here, if we've had
> "maintenance --auto" that didn't ever detach (though it sounds like it
> did, via gc, anyway). But yes, one kicked off from a scheduler should be
> using --no-detach, I'd think.

Yes, we did, but as mentioned it was buggy. Once the scheduler kicks
off, you'd now have N git-gc(1) processes all running in parallel to
each other. With N being large you will certainly face some issues. You
also lose the exit code, which is another issue.

But as you said, you could make the scheduler pass `--no-detach`. In
fact, the first versions of this patch series were using your approach,
where I changed `git maintenance run --auto` to detach based on the
config. But after some thought (and after seeing the negative fallout
that this had on our test suite) I decided to throw this approach away
because it just didn't feel right to me.

> Like I said, I don't feel strongly enough to work on any changes here.
> I'd hoped to never think about repository maintenance ever again. So you
> can take these as just impressions of a (relatively) clueful user seeing
> it for the first time. ;)

I certainly appreciate the discussion, thanks for chiming in! I'm still
not convinced that we should continue to couple auto-maintenance and
backgrounding to each other. In my opinion, this behaviour was a mistake
in the past and continues to surprise now, too. Making it an explicit
option feels more natural to me.

That being said, when others feel strongly about this, as well, then I'm
of course happy to adapt.

Patrick
Junio C Hamano Aug. 20, 2024, 3:58 p.m. UTC | #8
Patrick Steinhardt <ps@pks.im> writes:

> I certainly appreciate the discussion, thanks for chiming in! I'm still
> not convinced that we should continue to couple auto-maintenance and
> backgrounding to each other. In my opinion, this behaviour was a mistake
> in the past and continues to surprise now, too. Making it an explicit
> option feels more natural to me.
>
> That being said, when others feel strongly about this, as well, then I'm
> of course happy to adapt.

FWIW, I find it is a sensible approach to have a separate "run in
the background" that is not strongly tied to "do your thing if you
think the repository really needs it".

Thanks.
diff mbox series

Patch

diff --git a/builtin/gc.c b/builtin/gc.c
index 13bc0572a3..be75efa17a 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -1159,6 +1159,12 @@  static int pack_loose(struct maintenance_run_opts *opts)
 
 	pack_proc.in = -1;
 
+	/*
+	 * git-pack-objects(1) ends up writing the pack hash to stdout, which
+	 * we do not care for.
+	 */
+	pack_proc.out = -1;
+
 	if (start_command(&pack_proc)) {
 		error(_("failed to start 'git pack-objects' process"));
 		return 1;
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 46a61d66fb..7cc4eb262c 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -978,4 +978,20 @@  test_expect_success '--detach causes maintenance to run in background' '
 	)
 '
 
+test_expect_success 'repacking loose objects is quiet' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		test_commit something &&
+		git config set maintenance.gc.enabled false &&
+		git config set maintenance.loose-objects.enabled true &&
+		git config set maintenance.loose-objects.auto 1 &&
+
+		git maintenance run --quiet >out 2>&1 &&
+		test_must_be_empty out
+	)
+'
+
 test_done