diff mbox series

[v1] builtin/pack-objects.c: ignore missing links with --stdin-packs

Message ID 815623da67d283e8509fc4ac67e939c6140fc39a.1616168441.git.me@ttaylorr.com (mailing list archive)
State New, archived
Headers show
Series [v1] builtin/pack-objects.c: ignore missing links with --stdin-packs | expand

Commit Message

Taylor Blau March 19, 2021, 3:40 p.m. UTC
When 'git pack-objects --stdin-packs' encounters a commit in a pack, it
marks it as a starting point of a best-effort reachability traversal
that is used to populate the name-hash of the objects listed in the
given packs.

The traversal expects that it should be able to walk the ancestors of
all commits in a pack without issue. Ordinarily this is the case, but it
is possible to having missing parents from an unreachable part of the
repository. In that case, we'd consider any missing objects in the
unreachable portion of the graph to be junk.

This should be handled gracefully: since the traversal is best-effort
(i.e., we don't strictly need to fill in all of the name-hash fields),
we should simply ignore any missing links.

This patch does that (by setting the 'ignore_missing_links' bit on the
rev_info struct), and ensures we don't regress in the future by adding a
test which demonstrates this case.

It is a little over-eager, since it will also ignore missing links in
reachable parts of the packs (which would indicate a corrupted
repository), but '--stdin-packs' is explicitly *not* about reachability.
So this step isn't making anything worse for a repository which contains
packs missing reachable objects (since we never drop objects with
'--stdin-packs').

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
I noticed a small number of repositories using geometric repacks
complained about missing links when doing the best-effort traversal to
fill in missing namehash fields with 'git pack-objects --stdin-packs'.

This patch takes care to handle that situation gracefully. It is based
on tb/geometric-repack, which is on 'next'.

 builtin/pack-objects.c |  1 +
 t/t5300-pack-object.sh | 38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

--
2.30.0.667.g81c0cbc6fd

Comments

Junio C Hamano March 19, 2021, 6:19 p.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:

> When 'git pack-objects --stdin-packs' encounters a commit in a pack, it
> marks it as a starting point of a best-effort reachability traversal
> that is used to populate the name-hash of the objects listed in the
> given packs.
>
> The traversal expects that it should be able to walk the ancestors of
> all commits in a pack without issue. Ordinarily this is the case, but it
> is possible to having missing parents from an unreachable part of the
> repository. In that case, we'd consider any missing objects in the
> unreachable portion of the graph to be junk.

Ah, OK.  So a pack that is being consolidated, or more likely a
loose commit that is rolled into the smallest geometric group) may
contain an unreachable commit, whose tree or blob has already been
pruned, which is an expected state (i.e. tree or blob may have been
older than the commit whose message may have been amended recently
before the entire commit got abandoned).  And we do not want to alarm
users by warning.

> This should be handled gracefully: since the traversal is best-effort
> (i.e., we don't strictly need to fill in all of the name-hash fields),
> we should simply ignore any missing links.

Or the entire set of objects that refer them can be omitted from the
resulting set of objects (iow, consider a commit that lacks a tree
or a blob to be checked out stale and prunable without downsides,
and prune it and its remaining trees and blobs by leaving them out
of the resulting pack), but I suspect that is a lot more involved
change.

> This patch does that (by setting the 'ignore_missing_links' bit on the
> rev_info struct), and ensures we don't regress in the future by adding a
> test which demonstrates this case.
>
> It is a little over-eager, since it will also ignore missing links in
> reachable parts of the packs (which would indicate a corrupted
> repository), but '--stdin-packs' is explicitly *not* about reachability.
> So this step isn't making anything worse for a repository which contains
> packs missing reachable objects (since we never drop objects with
> '--stdin-packs').

Yup.  I find the reasoning quite sensible.

Thanks, will queue.

> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
> I noticed a small number of repositories using geometric repacks
> complained about missing links when doing the best-effort traversal to
> fill in missing namehash fields with 'git pack-objects --stdin-packs'.
>
> This patch takes care to handle that situation gracefully. It is based
> on tb/geometric-repack, which is on 'next'.
>
>  builtin/pack-objects.c |  1 +
>  t/t5300-pack-object.sh | 38 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 39 insertions(+)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 8cb32763b7..f513138513 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3122,6 +3122,7 @@ static void read_packs_list_from_stdin(void)
>  	revs.blob_objects = 1;
>  	revs.tree_objects = 1;
>  	revs.tag_objects = 1;
> +	revs.ignore_missing_links = 1;
>
>  	while (strbuf_getline(&buf, stdin) != EOF) {
>  		if (!buf.len)
> diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
> index 7138a54595..ab509e8c38 100755
> --- a/t/t5300-pack-object.sh
> +++ b/t/t5300-pack-object.sh
> @@ -629,4 +629,42 @@ test_expect_success '--stdin-packs with loose objects' '
>  	)
>  '
>
> +test_expect_success '--stdin-packs with broken links' '
> +	(
> +		cd stdin-packs &&
> +
> +		# make an unreachable object with a bogus parent
> +		git cat-file -p HEAD >commit &&
> +		sed "s/$(git rev-parse HEAD^)/$(test_oid zero)/" <commit |
> +		git hash-object -w -t commit --stdin >in &&
> +
> +		git pack-objects .git/objects/pack/pack-D <in &&
> +
> +		PACK_A="$(basename .git/objects/pack/pack-A-*.pack)" &&
> +		PACK_B="$(basename .git/objects/pack/pack-B-*.pack)" &&
> +		PACK_C="$(basename .git/objects/pack/pack-C-*.pack)" &&
> +		PACK_D="$(basename .git/objects/pack/pack-D-*.pack)" &&
> +
> +		git pack-objects test3 --stdin-packs --unpacked <<-EOF &&
> +		$PACK_A
> +		^$PACK_B
> +		$PACK_C
> +		$PACK_D
> +		EOF
> +
> +		(
> +			git show-index <$(ls .git/objects/pack/pack-A-*.idx) &&
> +			git show-index <$(ls .git/objects/pack/pack-C-*.idx) &&
> +			git show-index <$(ls .git/objects/pack/pack-D-*.idx) &&
> +			git rev-list --objects --no-object-names \
> +				refs/tags/C..refs/tags/D
> +		) >expect.raw &&
> +		git show-index <$(ls test3-*.idx) >actual.raw &&
> +
> +		cut -d" " -f2 <expect.raw | sort >expect &&
> +		cut -d" " -f2 <actual.raw | sort >actual &&
> +		test_cmp expect actual
> +	)
> +'
> +
>  test_done
> --
> 2.30.0.667.g81c0cbc6fd
Jeff King March 19, 2021, 7:31 p.m. UTC | #2
On Fri, Mar 19, 2021 at 11:19:11AM -0700, Junio C Hamano wrote:

> > The traversal expects that it should be able to walk the ancestors of
> > all commits in a pack without issue. Ordinarily this is the case, but it
> > is possible to having missing parents from an unreachable part of the
> > repository. In that case, we'd consider any missing objects in the
> > unreachable portion of the graph to be junk.
> 
> Ah, OK.  So a pack that is being consolidated, or more likely a
> loose commit that is rolled into the smallest geometric group) may
> contain an unreachable commit, whose tree or blob has already been
> pruned, which is an expected state (i.e. tree or blob may have been
> older than the commit whose message may have been amended recently
> before the entire commit got abandoned).  And we do not want to alarm
> users by warning.

Yes, though it is not just warning, but rather that before this patch
we'd abort the whole repack.

> > This should be handled gracefully: since the traversal is best-effort
> > (i.e., we don't strictly need to fill in all of the name-hash fields),
> > we should simply ignore any missing links.
> 
> Or the entire set of objects that refer them can be omitted from the
> resulting set of objects (iow, consider a commit that lacks a tree
> or a blob to be checked out stale and prunable without downsides,
> and prune it and its remaining trees and blobs by leaving them out
> of the resulting pack), but I suspect that is a lot more involved
> change.

It is safe to omit the whole set from the name-hash traversal, which is
purely an optimization. But it would generally not be a good idea to
leave them out of the resulting pack, since that would mean deleting
them entirely from the repository (because we'll remove the old packs
they were in after pack-objects completes).

If they are truly unreachable, then it is not strictly wrong to delete
them (i.e., we are not corrupting a repository unless it was already
corrupted), but:

  - if the repository _is_ already corrupted, we are definitely making
    things worse

  - we generally try to keep even unreachable parts of the graph
    complete, doing things like keeping unreachable-but-old objects that
    are reachable from unreachable-but-recent. Again, we know here that
    the object graph is incomplete, so anybody pointing a ref at a
    descendant of our broken commit is already corrupting the
    repository. But it probably makes sense to follow the existing rules
    as much as possible, and not make such a situation worse.

> > It is a little over-eager, since it will also ignore missing links in
> > reachable parts of the packs (which would indicate a corrupted
> > repository), but '--stdin-packs' is explicitly *not* about reachability.
> > So this step isn't making anything worse for a repository which contains
> > packs missing reachable objects (since we never drop objects with
> > '--stdin-packs').
> 
> Yup.  I find the reasoning quite sensible.
> 
> Thanks, will queue.

I had seen and discussed the patch before it hit the list, but just to
make it explicit: it also looks good to me. ;)

-Peff
Junio C Hamano March 19, 2021, 8:55 p.m. UTC | #3
Jeff King <peff@peff.net> writes:

> If they are truly unreachable, then it is not strictly wrong to delete
> them (i.e., we are not corrupting a repository unless it was already
> corrupted), but:
>
>   - if the repository _is_ already corrupted, we are definitely making
>     things worse

Yes, that is a reason enough not to eject a commit whose tree is
incomplete.
diff mbox series

Patch

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 8cb32763b7..f513138513 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3122,6 +3122,7 @@  static void read_packs_list_from_stdin(void)
 	revs.blob_objects = 1;
 	revs.tree_objects = 1;
 	revs.tag_objects = 1;
+	revs.ignore_missing_links = 1;

 	while (strbuf_getline(&buf, stdin) != EOF) {
 		if (!buf.len)
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 7138a54595..ab509e8c38 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -629,4 +629,42 @@  test_expect_success '--stdin-packs with loose objects' '
 	)
 '

+test_expect_success '--stdin-packs with broken links' '
+	(
+		cd stdin-packs &&
+
+		# make an unreachable object with a bogus parent
+		git cat-file -p HEAD >commit &&
+		sed "s/$(git rev-parse HEAD^)/$(test_oid zero)/" <commit |
+		git hash-object -w -t commit --stdin >in &&
+
+		git pack-objects .git/objects/pack/pack-D <in &&
+
+		PACK_A="$(basename .git/objects/pack/pack-A-*.pack)" &&
+		PACK_B="$(basename .git/objects/pack/pack-B-*.pack)" &&
+		PACK_C="$(basename .git/objects/pack/pack-C-*.pack)" &&
+		PACK_D="$(basename .git/objects/pack/pack-D-*.pack)" &&
+
+		git pack-objects test3 --stdin-packs --unpacked <<-EOF &&
+		$PACK_A
+		^$PACK_B
+		$PACK_C
+		$PACK_D
+		EOF
+
+		(
+			git show-index <$(ls .git/objects/pack/pack-A-*.idx) &&
+			git show-index <$(ls .git/objects/pack/pack-C-*.idx) &&
+			git show-index <$(ls .git/objects/pack/pack-D-*.idx) &&
+			git rev-list --objects --no-object-names \
+				refs/tags/C..refs/tags/D
+		) >expect.raw &&
+		git show-index <$(ls test3-*.idx) >actual.raw &&
+
+		cut -d" " -f2 <expect.raw | sort >expect &&
+		cut -d" " -f2 <actual.raw | sort >actual &&
+		test_cmp expect actual
+	)
+'
+
 test_done