Message ID | 20231024122631.158415-4-karthik.188@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | rev-list: add support for commits in `--missing` | expand |
Karthik Nayak <karthik.188@gmail.com> writes: > @@ -1176,7 +1181,11 @@ static int process_parents(struct rev_info *revs,... > break; > continue; > } > - return -1; > + > + if (!revs->do_not_die_on_missing_objects) > + return -1; > + else > + oidset_insert(&revs->missing_objects, &p->object.oid); I would suspect that swapping if/else would make it easier to follow. Everybody else in the patch guards the use of the oidset with "were we told not to die on missing objects?", i.e., if (revs->do_not_die_on_missing_objects) oidset_insert(&revs->missing_objects, &p->object.oid); else return -1; /* corrupt repository */ > @@ -3800,6 +3809,9 @@ int prepare_revision_walk(struct rev_info *revs) > FOR_EACH_OBJECT_PROMISOR_ONLY); > } > > + if (revs->do_not_die_on_missing_objects) > + oidset_init(&revs->missing_objects, 0); I read the patch to make sure that .missing_objects oidset is used only when .do_not_die_on_missing_objects is set and the oidset is untouched unless it is initialized. Well done. I know I floated "perhaps oidset can replace the object bits, especially because the number of objects that need marking is expected to be small", but I am curious what the performance implication of this would be. Is this something we can create comparison easily? I noticed that nobody releases the resource held by this new oidset. Shouldn't we do so in revision.c:release_revisions()? Thanks.
Junio C Hamano <gitster@pobox.com> writes: > I noticed that nobody releases the resource held by this new oidset. > Shouldn't we do so in revision.c:release_revisions()? It seems that linux-leaks CI job noticed the same. https://github.com/git/git/actions/runs/6633599458/job/18021612518#step:5:2949 I wonder if the following is sufficient? revision.c | 2 ++ 1 file changed, 2 insertions(+) diff --git c/revision.c w/revision.c index 724a116401..7a67ff74dc 100644 --- c/revision.c +++ w/revision.c @@ -3136,6 +3136,8 @@ void release_revisions(struct rev_info *revs) clear_decoration(&revs->merge_simplification, free); clear_decoration(&revs->treesame, free); line_log_free(revs); + if (revs->do_not_die_on_missing_objects) + oidset_clear(&revs->missing_objects); } static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
On Tue, Oct 24, 2023 at 02:26:31PM +0200, Karthik Nayak wrote: > The `--missing` object option in rev-list currently works only with > missing blobs/trees. For missing commits the revision walker fails with > a fatal error. > > Let's extend the functionality of `--missing` option to also support > commit objects. This is done by adding a `missing_objects` field to > `rev_info`. This field is an `oidset` to which we'll add the missing > commits as we encounter them. The revision walker will now continue the > traversal and call `show_commit()` even for missing commits. In rev-list > we can then check if the commit is a missing commit and call the > existing code for parsing `--missing` objects. > > A scenario where this option would be used is to find the boundary > objects between different object directories. Consider a repository with > a main object directory (GIT_OBJECT_DIRECTORY) and one or more alternate > object directories (GIT_ALTERNATE_OBJECT_DIRECTORIES). In such a > repository, using the `--missing=print` option while disabling the > alternate object directory allows us to find the boundary objects > between the main and alternate object directory. > > Helped-by: Patrick Steinhardt <ps@pks.im> > Signed-off-by: Karthik Nayak <karthik.188@gmail.com> > --- > builtin/rev-list.c | 6 +++ > list-objects.c | 3 ++ > revision.c | 16 +++++++- > revision.h | 4 ++ > t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++++++++++ > 5 files changed, 101 insertions(+), 2 deletions(-) > create mode 100755 t/t6022-rev-list-missing.sh > > diff --git a/builtin/rev-list.c b/builtin/rev-list.c > index 98542e8b3c..37b52520b5 100644 > --- a/builtin/rev-list.c > +++ b/builtin/rev-list.c > @@ -149,6 +149,12 @@ static void show_commit(struct commit *commit, void *data) > > display_progress(progress, ++progress_counter); > > + if (revs->do_not_die_on_missing_objects && > + oidset_contains(&revs->missing_objects, &commit->object.oid)) { > + finish_object__ma(&commit->object); > + return; > + } > + > if (show_disk_usage) > total_disk_usage += get_object_disk_usage(&commit->object); > > diff --git a/list-objects.c b/list-objects.c > index 47296dff2f..260089388c 100644 > --- a/list-objects.c > +++ b/list-objects.c > @@ -389,6 +389,9 @@ static void do_traverse(struct traversal_context *ctx) > */ > if (!ctx->revs->tree_objects) > ; /* do not bother loading tree */ > + else if (ctx->revs->do_not_die_on_missing_objects && > + oidset_contains(&ctx->revs->missing_objects, &commit->object.oid)) > + ; > else if (repo_get_commit_tree(the_repository, commit)) { > struct tree *tree = repo_get_commit_tree(the_repository, > commit); > diff --git a/revision.c b/revision.c > index 219dc76716..e60646c1a7 100644 > --- a/revision.c > +++ b/revision.c > @@ -6,6 +6,7 @@ > #include "object-name.h" > #include "object-file.h" > #include "object-store-ll.h" > +#include "oidset.h" > #include "tag.h" > #include "blob.h" > #include "tree.h" > @@ -1112,6 +1113,9 @@ static int process_parents(struct rev_info *revs, struct commit *commit, > > if (commit->object.flags & ADDED) > return 0; > + if (revs->do_not_die_on_missing_objects && > + oidset_contains(&revs->missing_objects, &commit->object.oid)) Nit: indentation is off here. > + return 0; > commit->object.flags |= ADDED; > > if (revs->include_check && > @@ -1168,7 +1172,8 @@ static int process_parents(struct rev_info *revs, struct commit *commit, > for (parent = commit->parents; parent; parent = parent->next) { > struct commit *p = parent->item; > int gently = revs->ignore_missing_links || > - revs->exclude_promisor_objects; > + revs->exclude_promisor_objects || > + revs->do_not_die_on_missing_objects; > if (repo_parse_commit_gently(revs->repo, p, gently) < 0) { > if (revs->exclude_promisor_objects && > is_promisor_object(&p->object.oid)) { > @@ -1176,7 +1181,11 @@ static int process_parents(struct rev_info *revs, struct commit *commit, > break; > continue; > } > - return -1; > + > + if (!revs->do_not_die_on_missing_objects) > + return -1; > + else > + oidset_insert(&revs->missing_objects, &p->object.oid); > } > if (revs->sources) { > char **slot = revision_sources_at(revs->sources, p); > @@ -3800,6 +3809,9 @@ int prepare_revision_walk(struct rev_info *revs) > FOR_EACH_OBJECT_PROMISOR_ONLY); > } > > + if (revs->do_not_die_on_missing_objects) > + oidset_init(&revs->missing_objects, 0); > + While we're initializing the new oidset, we never clear it. We should probably call `oidset_clear()` in `release_revisions()`. And if we unconditionally initialized the oidset here then we can also call `oiadset_clear()` unconditionally there, which should be preferable given that `oidset_init()` does not allocate memory when no initial size was given. > if (!revs->reflog_info) > prepare_to_use_bloom_filter(revs); > if (!revs->unsorted_input) > diff --git a/revision.h b/revision.h > index c73c92ef40..f6bf422f0e 100644 > --- a/revision.h > +++ b/revision.h > @@ -4,6 +4,7 @@ > #include "commit.h" > #include "grep.h" > #include "notes.h" > +#include "oidset.h" > #include "pretty.h" > #include "diff.h" > #include "commit-slab-decl.h" > @@ -373,6 +374,9 @@ struct rev_info { > > /* Location where temporary objects for remerge-diff are written. */ > struct tmp_objdir *remerge_objdir; > + > + /* Missing objects to be tracked without failing traversal. */ > + struct oidset missing_objects; As far as I can see we only use this set to track missing commits, but none of the other objects. The name thus feels a bit misleading to me, as a reader might rightfully assume that it contains _all_ missing objects after the revwalk. Should we rename it to `missing_commits` to clarify? Patrick > }; > > /** > diff --git a/t/t6022-rev-list-missing.sh b/t/t6022-rev-list-missing.sh > new file mode 100755 > index 0000000000..40265a4f66 > --- /dev/null > +++ b/t/t6022-rev-list-missing.sh > @@ -0,0 +1,74 @@ > +#!/bin/sh > + > +test_description='handling of missing objects in rev-list' > + > +TEST_PASSES_SANITIZE_LEAK=true > +. ./test-lib.sh > + > +# We setup the repository with two commits, this way HEAD is always > +# available and we can hide commit 1. > +test_expect_success 'create repository and alternate directory' ' > + test_commit 1 && > + test_commit 2 && > + test_commit 3 > +' > + > +for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t" > +do > + test_expect_success "rev-list --missing=error fails with missing object $obj" ' > + oid="$(git rev-parse $obj)" && > + path=".git/objects/$(test_oid_to_path $oid)" && > + > + mv "$path" "$path.hidden" && > + test_when_finished "mv $path.hidden $path" && > + > + test_must_fail git rev-list --missing=error --objects \ > + --no-object-names HEAD > + ' > +done > + > +for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t" > +do > + for action in "allow-any" "print" > + do > + test_expect_success "rev-list --missing=$action with missing $obj" ' > + oid="$(git rev-parse $obj)" && > + path=".git/objects/$(test_oid_to_path $oid)" && > + > + # Before the object is made missing, we use rev-list to > + # get the expected oids. > + git rev-list --objects --no-object-names \ > + HEAD ^$obj >expect.raw && > + > + # Blobs are shared by all commits, so evethough a commit/tree > + # might be skipped, its blob must be accounted for. > + if [ $obj != "HEAD:1.t" ]; then > + echo $(git rev-parse HEAD:1.t) >>expect.raw && > + echo $(git rev-parse HEAD:2.t) >>expect.raw > + fi && > + > + mv "$path" "$path.hidden" && > + test_when_finished "mv $path.hidden $path" && > + > + git rev-list --missing=$action --objects --no-object-names \ > + HEAD >actual.raw && > + > + # When the action is to print, we should also add the missing > + # oid to the expect list. > + case $action in > + allow-any) > + ;; > + print) > + grep ?$oid actual.raw && > + echo ?$oid >>expect.raw > + ;; > + esac && > + > + sort actual.raw >actual && > + sort expect.raw >expect && > + test_cmp expect actual > + ' > + done > +done > + > +test_done > -- > 2.42.0 >
On Tue, Oct 24, 2023 at 7:45 PM Junio C Hamano <gitster@pobox.com> wrote: > I would suspect that swapping if/else would make it easier to > follow. Everybody else in the patch guards the use of the oidset > with "were we told not to die on missing objects?", i.e., > > if (revs->do_not_die_on_missing_objects) > oidset_insert(&revs->missing_objects, &p->object.oid); > else > return -1; /* corrupt repository */ > Makes sense. Will change. > > @@ -3800,6 +3809,9 @@ int prepare_revision_walk(struct rev_info *revs) > > FOR_EACH_OBJECT_PROMISOR_ONLY); > > } > > > > + if (revs->do_not_die_on_missing_objects) > > + oidset_init(&revs->missing_objects, 0); > > I read the patch to make sure that .missing_objects oidset is used > only when .do_not_die_on_missing_objects is set and the oidset is > untouched unless it is initialized. Well done. > > I know I floated "perhaps oidset can replace the object bits, > especially because the number of objects that need marking is > expected to be small", but I am curious what the performance > implication of this would be. Is this something we can create > comparison easily? I did a simple comparision here, I randomly deleted commits which had child commits with greater than 2 parents. $ git rev-list --missing=print HEAD | grep "?" | wc -l 828 Using the flag bit: $ hyperfine -w 3 "~/git/bin-wrappers/git rev-list --missing=allow-any HEAD" Benchmark 1: ~/git/bin-wrappers/git rev-list --missing=allow-any HEAD Time (mean ± σ): 860.5 ms ± 15.2 ms [User: 375.3 ms, System: 467.5 ms] Range (min … max): 835.2 ms … 881.0 ms 10 runs Using the oidset: $ hyperfine -w 3 "~/git/bin-wrappers/git rev-list --missing=allow-any HEAD" Benchmark 1: ~/git/bin-wrappers/git rev-list --missing=allow-any HEAD Time (mean ± σ): 901.3 ms ± 9.6 ms [User: 394.3 ms, System: 488.3 ms] Range (min … max): 885.0 ms … 913.2 ms 10 runs Its definitely slower, but not by much. > > > I noticed that nobody releases the resource held by this new oidset. > > Shouldn't we do so in revision.c:release_revisions()? > > It seems that linux-leaks CI job noticed the same. > > https://github.com/git/git/actions/runs/6633599458/job/18021612518#step:5:2949 > > I wonder if the following is sufficient? > > revision.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git c/revision.c w/revision.c > index 724a116401..7a67ff74dc 100644 > --- c/revision.c > +++ w/revision.c > @@ -3136,6 +3136,8 @@ void release_revisions(struct rev_info *revs) > clear_decoration(&revs->merge_simplification, free); > clear_decoration(&revs->treesame, free); > line_log_free(revs); > + if (revs->do_not_die_on_missing_objects) > + oidset_clear(&revs->missing_objects); > } > > static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child) > Yup, this seems right and was missed, will add. On Wed, Oct 25, 2023 at 8:40 AM Patrick Steinhardt <ps@pks.im> wrote: > > > > > if (commit->object.flags & ADDED) > > return 0; > > + if (revs->do_not_die_on_missing_objects && > > + oidset_contains(&revs->missing_objects, &commit->object.oid)) > > Nit: indentation is off here. > Thanks, will fix. > > + > > + /* Missing objects to be tracked without failing traversal. */ > > + struct oidset missing_objects; > > As far as I can see we only use this set to track missing commits, but > none of the other objects. The name thus feels a bit misleading to me, > as a reader might rightfully assume that it contains _all_ missing > objects after the revwalk. Should we rename it to `missing_commits` to > clarify? > Fair enough, I was thinking of being future compatible. But probably best to be specific now and refactor as needed in the future. Will change. Thanks both for the review!
Patrick Steinhardt <ps@pks.im> writes: >> + if (revs->do_not_die_on_missing_objects) >> + oidset_init(&revs->missing_objects, 0); >> + > > While we're initializing the new oidset, we never clear it. We should > probably call `oidset_clear()` in `release_revisions()`. And if we > unconditionally initialized the oidset here then we can also call > `oiadset_clear()` unconditionally there, which should be preferable > given that `oidset_init()` does not allocate memory when no initial size > was given. Yup, I used the conditional one to match the above, but initializing unused oidset is cheap and frees us from having to worry about mistakes. I like your idea much better. >> + >> + /* Missing objects to be tracked without failing traversal. */ >> + struct oidset missing_objects; > > As far as I can see we only use this set to track missing commits, but > none of the other objects. The name thus feels a bit misleading to me, > as a reader might rightfully assume that it contains _all_ missing > objects after the revwalk. Should we rename it to `missing_commits` to > clarify? Again, very good suggestion. Thanks.
diff --git a/builtin/rev-list.c b/builtin/rev-list.c index 98542e8b3c..37b52520b5 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -149,6 +149,12 @@ static void show_commit(struct commit *commit, void *data) display_progress(progress, ++progress_counter); + if (revs->do_not_die_on_missing_objects && + oidset_contains(&revs->missing_objects, &commit->object.oid)) { + finish_object__ma(&commit->object); + return; + } + if (show_disk_usage) total_disk_usage += get_object_disk_usage(&commit->object); diff --git a/list-objects.c b/list-objects.c index 47296dff2f..260089388c 100644 --- a/list-objects.c +++ b/list-objects.c @@ -389,6 +389,9 @@ static void do_traverse(struct traversal_context *ctx) */ if (!ctx->revs->tree_objects) ; /* do not bother loading tree */ + else if (ctx->revs->do_not_die_on_missing_objects && + oidset_contains(&ctx->revs->missing_objects, &commit->object.oid)) + ; else if (repo_get_commit_tree(the_repository, commit)) { struct tree *tree = repo_get_commit_tree(the_repository, commit); diff --git a/revision.c b/revision.c index 219dc76716..e60646c1a7 100644 --- a/revision.c +++ b/revision.c @@ -6,6 +6,7 @@ #include "object-name.h" #include "object-file.h" #include "object-store-ll.h" +#include "oidset.h" #include "tag.h" #include "blob.h" #include "tree.h" @@ -1112,6 +1113,9 @@ static int process_parents(struct rev_info *revs, struct commit *commit, if (commit->object.flags & ADDED) return 0; + if (revs->do_not_die_on_missing_objects && + oidset_contains(&revs->missing_objects, &commit->object.oid)) + return 0; commit->object.flags |= ADDED; if (revs->include_check && @@ -1168,7 +1172,8 @@ static int process_parents(struct rev_info *revs, struct commit *commit, for (parent = commit->parents; parent; parent = parent->next) { struct commit *p = parent->item; int gently = revs->ignore_missing_links || - revs->exclude_promisor_objects; + revs->exclude_promisor_objects || + revs->do_not_die_on_missing_objects; if (repo_parse_commit_gently(revs->repo, p, gently) < 0) { if (revs->exclude_promisor_objects && is_promisor_object(&p->object.oid)) { @@ -1176,7 +1181,11 @@ static int process_parents(struct rev_info *revs, struct commit *commit, break; continue; } - return -1; + + if (!revs->do_not_die_on_missing_objects) + return -1; + else + oidset_insert(&revs->missing_objects, &p->object.oid); } if (revs->sources) { char **slot = revision_sources_at(revs->sources, p); @@ -3800,6 +3809,9 @@ int prepare_revision_walk(struct rev_info *revs) FOR_EACH_OBJECT_PROMISOR_ONLY); } + if (revs->do_not_die_on_missing_objects) + oidset_init(&revs->missing_objects, 0); + if (!revs->reflog_info) prepare_to_use_bloom_filter(revs); if (!revs->unsorted_input) diff --git a/revision.h b/revision.h index c73c92ef40..f6bf422f0e 100644 --- a/revision.h +++ b/revision.h @@ -4,6 +4,7 @@ #include "commit.h" #include "grep.h" #include "notes.h" +#include "oidset.h" #include "pretty.h" #include "diff.h" #include "commit-slab-decl.h" @@ -373,6 +374,9 @@ struct rev_info { /* Location where temporary objects for remerge-diff are written. */ struct tmp_objdir *remerge_objdir; + + /* Missing objects to be tracked without failing traversal. */ + struct oidset missing_objects; }; /** diff --git a/t/t6022-rev-list-missing.sh b/t/t6022-rev-list-missing.sh new file mode 100755 index 0000000000..40265a4f66 --- /dev/null +++ b/t/t6022-rev-list-missing.sh @@ -0,0 +1,74 @@ +#!/bin/sh + +test_description='handling of missing objects in rev-list' + +TEST_PASSES_SANITIZE_LEAK=true +. ./test-lib.sh + +# We setup the repository with two commits, this way HEAD is always +# available and we can hide commit 1. +test_expect_success 'create repository and alternate directory' ' + test_commit 1 && + test_commit 2 && + test_commit 3 +' + +for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t" +do + test_expect_success "rev-list --missing=error fails with missing object $obj" ' + oid="$(git rev-parse $obj)" && + path=".git/objects/$(test_oid_to_path $oid)" && + + mv "$path" "$path.hidden" && + test_when_finished "mv $path.hidden $path" && + + test_must_fail git rev-list --missing=error --objects \ + --no-object-names HEAD + ' +done + +for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t" +do + for action in "allow-any" "print" + do + test_expect_success "rev-list --missing=$action with missing $obj" ' + oid="$(git rev-parse $obj)" && + path=".git/objects/$(test_oid_to_path $oid)" && + + # Before the object is made missing, we use rev-list to + # get the expected oids. + git rev-list --objects --no-object-names \ + HEAD ^$obj >expect.raw && + + # Blobs are shared by all commits, so evethough a commit/tree + # might be skipped, its blob must be accounted for. + if [ $obj != "HEAD:1.t" ]; then + echo $(git rev-parse HEAD:1.t) >>expect.raw && + echo $(git rev-parse HEAD:2.t) >>expect.raw + fi && + + mv "$path" "$path.hidden" && + test_when_finished "mv $path.hidden $path" && + + git rev-list --missing=$action --objects --no-object-names \ + HEAD >actual.raw && + + # When the action is to print, we should also add the missing + # oid to the expect list. + case $action in + allow-any) + ;; + print) + grep ?$oid actual.raw && + echo ?$oid >>expect.raw + ;; + esac && + + sort actual.raw >actual && + sort expect.raw >expect && + test_cmp expect actual + ' + done +done + +test_done
The `--missing` object option in rev-list currently works only with missing blobs/trees. For missing commits the revision walker fails with a fatal error. Let's extend the functionality of `--missing` option to also support commit objects. This is done by adding a `missing_objects` field to `rev_info`. This field is an `oidset` to which we'll add the missing commits as we encounter them. The revision walker will now continue the traversal and call `show_commit()` even for missing commits. In rev-list we can then check if the commit is a missing commit and call the existing code for parsing `--missing` objects. A scenario where this option would be used is to find the boundary objects between different object directories. Consider a repository with a main object directory (GIT_OBJECT_DIRECTORY) and one or more alternate object directories (GIT_ALTERNATE_OBJECT_DIRECTORIES). In such a repository, using the `--missing=print` option while disabling the alternate object directory allows us to find the boundary objects between the main and alternate object directory. Helped-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Karthik Nayak <karthik.188@gmail.com> --- builtin/rev-list.c | 6 +++ list-objects.c | 3 ++ revision.c | 16 +++++++- revision.h | 4 ++ t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++++++++++ 5 files changed, 101 insertions(+), 2 deletions(-) create mode 100755 t/t6022-rev-list-missing.sh