Message ID | 1fd83f726a04dfb5be27c74cb116618cb76be923.1627896460.git.ps@pks.im (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Speed up connectivity checks | expand |
On Mon, Aug 02 2021, Patrick Steinhardt wrote: > [[PGP Signed Part:Undecided]] > In order to compute whether objects reachable from a set of tips are all > connected, we do a revision walk with these tips as positive references > and `--not --all`. `--not --all` will cause the revision walk to load > all preexisting references as uninteresting, which can be very expensive > in repositories with many references. > > Benchmarking the git-rev-list(1) command highlights that by far the most > expensive single phase is initial sorting of the input revisions: after > all references have been loaded, we first sort commits by author date. > In a real-world repository with about 2.2 million references, it makes > up about 40% of the total runtime of git-rev-list(1). > > Ultimately, the connectivity check shouldn't really bother about the > order of input revisions at all. We only care whether we can actually > walk all objects until we hit the cut-off point. So sorting the input is > a complete waste of time. Really good results: > Introduce a new "--unsorted-input" flag to git-rev-list(1) which will > cause it to not sort the commits and adjust the connectivity check to > always pass the flag. This results in the following speedups, executed > in a clone of gitlab-org/gitlab [1]: > > Benchmark #1: git rev-list --objects --quiet --not --all --not $(cat newrev) > Time (mean ± σ): 7.639 s ± 0.065 s [User: 7.304 s, System: 0.335 s] > Range (min … max): 7.543 s … 7.742 s 10 runs > > Benchmark #2: git rev-list --unsorted-input --objects --quiet --not --all --not $newrev > Time (mean ± σ): 4.995 s ± 0.044 s [User: 4.657 s, System: 0.337 s] > Range (min … max): 4.909 s … 5.048 s 10 runs > > Summary > 'git rev-list --unsorted-input --objects --quiet --not --all --not $(cat newrev)' ran > 1.53 ± 0.02 times faster than 'git rev-list --objects --quiet --not --all --not $newrev' Just bikeshedding for a potential re-roll, perhaps --unordered-input, so that it matches/rhymes with the existing "git cat-file --unordered", which serves the same conceptual purpose (except this one's input, that one's output).
Patrick Steinhardt <ps@pks.im> writes: > In order to compute whether objects reachable from a set of tips are all > connected, we do a revision walk with these tips as positive references > and `--not --all`. `--not --all` will cause the revision walk to load > all preexisting references as uninteresting, which can be very expensive > in repositories with many references. > > Benchmarking the git-rev-list(1) command highlights that by far the most > expensive single phase is initial sorting of the input revisions: after > all references have been loaded, we first sort commits by author date. > In a real-world repository with about 2.2 million references, it makes > up about 40% of the total runtime of git-rev-list(1). Nice finding. > Ultimately, the connectivity check shouldn't really bother about the > order of input revisions at all. We only care whether we can actually > walk all objects until we hit the cut-off point. So sorting the input is > a complete waste of time. Sorting of positive side is done to help both performance and correctness in regular use of the traversal machinery, especially when reachability bitmap is not in effect, but on the negative side I do not think there is any downside to omit sorting offhand. The only case that may get affected is when the revision.c::SLOP kicks in to deal with oddball commits with incorrect committer timestamps, but then the result of the sorting isn't to be trusted anyway, so... > Introduce a new "--unsorted-input" flag to git-rev-list(1) which will > cause it to not sort the commits and adjust the connectivity check to > always pass the flag. This results in the following speedups, executed > in a clone of gitlab-org/gitlab [1]: > ... > [1]: https://gitlab.com/gitlab-org/gitlab.git. Note that not all refs > are visible to clients. So is this the 2.2 million refs thing? > @@ -3584,7 +3586,7 @@ int prepare_revision_walk(struct rev_info *revs) > > if (!revs->reflog_info) > prepare_to_use_bloom_filter(revs); > - if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED) > + if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED && !revs->unsorted_input) > commit_list_sort_by_date(&revs->commits); Looks quite straight-forward. I however suspect that in the longer term it may be cleaner to get rid of REVSISION_WALK_NO_WALK_UNSORTED if we do this. The knob that controls if we sort the initial traversal tips and the knob that controls if we walk from these tips used to be tied to --no-walk only because ca92e59e30b wanted to affect only no-walk case, but with your new finding, it clearly is not limited to the no-walk case to want to avoid sorting.
On Mon, Aug 02, 2021 at 02:49:29PM +0200, Ævar Arnfjörð Bjarmason wrote: > On Mon, Aug 02 2021, Patrick Steinhardt wrote: [snip] > > Introduce a new "--unsorted-input" flag to git-rev-list(1) which will > > cause it to not sort the commits and adjust the connectivity check to > > always pass the flag. This results in the following speedups, executed > > in a clone of gitlab-org/gitlab [1]: > > > > Benchmark #1: git rev-list --objects --quiet --not --all --not $(cat newrev) > > Time (mean ± σ): 7.639 s ± 0.065 s [User: 7.304 s, System: 0.335 s] > > Range (min … max): 7.543 s … 7.742 s 10 runs > > > > Benchmark #2: git rev-list --unsorted-input --objects --quiet --not --all --not $newrev > > Time (mean ± σ): 4.995 s ± 0.044 s [User: 4.657 s, System: 0.337 s] > > Range (min … max): 4.909 s … 5.048 s 10 runs > > > > Summary > > 'git rev-list --unsorted-input --objects --quiet --not --all --not $(cat newrev)' ran > > 1.53 ± 0.02 times faster than 'git rev-list --objects --quiet --not --all --not $newrev' > > Just bikeshedding for a potential re-roll, perhaps --unordered-input, so > that it matches/rhymes with the existing "git cat-file --unordered", > which serves the same conceptual purpose (except this one's input, that > one's output). Yeah, I wasn't quite sure how to name it myself either. Internally, we typically use "unsorted" instead of "unordered", and there's also the preexisting "--no-walk=(sorted|unsorted)" flag for git-rev-list(1). With the latter in mind, I think that "unsorted" fits a bit better given that we already use it in the same command, but I don't particularly mind. For now, I'll keep "--unsorted-input", but please feel free to give me another shout if you disagree with my reasoning. Patrick
On Mon, Aug 02, 2021 at 12:00:02PM -0700, Junio C Hamano wrote: > Patrick Steinhardt <ps@pks.im> writes: [snip] > > Introduce a new "--unsorted-input" flag to git-rev-list(1) which will > > cause it to not sort the commits and adjust the connectivity check to > > always pass the flag. This results in the following speedups, executed > > in a clone of gitlab-org/gitlab [1]: > > ... > > [1]: https://gitlab.com/gitlab-org/gitlab.git. Note that not all refs > > are visible to clients. > > So is this the 2.2 million refs thing? Yeah, it is. The repo itself got 2.2 million refs, even though only 800k are publicly visible. It got even more when one considers its alternate, where it grows to 3.4 million in total. > > @@ -3584,7 +3586,7 @@ int prepare_revision_walk(struct rev_info *revs) > > > > if (!revs->reflog_info) > > prepare_to_use_bloom_filter(revs); > > - if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED) > > + if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED && !revs->unsorted_input) > > commit_list_sort_by_date(&revs->commits); > > Looks quite straight-forward. > > I however suspect that in the longer term it may be cleaner to get > rid of REVSISION_WALK_NO_WALK_UNSORTED if we do this. The knob that > controls if we sort the initial traversal tips and the knob that > controls if we walk from these tips used to be tied to --no-walk > only because ca92e59e30b wanted to affect only no-walk case, but > with your new finding, it clearly is not limited to the no-walk case > to want to avoid sorting. Right. The question also is what to do when the user calls `git rev-list --no-walk=sorted --unsorted-input`. Do we sort? Don't we? Should we mark these options as incompatible with each other and bail out? I guess just bailing out would be the easiest solution for now. Patrick
Patrick Steinhardt <ps@pks.im> writes: >> I however suspect that in the longer term it may be cleaner to get >> rid of REVSISION_WALK_NO_WALK_UNSORTED if we do this. The knob that >> controls if we sort the initial traversal tips and the knob that >> controls if we walk from these tips used to be tied to --no-walk >> only because ca92e59e30b wanted to affect only no-walk case, but >> with your new finding, it clearly is not limited to the no-walk case >> to want to avoid sorting. > > Right. The question also is what to do when the user calls `git rev-list > --no-walk=sorted --unsorted-input`. Do we sort? Don't we? Should we mark > these options as incompatible with each other and bail out? I guess just > bailing out would be the easiest solution for now. I'd say so. Even without the future clean-up to get rid of the NO_WALK_UNSORTED, the issue already exists with this series, and when in doubt, it is easiest to start tight and take our time to figure out what the right behaviour should be while we initially do not allow both to be used at the same time.
On Tue, Aug 03 2021, Patrick Steinhardt wrote: > [[PGP Signed Part:Undecided]] > On Mon, Aug 02, 2021 at 02:49:29PM +0200, Ævar Arnfjörð Bjarmason wrote: >> On Mon, Aug 02 2021, Patrick Steinhardt wrote: > [snip] >> > Introduce a new "--unsorted-input" flag to git-rev-list(1) which will >> > cause it to not sort the commits and adjust the connectivity check to >> > always pass the flag. This results in the following speedups, executed >> > in a clone of gitlab-org/gitlab [1]: >> > >> > Benchmark #1: git rev-list --objects --quiet --not --all --not $(cat newrev) >> > Time (mean ± σ): 7.639 s ± 0.065 s [User: 7.304 s, System: 0.335 s] >> > Range (min … max): 7.543 s … 7.742 s 10 runs >> > >> > Benchmark #2: git rev-list --unsorted-input --objects --quiet --not --all --not $newrev >> > Time (mean ± σ): 4.995 s ± 0.044 s [User: 4.657 s, System: 0.337 s] >> > Range (min … max): 4.909 s … 5.048 s 10 runs >> > >> > Summary >> > 'git rev-list --unsorted-input --objects --quiet --not --all --not $(cat newrev)' ran >> > 1.53 ± 0.02 times faster than 'git rev-list --objects --quiet --not --all --not $newrev' >> >> Just bikeshedding for a potential re-roll, perhaps --unordered-input, so >> that it matches/rhymes with the existing "git cat-file --unordered", >> which serves the same conceptual purpose (except this one's input, that >> one's output). > > Yeah, I wasn't quite sure how to name it myself either. Internally, we > typically use "unsorted" instead of "unordered", and there's also the > preexisting "--no-walk=(sorted|unsorted)" flag for git-rev-list(1). With > the latter in mind, I think that "unsorted" fits a bit better given that > we already use it in the same command, but I don't particularly mind. > > For now, I'll keep "--unsorted-input", but please feel free to give me > another shout if you disagree with my reasoning. Sounds good. I didn't mean it as a strong suggestion, unsorted does indeed sounds better in this context, just a gentle poke to check if you knew about that similar option... Thanks.
diff --git a/connected.c b/connected.c index b18299fdf0..b5f9523a5f 100644 --- a/connected.c +++ b/connected.c @@ -106,6 +106,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data, if (opt->progress) strvec_pushf(&rev_list.args, "--progress=%s", _("Checking connectivity")); + strvec_push(&rev_list.args, "--unsorted-input"); rev_list.git_cmd = 1; rev_list.env = opt->env; diff --git a/revision.c b/revision.c index cddd0542a6..7eb09009ba 100644 --- a/revision.c +++ b/revision.c @@ -2256,6 +2256,8 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg } else if (!strcmp(arg, "--author-date-order")) { revs->sort_order = REV_SORT_BY_AUTHOR_DATE; revs->topo_order = 1; + } else if (!strcmp(arg, "--unsorted-input")) { + revs->unsorted_input = 1; } else if (!strcmp(arg, "--early-output")) { revs->early_output = 100; revs->topo_order = 1; @@ -3584,7 +3586,7 @@ int prepare_revision_walk(struct rev_info *revs) if (!revs->reflog_info) prepare_to_use_bloom_filter(revs); - if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED) + if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED && !revs->unsorted_input) commit_list_sort_by_date(&revs->commits); if (revs->no_walk) return 0; diff --git a/revision.h b/revision.h index fbb068da9f..ef6998509a 100644 --- a/revision.h +++ b/revision.h @@ -134,6 +134,7 @@ struct rev_info { simplify_history:1, show_pulls:1, topo_order:1, + unsorted_input:1, simplify_merges:1, simplify_by_decoration:1, single_worktree:1,
In order to compute whether objects reachable from a set of tips are all connected, we do a revision walk with these tips as positive references and `--not --all`. `--not --all` will cause the revision walk to load all preexisting references as uninteresting, which can be very expensive in repositories with many references. Benchmarking the git-rev-list(1) command highlights that by far the most expensive single phase is initial sorting of the input revisions: after all references have been loaded, we first sort commits by author date. In a real-world repository with about 2.2 million references, it makes up about 40% of the total runtime of git-rev-list(1). Ultimately, the connectivity check shouldn't really bother about the order of input revisions at all. We only care whether we can actually walk all objects until we hit the cut-off point. So sorting the input is a complete waste of time. Introduce a new "--unsorted-input" flag to git-rev-list(1) which will cause it to not sort the commits and adjust the connectivity check to always pass the flag. This results in the following speedups, executed in a clone of gitlab-org/gitlab [1]: Benchmark #1: git rev-list --objects --quiet --not --all --not $(cat newrev) Time (mean ± σ): 7.639 s ± 0.065 s [User: 7.304 s, System: 0.335 s] Range (min … max): 7.543 s … 7.742 s 10 runs Benchmark #2: git rev-list --unsorted-input --objects --quiet --not --all --not $newrev Time (mean ± σ): 4.995 s ± 0.044 s [User: 4.657 s, System: 0.337 s] Range (min … max): 4.909 s … 5.048 s 10 runs Summary 'git rev-list --unsorted-input --objects --quiet --not --all --not $(cat newrev)' ran 1.53 ± 0.02 times faster than 'git rev-list --objects --quiet --not --all --not $newrev' [1]: https://gitlab.com/gitlab-org/gitlab.git. Note that not all refs are visible to clients. Signed-off-by: Patrick Steinhardt <ps@pks.im> --- connected.c | 1 + revision.c | 4 +++- revision.h | 1 + 3 files changed, 5 insertions(+), 1 deletion(-)