[v4,0/6] Speed up connectivity checks

Message ID	cover.1628162156.git.ps@pks.im (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Date: Thu, 5 Aug 2021 13:25:19 +0200 From: Patrick Steinhardt <ps@pks.im> To: git@vger.kernel.org Cc: Jeff King <peff@peff.net>, Felipe Contreras <felipe.contreras@gmail.com>, SZEDER =?iso-8859-1?q?G=E1bor?= <szeder.dev@gmail.com>, Chris Torek <chris.torek@gmail.com>, =?iso-8859-1?q?=C6var_Arnfj=F6r=F0?= Bjarmason <avarab@gmail.com>, Junio C Hamano <gitster@pobox.com>, Taylor Blau <me@ttaylorr.com> Subject: [PATCH v4 0/6] Speed up connectivity checks Message-ID: <cover.1628162156.git.ps@pks.im> References: <cover.1627896460.git.ps@pks.im> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="ORyOJyGFPjfiSM9a" Content-Disposition: inline In-Reply-To: <cover.1627896460.git.ps@pks.im> Precedence: bulk
Series	Speed up connectivity checks \| expand [v4,0/6] Speed up connectivity checks [v4,1/6] revision: separate walk and unsorted flags [v4,2/6] connected: do not sort input revisions [v4,3/6] revision: stop retrieving reference twice [v4,4/6] revision: avoid loading object headers multiple times [v4,5/6] commit-graph: split out function to search commit position [v4,6/6] revision: avoid hitting packfiles when commits are in commit-graph

Hi, this is version 4 of my series to speed up connectivity checks. Given that v3 has received positive feedback, I finally stuck to the approach and only have a bunch of iterative changes based on your feedback. Changes compared to v3: - Patch 1/6 is new and splits up revs->no_walk into revs->no_walk which now only indicates whether to walk and revs->unserted_input, which indicates whether to sort input. - Patch 2/6 got some documentation for the new `--unsorted-input` option. Furthermore, we refuse `--no-walk`/`--no-walk=sorted` and `--unsorted-input` if used together. I've also added some tests for the new option. - Patch 3/6 has an updated commit message, detailing that `add_pending_oid()` only is a thin wrapper around `add_pending_object()`. - Patch 4/6 has an update commit message, stating that it's a prerequisite for 6/6. - Patch 5/6 is new, splitting out the logic to find a commit ID in the commit graph as a prerequisite for 6/6. - Patch 6/6 now also verifies that commits parsed only via the commit-graph exist in the ODB. I've also renamed `find_object_in_graph()` to `parse_commit_in_graph_gently()` to better reflect what the function does. With the added existence check in 6/6, the speedup is not as big as before (1.47x faster instead of 1.55x). But it's still very much worth it. In total, this patch series decreases `git rev-list --objects --unsorted --not --all --not $newrev` from 7.6s to 3.0s in my test repository. Thanks for all your feedback! Patrick Patrick Steinhardt (6): revision: separate walk and unsorted flags connected: do not sort input revisions revision: stop retrieving reference twice revision: avoid loading object headers multiple times commit-graph: split out function to search commit position revision: avoid hitting packfiles when commits are in commit-graph Documentation/rev-list-options.txt | 8 +++- builtin/log.c | 2 +- builtin/revert.c | 3 +- commit-graph.c | 75 +++++++++++++++++++++--------- commit-graph.h | 7 +++ connected.c | 1 + revision.c | 52 +++++++++++++++++---- revision.h | 7 +-- t/t6000-rev-list-misc.sh | 38 +++++++++++++++ 9 files changed, 153 insertions(+), 40 deletions(-) Range-diff against v3: -: ---------- > 1: 67232910ac revision: separate walk and unsorted flags 1: 1fd83f726a ! 2: 9d7f484907 connected: do not sort input revisions @@ Commit message Signed-off-by: Patrick Steinhardt <ps@pks.im> + ## Documentation/rev-list-options.txt ## +@@ Documentation/rev-list-options.txt: list of the missing objects. Object IDs are prefixed with a ``?'' character. + objects. + endif::git-rev-list[] + ++--unsorted-input:: ++ Show commits in the order they were given on the command line instead ++ of sorting them in reverse chronological order by commit time. Cannot ++ be combined with `--no-walk` or `--no-walk=sorted`. ++ + --no-walk[=(sorted|unsorted)]:: + Only show the given commits, but do not traverse their ancestors. + This has no effect if a range is specified. If the argument +@@ Documentation/rev-list-options.txt: endif::git-rev-list[] + given on the command line. Otherwise (if `sorted` or no argument + was given), the commits are shown in reverse chronological order + by commit time. +- Cannot be combined with `--graph`. ++ Cannot be combined with `--graph`. Cannot be combined with ++ `--unsorted-input` if `sorted` or no argument was given. + + --do-walk:: + Overrides a previous `--no-walk`. + ## connected.c ## @@ connected.c: int check_connected(oid_iterate_fn fn, void *cb_data, if (opt->progress) @@ revision.c: static int handle_revision_opt(struct rev_info *revs, int argc, cons revs->sort_order = REV_SORT_BY_AUTHOR_DATE; revs->topo_order = 1; + } else if (!strcmp(arg, "--unsorted-input")) { ++ if (revs->no_walk && !revs->unsorted_input) ++ die(_("--unsorted-input is incompatible with --no-walk and --no-walk=sorted")); + revs->unsorted_input = 1; } else if (!strcmp(arg, "--early-output")) { revs->early_output = 100; revs->topo_order = 1; -@@ revision.c: int prepare_revision_walk(struct rev_info *revs) - - if (!revs->reflog_info) - prepare_to_use_bloom_filter(revs); -- if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED) -+ if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED && !revs->unsorted_input) - commit_list_sort_by_date(&revs->commits); - if (revs->no_walk) - return 0; +@@ revision.c: static int handle_revision_pseudo_opt(const char *submodule, + } else if (!strcmp(arg, "--not")) { + *flags ^= UNINTERESTING | BOTTOM; + } else if (!strcmp(arg, "--no-walk")) { ++ if (revs->unsorted_input) ++ die(_("--no-walk is incompatible with --no-walk=unsorted and --unsorted-input")); + revs->no_walk = 1; + } else if (skip_prefix(arg, "--no-walk=", &optarg)) { + /* +@@ revision.c: static int handle_revision_pseudo_opt(const char *submodule, + * not allowed, since the argument is optional. + */ + revs->no_walk = 1; +- if (!strcmp(optarg, "sorted")) ++ if (!strcmp(optarg, "sorted")) { ++ if (revs->unsorted_input) ++ die(_("--no-walk=sorted is incompatible with --no-walk=unsorted " ++ "and --unsorted-input")); + revs->unsorted_input = 0; +- else if (!strcmp(optarg, "unsorted")) ++ } else if (!strcmp(optarg, "unsorted")) + revs->unsorted_input = 1; + else + return error("invalid argument to --no-walk"); - ## revision.h ## -@@ revision.h: struct rev_info { - simplify_history:1, - show_pulls:1, - topo_order:1, -+ unsorted_input:1, - simplify_merges:1, - simplify_by_decoration:1, - single_worktree:1, + ## t/t6000-rev-list-misc.sh ## +@@ t/t6000-rev-list-misc.sh: test_expect_success 'rev-list --count --objects' ' + test_line_count = $count actual + ' + ++test_expect_success 'rev-list --unsorted-input results in different sorting' ' ++ git rev-list --unsorted-input HEAD HEAD~ >first && ++ git rev-list --unsorted-input HEAD~ HEAD >second && ++ ! test_cmp first second && ++ sort first >first.sorted && ++ sort second >second.sorted && ++ test_cmp first.sorted second.sorted ++' ++ ++test_expect_success 'rev-list --unsorted-input compatible with --no-walk=unsorted' ' ++ git rev-list --unsorted-input --no-walk=unsorted HEAD HEAD~ >actual && ++ git rev-parse HEAD >expect && ++ git rev-parse HEAD~ >>expect && ++ test_cmp expect actual ++' ++ ++test_expect_success 'rev-list --unsorted-input incompatible with --no-walk=sorted' ' ++ cat >expect <<-EOF && ++ fatal: --no-walk is incompatible with --no-walk=unsorted and --unsorted-input ++ EOF ++ test_must_fail git rev-list --unsorted-input --no-walk HEAD 2>error && ++ test_cmp expect error && ++ ++ cat >expect <<-EOF && ++ fatal: --no-walk=sorted is incompatible with --no-walk=unsorted and --unsorted-input ++ EOF ++ test_must_fail git rev-list --unsorted-input --no-walk=sorted HEAD 2>error && ++ test_cmp expect error && ++ ++ cat >expect <<-EOF && ++ fatal: --unsorted-input is incompatible with --no-walk and --no-walk=sorted ++ EOF ++ test_must_fail git rev-list --no-walk --unsorted-input HEAD 2>error && ++ test_cmp expect error && ++ test_must_fail git rev-list --no-walk=sorted --unsorted-input HEAD 2>error && ++ test_cmp expect error ++' ++ + test_done 2: db85480649 ! 3: d8e63d0943 revision: stop retrieving reference twice @@ Commit message revision: stop retrieving reference twice When queueing up references for the revision walk, `handle_one_ref()` - will resolve the reference's object ID and then queue the ID as pending - object via `add_pending_oid()`. But `add_pending_oid()` will again try - to resolve the object ID to an object, effectively duplicating the work - its caller already did before. + will resolve the reference's object ID via `get_reference()` and then + queue the ID as pending object via `add_pending_oid()`. But given that + `add_pending_oid()` is only a thin wrapper around `add_pending_object()` + which fist calls `get_reference()`, we effectively resolve the reference + twice and thus duplicate some of the work. - Fix the issue by instead calling `add_pending_object()`, which takes the - already-resolved object as input. In a repository with lots of refs, - this translates in a nearly 10% speedup: + Fix the issue by instead calling `add_pending_object()` directly, which + takes the already-resolved object as input. In a repository with lots of + refs, this translates into a near 10% speedup: Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev Time (mean ± σ): 5.015 s ± 0.038 s [User: 4.698 s, System: 0.316 s] 3: b9897e102a ! 4: ba8df5cad0 revision: avoid loading object headers multiple times @@ Commit message either a tag or a commit, we'd have a modest increase in memory consumption of about 12.5% here. + Note that on its own, this patch may not seem like a clear win. But it + is a prerequisite for the following patch, which will result in another + 37% speedup. + Signed-off-by: Patrick Steinhardt <ps@pks.im> ## revision.c ## -: ---------- > 5: e33cd51ebf commit-graph: split out function to search commit position 4: f6fc2a5e6d ! 6: 900c5a9c60 revision: avoid hitting packfiles when commits are in commit-graph @@ Commit message directly fill in the commit object, otherwise we can still hit the disk to determine the object's type. - Expose a new function `find_object_in_graph()`, which fills in an object - of unknown type in case we find its object ID in the graph. This - provides a big performance win in cases where there is a commit-graph - available in the repository in case we load lots of references. The - following has been executed in a real-world repository with about 2.2 - million refs: + Expose a new function `parse_commit_in_graph_gently()`, which fills in + an object of unknown type in case we find its object ID in the graph. + This provides a big performance win in cases where there is a + commit-graph available in the repository in case we load lots of + references. The following has been executed in a real-world repository + with about 2.2 million refs: Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev - Time (mean ± σ): 4.465 s ± 0.037 s [User: 4.144 s, System: 0.320 s] - Range (min … max): 4.411 s … 4.514 s 10 runs + Time (mean ± σ): 4.508 s ± 0.039 s [User: 4.131 s, System: 0.377 s] + Range (min … max): 4.455 s … 4.576 s 10 runs Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev - Time (mean ± σ): 2.886 s ± 0.032 s [User: 2.570 s, System: 0.316 s] - Range (min … max): 2.826 s … 2.933 s 10 runs + Time (mean ± σ): 3.072 s ± 0.031 s [User: 2.707 s, System: 0.365 s] + Range (min … max): 3.040 s … 3.144 s 10 runs Summary 'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran - 1.55 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' + 1.47 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' Signed-off-by: Patrick Steinhardt <ps@pks.im> ## commit-graph.c ## -@@ commit-graph.c: static int fill_commit_in_graph(struct repository *r, - return 1; +@@ commit-graph.c: static int find_commit_pos_in_graph(struct commit *item, struct commit_graph *g, + } } -+static int find_object_id_in_graph(const struct object_id *id, struct commit_graph *g, uint32_t *pos) -+{ -+ struct commit_graph *cur_g = g; -+ uint32_t lex_index; -+ -+ while (cur_g && !bsearch_graph(cur_g, (struct object_id *)id, &lex_index)) -+ cur_g = cur_g->base_graph; -+ -+ if (cur_g) { -+ *pos = lex_index + cur_g->num_commits_in_base; -+ return 1; -+ } -+ -+ return 0; -+} -+ -+int find_object_in_graph(struct repository *repo, struct object *object) ++int parse_commit_in_graph_gently(struct repository *repo, struct object *object) +{ + struct commit *commit; + uint32_t pos; @@ commit-graph.c: static int fill_commit_in_graph(struct repository *r, + if (!repo->objects->commit_graph) + return -1; + -+ if (!find_object_id_in_graph(&object->oid, repo->objects->commit_graph, &pos)) ++ if (!search_commit_pos_in_graph(&object->oid, repo->objects->commit_graph, &pos)) + return -1; + + commit = object_as_type(object, OBJ_COMMIT, 1); @@ commit-graph.c: static int fill_commit_in_graph(struct repository *r, + return 0; +} + - static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos) - { - uint32_t graph_pos = commit_graph_position(item); -@@ commit-graph.c: static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin - *pos = graph_pos; - return 1; - } else { -- struct commit_graph *cur_g = g; -- uint32_t lex_index; -- -- while (cur_g && !bsearch_graph(cur_g, &(item->object.oid), &lex_index)) -- cur_g = cur_g->base_graph; -- -- if (cur_g) { -- *pos = lex_index + cur_g->num_commits_in_base; -- return 1; -- } -- -- return 0; -+ return find_object_id_in_graph(&item->object.oid, g, pos); - } - } - + static int parse_commit_in_graph_one(struct repository *r, + struct commit_graph *g, + struct commit *item) ## commit-graph.h ## -@@ commit-graph.h: int write_commit_graph(struct object_directory *odb, - enum commit_graph_write_flags flags, - const struct commit_graph_opts *opts); +@@ commit-graph.h: int open_commit_graph(const char *graph_file, int *fd, struct stat *st); + */ + int parse_commit_in_graph(struct repository *r, struct commit *item); -+int find_object_in_graph(struct repository *repo, struct object *object); ++/* ++ * Given an object of unknown type, try to fill in the object in case it is a ++ * commit part of the commit-graph. Returns 0 if the object is a parsed commit ++ * or if it could be filled in via the commit graph, otherwise it returns -1. ++ */ ++int parse_commit_in_graph_gently(struct repository *repo, struct object *object); + - #define COMMIT_GRAPH_VERIFY_SHALLOW (1 << 0) - - int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags); + /* + * It is possible that we loaded commit contents from the commit buffer, + * but we also want to ensure the commit-graph content is correctly ## revision.c ## @@ revision.c: static struct object *get_reference(struct rev_info *revs, const char *name, @@ revision.c: static struct object *get_reference(struct rev_info *revs, const cha if (object->type == OBJ_NONE) { - int type = oid_object_info(revs->repo, oid, NULL); - if (type < 0 || !object_as_type(object, type, 1)) { -- object = NULL; -- goto out; -+ if (find_object_in_graph(revs->repo, object) < 0) { ++ /* ++ * It's likely that the reference points to a commit, so we ++ * first try to look it up via the commit-graph. If successful, ++ * then we know it's a commit and don't have to unpack the ++ * object header. We still need to assert that the object ++ * exists, but given that we don't request any info about the ++ * object this is a lot faster than `oid_object_info()`. ++ */ ++ if (parse_commit_in_graph_gently(revs->repo, object) < 0) { + int type = oid_object_info(revs->repo, oid, NULL); + if (type < 0 || !object_as_type(object, type, 1)) { + object = NULL; + goto out; + } ++ } else if (!repo_has_object_file(revs->repo, oid)) { + object = NULL; + goto out; } - } -

[v4,0/6] Speed up connectivity checks

Message