Message ID | 31cf8f87a149c0fc8013b869e0e30364f3c60e01.1643364888.git.ps@pks.im (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | fetch: speed up mirror-fetches with many refs | expand |
(+cc Stolee, in case anything I'm saying here is wrong) On Fri, Jan 28, 2022 at 11:17:03AM +0100, Patrick Steinhardt wrote: > One thing to keep in mind though is that the commit-graph corrects > committer dates: > > * A commit with at least one parent has corrected committer date > equal to the maximum of its commiter date and one more than the > largest corrected committer date among its parents. This snippet refers to how correct committer dates are computed, not how the commit dates themselves are stored. Indeed, the corrected committer date is used to compute the corrected commit date offset, which is the "v2" generation number scheme (as opposed to topological levels, which make up "v1"). But that is entirely separate from the committer dates stored by the commit-graph file, which are faithful representations of the exact committer date attached to each commit. Looking at the very last few lines of the main loop in write_graph_chunk_data() (where the committer dates are stored): if (sizeof((*list)->date) > 4) packedDate[0] = htonl(((*list)->date >> 32) & 0x3); else packedDate[0] = 0; packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2); packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); the low-order 34 bits are used to store the commit's `->date` field, and the remaining high-order 30 bits are used to store the generation number. (You can look in `fill_commit_graph_info()` to see that we only use those 34 bits to write back the date field). So I think this paragraph (and the ones related to it) about this being an approximation and that being OK since this is a heuristic can all go away. Thanks, Taylor
On Mon, Jan 31, 2022 at 05:53:14PM -0500, Taylor Blau wrote: > (+cc Stolee, in case anything I'm saying here is wrong) > > On Fri, Jan 28, 2022 at 11:17:03AM +0100, Patrick Steinhardt wrote: > > One thing to keep in mind though is that the commit-graph corrects > > committer dates: > > > > * A commit with at least one parent has corrected committer date > > equal to the maximum of its commiter date and one more than the > > largest corrected committer date among its parents. > > This snippet refers to how correct committer dates are computed, not how > the commit dates themselves are stored. > > Indeed, the corrected committer date is used to compute the corrected > commit date offset, which is the "v2" generation number scheme (as > opposed to topological levels, which make up "v1"). > > But that is entirely separate from the committer dates stored by the > commit-graph file, which are faithful representations of the exact > committer date attached to each commit. > > Looking at the very last few lines of the main loop in > write_graph_chunk_data() (where the committer dates are stored): > > if (sizeof((*list)->date) > 4) > packedDate[0] = htonl(((*list)->date >> 32) & 0x3); > else > packedDate[0] = 0; > > packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2); > packedDate[1] = htonl((*list)->date); > hashwrite(f, packedDate, 8); > > the low-order 34 bits are used to store the commit's `->date` field, and > the remaining high-order 30 bits are used to store the generation > number. (You can look in `fill_commit_graph_info()` to see that we only > use those 34 bits to write back the date field). > > So I think this paragraph (and the ones related to it) about this being > an approximation and that being OK since this is a heuristic can all go > away. > > Thanks, > Taylor Aha, that makes a lot of sense. Thanks a lot for correcting my misconception! I'll send a v2 of this patch series with the corrected commit message. Patrick
diff --git a/fetch-pack.c b/fetch-pack.c index dd6ec449f2..c5967e228e 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -696,26 +696,30 @@ static void mark_complete_and_common_ref(struct fetch_negotiator *negotiator, trace2_region_enter("fetch-pack", "parse_remote_refs_and_find_cutoff", NULL); for (ref = *refs; ref; ref = ref->next) { - struct object *o; + struct commit *commit; - if (!has_object_file_with_flags(&ref->old_oid, + commit = lookup_commit_in_graph(the_repository, &ref->old_oid); + if (!commit) { + struct object *o; + + if (!has_object_file_with_flags(&ref->old_oid, OBJECT_INFO_QUICK | - OBJECT_INFO_SKIP_FETCH_OBJECT)) - continue; - o = parse_object(the_repository, &ref->old_oid); - if (!o) - continue; + OBJECT_INFO_SKIP_FETCH_OBJECT)) + continue; + o = parse_object(the_repository, &ref->old_oid); + if (!o || o->type != OBJ_COMMIT) + continue; + + commit = (struct commit *)o; + } /* * We already have it -- which may mean that we were * in sync with the other side at some time after * that (it is OK if we guess wrong here). */ - if (o->type == OBJ_COMMIT) { - struct commit *commit = (struct commit *)o; - if (!cutoff || cutoff < commit->date) - cutoff = commit->date; - } + if (!cutoff || cutoff < commit->date) + cutoff = commit->date; } trace2_region_leave("fetch-pack", "parse_remote_refs_and_find_cutoff", NULL);
During packfile negotiation we iterate over all refs announced by the remote side to check whether their IDs refer to commits already known to us. If a commit is known to us already, then its date is a potential cutoff point for commits we have in common with the remote side. There is potentially a lot of commits announced by the remote depending on how many refs there are in the remote repository, and for every one of them we need to search for it in our object database and, if found, parse the corresponding object to find out whether it is a candidate for the cutoff date. This can be sped up by trying to look up commits via the commit-graph first, which is a lot more efficient. One thing to keep in mind though is that the commit-graph corrects committer dates: * A commit with at least one parent has corrected committer date equal to the maximum of its commiter date and one more than the largest corrected committer date among its parents. As a result, it may be that the commit date we load via the commit graph is more recent than it would have been when loaded via the ODB, and as a result we may also choose a more recent cutoff point. But as the code documents, this is only a heuristic and it is okay if we determine a wrong cutoff date. The worst that can happen is that we report more commits as HAVEs to the server when using corrected dates. Loading commits via the commit-graph is typically much faster than loading commits via the object database. Benchmarks in a repository with about 2,1 million refs and an up-to-date commit-graph show a 20% speedup when mirror-fetching: Benchmark 1: git fetch --atomic +refs/*:refs/* (v2.35.0) Time (mean ± σ): 75.264 s ± 1.115 s [User: 68.199 s, System: 10.094 s] Range (min … max): 74.145 s … 76.862 s 5 runs Benchmark 2: git fetch --atomic +refs/*:refs/* (HEAD) Time (mean ± σ): 62.350 s ± 0.854 s [User: 55.412 s, System: 9.976 s] Range (min … max): 61.224 s … 63.216 s 5 runs Summary 'git fetch --atomic +refs/*:refs/* (HEAD)' ran 1.21 ± 0.02 times faster than 'git fetch --atomic +refs/*:refs/* (v2.35.0)' Signed-off-by: Patrick Steinhardt <ps@pks.im> --- fetch-pack.c | 28 ++++++++++++++++------------ 1 file changed, 16 insertions(+), 12 deletions(-)