Message ID | 646ac90e62aab4e4aec595d6848b60233bbe8c77.1629452412.git.ps@pks.im (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Speed up mirror-fetches with many refs | expand |
On 8/20/2021 6:08 AM, Patrick Steinhardt wrote: > When fetching refs, we are doing two connectivity checks: > > - The first one in `fetch_refs()` is done such that we can > short-circuit the case where we already have all objects > referenced by the updated set of refs. > > - The second one in `store_updated_refs()` does a sanity check that > we have all objects after we have fetched the packfile. > > We always execute both connectivity checks, but this is wasteful in case > the first connectivity check already notices that we have all objects > locally available. > > Refactor the code to do both connectivity checks in `fetch_refs()`, > which allows us to easily skip the second connectivity check if we > already have all objects available. This refactoring is safe to do given > that we always call `fetch_refs()` followed by `consume_refs()`, which > is the only caller of `store_updated_refs()`. Should we try to make it more clear that fetch_refs() must be followed by consume_refs() via a comment above the fetch_refs(), or possibly even its call sites? Thanks, -Stolee
On Fri, Aug 20, 2021 at 10:47:11AM -0400, Derrick Stolee wrote: > On 8/20/2021 6:08 AM, Patrick Steinhardt wrote: > > When fetching refs, we are doing two connectivity checks: > > > > - The first one in `fetch_refs()` is done such that we can > > short-circuit the case where we already have all objects > > referenced by the updated set of refs. > > > > - The second one in `store_updated_refs()` does a sanity check that > > we have all objects after we have fetched the packfile. > > > > We always execute both connectivity checks, but this is wasteful in case > > the first connectivity check already notices that we have all objects > > locally available. > > > > Refactor the code to do both connectivity checks in `fetch_refs()`, > > which allows us to easily skip the second connectivity check if we > > already have all objects available. This refactoring is safe to do given > > that we always call `fetch_refs()` followed by `consume_refs()`, which > > is the only caller of `store_updated_refs()`. > > Should we try to make it more clear that fetch_refs() must be followed > by consume_refs() via a comment above the fetch_refs(), or possibly even > its call sites? I wasn't quite happy with this outcome, either. How about we instead merge both functions into `fetch_and_consume_refs()`? Both are quite short, and that makes sure we always call them together to make the requirement explicit. I'll add another patch to do this refactoring. Patrick
diff --git a/builtin/fetch.c b/builtin/fetch.c index 20fcfe0f45..088a8af13b 100644 --- a/builtin/fetch.c +++ b/builtin/fetch.c @@ -1068,7 +1068,7 @@ N_("It took %.2f seconds to check forced updates. You can use\n" " to avoid this check.\n"); static int store_updated_refs(const char *raw_url, const char *remote_name, - int connectivity_checked, struct ref *ref_map) + struct ref *ref_map) { struct fetch_head fetch_head; struct commit *commit; @@ -1090,16 +1090,6 @@ static int store_updated_refs(const char *raw_url, const char *remote_name, else url = xstrdup("foreign"); - if (!connectivity_checked) { - struct check_connected_options opt = CHECK_CONNECTED_INIT; - - rm = ref_map; - if (check_connected(iterate_ref_map, &rm, &opt)) { - rc = error(_("%s did not send all necessary objects\n"), url); - goto abort; - } - } - if (atomic_fetch) { transaction = ref_transaction_begin(&err); if (!transaction) { @@ -1302,6 +1292,18 @@ static int fetch_refs(struct transport *transport, struct ref *ref_map) return ret; } + /* + * If the transport didn't yet check for us, we need to verify + * ourselves that we have obtained all missing objects now. + */ + if (!transport->smart_options || !transport->smart_options->connectivity_checked) { + if (check_connected(iterate_ref_map, &ref_map, NULL)) { + ret = error(_("remote did not send all necessary objects\n")); + transport_unlock_pack(transport); + return ret; + } + } + /* * Keep the new pack's ".keep" file around to allow the caller * time to update refs to reference the new objects. @@ -1312,13 +1314,10 @@ static int fetch_refs(struct transport *transport, struct ref *ref_map) /* Update local refs based on the ref values fetched from a remote */ static int consume_refs(struct transport *transport, struct ref *ref_map) { - int connectivity_checked = transport->smart_options - ? transport->smart_options->connectivity_checked : 0; int ret; trace2_region_enter("fetch", "consume_refs", the_repository); ret = store_updated_refs(transport->url, transport->remote->name, - connectivity_checked, ref_map); transport_unlock_pack(transport); trace2_region_leave("fetch", "consume_refs", the_repository);
When fetching refs, we are doing two connectivity checks: - The first one in `fetch_refs()` is done such that we can short-circuit the case where we already have all objects referenced by the updated set of refs. - The second one in `store_updated_refs()` does a sanity check that we have all objects after we have fetched the packfile. We always execute both connectivity checks, but this is wasteful in case the first connectivity check already notices that we have all objects locally available. Refactor the code to do both connectivity checks in `fetch_refs()`, which allows us to easily skip the second connectivity check if we already have all objects available. This refactoring is safe to do given that we always call `fetch_refs()` followed by `consume_refs()`, which is the only caller of `store_updated_refs()`. This gives us a nice speedup when doing a mirror-fetch in a repository with about 2.3M refs where the fetching repo already has all objects: Benchmark #1: HEAD~: git-fetch Time (mean ± σ): 31.232 s ± 0.082 s [User: 27.901 s, System: 5.178 s] Range (min … max): 31.118 s … 31.301 s 5 runs Benchmark #2: HEAD: git-fetch Time (mean ± σ): 26.616 s ± 0.100 s [User: 23.675 s, System: 4.752 s] Range (min … max): 26.544 s … 26.788 s 5 runs Summary 'HEAD: git-fetch' ran 1.17 ± 0.01 times faster than 'HEAD~: git-fetch' Signed-off-by: Patrick Steinhardt <ps@pks.im> --- builtin/fetch.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-)