Message ID | pull.1266.v2.git.git.1652690501963.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Accepted |
Commit | f7400da8005e8af979b61ab2f42bb2dacfa656c8 |
Headers | show |
Series | [v2] fetch: limit shared symref check only for local branches | expand |
"Orgad Shaneh via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Orgad Shaneh <orgads@gmail.com> > > This check was introduced in 8ee5d73137f (Fix fetch/pull when run without > --update-head-ok, 2008-10-13) in order to protect against replacing the ref > of the active branch by mistake, for example by running git fetch origin > master:master. > > It was later extended in 8bc1f39f411 (fetch: protect branches checked out > in all worktrees, 2021-12-01) to scan all worktrees. > > This operation is very expensive (takes about 30s in my repository) when > there are many tags or branches, and it is executed on every fetch, even if > no local heads are updated at all. > > Limit it to protect only refs/heads/* to improve fetch performance. The point of the check is to prevent the index+working tree in the worktrees to go out of sync with HEAD, and HEAD by definition can point only into refs/heads/*, this change should be OK. It is surprising find_shared_symref() is so expensive, though. If you have a dozen worktrees linked to the current repository, there are at most a dozen HEAD that point at various refs in refs/heads/ namespace. Even if you need to check a thousand ref_map elements, it should cost almost nothing if you build a hashmap to find matches with these dozen HEADs upfront, no? Another thing that is surprising is that you say this loop is expensive when there are many tags or branches. Do you mean it is expensive when there are many tags and branches that are updated, or it is expensive to merely have thousands of dormant tags and branches? If the latter, I wonder if it is sensible to limit the check only to the refs that are going to be updated. > diff --git a/builtin/fetch.c b/builtin/fetch.c > index e3791f09ed5..eeee5ac8f15 100644 > --- a/builtin/fetch.c > +++ b/builtin/fetch.c > @@ -1440,6 +1440,7 @@ static void check_not_current_branch(struct ref *ref_map, > const struct worktree *wt; > for (; ref_map; ref_map = ref_map->next) > if (ref_map->peer_ref && > + starts_with(ref_map->peer_ref->name, "refs/heads/") && > (wt = find_shared_symref(worktrees, "HEAD", > ref_map->peer_ref->name)) && > !wt->is_bare) > > base-commit: 277cf0bc36094f6dc4297d8c9cef79df045b735d
Junio C Hamano <gitster@pobox.com> writes: >> Limit it to protect only refs/heads/* to improve fetch performance. > > The point of the check is to prevent the index+working tree in the > worktrees to go out of sync with HEAD, and HEAD by definition can > point only into refs/heads/*, this change should be OK. I am willing to take the above change as-is as a standalone patch, but independently from that ... > It is surprising find_shared_symref() is so expensive, though. If > you have a dozen worktrees linked to the current repository, there > are at most a dozen HEAD that point at various refs in refs/heads/ > namespace. Even if you need to check a thousand ref_map elements, > it should cost almost nothing if you build a hashmap to find matches > with these dozen HEADs upfront, no? > > Another thing that is surprising is that you say this loop is > expensive when there are many tags or branches. Do you mean it is > expensive when there are many tags and branches that are updated, or > it is expensive to merely have thousands of dormant tags and > branches? If the latter, I wonder if it is sensible to limit the > check only to the refs that are going to be updated. ... I was wondering if an approach like the attached might be a better way to proceed in the longer run. Instead of (or in addition to) reducing the number of calls to the function, we should see if we can make the function less expensive. In short, find_shared_symref(), and is_worktree_being_FUTZed(), is following a wrong API pattern that encourages a loop to call them repeatedly by taking 'target' parameter and doing a comparison itself. If you want to find if any of your 1000 refs violate what they are trying to check, you'd need to call them 1000 times. Instead, you should be able to learn which branch is to be protected per each worktree and do the asking about your 1000 refs yourself. If we instead find out what branch, other than the one pointed at by the HEAD symref, each worktree is working on, just like we find out what branch is pointed at by the HEAD symref, and store that findings in the worktree structure, you should be able to walk the list of worktrees just once to build a hashmap that lets you tell which branches are not to be messed with before deciding if the fetch should go through. The following is not even compile tested, and some details may be wrong, but I hope it is good enough to illustrate the idea. worktree.c | 73 ++++++++++++++++++++++++++++++++++---------------------------- worktree.h | 5 +++++ 2 files changed, 45 insertions(+), 33 deletions(-) diff --git c/worktree.c w/worktree.c index 90fc085f76..2d96bd9dd1 100644 --- c/worktree.c +++ w/worktree.c @@ -15,6 +15,7 @@ void free_worktrees(struct worktree **worktrees) free(worktrees[i]->path); free(worktrees[i]->id); free(worktrees[i]->head_ref); + free(worktrees[i]->protected_branch); free(worktrees[i]->lock_reason); free(worktrees[i]->prune_reason); free(worktrees[i]); @@ -22,13 +23,28 @@ void free_worktrees(struct worktree **worktrees) free (worktrees); } +int is_worktree_being_rebased(const struct worktree *wt, + const char *target) +{ + return ((wt->protected_reason == WT_BRANCH_REBASING) && + (!strcmp(target, wt->protected_branch))); +} + +int is_worktree_being_bisected(const struct worktree *wt, + const char *target) +{ + return ((wt->protected_reason == WT_BRANCH_BISECTING) && + (!strcmp(target, wt->protected_branch))); +} + /** - * Update head_oid, head_ref and is_detached of the given worktree + * Grab HEAD-related information of the given worktree */ static void add_head_info(struct worktree *wt) { int flags; const char *target; + struct wt_status_state state; target = refs_resolve_ref_unsafe(get_worktree_ref_store(wt), "HEAD", @@ -41,6 +57,29 @@ static void add_head_info(struct worktree *wt) wt->head_ref = xstrdup(target); else wt->is_detached = 1; + + wt->protected_reason = 0; + memset(&state, 0, sizeof(state)); + if (wt_status_check_rebase(wt, &state) && + (state.rebase_in_progress || + state.rebase_interactive_in_progress) && + state.branch && + skip_prefix(state.branch, "refs/heads/", &target)) { + wt->protected_branch = xstrdup(target); + wt->protected_reason = WT_BRANCH_REBASING; + } + wt_status_state_free_buffers(&state); + + memset(&state, 0, sizeof(state)); + if (wt_status_check_bisect(wt, &state) && state.branch && + skip_prefix(state.branch, "refs/heads/", &target)) { + if (wt->protected_reason) + BUG("branch '%s' being bisected and rebased at the same time?", + wt->protected_branch); + wt->protected_branch = xstrdup(target); + wt->protected_reason = WT_BRANCH_BISECTING; + } + wt_status_state_free_buffers(&state); } /** @@ -365,38 +404,6 @@ void update_worktree_location(struct worktree *wt, const char *path_) strbuf_release(&path); } -int is_worktree_being_rebased(const struct worktree *wt, - const char *target) -{ - struct wt_status_state state; - int found_rebase; - - memset(&state, 0, sizeof(state)); - found_rebase = wt_status_check_rebase(wt, &state) && - (state.rebase_in_progress || - state.rebase_interactive_in_progress) && - state.branch && - skip_prefix(target, "refs/heads/", &target) && - !strcmp(state.branch, target); - wt_status_state_free_buffers(&state); - return found_rebase; -} - -int is_worktree_being_bisected(const struct worktree *wt, - const char *target) -{ - struct wt_status_state state; - int found_bisect; - - memset(&state, 0, sizeof(state)); - found_bisect = wt_status_check_bisect(wt, &state) && - state.branch && - skip_prefix(target, "refs/heads/", &target) && - !strcmp(state.branch, target); - wt_status_state_free_buffers(&state); - return found_bisect; -} - /* * note: this function should be able to detect shared symref even if * HEAD is temporarily detached (e.g. in the middle of rebase or diff --git c/worktree.h w/worktree.h index e9e839926b..4e9d06c26a 100644 --- c/worktree.h +++ w/worktree.h @@ -10,6 +10,11 @@ struct worktree { char *path; char *id; char *head_ref; /* NULL if HEAD is broken or detached */ + char *protected_branch; /* being rebased or bisected, othewise NULL */ + enum { + WT_BRANCH_REBASING = 1, + WT_BRANCH_BISECTING, + } protected_reason; char *lock_reason; /* private - use worktree_lock_reason */ char *prune_reason; /* private - use worktree_prune_reason */ struct object_id head_oid;
On Mon, May 16, 2022 at 7:01 PM Junio C Hamano <gitster@pobox.com> wrote: > > "Orgad Shaneh via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > From: Orgad Shaneh <orgads@gmail.com> > > > > This check was introduced in 8ee5d73137f (Fix fetch/pull when run without > > --update-head-ok, 2008-10-13) in order to protect against replacing the ref > > of the active branch by mistake, for example by running git fetch origin > > master:master. > > > > It was later extended in 8bc1f39f411 (fetch: protect branches checked out > > in all worktrees, 2021-12-01) to scan all worktrees. > > > > This operation is very expensive (takes about 30s in my repository) when > > there are many tags or branches, and it is executed on every fetch, even if > > no local heads are updated at all. > > > > Limit it to protect only refs/heads/* to improve fetch performance. > > The point of the check is to prevent the index+working tree in the > worktrees to go out of sync with HEAD, and HEAD by definition can > point only into refs/heads/*, this change should be OK. > > It is surprising find_shared_symref() is so expensive, though. If > you have a dozen worktrees linked to the current repository, there > are at most a dozen HEAD that point at various refs in refs/heads/ > namespace. Even if you need to check a thousand ref_map elements, > it should cost almost nothing if you build a hashmap to find matches > with these dozen HEADs upfront, no? I also had this idea, but I'm not familiar enough with the codebase to implement it. I see you already started that. > Another thing that is surprising is that you say this loop is > expensive when there are many tags or branches. Do you mean it is > expensive when there are many tags and branches that are updated, or > it is expensive to merely have thousands of dormant tags and > branches? If the latter, I wonder if it is sensible to limit the > check only to the refs that are going to be updated. It's expensive even when *nothing* is updated. I have a repo with 44K tags, 13K of the tags are annotated, 134 remote branches and 4 worktrees (except the main repo) with 33 local branches. I counted the calls to find_shared_symref - it was called 35755 times, and refs_read_raw_ref was called 357585 times. - Orgad
Orgad Shaneh <orgads@gmail.com> writes: >> Another thing that is surprising is that you say this loop is >> expensive when there are many tags or branches. Do you mean it is >> expensive when there are many tags and branches that are updated, or >> it is expensive to merely have thousands of dormant tags and >> branches? If the latter, I wonder if it is sensible to limit the >> check only to the refs that are going to be updated. > > It's expensive even when *nothing* is updated. I have a repo with 44K > tags, 13K of the tags are annotated, 134 remote branches and 4 > worktrees (except the main repo) with 33 local branches. > > I counted the calls to find_shared_symref - it was called 35755 times, > and refs_read_raw_ref was called 357585 times. That is exactly why I asked, as the above number hints that it could be a viable optimization to omit calls for refs whose old_ and new_oid are the same, just like you omit calls for refs that are not inside refs/heads/ in your patch, perhaps?
On Tue, May 17, 2022 at 1:27 PM Junio C Hamano <gitster@pobox.com> wrote: > > Orgad Shaneh <orgads@gmail.com> writes: > > >> Another thing that is surprising is that you say this loop is > >> expensive when there are many tags or branches. Do you mean it is > >> expensive when there are many tags and branches that are updated, or > >> it is expensive to merely have thousands of dormant tags and > >> branches? If the latter, I wonder if it is sensible to limit the > >> check only to the refs that are going to be updated. > > > > It's expensive even when *nothing* is updated. I have a repo with 44K > > tags, 13K of the tags are annotated, 134 remote branches and 4 > > worktrees (except the main repo) with 33 local branches. > > > > I counted the calls to find_shared_symref - it was called 35755 times, > > and refs_read_raw_ref was called 357585 times. > > That is exactly why I asked, as the above number hints that it could > be a viable optimization to omit calls for refs whose old_ and > new_oid are the same, just like you omit calls for refs that are not > inside refs/heads/ in your patch, perhaps? This would require shuffling the code. check_not_current_branch() is called by do_fetch before fetch_and_consume_refs (which updates ref->old_oid and ref->new_oid). - Orgad
Orgad Shaneh <orgads@gmail.com> writes: > This would require shuffling the code. check_not_current_branch() is > called by do_fetch before fetch_and_consume_refs (which updates > ref->old_oid and ref->new_oid). Oh, that's unfortunate. We may not be matching them with the current values of the local copies in ref_map in today's code until fetch_and_consume_refs() calls store_updated_refs() to update the .new_oid member, but I would have thought that we learned the equivalent of "git ls-remote" output a lot earlier, at least most of the time, via a call to transport_get_remote_refs(), and that was why I thought such an optimization was doable. Thanks.
diff --git a/builtin/fetch.c b/builtin/fetch.c index e3791f09ed5..eeee5ac8f15 100644 --- a/builtin/fetch.c +++ b/builtin/fetch.c @@ -1440,6 +1440,7 @@ static void check_not_current_branch(struct ref *ref_map, const struct worktree *wt; for (; ref_map; ref_map = ref_map->next) if (ref_map->peer_ref && + starts_with(ref_map->peer_ref->name, "refs/heads/") && (wt = find_shared_symref(worktrees, "HEAD", ref_map->peer_ref->name)) && !wt->is_bare)