diff mbox series

[v2] fetch: limit shared symref check only for local branches

Message ID pull.1266.v2.git.git.1652690501963.gitgitgadget@gmail.com (mailing list archive)
State Accepted
Commit f7400da8005e8af979b61ab2f42bb2dacfa656c8
Headers show
Series [v2] fetch: limit shared symref check only for local branches | expand

Commit Message

Orgad Shaneh May 16, 2022, 8:41 a.m. UTC
From: Orgad Shaneh <orgads@gmail.com>

This check was introduced in 8ee5d73137f (Fix fetch/pull when run without
--update-head-ok, 2008-10-13) in order to protect against replacing the ref
of the active branch by mistake, for example by running git fetch origin
master:master.

It was later extended in 8bc1f39f411 (fetch: protect branches checked out
in all worktrees, 2021-12-01) to scan all worktrees.

This operation is very expensive (takes about 30s in my repository) when
there are many tags or branches, and it is executed on every fetch, even if
no local heads are updated at all.

Limit it to protect only refs/heads/* to improve fetch performance.

Signed-off-by: Orgad Shaneh <orgads@gmail.com>
---
    fetch: limit shared symref check only for local branches

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1266%2Forgads%2Ffetch-perf-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1266/orgads/fetch-perf-v2
Pull-Request: https://github.com/git/git/pull/1266

Range-diff vs v1:

 1:  5e86dc86d3d ! 1:  72bea90b26f fetch: limit shared symref check only for local branches
     @@ builtin/fetch.c: static void check_not_current_branch(struct ref *ref_map,
       	const struct worktree *wt;
       	for (; ref_map; ref_map = ref_map->next)
       		if (ref_map->peer_ref &&
     -+			starts_with(ref_map->peer_ref->name, "refs/heads/") &&
     ++		    starts_with(ref_map->peer_ref->name, "refs/heads/") &&
       		    (wt = find_shared_symref(worktrees, "HEAD",
       					     ref_map->peer_ref->name)) &&
       		    !wt->is_bare)


 builtin/fetch.c | 1 +
 1 file changed, 1 insertion(+)


base-commit: 277cf0bc36094f6dc4297d8c9cef79df045b735d

Comments

Junio C Hamano May 16, 2022, 4 p.m. UTC | #1
"Orgad Shaneh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Orgad Shaneh <orgads@gmail.com>
>
> This check was introduced in 8ee5d73137f (Fix fetch/pull when run without
> --update-head-ok, 2008-10-13) in order to protect against replacing the ref
> of the active branch by mistake, for example by running git fetch origin
> master:master.
>
> It was later extended in 8bc1f39f411 (fetch: protect branches checked out
> in all worktrees, 2021-12-01) to scan all worktrees.
>
> This operation is very expensive (takes about 30s in my repository) when
> there are many tags or branches, and it is executed on every fetch, even if
> no local heads are updated at all.
>
> Limit it to protect only refs/heads/* to improve fetch performance.

The point of the check is to prevent the index+working tree in the
worktrees to go out of sync with HEAD, and HEAD by definition can
point only into refs/heads/*, this change should be OK.

It is surprising find_shared_symref() is so expensive, though.  If
you have a dozen worktrees linked to the current repository, there
are at most a dozen HEAD that point at various refs in refs/heads/
namespace.  Even if you need to check a thousand ref_map elements,
it should cost almost nothing if you build a hashmap to find matches
with these dozen HEADs upfront, no?

Another thing that is surprising is that you say this loop is
expensive when there are many tags or branches.  Do you mean it is
expensive when there are many tags and branches that are updated, or
it is expensive to merely have thousands of dormant tags and
branches?  If the latter, I wonder if it is sensible to limit the
check only to the refs that are going to be updated.

> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index e3791f09ed5..eeee5ac8f15 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -1440,6 +1440,7 @@ static void check_not_current_branch(struct ref *ref_map,
>  	const struct worktree *wt;
>  	for (; ref_map; ref_map = ref_map->next)
>  		if (ref_map->peer_ref &&
> +		    starts_with(ref_map->peer_ref->name, "refs/heads/") &&
>  		    (wt = find_shared_symref(worktrees, "HEAD",
>  					     ref_map->peer_ref->name)) &&
>  		    !wt->is_bare)
>
> base-commit: 277cf0bc36094f6dc4297d8c9cef79df045b735d
Junio C Hamano May 16, 2022, 5:57 p.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

>> Limit it to protect only refs/heads/* to improve fetch performance.
>
> The point of the check is to prevent the index+working tree in the
> worktrees to go out of sync with HEAD, and HEAD by definition can
> point only into refs/heads/*, this change should be OK.

I am willing to take the above change as-is as a standalone patch,
but independently from that ...

> It is surprising find_shared_symref() is so expensive, though.  If
> you have a dozen worktrees linked to the current repository, there
> are at most a dozen HEAD that point at various refs in refs/heads/
> namespace.  Even if you need to check a thousand ref_map elements,
> it should cost almost nothing if you build a hashmap to find matches
> with these dozen HEADs upfront, no?
>
> Another thing that is surprising is that you say this loop is
> expensive when there are many tags or branches.  Do you mean it is
> expensive when there are many tags and branches that are updated, or
> it is expensive to merely have thousands of dormant tags and
> branches?  If the latter, I wonder if it is sensible to limit the
> check only to the refs that are going to be updated.

... I was wondering if an approach like the attached might be a
better way to proceed in the longer run.  Instead of (or in addition
to) reducing the number of calls to the function, we should see if
we can make the function less expensive.

In short, find_shared_symref(), and is_worktree_being_FUTZed(), is
following a wrong API pattern that encourages a loop to call them
repeatedly by taking 'target' parameter and doing a comparison
itself.  If you want to find if any of your 1000 refs violate what
they are trying to check, you'd need to call them 1000 times.

Instead, you should be able to learn which branch is to be protected
per each worktree and do the asking about your 1000 refs yourself.

If we instead find out what branch, other than the one pointed at by
the HEAD symref, each worktree is working on, just like we find out
what branch is pointed at by the HEAD symref, and store that
findings in the worktree structure, you should be able to walk the
list of worktrees just once to build a hashmap that lets you tell
which branches are not to be messed with before deciding if the
fetch should go through.

The following is not even compile tested, and some details may be
wrong, but I hope it is good enough to illustrate the idea.

 worktree.c | 73 ++++++++++++++++++++++++++++++++++----------------------------
 worktree.h |  5 +++++
 2 files changed, 45 insertions(+), 33 deletions(-)

diff --git c/worktree.c w/worktree.c
index 90fc085f76..2d96bd9dd1 100644
--- c/worktree.c
+++ w/worktree.c
@@ -15,6 +15,7 @@ void free_worktrees(struct worktree **worktrees)
 		free(worktrees[i]->path);
 		free(worktrees[i]->id);
 		free(worktrees[i]->head_ref);
+		free(worktrees[i]->protected_branch);
 		free(worktrees[i]->lock_reason);
 		free(worktrees[i]->prune_reason);
 		free(worktrees[i]);
@@ -22,13 +23,28 @@ void free_worktrees(struct worktree **worktrees)
 	free (worktrees);
 }
 
+int is_worktree_being_rebased(const struct worktree *wt,
+			      const char *target)
+{
+	return ((wt->protected_reason == WT_BRANCH_REBASING) &&
+		(!strcmp(target, wt->protected_branch)));
+}
+
+int is_worktree_being_bisected(const struct worktree *wt,
+			       const char *target)
+{
+	return ((wt->protected_reason == WT_BRANCH_BISECTING) &&
+		(!strcmp(target, wt->protected_branch)));
+}
+
 /**
- * Update head_oid, head_ref and is_detached of the given worktree
+ * Grab HEAD-related information of the given worktree
  */
 static void add_head_info(struct worktree *wt)
 {
 	int flags;
 	const char *target;
+	struct wt_status_state state;
 
 	target = refs_resolve_ref_unsafe(get_worktree_ref_store(wt),
 					 "HEAD",
@@ -41,6 +57,29 @@ static void add_head_info(struct worktree *wt)
 		wt->head_ref = xstrdup(target);
 	else
 		wt->is_detached = 1;
+
+	wt->protected_reason = 0;
+	memset(&state, 0, sizeof(state));
+	if (wt_status_check_rebase(wt, &state) &&
+	    (state.rebase_in_progress ||
+	     state.rebase_interactive_in_progress) &&
+	    state.branch &&
+	    skip_prefix(state.branch, "refs/heads/", &target)) {
+		wt->protected_branch = xstrdup(target);
+		wt->protected_reason = WT_BRANCH_REBASING;
+	}
+	wt_status_state_free_buffers(&state);
+
+	memset(&state, 0, sizeof(state));
+	if (wt_status_check_bisect(wt, &state) && state.branch &&
+	    skip_prefix(state.branch, "refs/heads/", &target)) {
+		if (wt->protected_reason)
+			BUG("branch '%s' being bisected and rebased at the same time?",
+			    wt->protected_branch);
+		wt->protected_branch = xstrdup(target);
+		wt->protected_reason = WT_BRANCH_BISECTING;
+	}
+	wt_status_state_free_buffers(&state);
 }
 
 /**
@@ -365,38 +404,6 @@ void update_worktree_location(struct worktree *wt, const char *path_)
 	strbuf_release(&path);
 }
 
-int is_worktree_being_rebased(const struct worktree *wt,
-			      const char *target)
-{
-	struct wt_status_state state;
-	int found_rebase;
-
-	memset(&state, 0, sizeof(state));
-	found_rebase = wt_status_check_rebase(wt, &state) &&
-		       (state.rebase_in_progress ||
-			state.rebase_interactive_in_progress) &&
-		       state.branch &&
-		       skip_prefix(target, "refs/heads/", &target) &&
-		       !strcmp(state.branch, target);
-	wt_status_state_free_buffers(&state);
-	return found_rebase;
-}
-
-int is_worktree_being_bisected(const struct worktree *wt,
-			       const char *target)
-{
-	struct wt_status_state state;
-	int found_bisect;
-
-	memset(&state, 0, sizeof(state));
-	found_bisect = wt_status_check_bisect(wt, &state) &&
-		       state.branch &&
-		       skip_prefix(target, "refs/heads/", &target) &&
-		       !strcmp(state.branch, target);
-	wt_status_state_free_buffers(&state);
-	return found_bisect;
-}
-
 /*
  * note: this function should be able to detect shared symref even if
  * HEAD is temporarily detached (e.g. in the middle of rebase or
diff --git c/worktree.h w/worktree.h
index e9e839926b..4e9d06c26a 100644
--- c/worktree.h
+++ w/worktree.h
@@ -10,6 +10,11 @@ struct worktree {
 	char *path;
 	char *id;
 	char *head_ref;		/* NULL if HEAD is broken or detached */
+	char *protected_branch; /* being rebased or bisected, othewise NULL */
+	enum {
+		WT_BRANCH_REBASING = 1,
+		WT_BRANCH_BISECTING,
+	} protected_reason;
 	char *lock_reason;	/* private - use worktree_lock_reason */
 	char *prune_reason;     /* private - use worktree_prune_reason */
 	struct object_id head_oid;
Orgad Shaneh May 17, 2022, 6:05 a.m. UTC | #3
On Mon, May 16, 2022 at 7:01 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Orgad Shaneh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Orgad Shaneh <orgads@gmail.com>
> >
> > This check was introduced in 8ee5d73137f (Fix fetch/pull when run without
> > --update-head-ok, 2008-10-13) in order to protect against replacing the ref
> > of the active branch by mistake, for example by running git fetch origin
> > master:master.
> >
> > It was later extended in 8bc1f39f411 (fetch: protect branches checked out
> > in all worktrees, 2021-12-01) to scan all worktrees.
> >
> > This operation is very expensive (takes about 30s in my repository) when
> > there are many tags or branches, and it is executed on every fetch, even if
> > no local heads are updated at all.
> >
> > Limit it to protect only refs/heads/* to improve fetch performance.
>
> The point of the check is to prevent the index+working tree in the
> worktrees to go out of sync with HEAD, and HEAD by definition can
> point only into refs/heads/*, this change should be OK.
>
> It is surprising find_shared_symref() is so expensive, though.  If
> you have a dozen worktrees linked to the current repository, there
> are at most a dozen HEAD that point at various refs in refs/heads/
> namespace.  Even if you need to check a thousand ref_map elements,
> it should cost almost nothing if you build a hashmap to find matches
> with these dozen HEADs upfront, no?

I also had this idea, but I'm not familiar enough with the codebase to
implement it. I see you already started that.

> Another thing that is surprising is that you say this loop is
> expensive when there are many tags or branches.  Do you mean it is
> expensive when there are many tags and branches that are updated, or
> it is expensive to merely have thousands of dormant tags and
> branches?  If the latter, I wonder if it is sensible to limit the
> check only to the refs that are going to be updated.

It's expensive even when *nothing* is updated. I have a repo with 44K
tags, 13K of the tags are annotated, 134 remote branches and 4
worktrees (except the main repo) with 33 local branches.

I counted the calls to find_shared_symref - it was called 35755 times,
and refs_read_raw_ref was called 357585 times.

- Orgad
Junio C Hamano May 17, 2022, 10:27 a.m. UTC | #4
Orgad Shaneh <orgads@gmail.com> writes:

>> Another thing that is surprising is that you say this loop is
>> expensive when there are many tags or branches.  Do you mean it is
>> expensive when there are many tags and branches that are updated, or
>> it is expensive to merely have thousands of dormant tags and
>> branches?  If the latter, I wonder if it is sensible to limit the
>> check only to the refs that are going to be updated.
>
> It's expensive even when *nothing* is updated. I have a repo with 44K
> tags, 13K of the tags are annotated, 134 remote branches and 4
> worktrees (except the main repo) with 33 local branches.
>
> I counted the calls to find_shared_symref - it was called 35755 times,
> and refs_read_raw_ref was called 357585 times.

That is exactly why I asked, as the above number hints that it could
be a viable optimization to omit calls for refs whose old_ and
new_oid are the same, just like you omit calls for refs that are not
inside refs/heads/ in your patch, perhaps?
Orgad Shaneh May 17, 2022, 10:41 a.m. UTC | #5
On Tue, May 17, 2022 at 1:27 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Orgad Shaneh <orgads@gmail.com> writes:
>
> >> Another thing that is surprising is that you say this loop is
> >> expensive when there are many tags or branches.  Do you mean it is
> >> expensive when there are many tags and branches that are updated, or
> >> it is expensive to merely have thousands of dormant tags and
> >> branches?  If the latter, I wonder if it is sensible to limit the
> >> check only to the refs that are going to be updated.
> >
> > It's expensive even when *nothing* is updated. I have a repo with 44K
> > tags, 13K of the tags are annotated, 134 remote branches and 4
> > worktrees (except the main repo) with 33 local branches.
> >
> > I counted the calls to find_shared_symref - it was called 35755 times,
> > and refs_read_raw_ref was called 357585 times.
>
> That is exactly why I asked, as the above number hints that it could
> be a viable optimization to omit calls for refs whose old_ and
> new_oid are the same, just like you omit calls for refs that are not
> inside refs/heads/ in your patch, perhaps?

This would require shuffling the code. check_not_current_branch() is
called by do_fetch before fetch_and_consume_refs (which updates
ref->old_oid and ref->new_oid).

- Orgad
Junio C Hamano May 18, 2022, 3:50 p.m. UTC | #6
Orgad Shaneh <orgads@gmail.com> writes:

> This would require shuffling the code. check_not_current_branch() is
> called by do_fetch before fetch_and_consume_refs (which updates
> ref->old_oid and ref->new_oid).

Oh, that's unfortunate.

We may not be matching them with the current values of the local
copies in ref_map in today's code until fetch_and_consume_refs()
calls store_updated_refs() to update the .new_oid member, but I
would have thought that we learned the equivalent of "git ls-remote"
output a lot earlier, at least most of the time, via a call to
transport_get_remote_refs(), and that was why I thought such an
optimization was doable.

Thanks.
diff mbox series

Patch

diff --git a/builtin/fetch.c b/builtin/fetch.c
index e3791f09ed5..eeee5ac8f15 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1440,6 +1440,7 @@  static void check_not_current_branch(struct ref *ref_map,
 	const struct worktree *wt;
 	for (; ref_map; ref_map = ref_map->next)
 		if (ref_map->peer_ref &&
+		    starts_with(ref_map->peer_ref->name, "refs/heads/") &&
 		    (wt = find_shared_symref(worktrees, "HEAD",
 					     ref_map->peer_ref->name)) &&
 		    !wt->is_bare)