[1/3] commit-reach: use one walk in remove_redundant()

Message ID	3fe74e339fc5b7083398f2df51baae5a4a008060.1611851095.git.gitgitgadget@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <3fe74e339fc5b7083398f2df51baae5a4a008060.1611851095.git.gitgitgadget@gmail.com> In-Reply-To: <pull.852.git.1611851095.gitgitgadget@gmail.com> References: <pull.852.git.1611851095.gitgitgadget@gmail.com> Date: Thu, 28 Jan 2021 16:24:52 +0000 Subject: [PATCH 1/3] commit-reach: use one walk in remove_redundant() Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Michael Haggerty <mhagger@alum.mit.edu>, me@ttaylorr.com, peff@peff.net, gitster@pobox.net, Derrick Stolee <derrickstolee@github.com>, Derrick Stolee <dstolee@microsoft.com> Precedence: bulk From: Derrick Stolee <dstolee@microsoft.com>
Series	Speed up remove_redundant() \| expand [0/3] Speed up remove_redundant() [1/3] commit-reach: use one walk in remove_redundant() [2/3] commit-reach: move compare_commits_by_gen [3/3] commit-reach: use heuristic in remove_redundant()

Derrick Stolee Jan. 28, 2021, 4:24 p.m. UTC

From: Derrick Stolee <dstolee@microsoft.com>

The current implementation of remove_redundant() uses several calls to
paint_down_to_common() to determine that commits are independent of each
other. This leads to quadratic behavior when many inputs are passed to
commands such as 'git merge-base'.

For example, in the Linux kernel repository, I tested the performance
by passing all tags:

 git merge-base --independent $(git for-each-ref refs/tags --format="$(refname)")

(Note: I had to delete the tags v2.6.11-tree and v2.6.11 as they do
not point to commits.)

Here is the performance improvement introduced by this change:

 Before: 16.4s
  After:  1.1s

The basic approach is to do one commit walk instead of many. First, scan
all commits in the list and mark their _parents_ with the STALE flag.
This flag will indicate commits that are reachable from one of the
inputs, except not including themselves. Then, walk commits until
covering all commits up to the minimum generation number pushing the
STALE flag throughout.

At the end of the walk, commits in the input list that have the STALE
flag are reachable from a _different_ commit in the list. These should
be moved to the end of the array while the others are shifted to the
front.

This logic is covered by tests in t6600-test-reach.sh, so the behavior
does not change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-reach.c | 108 +++++++++++++++++++++++++++++--------------------
 1 file changed, 65 insertions(+), 43 deletions(-)

Junio C Hamano Jan. 28, 2021, 8:51 p.m. UTC | #1

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +	/* Mark all parents of the input as STALE */
> +	for (i = 0; i < cnt; i++) {
> +		struct commit_list *parents;
> +		timestamp_t generation;
>  
>  		repo_parse_commit(r, array[i]);
> +		parents = array[i]->parents;
> +
> +		while (parents) {
> +			repo_parse_commit(r, parents->item);
> +			if (!(parents->item->object.flags & STALE)) {
> +				parents->item->object.flags |= STALE;
> +				prio_queue_put(&queue, parents->item);
> +			}
> +			parents = parents->next;
> +		}
> +
> +		generation = commit_graph_generation(array[i]);
> +
> +		if (generation < min_generation)
> +			min_generation = generation;
> +	}
> +
> +	/* push the STALE bits up to min generation */
> +	while (queue.nr) {
> +		struct commit_list *parents;
> +		struct commit *c = prio_queue_get(&queue);
> +
> +		repo_parse_commit(r, c);
>  
> +		if (commit_graph_generation(c) < min_generation)
>  			continue;
>  
> +		parents = c->parents;
> +		while (parents) {
> +			if (!(parents->item->object.flags & STALE)) {
> +				parents->item->object.flags |= STALE;
> +				prio_queue_put(&queue, parents->item);
> +			}
> +			parents = parents->next;
> +		}
> +	}

So, the inner loop makes sure we won't revisit STALE parent, but
keep digging parents we haven't seen, and stop when the generation
is old enough.  What happens when there is no generation number
computed yet, I wonder...  We'll keep getting infinity and dig all
the way down to root?

> +	/* rearrange array */
> +	dup = xcalloc(cnt, sizeof(struct commit *));
> +	COPY_ARRAY(dup, array, cnt);
> +	for (i = 0; i < cnt; i++) {
> +		if (dup[i]->object.flags & STALE) {
> +			int insert = cnt - 1 - (i - count_non_stale);
> +			array[insert] = dup[i];
> +		} else {
> +			array[count_non_stale] = dup[i];
> +			count_non_stale++;
> +		}
> +	}
> +	free(dup);

The "fill stale ones from the end, non-stale ones from the
beginning" in the loop looks unnecessarily complex to me.  I wonder
if we can do only the "fill non-stale ones from the beginning" half,
i.e.

	for (i = count_non_stale = 0; i < cnt; i++) {
		if (dup[i] is not stale)
			array[count_non_stale++] = dup[i];
	}

without the "keep the stale one at the end of array[]", and clear
marks using what is in dup[] as starting points before discarding
dup[]?

Or do the callers still look at the entries beyond count_non_stale?

Other than that, nicely done.

> +	/* clear marks */
> +	for (i = 0; i < cnt; i++) {
> +		struct commit_list *parents;
> +		parents = array[i]->parents;
> +
> +		while (parents) {
> +			clear_commit_marks(parents->item, STALE);
> +			parents = parents->next;
>  		}
> -		common = paint_down_to_common(r, array[i], filled,
> -					      work, min_generation);
> -		if (array[i]->object.flags & PARENT2)
> -			redundant[i] = 1;
> -		for (j = 0; j < filled; j++)
> -			if (work[j]->object.flags & PARENT1)
> -				redundant[filled_index[j]] = 1;
> -		clear_commit_marks(array[i], all_flags);
> -		clear_commit_marks_many(filled, work, all_flags);
> -		free_commit_list(common);
>  	}
>  
> -	/* Now collect the result */
> -	COPY_ARRAY(work, array, cnt);
> -	for (i = filled = 0; i < cnt; i++)
> -		if (!redundant[i])
> -			array[filled++] = work[i];
> -	for (j = filled, i = 0; i < cnt; i++)
> -		if (redundant[i])
> -			array[j++] = work[i];
> -	free(work);
> -	free(redundant);
> -	free(filled_index);
> -	return filled;
> +	return count_non_stale;
>  }
>  
>  static struct commit_list *get_merge_bases_many_0(struct repository *r,

René Scharfe Jan. 29, 2021, 5:10 p.m. UTC | #2

Am 28.01.21 um 17:24 schrieb Derrick Stolee via GitGitGadget:
> +	/* rearrange array */
> +	dup = xcalloc(cnt, sizeof(struct commit *));

You could use CALLOC_ARRAY instead here, which is shorter and uses the
correct type automatically.  Or -- seeing that the next line overwrites
all items anyway --  ALLOC_ARRAY.

> +	COPY_ARRAY(dup, array, cnt);

René

René Scharfe Jan. 29, 2021, 5:11 p.m. UTC | #3

Am 28.01.21 um 21:51 schrieb Junio C Hamano:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> +	/* rearrange array */
>> +	dup = xcalloc(cnt, sizeof(struct commit *));
>> +	COPY_ARRAY(dup, array, cnt);
>> +	for (i = 0; i < cnt; i++) {
>> +		if (dup[i]->object.flags & STALE) {
>> +			int insert = cnt - 1 - (i - count_non_stale);
>> +			array[insert] = dup[i];
>> +		} else {
>> +			array[count_non_stale] = dup[i];
>> +			count_non_stale++;
>> +		}
>> +	}
>> +	free(dup);
>
> The "fill stale ones from the end, non-stale ones from the
> beginning" in the loop looks unnecessarily complex to me.  I wonder
> if we can do only the "fill non-stale ones from the beginning" half,
> i.e.
>
> 	for (i = count_non_stale = 0; i < cnt; i++) {
> 		if (dup[i] is not stale)
> 			array[count_non_stale++] = dup[i];
> 	}
>
> without the "keep the stale one at the end of array[]", and clear
> marks using what is in dup[] as starting points before discarding
> dup[]?
>
> Or do the callers still look at the entries beyond count_non_stale?

Had the same reaction.  Both callers ignore the stale entries.

> Other than that, nicely done.
>
>> +	/* clear marks */
>> +	for (i = 0; i < cnt; i++) {
>> +		struct commit_list *parents;
>> +		parents = array[i]->parents;
>> +
>> +		while (parents) {
>> +			clear_commit_marks(parents->item, STALE);
>> +			parents = parents->next;
>>  		}

This loop clears STALE from the parents of both the non-stale and
stale entries.  OK.  Should it also clear it from the stale entries
themselves?

>> -		common = paint_down_to_common(r, array[i], filled,
>> -					      work, min_generation);
>> -		if (array[i]->object.flags & PARENT2)
>> -			redundant[i] = 1;
>> -		for (j = 0; j < filled; j++)
>> -			if (work[j]->object.flags & PARENT1)
>> -				redundant[filled_index[j]] = 1;
>> -		clear_commit_marks(array[i], all_flags);
>> -		clear_commit_marks_many(filled, work, all_flags);
>> -		free_commit_list(common);
>>  	}
>>
>> -	/* Now collect the result */
>> -	COPY_ARRAY(work, array, cnt);
>> -	for (i = filled = 0; i < cnt; i++)
>> -		if (!redundant[i])
>> -			array[filled++] = work[i];
>> -	for (j = filled, i = 0; i < cnt; i++)
>> -		if (redundant[i])
>> -			array[j++] = work[i];
>> -	free(work);
>> -	free(redundant);
>> -	free(filled_index);
>> -	return filled;
>> +	return count_non_stale;
>>  }
>>
>>  static struct commit_list *get_merge_bases_many_0(struct repository *r,

Derrick Stolee Jan. 31, 2021, 3:52 a.m. UTC | #4

On 1/29/2021 12:11 PM, René Scharfe wrote:
> Am 28.01.21 um 21:51 schrieb Junio C Hamano:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>> +	/* rearrange array */
>>> +	dup = xcalloc(cnt, sizeof(struct commit *));
>>> +	COPY_ARRAY(dup, array, cnt);
>>> +	for (i = 0; i < cnt; i++) {
>>> +		if (dup[i]->object.flags & STALE) {
>>> +			int insert = cnt - 1 - (i - count_non_stale);
>>> +			array[insert] = dup[i];
>>> +		} else {
>>> +			array[count_non_stale] = dup[i];
>>> +			count_non_stale++;
>>> +		}
>>> +	}
>>> +	free(dup);
>>
>> The "fill stale ones from the end, non-stale ones from the
>> beginning" in the loop looks unnecessarily complex to me.  I wonder
>> if we can do only the "fill non-stale ones from the beginning" half,
>> i.e.
>>
>> 	for (i = count_non_stale = 0; i < cnt; i++) {
>> 		if (dup[i] is not stale)
>> 			array[count_non_stale++] = dup[i];
>> 	}
>>
>> without the "keep the stale one at the end of array[]", and clear
>> marks using what is in dup[] as starting points before discarding
>> dup[]?
>>
>> Or do the callers still look at the entries beyond count_non_stale?
> 
> Had the same reaction.  Both callers ignore the stale entries.

Ok, I can update that logic accordingly. I wanted to keep consistent
with the comment at the start of the method:

	/*
	 * Some commit in the array may be an ancestor of
	 * another commit.  Move such commit to the end of
	 * the array, and return the number of commits that
	 * are independent from each other.
	 */

but if no caller actually needs that, then I can remove this
behavior. Anyone mind if it is a follow-up patch to change this
part of the behavior?

>> Other than that, nicely done.
>>
>>> +	/* clear marks */
>>> +	for (i = 0; i < cnt; i++) {
>>> +		struct commit_list *parents;
>>> +		parents = array[i]->parents;
>>> +
>>> +		while (parents) {
>>> +			clear_commit_marks(parents->item, STALE);
>>> +			parents = parents->next;
>>>  		}
> 
> This loop clears STALE from the parents of both the non-stale and
> stale entries.  OK.  Should it also clear it from the stale entries
> themselves?

clear_commit_marks() walks commits starting from the input commit
(parents->item in this case) and clears the STALE bit as long as
it is present. This way, the accumulated clear_commit_marks() will
walk each commit only once _and_ will visit any of the commits from
'array' that received the STALE bit during the above walk.

Thanks,
-Stolee

Derrick Stolee Jan. 31, 2021, 3:59 a.m. UTC | #5

On 1/28/2021 3:51 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> +		parents = c->parents;
>> +		while (parents) {
>> +			if (!(parents->item->object.flags & STALE)) {
>> +				parents->item->object.flags |= STALE;
>> +				prio_queue_put(&queue, parents->item);
>> +			}
>> +			parents = parents->next;
>> +		}
>> +	}
> 
> So, the inner loop makes sure we won't revisit STALE parent, but
> keep digging parents we haven't seen, and stop when the generation
> is old enough.  What happens when there is no generation number
> computed yet, I wonder...  We'll keep getting infinity and dig all
> the way down to root?

If we are on commits that have no generation number yet, then we
will walk until reaching commits in the commit-graph file that have
a computed generation (or in the heuristic case, when we have reached
all but one of the commits).

In the case of the commit-graph, all commits will have generation
number "infinity". In such a case, perhaps the old algorithm _is_
the best we can do, at least for now.

The trade-off is that we might walk more commits in unusual cases
with few input commits. That quadratic behavior will take over as
the input size grows, no matter if there is a commit-graph or not.

I can do a big more digging into these unusual cases, especially
when we cannot rely on commit-graphs being present.

One way to ensure we do not regress from the current behavior
would be to condition the new algorithm with

	if (generation_numbers_enabled(the_repository))
		new_algorithm();
	else
		old_algorithm();

much like in repo_is_descendant_of().

Is that a good plan?

Thanks,
-Stolee

Derrick Stolee Jan. 31, 2021, 8:13 p.m. UTC | #6

On 1/30/2021 10:59 PM, Derrick Stolee wrote:
> On 1/28/2021 3:51 PM, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>> +		parents = c->parents;
>>> +		while (parents) {
>>> +			if (!(parents->item->object.flags & STALE)) {
>>> +				parents->item->object.flags |= STALE;
>>> +				prio_queue_put(&queue, parents->item);
>>> +			}
>>> +			parents = parents->next;
>>> +		}
>>> +	}
>>
>> So, the inner loop makes sure we won't revisit STALE parent, but
>> keep digging parents we haven't seen, and stop when the generation
>> is old enough.  What happens when there is no generation number
>> computed yet, I wonder...  We'll keep getting infinity and dig all
>> the way down to root?
> 
> If we are on commits that have no generation number yet, then we
> will walk until reaching commits in the commit-graph file that have
> a computed generation (or in the heuristic case, when we have reached
> all but one of the commits).
> 
> In the case of the commit-graph, all commits will have generation
> number "infinity". In such a case, perhaps the old algorithm _is_
> the best we can do, at least for now.
> 
> The trade-off is that we might walk more commits in unusual cases
> with few input commits. That quadratic behavior will take over as
> the input size grows, no matter if there is a commit-graph or not.
> 
> I can do a big more digging into these unusual cases, especially
> when we cannot rely on commit-graphs being present.

Indeed, the old algorithm is better for cases where generation
numbers do not help and there are multiple independent commits.

For example, this command in the Linux kernel repo changes from
0.047s in the old algorithm to 6.6s with the new algorithm:

git -c core.commitGraph=false merge-base --independent 2ecedd756908 d2360a398f0b >/dev/null

> One way to ensure we do not regress from the current behavior
> would be to condition the new algorithm with
> 
> 	if (generation_numbers_enabled(the_repository))
> 		new_algorithm();
> 	else
> 		old_algorithm();
> 
> much like in repo_is_descendant_of().

I will include this in a v2.

Thanks,
-Stolee

Junio C Hamano Jan. 31, 2021, 8:25 p.m. UTC | #7

Derrick Stolee <stolee@gmail.com> writes:

>> So, the inner loop makes sure we won't revisit STALE parent, but
>> keep digging parents we haven't seen, and stop when the generation
>> is old enough.  What happens when there is no generation number
>> computed yet, I wonder...  We'll keep getting infinity and dig all
>> the way down to root?
>
> If we are on commits that have no generation number yet, then we
> will walk until reaching commits in the commit-graph file that have
> a computed generation (or in the heuristic case, when we have reached
> all but one of the commits).
>
> In the case of the commit-graph, all commits will have generation
> number "infinity". In such a case, perhaps the old algorithm _is_
> the best we can do, at least for now.

Hmph, I am afraid that such is life X-<.

> One way to ensure we do not regress from the current behavior
> would be to condition the new algorithm with
>
> 	if (generation_numbers_enabled(the_repository))
> 		new_algorithm();
> 	else
> 		old_algorithm();
>
> much like in repo_is_descendant_of().
>
> Is that a good plan?

It would certainly avoid one particular form of regression, so it is
better than nothing.

But at the same time, we'd probably want to encourage people to
enable and maintain generation numbers for the majority of commits
in their repository, but unless you "gc" twice a day or something,
you'd inevitably have a mixture, say, all commits that are more than
two weeks old are covered by commit-graph, but more recent ones are
not yet enumerated, and you have to traverse at runtime.

And the performance characteristics we would care the most in the
longer term is to make sure that we perform well in such a mixed
environment for the parts of the history that are not old enough.

Many things can be sped up by precomputing and storing the result in
the commit-graph file and that is not all that interesting or
surprising part of the story, I would think.  Rather, we want to
ensure that we do not perform on the youngest part of the history
any worse---that way, people will have strong incentive to enable
commit-graph, as things will work superbly for older parts of the
history, while not doing any worse than the original system for the
newest parts of the history.

There was a side thread where somebody wished if they can remove
support for all the codepaths that do not use commit-graph, but
would this be an example of how such a wish is impractical, I have
to wonder?

Thanks.

Derrick Stolee Feb. 1, 2021, 3:55 a.m. UTC | #8

On 1/31/2021 3:25 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>>> So, the inner loop makes sure we won't revisit STALE parent, but
>>> keep digging parents we haven't seen, and stop when the generation
>>> is old enough.  What happens when there is no generation number
>>> computed yet, I wonder...  We'll keep getting infinity and dig all
>>> the way down to root?
>>
>> If we are on commits that have no generation number yet, then we
>> will walk until reaching commits in the commit-graph file that have
>> a computed generation (or in the heuristic case, when we have reached
>> all but one of the commits).
>>
>> In the case of the commit-graph, all commits will have generation
>> number "infinity". In such a case, perhaps the old algorithm _is_
>> the best we can do, at least for now.
> 
> Hmph, I am afraid that such is life X-<.
> 
>> One way to ensure we do not regress from the current behavior
>> would be to condition the new algorithm with
>>
>> 	if (generation_numbers_enabled(the_repository))
>> 		new_algorithm();
>> 	else
>> 		old_algorithm();
>>
>> much like in repo_is_descendant_of().
>>
>> Is that a good plan?
> 
> It would certainly avoid one particular form of regression, so it is
> better than nothing.
> 
> But at the same time, we'd probably want to encourage people to
> enable and maintain generation numbers for the majority of commits
> in their repository, but unless you "gc" twice a day or something,
> you'd inevitably have a mixture, say, all commits that are more than
> two weeks old are covered by commit-graph, but more recent ones are
> not yet enumerated, and you have to traverse at runtime.
> 
> And the performance characteristics we would care the most in the
> longer term is to make sure that we perform well in such a mixed
> environment for the parts of the history that are not old enough.

You're right, that in the mixed case of "all of these input commits
have infinite generation number" is the worst possible case for
this algorithm. However, if even a single commit has a finite
generation number, then this new algorithm should out-perform the
one using paint_down_to_common() in all cases.

I tested the painful case of

	git merge-base --independent 2ecedd756908 d2360a398f0b

in the Linux kernel repository with different commit-graph files.
I found that the new algorithm performed worse than the old
algorithm until there were fewer than 2,500 commits reachable
from these two but not in the commit-graph file. That's probably
too small of a gap to expect of a typical user.

I will modify my check

	if (generation_numbers_enabled(r)) {
		int i;

		/*
		 * If we have a single commit with finite generation
		 * number, then the _with_gen algorithm is preferred.
		 */
		for (i = 0; i < cnt; i++) {
			if (commit_graph_generation(array[i]) < GENERATION_NUMBER_INFINITY)
				return remove_redundant_with_gen(r, array, cnt);
		}
	}

	return remove_redundant_no_gen(r, array, cnt);

> Many things can be sped up by precomputing and storing the result in
> the commit-graph file and that is not all that interesting or
> surprising part of the story, I would think.  Rather, we want to
> ensure that we do not perform on the youngest part of the history
> any worse---that way, people will have strong incentive to enable
> commit-graph, as things will work superbly for older parts of the
> history, while not doing any worse than the original system for the
> newest parts of the history.

I'm definitely biased to a case where there is an up-to-date
commit-graph file, since I care most about server-side performance
or users with background maintenance enabled (and computing the
commit-graph frequently). You are right to call out that that is
not always the case.
 
> There was a side thread where somebody wished if they can remove
> support for all the codepaths that do not use commit-graph, but
> would this be an example of how such a wish is impractical, I have
> to wonder?

Some cases might allow de-duplication, such as
sort_in_topological_order(), that would have roughly the same
performance, give or take some overhead. It takes time to check
and see if that is the case, however.

Thanks,
-Stolee

[1/3] commit-reach: use one walk in remove_redundant()

Commit Message

Comments

Patch