diff mbox series

[v4,01/10] commit-graph: fix regression when computing Bloom filters

Message ID fae81b534b14c8227454ff94e385fb16faee0e99.1602079785.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v4,01/10] commit-graph: fix regression when computing Bloom filters | expand

Commit Message

Phillip Wood via GitGitGadget Oct. 7, 2020, 2:09 p.m. UTC
From: Abhishek Kumar <abhishekkumar8222@gmail.com>

commit_gen_cmp is used when writing a commit-graph to sort commits in
generation order before computing Bloom filters. Since c49c82aa (commit:
move members graph_pos, generation to a slab, 2020-06-17) made it so
that 'commit_graph_generation()' returns 'GENERATION_NUMBER_INFINITY'
during writing, we cannot call it within this function. Instead, access
the generation number directly through the slab (i.e., by calling
'commit_graph_data_at(c)->generation') in order to access it while
writing.

While measuring performance with `git commit-graph write --reachable
--changed-paths` on the linux repository led to around 1m40s for both
HEAD and master (and could be due to fault in my measurements), it is
still the "right" thing to do.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Jakub Narębski Oct. 24, 2020, 11:16 p.m. UTC | #1
"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> commit_gen_cmp is used when writing a commit-graph to sort commits in
> generation order before computing Bloom filters. Since c49c82aa (commit:
> move members graph_pos, generation to a slab, 2020-06-17) made it so
> that 'commit_graph_generation()' returns 'GENERATION_NUMBER_INFINITY'
> during writing, we cannot call it within this function. Instead, access
> the generation number directly through the slab (i.e., by calling
> 'commit_graph_data_at(c)->generation') in order to access it while
> writing.

This description is all right, but I think it can be made more clear:

  When running `git commit-graph write --reachable --changed-paths` to
  compute Bloom filters for changed paths, commits are first sorted by
  generation number using 'commit_gen_cmp()'.  Commits with similar
  generation are more likely to have many trees in common, making the
  diff faster, see 3d112755.

  However, since c49c82aa (commit: move members graph_pos, generation to
  a slab, 2020-06-17) made it so that 'commit_graph_generation()'
  returns 'GENERATION_NUMBER_INFINITY' during writing, we cannot call it
  within this function.  Instead, access the generation number directly
  through the slab (i.e., by calling 'commit_graph_data_at(c)->generation')
  in order to access it while writing.

Or something like that.

We should also add an explanation why avoiding getter is safe here,
perhaps adding the following line to the second paragraph:

  It is safe to do because 'commit_gen_cmp()' from commit-graph.c is
  static and used only when writing Bloom filters, and because writing
  changed-paths filters is done after computing generation numbers (if
  necessary).

Or something like that.

>
> While measuring performance with `git commit-graph write --reachable
> --changed-paths` on the linux repository led to around 1m40s for both
> HEAD and master (and could be due to fault in my measurements), it is
> still the "right" thing to do.

I had to read the above paragraph several times to understand it,
possibly because I have expected here to be a fix for a performance
regression.  The commit message for 3d112755 (commit-graph: examine
commits by generation number) describes reduction of computation time
from 3m00s to 1m37s.  So I would expect performance with HEAD (i.e.
before those changes) to be around 3m, not the same before and after
changes being around 1m40s.

Can anyone recheck this before-and-after benchmark, please?

Anyway, it might be more clear to write it as the following:

  On the Linux kernel repository, this patch didn't reduce the
  computation time for 'git commit-graph write --reachable
  --changed-paths', which is around 1m40s both before and after this
  change.  This could be a fault in my measurements; it is still the
  "right" thing to do.

Or something like that.

> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---

Anyway, it is nice and clear change.

>  commit-graph.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index cb042bdba8..94503e584b 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	const struct commit *a = *(const struct commit **)va;
>  	const struct commit *b = *(const struct commit **)vb;
>  
> -	uint32_t generation_a = commit_graph_generation(a);
> -	uint32_t generation_b = commit_graph_generation(b);
> +	uint32_t generation_a = commit_graph_data_at(a)->generation;
> +	uint32_t generation_b = commit_graph_data_at(b)->generation;
>  	/* lower generation commits first */
>  	if (generation_a < generation_b)
>  		return -1;

Best,
Taylor Blau Oct. 25, 2020, 8:58 p.m. UTC | #2
On Sun, Oct 25, 2020 at 01:16:48AM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > While measuring performance with `git commit-graph write --reachable
> > --changed-paths` on the linux repository led to around 1m40s for both
> > HEAD and master (and could be due to fault in my measurements), it is
> > still the "right" thing to do.
>
> I had to read the above paragraph several times to understand it,
> possibly because I have expected here to be a fix for a performance
> regression.  The commit message for 3d112755 (commit-graph: examine
> commits by generation number) describes reduction of computation time
> from 3m00s to 1m37s.  So I would expect performance with HEAD (i.e.
> before those changes) to be around 3m, not the same before and after
> changes being around 1m40s.
>
> Can anyone recheck this before-and-after benchmark, please?

My hunch is that our heuristic to fall back to the commits 'date'
value is saving us here. commit_gen_cmp() first compares the generation
numbers, breaking ties by 'date' as a heuristic. But since all
generation number queries return GENERATION_NUMBER_INFINITY during
writing, we're relying on our heuristic entirely.

I haven't looked much further than that, other than to see that I could
get about a ~4sec speed-up with this patch as compared to v2.29.1 in the
computing Bloom filters region on the kernel.

> Anyway, it might be more clear to write it as the following:
>
>   On the Linux kernel repository, this patch didn't reduce the
>   computation time for 'git commit-graph write --reachable
>   --changed-paths', which is around 1m40s both before and after this
>   change.  This could be a fault in my measurements; it is still the
>   "right" thing to do.
>
> Or something like that.

Assuming that we are in fact being saved by the "date" heuristic, I'd
probably write the following commit message instead:

  Before computing Bloom filters, the commit-graph machinery uses
  commit_gen_cmp to sort commits by generation order for improved diff
  performance. 3d11275505 (commit-graph: examine commits by generation
  number, 2020-03-30) claims that this sort can reduce the time spent to
  compute Bloom filters by nearly half.

  But since c49c82aa4c (commit: move members graph_pos, generation to a
  slab, 2020-06-17), this optimization is broken, since asking for
  'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
  while writing.

  Not all hope is lost, though: 'commit_graph_generation()' falls
  back to comparing commits by their date when they have equal generation
  number, and so since c49c82aa4c is purely a date comparison function.
  This heuristic is good enough that we don't seem to loose appreciable
  performance while computing Bloom filters. [Benchmark that we loose
  about ~4sec before/after c49c82aa4c9...]

  So, avoid the uesless 'commit_graph_generation()' while writing by
  instead accessing the slab directly. This returns the newly-computed
  generation numbers, and allows us to avoid the heuristic by directly
  comparing generation numbers.

Thanks,
Taylor
Abhishek Kumar Nov. 3, 2020, 5:36 a.m. UTC | #3
On Sun, Oct 25, 2020 at 04:58:14PM -0400, Taylor Blau wrote:
> On Sun, Oct 25, 2020 at 01:16:48AM +0200, Jakub Narębski wrote:
> > "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >
> > > While measuring performance with `git commit-graph write --reachable
> > > --changed-paths` on the linux repository led to around 1m40s for both
> > > HEAD and master (and could be due to fault in my measurements), it is
> > > still the "right" thing to do.
> >
> > I had to read the above paragraph several times to understand it,
> > possibly because I have expected here to be a fix for a performance
> > regression.  The commit message for 3d112755 (commit-graph: examine
> > commits by generation number) describes reduction of computation time
> > from 3m00s to 1m37s.  So I would expect performance with HEAD (i.e.
> > before those changes) to be around 3m, not the same before and after
> > changes being around 1m40s.
> >
> > Can anyone recheck this before-and-after benchmark, please?
> 
> My hunch is that our heuristic to fall back to the commits 'date'
> value is saving us here. commit_gen_cmp() first compares the generation
> numbers, breaking ties by 'date' as a heuristic. But since all
> generation number queries return GENERATION_NUMBER_INFINITY during
> writing, we're relying on our heuristic entirely.
> 
> I haven't looked much further than that, other than to see that I could
> get about a ~4sec speed-up with this patch as compared to v2.29.1 in the
> computing Bloom filters region on the kernel.
> 

Thanks for benchmarking it. I wasn't sure if I am testing it correctly
or the patch made no difference.

> > Anyway, it might be more clear to write it as the following:
> >
> >   On the Linux kernel repository, this patch didn't reduce the
> >   computation time for 'git commit-graph write --reachable
> >   --changed-paths', which is around 1m40s both before and after this
> >   change.  This could be a fault in my measurements; it is still the
> >   "right" thing to do.
> >
> > Or something like that.
> 
> Assuming that we are in fact being saved by the "date" heuristic, I'd
> probably write the following commit message instead:
> 
>   Before computing Bloom filters, the commit-graph machinery uses
>   commit_gen_cmp to sort commits by generation order for improved diff
>   performance. 3d11275505 (commit-graph: examine commits by generation
>   number, 2020-03-30) claims that this sort can reduce the time spent to
>   compute Bloom filters by nearly half.
> 
>   But since c49c82aa4c (commit: move members graph_pos, generation to a
>   slab, 2020-06-17), this optimization is broken, since asking for
>   'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
>   while writing.
> 
>   Not all hope is lost, though: 'commit_graph_generation()' falls
>   back to comparing commits by their date when they have equal generation
>   number, and so since c49c82aa4c is purely a date comparison function.
>   This heuristic is good enough that we don't seem to loose appreciable
>   performance while computing Bloom filters. [Benchmark that we loose
>   about ~4sec before/after c49c82aa4c9...]
> 
>   So, avoid the uesless 'commit_graph_generation()' while writing by
>   instead accessing the slab directly. This returns the newly-computed
>   generation numbers, and allows us to avoid the heuristic by directly
>   comparing generation numbers.
> 

That's a lot better, will change.

> Thanks,
> Taylor
diff mbox series

Patch

diff --git a/commit-graph.c b/commit-graph.c
index cb042bdba8..94503e584b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -144,8 +144,8 @@  static int commit_gen_cmp(const void *va, const void *vb)
 	const struct commit *a = *(const struct commit **)va;
 	const struct commit *b = *(const struct commit **)vb;
 
-	uint32_t generation_a = commit_graph_generation(a);
-	uint32_t generation_b = commit_graph_generation(b);
+	uint32_t generation_a = commit_graph_data_at(a)->generation;
+	uint32_t generation_b = commit_graph_data_at(b)->generation;
 	/* lower generation commits first */
 	if (generation_a < generation_b)
 		return -1;