Message ID | fae81b534b14c8227454ff94e385fb16faee0e99.1602079785.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4,01/10] commit-graph: fix regression when computing Bloom filters | expand |
"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Abhishek Kumar <abhishekkumar8222@gmail.com> > > commit_gen_cmp is used when writing a commit-graph to sort commits in > generation order before computing Bloom filters. Since c49c82aa (commit: > move members graph_pos, generation to a slab, 2020-06-17) made it so > that 'commit_graph_generation()' returns 'GENERATION_NUMBER_INFINITY' > during writing, we cannot call it within this function. Instead, access > the generation number directly through the slab (i.e., by calling > 'commit_graph_data_at(c)->generation') in order to access it while > writing. This description is all right, but I think it can be made more clear: When running `git commit-graph write --reachable --changed-paths` to compute Bloom filters for changed paths, commits are first sorted by generation number using 'commit_gen_cmp()'. Commits with similar generation are more likely to have many trees in common, making the diff faster, see 3d112755. However, since c49c82aa (commit: move members graph_pos, generation to a slab, 2020-06-17) made it so that 'commit_graph_generation()' returns 'GENERATION_NUMBER_INFINITY' during writing, we cannot call it within this function. Instead, access the generation number directly through the slab (i.e., by calling 'commit_graph_data_at(c)->generation') in order to access it while writing. Or something like that. We should also add an explanation why avoiding getter is safe here, perhaps adding the following line to the second paragraph: It is safe to do because 'commit_gen_cmp()' from commit-graph.c is static and used only when writing Bloom filters, and because writing changed-paths filters is done after computing generation numbers (if necessary). Or something like that. > > While measuring performance with `git commit-graph write --reachable > --changed-paths` on the linux repository led to around 1m40s for both > HEAD and master (and could be due to fault in my measurements), it is > still the "right" thing to do. I had to read the above paragraph several times to understand it, possibly because I have expected here to be a fix for a performance regression. The commit message for 3d112755 (commit-graph: examine commits by generation number) describes reduction of computation time from 3m00s to 1m37s. So I would expect performance with HEAD (i.e. before those changes) to be around 3m, not the same before and after changes being around 1m40s. Can anyone recheck this before-and-after benchmark, please? Anyway, it might be more clear to write it as the following: On the Linux kernel repository, this patch didn't reduce the computation time for 'git commit-graph write --reachable --changed-paths', which is around 1m40s both before and after this change. This could be a fault in my measurements; it is still the "right" thing to do. Or something like that. > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> > --- Anyway, it is nice and clear change. > commit-graph.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/commit-graph.c b/commit-graph.c > index cb042bdba8..94503e584b 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb) > const struct commit *a = *(const struct commit **)va; > const struct commit *b = *(const struct commit **)vb; > > - uint32_t generation_a = commit_graph_generation(a); > - uint32_t generation_b = commit_graph_generation(b); > + uint32_t generation_a = commit_graph_data_at(a)->generation; > + uint32_t generation_b = commit_graph_data_at(b)->generation; > /* lower generation commits first */ > if (generation_a < generation_b) > return -1; Best,
On Sun, Oct 25, 2020 at 01:16:48AM +0200, Jakub Narębski wrote: > "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > While measuring performance with `git commit-graph write --reachable > > --changed-paths` on the linux repository led to around 1m40s for both > > HEAD and master (and could be due to fault in my measurements), it is > > still the "right" thing to do. > > I had to read the above paragraph several times to understand it, > possibly because I have expected here to be a fix for a performance > regression. The commit message for 3d112755 (commit-graph: examine > commits by generation number) describes reduction of computation time > from 3m00s to 1m37s. So I would expect performance with HEAD (i.e. > before those changes) to be around 3m, not the same before and after > changes being around 1m40s. > > Can anyone recheck this before-and-after benchmark, please? My hunch is that our heuristic to fall back to the commits 'date' value is saving us here. commit_gen_cmp() first compares the generation numbers, breaking ties by 'date' as a heuristic. But since all generation number queries return GENERATION_NUMBER_INFINITY during writing, we're relying on our heuristic entirely. I haven't looked much further than that, other than to see that I could get about a ~4sec speed-up with this patch as compared to v2.29.1 in the computing Bloom filters region on the kernel. > Anyway, it might be more clear to write it as the following: > > On the Linux kernel repository, this patch didn't reduce the > computation time for 'git commit-graph write --reachable > --changed-paths', which is around 1m40s both before and after this > change. This could be a fault in my measurements; it is still the > "right" thing to do. > > Or something like that. Assuming that we are in fact being saved by the "date" heuristic, I'd probably write the following commit message instead: Before computing Bloom filters, the commit-graph machinery uses commit_gen_cmp to sort commits by generation order for improved diff performance. 3d11275505 (commit-graph: examine commits by generation number, 2020-03-30) claims that this sort can reduce the time spent to compute Bloom filters by nearly half. But since c49c82aa4c (commit: move members graph_pos, generation to a slab, 2020-06-17), this optimization is broken, since asking for 'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY while writing. Not all hope is lost, though: 'commit_graph_generation()' falls back to comparing commits by their date when they have equal generation number, and so since c49c82aa4c is purely a date comparison function. This heuristic is good enough that we don't seem to loose appreciable performance while computing Bloom filters. [Benchmark that we loose about ~4sec before/after c49c82aa4c9...] So, avoid the uesless 'commit_graph_generation()' while writing by instead accessing the slab directly. This returns the newly-computed generation numbers, and allows us to avoid the heuristic by directly comparing generation numbers. Thanks, Taylor
On Sun, Oct 25, 2020 at 04:58:14PM -0400, Taylor Blau wrote: > On Sun, Oct 25, 2020 at 01:16:48AM +0200, Jakub Narębski wrote: > > "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > > > While measuring performance with `git commit-graph write --reachable > > > --changed-paths` on the linux repository led to around 1m40s for both > > > HEAD and master (and could be due to fault in my measurements), it is > > > still the "right" thing to do. > > > > I had to read the above paragraph several times to understand it, > > possibly because I have expected here to be a fix for a performance > > regression. The commit message for 3d112755 (commit-graph: examine > > commits by generation number) describes reduction of computation time > > from 3m00s to 1m37s. So I would expect performance with HEAD (i.e. > > before those changes) to be around 3m, not the same before and after > > changes being around 1m40s. > > > > Can anyone recheck this before-and-after benchmark, please? > > My hunch is that our heuristic to fall back to the commits 'date' > value is saving us here. commit_gen_cmp() first compares the generation > numbers, breaking ties by 'date' as a heuristic. But since all > generation number queries return GENERATION_NUMBER_INFINITY during > writing, we're relying on our heuristic entirely. > > I haven't looked much further than that, other than to see that I could > get about a ~4sec speed-up with this patch as compared to v2.29.1 in the > computing Bloom filters region on the kernel. > Thanks for benchmarking it. I wasn't sure if I am testing it correctly or the patch made no difference. > > Anyway, it might be more clear to write it as the following: > > > > On the Linux kernel repository, this patch didn't reduce the > > computation time for 'git commit-graph write --reachable > > --changed-paths', which is around 1m40s both before and after this > > change. This could be a fault in my measurements; it is still the > > "right" thing to do. > > > > Or something like that. > > Assuming that we are in fact being saved by the "date" heuristic, I'd > probably write the following commit message instead: > > Before computing Bloom filters, the commit-graph machinery uses > commit_gen_cmp to sort commits by generation order for improved diff > performance. 3d11275505 (commit-graph: examine commits by generation > number, 2020-03-30) claims that this sort can reduce the time spent to > compute Bloom filters by nearly half. > > But since c49c82aa4c (commit: move members graph_pos, generation to a > slab, 2020-06-17), this optimization is broken, since asking for > 'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY > while writing. > > Not all hope is lost, though: 'commit_graph_generation()' falls > back to comparing commits by their date when they have equal generation > number, and so since c49c82aa4c is purely a date comparison function. > This heuristic is good enough that we don't seem to loose appreciable > performance while computing Bloom filters. [Benchmark that we loose > about ~4sec before/after c49c82aa4c9...] > > So, avoid the uesless 'commit_graph_generation()' while writing by > instead accessing the slab directly. This returns the newly-computed > generation numbers, and allows us to avoid the heuristic by directly > comparing generation numbers. > That's a lot better, will change. > Thanks, > Taylor
diff --git a/commit-graph.c b/commit-graph.c index cb042bdba8..94503e584b 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb) const struct commit *a = *(const struct commit **)va; const struct commit *b = *(const struct commit **)vb; - uint32_t generation_a = commit_graph_generation(a); - uint32_t generation_b = commit_graph_generation(b); + uint32_t generation_a = commit_graph_data_at(a)->generation; + uint32_t generation_b = commit_graph_data_at(b)->generation; /* lower generation commits first */ if (generation_a < generation_b) return -1;