diff mbox series

[1/1] commit-graph.c: avoid unnecessary tag dereference when merging

Message ID 4c79a9ea909ebff8c0987bcf95692da92e79bda4.1584762087.git.me@ttaylorr.com (mailing list archive)
State New, archived
Headers show
Series commit-graph: avoid unnecessary tag deference when merging | expand

Commit Message

Taylor Blau March 21, 2020, 3:44 a.m. UTC
When performing a 'git commit-graph write' with '--split', the
commit-graph machinery calls 'merge_commit_graph()' after deciding on a
split strategy to optionally clean up any existing commit-graph
layers that were made obsolete by the split strategy [1].

At this time, 'merge_commit_graph()' checks each commit that it writes
into the merged graph to make sure that it still exists in the object
store.

To do this, it uses 'lookup_commit_reference_gently()', which accepts
either a commit object, or a tag that refers to a commit. However, since
all 'oid' arguments passed to this function are from within the
commit-graphs being merged, we never pass a commit reference, and so any
time we spend in 'deref_tag()' is wasted.

Improve the situation by using 'repo_has_object_file' to check if the
object still exists, and '{lookup,repo_parse}_commit()' to turn it into
a bona-fide 'struct commit *'.

In my testing environment, this improves the time to "merge" a split
commit-graph containing all reachable commits in the kernel by
re-writing the same commit-graph (effectively measuring the time it
takes to check that all of those commits still exist) from:

  Attempt 1: 9.614
  Attempt 2: 10.984
  Attempt 3: 10.39
  Attempt 4: 9.14
  Attempt 5: 9.439

  real	0m9.140s
  user	0m8.207s
  sys	0m0.602s

to:

  Attempt 1: 9.12
  Attempt 2: 8.904
  Attempt 3: 9.361
  Attempt 4: 9.288
  Attempt 5: 9.677

  real	0m8.904s
  user	0m8.208s
  sys	0m0.596s

yielding a modest ~2.6% improvement in the best timings from each run,
and ~7.4% improvement on average.

[1]: This can happen if, for example, the new commit-graph exceeds the
     maximum allowed factor on the number of commits.

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Jeff King March 21, 2020, 5 a.m. UTC | #1
On Fri, Mar 20, 2020 at 09:44:23PM -0600, Taylor Blau wrote:

> In my testing environment, this improves the time to "merge" a split
> commit-graph containing all reachable commits in the kernel by
> re-writing the same commit-graph (effectively measuring the time it
> takes to check that all of those commits still exist) from:
> 
>   Attempt 1: 9.614
>   Attempt 2: 10.984
>   Attempt 3: 10.39
>   Attempt 4: 9.14
>   Attempt 5: 9.439
> 
>   real	0m9.140s
>   user	0m8.207s
>   sys	0m0.602s
> 
> to:
> 
>   Attempt 1: 9.12
>   Attempt 2: 8.904
>   Attempt 3: 9.361
>   Attempt 4: 9.288
>   Attempt 5: 9.677
> 
>   real	0m8.904s
>   user	0m8.208s
>   sys	0m0.596s
> 
> yielding a modest ~2.6% improvement in the best timings from each run,
> and ~7.4% improvement on average.

That still seems really slow to me. If we were truly eliding the load of
most of the commit objects, I'd expect an order of magnitude or so
improvement. For example, with a fully constructed commit graph in
linux.git, I get:

  $ time git -c core.commitGraph=1 rev-list HEAD | wc -l
  886922

  real	0m1.088s
  user	0m0.659s
  sys	0m1.161s

  $ time git -c core.commitGraph=0 rev-list HEAD | wc -l
  886922

  real	0m7.185s
  user	0m6.729s
  sys	0m1.882s

Obviously not the same operation, but that should give us a rough idea
that commit graph lookups are 6-7 times cheaper than loading the actual
objects. I don't remember the details of the case that originally led us
towards this patch. Can you share more of the setup you used to generate
the numbers above (which repo, but more importantly the commands to
create the initial state and then time the test).

The patch otherwise still makes sense to me, but I suspect there are
other similar optimizations nearby that we'll need to do in tandem.

-Peff
Junio C Hamano March 21, 2020, 5:01 a.m. UTC | #2
Taylor Blau <me@ttaylorr.com> writes:

> When performing a 'git commit-graph write' with '--split', the
> commit-graph machinery calls 'merge_commit_graph()' after deciding on a
> split strategy to optionally clean up any existing commit-graph
> layers that were made obsolete by the split strategy [1].
>
> At this time, 'merge_commit_graph()' checks each commit that it writes
> into the merged graph to make sure that it still exists in the object
> store.
>
> To do this, it uses 'lookup_commit_reference_gently()', which accepts
> either a commit object, or a tag that refers to a commit. However, since
> all 'oid' arguments passed to this function are from within the
> commit-graphs being merged, we never pass a commit reference, and so any
> time we spend in 'deref_tag()' is wasted.

Ahh, so my question on the cover letter was utterly off the mark.
It is that feeding a commit to deref_tag() is unneeded.  It is quite
surprising to hear that deref_tag() call is _so_ expensive that it
wastes 7% of the total cycles.

The patch itself looks good.

Thanks.


>  	for (i = 0; i < g->num_commits; i++) {
>  		struct object_id oid;
> -		struct commit *result;
> +		struct commit *result = NULL;
>  
>  		display_progress(ctx->progress, i + 1);
>  
>  		load_oid_from_graph(g, i + offset, &oid);
>  
>  		/* only add commits if they still exist in the repo */
> -		result = lookup_commit_reference_gently(ctx->r, &oid, 1);
> +		if (repo_has_object_file(ctx->r, &oid)) {
> +			result = lookup_commit(ctx->r, &oid);
> +			if (repo_parse_commit(ctx->r, result))
> +				result = NULL;
> +		}
>  
>  		if (result) {
>  			ctx->commits.list[ctx->commits.nr] = result;
Taylor Blau March 21, 2020, 6:11 a.m. UTC | #3
On Sat, Mar 21, 2020 at 01:00:25AM -0400, Jeff King wrote:
> On Fri, Mar 20, 2020 at 09:44:23PM -0600, Taylor Blau wrote:
>
> > In my testing environment, this improves the time to "merge" a split
> > commit-graph containing all reachable commits in the kernel by
> > re-writing the same commit-graph (effectively measuring the time it
> > takes to check that all of those commits still exist) from:
> >
> >   Attempt 1: 9.614
> >   Attempt 2: 10.984
> >   Attempt 3: 10.39
> >   Attempt 4: 9.14
> >   Attempt 5: 9.439
> >
> >   real	0m9.140s
> >   user	0m8.207s
> >   sys	0m0.602s
> >
> > to:
> >
> >   Attempt 1: 9.12
> >   Attempt 2: 8.904
> >   Attempt 3: 9.361
> >   Attempt 4: 9.288
> >   Attempt 5: 9.677
> >
> >   real	0m8.904s
> >   user	0m8.208s
> >   sys	0m0.596s
> >
> > yielding a modest ~2.6% improvement in the best timings from each run,
> > and ~7.4% improvement on average.
>
> That still seems really slow to me. If we were truly eliding the load of
> most of the commit objects, I'd expect an order of magnitude or so
> improvement. For example, with a fully constructed commit graph in
> linux.git, I get:
>
>   $ time git -c core.commitGraph=1 rev-list HEAD | wc -l
>   886922
>
>   real	0m1.088s
>   user	0m0.659s
>   sys	0m1.161s
>
>   $ time git -c core.commitGraph=0 rev-list HEAD | wc -l
>   886922
>
>   real	0m7.185s
>   user	0m6.729s
>   sys	0m1.882s
>
> Obviously not the same operation, but that should give us a rough idea
> that commit graph lookups are 6-7 times cheaper than loading the actual
> objects. I don't remember the details of the case that originally led us
> towards this patch. Can you share more of the setup you used to generate
> the numbers above (which repo, but more importantly the commands to
> create the initial state and then time the test).

Sure. I'm running best-of-five on the time it takes to re-generate and
merge a commit-graph based on in-pack commits.

The script is (in linux.git):

  $ best-of-five \
      -p 'rm -rf .git/objects/info/commit-graph{,s/}; git commit-graph write --split=no-merge 2>/dev/null' \
      git commit-graph write --split=merge-all

So we're measuring the time it takes to crawl all the packs, decide on
the splitting strategy, and then compare all commits in the new merged
graph to make sure that they don't already exist in the object store.

But, here's where things get... Bizarre. I was trying to come up with a
way to do fewer things and spend proportionally more time in
'merge_commit_graphs', so I did something like:

  - Generate a pack containing a single, empty commit.
  - Generate a split commit-graph containing commits in the single large
    pack containing all of history.
  - Generate a commit-graph of the small pack, and merge it with the
    large pack.

That script is:

  $ git --version
  $ git commit -m "empty" --allow-empty
  $ pack="pack-$(git rev-parse HEAD | git pack-objects .git/objects/pack/pack).idx"
  $ best-of-five \
      -p "rm -rf .git/objects/info/commit-graphs && cp -r .git/objects/info/commit-graphs{.bak,}" \
      sh -c "echo $pack | git commit-graph write --split=merge-all"

but things get... slower with this patch? Here are the results before
and after:

  Attempt 1: 8.444
  Attempt 2: 8.453
  Attempt 3: 8.391
  Attempt 4: 8.376
  Attempt 5: 7.859

  real	0m7.859s
  user	0m7.309s
  sys	0m0.511s

vs:

  Attempt 1: 8.69
  Attempt 2: 8.735
  Attempt 3: 8.619
  Attempt 4: 8.747
  Attempt 5: 8.695

  real	0m8.619s
  user	0m8.030s
  sys	0m0.538s

Without more profiling, I'm puzzled by why this patch seems to make
things *worse* under this scenario.

So, I'd appreciate your thoughts: does this set-up seem reasonable? Is
it introducing some latency that isn't being accounted for in the
original setup?

> The patch otherwise still makes sense to me, but I suspect there are
> other similar optimizations nearby that we'll need to do in tandem.
>
> -Peff

Thanks,
Taylor
Taylor Blau March 21, 2020, 6:24 a.m. UTC | #4
On Sat, Mar 21, 2020 at 12:11:41AM -0600, Taylor Blau wrote:
> On Sat, Mar 21, 2020 at 01:00:25AM -0400, Jeff King wrote:
> > On Fri, Mar 20, 2020 at 09:44:23PM -0600, Taylor Blau wrote:
> >
> > > In my testing environment, this improves the time to "merge" a split
> > > commit-graph containing all reachable commits in the kernel by
> > > re-writing the same commit-graph (effectively measuring the time it
> > > takes to check that all of those commits still exist) from:
> > >
> > >   Attempt 1: 9.614
> > >   Attempt 2: 10.984
> > >   Attempt 3: 10.39
> > >   Attempt 4: 9.14
> > >   Attempt 5: 9.439
> > >
> > >   real	0m9.140s
> > >   user	0m8.207s
> > >   sys	0m0.602s
> > >
> > > to:
> > >
> > >   Attempt 1: 9.12
> > >   Attempt 2: 8.904
> > >   Attempt 3: 9.361
> > >   Attempt 4: 9.288
> > >   Attempt 5: 9.677
> > >
> > >   real	0m8.904s
> > >   user	0m8.208s
> > >   sys	0m0.596s
> > >
> > > yielding a modest ~2.6% improvement in the best timings from each run,
> > > and ~7.4% improvement on average.
> >
> > That still seems really slow to me. If we were truly eliding the load of
> > most of the commit objects, I'd expect an order of magnitude or so
> > improvement. For example, with a fully constructed commit graph in
> > linux.git, I get:
> >
> >   $ time git -c core.commitGraph=1 rev-list HEAD | wc -l
> >   886922
> >
> >   real	0m1.088s
> >   user	0m0.659s
> >   sys	0m1.161s
> >
> >   $ time git -c core.commitGraph=0 rev-list HEAD | wc -l
> >   886922
> >
> >   real	0m7.185s
> >   user	0m6.729s
> >   sys	0m1.882s
> >
> > Obviously not the same operation, but that should give us a rough idea
> > that commit graph lookups are 6-7 times cheaper than loading the actual
> > objects. I don't remember the details of the case that originally led us
> > towards this patch. Can you share more of the setup you used to generate
> > the numbers above (which repo, but more importantly the commands to
> > create the initial state and then time the test).
>
> Sure. I'm running best-of-five on the time it takes to re-generate and
> merge a commit-graph based on in-pack commits.
>
> The script is (in linux.git):
>
>   $ best-of-five \
>       -p 'rm -rf .git/objects/info/commit-graph{,s/}; git commit-graph write --split=no-merge 2>/dev/null' \
>       git commit-graph write --split=merge-all
>
> So we're measuring the time it takes to crawl all the packs, decide on
> the splitting strategy, and then compare all commits in the new merged
> graph to make sure that they don't already exist in the object store.
>
> But, here's where things get... Bizarre. I was trying to come up with a
> way to do fewer things and spend proportionally more time in
> 'merge_commit_graphs', so I did something like:
>
>   - Generate a pack containing a single, empty commit.
>   - Generate a split commit-graph containing commits in the single large
>     pack containing all of history.
>   - Generate a commit-graph of the small pack, and merge it with the
>     large pack.
>
> That script is:
>
>   $ git --version
>   $ git commit -m "empty" --allow-empty
>   $ pack="pack-$(git rev-parse HEAD | git pack-objects .git/objects/pack/pack).idx"
>   $ best-of-five \
>       -p "rm -rf .git/objects/info/commit-graphs && cp -r .git/objects/info/commit-graphs{.bak,}" \
>       sh -c "echo $pack | git commit-graph write --split=merge-all"
>
> but things get... slower with this patch? Here are the results before
> and after:
>
>   Attempt 1: 8.444
>   Attempt 2: 8.453
>   Attempt 3: 8.391
>   Attempt 4: 8.376
>   Attempt 5: 7.859
>
>   real	0m7.859s
>   user	0m7.309s
>   sys	0m0.511s
>
> vs:
>
>   Attempt 1: 8.69
>   Attempt 2: 8.735
>   Attempt 3: 8.619
>   Attempt 4: 8.747
>   Attempt 5: 8.695
>
>   real	0m8.619s
>   user	0m8.030s
>   sys	0m0.538s
>
> Without more profiling, I'm puzzled by why this patch seems to make
> things *worse* under this scenario.

Hmm. I tried roughly the same setup as the latter of the above two, but
this time under perf. On a warm cache, it does look like this parch is
an improvement:

  $ sudo perf diff perf.data perf.data.patch | grep merge_commit
       0.11%   -0.02%  git                [.] merge_commit_graph

(perf.data is from running 'next', perf.data.patch is from running
'next' with this patch applied on top). So... a slight improvement, but
nothing like what I had reported in the original patch.

I am puzzled.

> So, I'd appreciate your thoughts: does this set-up seem reasonable? Is
> it introducing some latency that isn't being accounted for in the
> original setup?
>
> > The patch otherwise still makes sense to me, but I suspect there are
> > other similar optimizations nearby that we'll need to do in tandem.
> >
> > -Peff
>
> Thanks,
> Taylor

Thanks,
Taylor
Jeff King March 21, 2020, 7:03 a.m. UTC | #5
On Sat, Mar 21, 2020 at 12:11:41AM -0600, Taylor Blau wrote:

> Sure. I'm running best-of-five on the time it takes to re-generate and
> merge a commit-graph based on in-pack commits.
> 
> The script is (in linux.git):
> 
>   $ best-of-five \
>       -p 'rm -rf .git/objects/info/commit-graph{,s/}; git commit-graph write --split=no-merge 2>/dev/null' \
>       git commit-graph write --split=merge-all

If I build Git without your patch and run the "--split=merge-all"
command under a debugger, and then break on parse_object() I find that
all of the commits are already parsed. This happens in
close_reachable().

So we weren't actually reading all of the commits even under the old
code. We were just going into deref_tag(), seeing that the object is
already parsed, and then quickly returning when we see that we already
have an OBJ_COMMIT. I suspect most of your timing differences are mostly
noise.

Perhaps a more interesting case is when you're _not_ adding all of the
existing packed commits as input. There we'd feed only a few objects to
close_reachable(), and it would stop traversing as soon as it hits a
parent that's already in a graph file. So most of the existing objects
would remain unparsed.

I'm not sure how to do that, though. Saying "--input=none" still puts
all of those existing graphed objects into the list of oids to include.
I think you'd need a case where you were legitimately only adding a few
commits, but the merge rules say we need to create one big commit-graph
file.

I guess --input=stdin-commits is a good way to simulate that. Try this
(assuming there's already a split-graph file with all of the commits in
it):

  git rev-parse HEAD >input
  time git commit-graph write --input=stdin-commits --split=merge-all <input

Without your patch on linux.git I get:

  real	0m11.713s
  user	0m11.349s
  sys	0m0.341s

but with it I get:

  real	0m2.305s
  user	0m2.177s
  sys	0m0.100s

A more realistic case would probably be feeding a new small pack to
--input=stdin-packs.

> But, here's where things get... Bizarre. I was trying to come up with a
> way to do fewer things and spend proportionally more time in
> 'merge_commit_graphs', so I did something like:
> 
>   - Generate a pack containing a single, empty commit.
>   - Generate a split commit-graph containing commits in the single large
>     pack containing all of history.
>   - Generate a commit-graph of the small pack, and merge it with the
>     large pack.
> 
> That script is:
> 
>   $ git --version
>   $ git commit -m "empty" --allow-empty
>   $ pack="pack-$(git rev-parse HEAD | git pack-objects .git/objects/pack/pack).idx"
>   $ best-of-five \
>       -p "rm -rf .git/objects/info/commit-graphs && cp -r .git/objects/info/commit-graphs{.bak,}" \
>       sh -c "echo $pack | git commit-graph write --split=merge-all"

I think you'd need --stdin-packs in the actual timed command?

At any rate, I think there is a demonstrable speedup there. But
moreover, I'm pretty sure this existing code is not doing what it
expects:

  /* only add commits if they still exist in the repo */
  result = lookup_commit_reference_gently(ctx->r, &oid, 1);

That won't look at the object database at all if the commit is already
marked as parsed. And that parsing might have come from the commit graph
itself, as our earlier attempts showed. So switching to a real
has_object_file() call is an important _correctness_ fix, even leaving
aside the speed improvements.

-Peff
Taylor Blau March 21, 2020, 5:27 p.m. UTC | #6
On Sat, Mar 21, 2020 at 03:03:33AM -0400, Jeff King wrote:
> On Sat, Mar 21, 2020 at 12:11:41AM -0600, Taylor Blau wrote:
>
> > Sure. I'm running best-of-five on the time it takes to re-generate and
> > merge a commit-graph based on in-pack commits.
> >
> > The script is (in linux.git):
> >
> >   $ best-of-five \
> >       -p 'rm -rf .git/objects/info/commit-graph{,s/}; git commit-graph write --split=no-merge 2>/dev/null' \
> >       git commit-graph write --split=merge-all
>
> If I build Git without your patch and run the "--split=merge-all"
> command under a debugger, and then break on parse_object() I find that
> all of the commits are already parsed. This happens in
> close_reachable().
>
> So we weren't actually reading all of the commits even under the old
> code. We were just going into deref_tag(), seeing that the object is
> already parsed, and then quickly returning when we see that we already
> have an OBJ_COMMIT. I suspect most of your timing differences are mostly
> noise.

Aha. Thanks for reasoning through this. That makes sense, and I think
that you're probably right.

> Perhaps a more interesting case is when you're _not_ adding all of the
> existing packed commits as input. There we'd feed only a few objects to
> close_reachable(), and it would stop traversing as soon as it hits a
> parent that's already in a graph file. So most of the existing objects
> would remain unparsed.
>
> I'm not sure how to do that, though. Saying "--input=none" still puts
> all of those existing graphed objects into the list of oids to include.
> I think you'd need a case where you were legitimately only adding a few
> commits, but the merge rules say we need to create one big commit-graph
> file.

You have to be careful, since we're taking the reachability closure over
any commits that you do give. So, one thing you could do to ensure that
you have an actually small graph is to take something from the output of
'git rev-list --max-parents=0 HEAD'.

To try and reproduce your results, I used '1da177e4c3', which is the
kernel's first commit in Git. If my interpretation of your setup is
faithful, it goes something like:

  $ graphdir=.git/objects/info/commit-graphs
  $ git rev-parse 1da177e4c3 |
    git commit-graph write --split=no-merge --stdin-commits
  $ cp -r "$graphdir{,.bak}"

  $ best-of-five -p "rm -rf $graphdir && cp -r $graphdir{.bak,}" \
    'git commit-graph write --split=merge-all'

Where the last step is taking all commits listed in any pack, which is
cheap to iterate.

> I guess --input=stdin-commits is a good way to simulate that. Try this
> (assuming there's already a split-graph file with all of the commits in
> it):
>
>   git rev-parse HEAD >input
>   time git commit-graph write --input=stdin-commits --split=merge-all <input
>
> Without your patch on linux.git I get:
>
>   real	0m11.713s
>   user	0m11.349s
>   sys	0m0.341s
>
> but with it I get:
>
>   real	0m2.305s
>   user	0m2.177s
>   sys	0m0.100s

In the above setup, I get something like:

  git version 2.26.0.rc2.221.ge327a58236
  Attempt 1: 16.933
  Attempt 2: 18.101
  Attempt 3: 17.603
  Attempt 4: 20.404
  Attempt 5: 18.871

  real	0m16.933s
  user	0m16.440s
  sys	0m0.472s

versus:

  git version 2.26.0.rc2.222.g295e7905ee
  Attempt 1: 5.34
  Attempt 2: 4.623
  Attempt 3: 5.263
  Attempt 4: 5.268
  Attempt 5: 5.606

  real	0m4.623s
  user	0m4.428s
  sys	0m0.176s

which is a best-case savings of ~72.7%, and a savings of ~71.5%. That
seems much better.

If you think that this is a realistic setup, I'll amend the original
patch text to include these updated numbers, along with the reproduction
steps.

> A more realistic case would probably be feeding a new small pack to
> --input=stdin-packs.
>
> > But, here's where things get... Bizarre. I was trying to come up with a
> > way to do fewer things and spend proportionally more time in
> > 'merge_commit_graphs', so I did something like:
> >
> >   - Generate a pack containing a single, empty commit.
> >   - Generate a split commit-graph containing commits in the single large
> >     pack containing all of history.
> >   - Generate a commit-graph of the small pack, and merge it with the
> >     large pack.
> >
> > That script is:
> >
> >   $ git --version
> >   $ git commit -m "empty" --allow-empty
> >   $ pack="pack-$(git rev-parse HEAD | git pack-objects .git/objects/pack/pack).idx"
> >   $ best-of-five \
> >       -p "rm -rf .git/objects/info/commit-graphs && cp -r .git/objects/info/commit-graphs{.bak,}" \
> >       sh -c "echo $pack | git commit-graph write --split=merge-all"
>
> I think you'd need --stdin-packs in the actual timed command?
>
> At any rate, I think there is a demonstrable speedup there. But
> moreover, I'm pretty sure this existing code is not doing what it
> expects:
>
>   /* only add commits if they still exist in the repo */
>   result = lookup_commit_reference_gently(ctx->r, &oid, 1);
>
> That won't look at the object database at all if the commit is already
> marked as parsed. And that parsing might have come from the commit graph
> itself, as our earlier attempts showed. So switching to a real
> has_object_file() call is an important _correctness_ fix, even leaving
> aside the speed improvements.

That's a great point. I hadn't thought of this myself, but I agree with
your reasoning here, too, and think that this is an important
correctness fix even if it yielded no performance benefit (thankfully,
we get both :)).

> -Peff

Thanks,
Taylor
Junio C Hamano March 21, 2020, 6:50 p.m. UTC | #7
Jeff King <peff@peff.net> writes:

> So we weren't actually reading all of the commits even under the old
> code. We were just going into deref_tag(), seeing that the object is
> already parsed, and then quickly returning when we see that we already
> have an OBJ_COMMIT. I suspect most of your timing differences are mostly
> noise.

Noise that contributes largely to ~7% fluctuation?  That sounds big.

> I guess --input=stdin-commits is a good way to simulate that. Try this
> (assuming there's already a split-graph file with all of the commits in
> it):
>
>   git rev-parse HEAD >input
>   time git commit-graph write --input=stdin-commits --split=merge-all <input
> ...
> A more realistic case would probably be feeding a new small pack to
> --input=stdin-packs.

Makes sense.

> At any rate, I think there is a demonstrable speedup there. But
> moreover, I'm pretty sure this existing code is not doing what it
> expects:
>
>   /* only add commits if they still exist in the repo */
>   result = lookup_commit_reference_gently(ctx->r, &oid, 1);
>
> That won't look at the object database at all if the commit is already
> marked as parsed. And that parsing might have come from the commit graph
> itself, as our earlier attempts showed. So switching to a real
> has_object_file() call is an important _correctness_ fix, even leaving
> aside the speed improvements.

True. has_object_file() asks the question without even
lookup-replace flag set, so it would try to see if the object is
truly there (modulo the "pretended" ones are declared to exist via
the cached-object mechanism).

Do we need to worry about INFO_QUICK and SKIP_FETCH_OBJECT in this
codepath, by the way?
Derrick Stolee March 22, 2020, 12:03 a.m. UTC | #8
On 3/21/2020 2:50 PM, Junio C Hamano wrote:
> Do we need to worry about INFO_QUICK and SKIP_FETCH_OBJECT in this
> codepath, by the way?

I was coming back to this thread to bring up these exact flags for
consideration. The good news is that in a partial clone with any
amount of filtering we will still have all reachable commits, which
are necessary for the commit-graph to make sense. The only ones that
would fail has_object_file() are ones removed by GC, but they may
still exist on the remote. So without SKIP_FETCH_OBJECT, we would
generate a network call even if the server has GC'd to remove the
commits. This gets particularly bad when the server returns all
reachable objects from that commit!

Thanks,
-Stolee
Taylor Blau March 22, 2020, 12:20 a.m. UTC | #9
On Sat, Mar 21, 2020 at 08:03:01PM -0400, Derrick Stolee wrote:
> On 3/21/2020 2:50 PM, Junio C Hamano wrote:
> > Do we need to worry about INFO_QUICK and SKIP_FETCH_OBJECT in this
> > codepath, by the way?
>
> I was coming back to this thread to bring up these exact flags for
> consideration. The good news is that in a partial clone with any
> amount of filtering we will still have all reachable commits, which
> are necessary for the commit-graph to make sense. The only ones that
> would fail has_object_file() are ones removed by GC, but they may
> still exist on the remote. So without SKIP_FETCH_OBJECT, we would
> generate a network call even if the server has GC'd to remove the
> commits. This gets particularly bad when the server returns all
> reachable objects from that commit!

That makes sense. Do you think something like this should be applied?

diff --git a/commit-graph.c b/commit-graph.c
index c7cfadc786..0097318798 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1594,6 +1594,7 @@ static void merge_commit_graph(struct write_commit_graph_context *ctx,
 {
        uint32_t i;
        uint32_t offset = g->num_commits_in_base;
+       int flags = OBJECT_INFO_QUICK | OBJECT_INFO_SKIP_FETCH_OBJECT;

        ALLOC_GROW(ctx->commits.list, ctx->commits.nr + g->num_commits, ctx->commits.alloc);

@@ -1606,7 +1607,7 @@ static void merge_commit_graph(struct write_commit_graph_context *ctx,
                load_oid_from_graph(g, i + offset, &oid);

                /* only add commits if they still exist in the repo */
-               if (repo_has_object_file(ctx->r, &oid)) {
+               if (repo_has_object_file_with_flags(ctx->r, &oid, flags)) {
                        result = lookup_commit(ctx->r, &oid);
                        if (repo_parse_commit(ctx->r, result))
                                result = NULL;

> Thanks,
> -Stolee

Thanks,
Taylor
Derrick Stolee March 22, 2020, 12:23 a.m. UTC | #10
On 3/21/2020 8:20 PM, Taylor Blau wrote:
> On Sat, Mar 21, 2020 at 08:03:01PM -0400, Derrick Stolee wrote:
>> On 3/21/2020 2:50 PM, Junio C Hamano wrote:
>>> Do we need to worry about INFO_QUICK and SKIP_FETCH_OBJECT in this
>>> codepath, by the way?
>>
>> I was coming back to this thread to bring up these exact flags for
>> consideration. The good news is that in a partial clone with any
>> amount of filtering we will still have all reachable commits, which
>> are necessary for the commit-graph to make sense. The only ones that
>> would fail has_object_file() are ones removed by GC, but they may
>> still exist on the remote. So without SKIP_FETCH_OBJECT, we would
>> generate a network call even if the server has GC'd to remove the
>> commits. This gets particularly bad when the server returns all
>> reachable objects from that commit!
> 
> That makes sense. Do you think something like this should be applied?

Yes, I think this is the appropriate update. Thanks!

> diff --git a/commit-graph.c b/commit-graph.c
> index c7cfadc786..0097318798 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1594,6 +1594,7 @@ static void merge_commit_graph(struct write_commit_graph_context *ctx,
>  {
>         uint32_t i;
>         uint32_t offset = g->num_commits_in_base;
> +       int flags = OBJECT_INFO_QUICK | OBJECT_INFO_SKIP_FETCH_OBJECT;
> 
>         ALLOC_GROW(ctx->commits.list, ctx->commits.nr + g->num_commits, ctx->commits.alloc);
> 
> @@ -1606,7 +1607,7 @@ static void merge_commit_graph(struct write_commit_graph_context *ctx,
>                 load_oid_from_graph(g, i + offset, &oid);
> 
>                 /* only add commits if they still exist in the repo */
> -               if (repo_has_object_file(ctx->r, &oid)) {
> +               if (repo_has_object_file_with_flags(ctx->r, &oid, flags)) {
>                         result = lookup_commit(ctx->r, &oid);
>                         if (repo_parse_commit(ctx->r, result))
>                                 result = NULL;
Jeff King March 22, 2020, 5:36 a.m. UTC | #11
On Sat, Mar 21, 2020 at 11:27:16AM -0600, Taylor Blau wrote:

> > I'm not sure how to do that, though. Saying "--input=none" still puts
> > all of those existing graphed objects into the list of oids to include.
> > I think you'd need a case where you were legitimately only adding a few
> > commits, but the merge rules say we need to create one big commit-graph
> > file.
> 
> You have to be careful, since we're taking the reachability closure over
> any commits that you do give. So, one thing you could do to ensure that
> you have an actually small graph is to take something from the output of
> 'git rev-list --max-parents=0 HEAD'.

I don't think you need to be that careful, though. In split-mode,
close_reachable() will stop traversing when it finds a graphed commit.
That's why using the tip of HEAD in my previous example worked.

> To try and reproduce your results, I used '1da177e4c3', which is the
> kernel's first commit in Git. If my interpretation of your setup is
> faithful, it goes something like:
> 
>   $ graphdir=.git/objects/info/commit-graphs
>   $ git rev-parse 1da177e4c3 |
>     git commit-graph write --split=no-merge --stdin-commits
>   $ cp -r "$graphdir{,.bak}"
> 
>   $ best-of-five -p "rm -rf $graphdir && cp -r $graphdir{.bak,}" \
>     'git commit-graph write --split=merge-all'

My case is the opposite, isn't it? Here it looks like you've made a very
_tiny_ commit-graph file (with one commit), and then you're going to end
up adding in all of the new objects. I don't think it would be improved
much by this patch (which makes me very confused by the numbers you got
below).

I also don't think it's that interesting a real-world case.

The more interesting one is where you do already have a big
commit-graph, and want to add just a bit more to it. In the real world,
that might look something like this:

  # create a fake server repo
  git clone --bare . dst.git

  # imagine the server already has everything in a graph file
  git -C dst.git commit-graph write --split=no-merge --reachable

  # and now do a small push
  git commit --allow-empty -m foo
  git push dst.git

  # the server might do an incremental immediately to cover the new
  # objects; here we'll use --stdin-commits with the new data, but a
  # real server might feed the new packfile. We'd probably just use
  # regular --split here in practice, but let's imagine that we're
  # starting to have a lot of graph files, and that triggers a desire to
  # merge. We'll force that state with --split=merge-all.
  git rev-list HEAD^..HEAD |
  git -C dst.git commit-graph write --split=merge-all --stdin-commits

Without your patch, that takes ~11s for me. With it, it takes ~2s.

Another equally interesting case is if the per-push generation _doesn't_
merge anything, and just creates a new, tiny graph file. And then later
we want to do a real maintenance, merging them all done. I think that
would be something like:

  git -C dst.git commit-graph write --input=none --split=merge-all

But that _isn't_ improved by your patch. For the simple reason that all
of the commits will already have been parsed. The "--input=none" option
isn't "no input"; it's actually "take all existing graphed objects as
input" (i.e., it implies --append). So each of those objects will
already have been examined in an earlier stage.

> Where the last step is taking all commits listed in any pack, which is
> cheap to iterate.

I'm not sure it's all that cheap. It has to find the type of every
object in every pack. And finding types involves walking delta chains.
That's something like 7s on my machine for linux.git (compared to the 2s
in which I can just merge down the existing graph files).

> In the above setup, I get something like:
> 
>   git version 2.26.0.rc2.221.ge327a58236
>   Attempt 1: 16.933
>   Attempt 2: 18.101
>   Attempt 3: 17.603
>   Attempt 4: 20.404
>   Attempt 5: 18.871
> 
>   real	0m16.933s
>   user	0m16.440s
>   sys	0m0.472s
> 
> versus:
> 
>   git version 2.26.0.rc2.222.g295e7905ee
>   Attempt 1: 5.34
>   Attempt 2: 4.623
>   Attempt 3: 5.263
>   Attempt 4: 5.268
>   Attempt 5: 5.606
> 
>   real	0m4.623s
>   user	0m4.428s
>   sys	0m0.176s
> 
> which is a best-case savings of ~72.7%, and a savings of ~71.5%. That
> seems much better.

I'm still puzzled by this. In the setup you showed, hardly anything is
graphed. But the change is only in the graph-merge code.

-Peff
Jeff King March 22, 2020, 5:49 a.m. UTC | #12
On Sat, Mar 21, 2020 at 08:23:13PM -0400, Derrick Stolee wrote:

> On 3/21/2020 8:20 PM, Taylor Blau wrote:
> > On Sat, Mar 21, 2020 at 08:03:01PM -0400, Derrick Stolee wrote:
> >> On 3/21/2020 2:50 PM, Junio C Hamano wrote:
> >>> Do we need to worry about INFO_QUICK and SKIP_FETCH_OBJECT in this
> >>> codepath, by the way?
> >>
> >> I was coming back to this thread to bring up these exact flags for
> >> consideration. The good news is that in a partial clone with any
> >> amount of filtering we will still have all reachable commits, which
> >> are necessary for the commit-graph to make sense. The only ones that
> >> would fail has_object_file() are ones removed by GC, but they may
> >> still exist on the remote. So without SKIP_FETCH_OBJECT, we would
> >> generate a network call even if the server has GC'd to remove the
> >> commits. This gets particularly bad when the server returns all
> >> reachable objects from that commit!
> > 
> > That makes sense. Do you think something like this should be applied?
> > +       int flags = OBJECT_INFO_QUICK | OBJECT_INFO_SKIP_FETCH_OBJECT;
> [...]
> 
> Yes, I think this is the appropriate update. Thanks!

I'm not so sure.

In a non-partial clone, when would expect QUICK to matter? If we find
the object is missing, then either:

  - we are racing with repack, and we would benefit from double-checking
    that the object really is gone

  - we used to have it (since it was mentioned in the graph file) but it
    went away

Using QUICK means we won't waste time double-checking in the second
case. But it means we won't catch the first case, and we may generate a
new graph file that omits the object. They're both optimizations, and I
don't think we're impacting correctness[1], but it's not clear to me
whether one is a win over the other. We don't generally expect objects
we have to go away often.

Skipping fetching seems a little more straight-forward to me. If we had
it in a graph file, presumably we had the actual object before, too. And
either we're in the first case above (we really do have it and just need
to double-check) in which case not saying QUICK would be enough. Or we
intentionally got rid of it. In which case downloading it just to
generate a cache is quite silly.

So it seems like SKIP_FETCH_OBJECT but _not_ QUICK might be reasonable
here.

-Peff

[1] I'm actually not quite sure about correctness here. It should be
    fine to generate a graph file without any given commit; readers will
    just have to load that commit the old-fashioned way. But at this
    phase of "commit-graph write", I think we'll already have done the
    close_reachable() check. What does it mean to throw away a commit at
    this stage? If we're the parent of another commit, then it will have
    trouble referring to us by a uint32_t. Will the actual writing phase
    barf, or will we generate an invalid graph file?
Jeff King March 22, 2020, 6:04 a.m. UTC | #13
On Sun, Mar 22, 2020 at 01:49:16AM -0400, Jeff King wrote:

> [1] I'm actually not quite sure about correctness here. It should be
>     fine to generate a graph file without any given commit; readers will
>     just have to load that commit the old-fashioned way. But at this
>     phase of "commit-graph write", I think we'll already have done the
>     close_reachable() check. What does it mean to throw away a commit at
>     this stage? If we're the parent of another commit, then it will have
>     trouble referring to us by a uint32_t. Will the actual writing phase
>     barf, or will we generate an invalid graph file?

It doesn't seem great. If I instrument Git like this to simulate an
object temporarily "missing" (if it were really missing the whole repo
would be corrupt; we're trying to see what would happen if a race causes
us to momentarily not see it):

diff --git a/commit-graph.c b/commit-graph.c
index 3da52847e4..71419c2532 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1596,6 +1596,19 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 	}
 }
 
+static int pretend_commit_is_missing(const struct object_id *oid)
+{
+	static int initialized;
+	static struct object_id missing;
+	if (!initialized) {
+		const char *x = getenv("PRETEND_COMMIT_IS_MISSING");
+		if (x)
+			get_oid_hex(x, &missing);
+		initialized = 1;
+	}
+	return oideq(&missing, oid);
+}
+
 static void merge_commit_graph(struct write_commit_graph_context *ctx,
 			       struct commit_graph *g)
 {
@@ -1612,6 +1625,11 @@ static void merge_commit_graph(struct write_commit_graph_context *ctx,
 
 		load_oid_from_graph(g, i + offset, &oid);
 
+		if (pretend_commit_is_missing(&oid)) {
+			warning("pretending %s is missing", oid_to_hex(&oid));
+			continue;
+		}
+
 		/* only add commits if they still exist in the repo */
 		result = lookup_commit_reference_gently(ctx->r, &oid, 1);
 

and then I make a fully-graphed repo like this:

  git init repo
  cd repo
  for i in $(seq 10); do
    git commit --allow-empty -m $i
  done
  git commit-graph write --input=reachable --split=no-merge

if we pretend a parent is missing, I get a BUG():

  $ git rev-parse HEAD |
    PRETEND_COMMIT_IS_MISSING=$(git rev-parse HEAD^) \
    git commit-graph write --stdin-commits --split=merge-all
  warning: pretending 35e6e15c738cf2bfbe495957b2a941c2efe86dd9 is missing
  BUG: commit-graph.c:879: missing parent 35e6e15c738cf2bfbe495957b2a941c2efe86dd9 for commit d4141fb57a9bbe26b247f23c790d63d078977833
  Aborted

So it seems like just skipping here (either with the new patch or
without) isn't really a good strategy.

-Peff
SZEDER Gábor March 22, 2020, 11:04 a.m. UTC | #14
On Sun, Mar 22, 2020 at 01:36:35AM -0400, Jeff King wrote:
> The "--input=none" option
> isn't "no input"; it's actually "take all existing graphed objects as
> input" (i.e., it implies --append). So each of those objects will
> already have been examined in an earlier stage.

Yeah, we discussed it earlier that '--input=none' doesn't properly
describe what it tries to accomplish, and suggested '--input=exists'
(or was it 'existing'?) instead.

> > Where the last step is taking all commits listed in any pack, which is
> > cheap to iterate.
> 
> I'm not sure it's all that cheap. It has to find the type of every
> object in every pack. And finding types involves walking delta chains.

Indeed, that's just about the most expensive way to find all commits,
in my experience it can be well over a magnitude slower than
'--reachable'.
Taylor Blau March 22, 2020, 3:44 p.m. UTC | #15
On Sun, Mar 22, 2020 at 01:49:16AM -0400, Jeff King wrote:
> On Sat, Mar 21, 2020 at 08:23:13PM -0400, Derrick Stolee wrote:
>
> > On 3/21/2020 8:20 PM, Taylor Blau wrote:
> > > On Sat, Mar 21, 2020 at 08:03:01PM -0400, Derrick Stolee wrote:
> > >> On 3/21/2020 2:50 PM, Junio C Hamano wrote:
> > >>> Do we need to worry about INFO_QUICK and SKIP_FETCH_OBJECT in this
> > >>> codepath, by the way?
> > >>
> > >> I was coming back to this thread to bring up these exact flags for
> > >> consideration. The good news is that in a partial clone with any
> > >> amount of filtering we will still have all reachable commits, which
> > >> are necessary for the commit-graph to make sense. The only ones that
> > >> would fail has_object_file() are ones removed by GC, but they may
> > >> still exist on the remote. So without SKIP_FETCH_OBJECT, we would
> > >> generate a network call even if the server has GC'd to remove the
> > >> commits. This gets particularly bad when the server returns all
> > >> reachable objects from that commit!
> > >
> > > That makes sense. Do you think something like this should be applied?
> > > +       int flags = OBJECT_INFO_QUICK | OBJECT_INFO_SKIP_FETCH_OBJECT;
> > [...]
> >
> > Yes, I think this is the appropriate update. Thanks!
>
> I'm not so sure.
>
> In a non-partial clone, when would expect QUICK to matter? If we find
> the object is missing, then either:
>
>   - we are racing with repack, and we would benefit from double-checking
>     that the object really is gone
>
>   - we used to have it (since it was mentioned in the graph file) but it
>     went away
>
> Using QUICK means we won't waste time double-checking in the second
> case. But it means we won't catch the first case, and we may generate a
> new graph file that omits the object. They're both optimizations, and I
> don't think we're impacting correctness[1], but it's not clear to me
> whether one is a win over the other. We don't generally expect objects
> we have to go away often.
>
> Skipping fetching seems a little more straight-forward to me. If we had
> it in a graph file, presumably we had the actual object before, too. And
> either we're in the first case above (we really do have it and just need
> to double-check) in which case not saying QUICK would be enough. Or we
> intentionally got rid of it. In which case downloading it just to
> generate a cache is quite silly.

I was going to write that I'm not entirely sure of this, but I tried to
talk myself through it below, and I think that the right flag is *only*
OBJECT_INFO_SKIP_FETCH_OBJECT.

First, I agree with your reasoning that we shouldn't have
OBJECT_INFO_QUICK, that much makes sense to me.

To consider OBJECT_INFO_SKIP_FETCH_OBJECT, let's say that we used to
have some commit, and it got GC'd away. This shouldn't happen unless a
commit is unreachable, so if we ever find ourselves in a situation where
the parent of some other commit is missing, we know that that represents
a true corruption, not the result of a normal 'git gc'.

Now, if we do find ourselves in this case, we know that a 'git
commit-graph write --input=none --split' *would* work because we would
find that the unreachable commit (if it were in the graph) was missing,
and 'OBJECT_INFO_SKIP_FETCH_OBJECT' would tell us not to go looking for
it. Likewise, if the commit weren't already in the graph, this would
work fine, too.

But, most of the time we won't even get a chance to write a new
commit-graph based on existing ones, since 'gc.writeCommitGraph' will
take care of this for us in-process with 'git gc'. This uses
'--reachable' without '--split', so we won't ever invoke
'merge_commit_graph{s,}'.

> So it seems like SKIP_FETCH_OBJECT but _not_ QUICK might be reasonable
> here.

I agree with your original reasoning that OBJECT_INFO_QUICK is the wrong
choice, but I think my reasoning above that
OBJECT_INFO_SKIP_FETCH_OBJECT *is* an appropriate flag is sound.

> -Peff
>
> [1] I'm actually not quite sure about correctness here. It should be
>     fine to generate a graph file without any given commit; readers will
>     just have to load that commit the old-fashioned way. But at this
>     phase of "commit-graph write", I think we'll already have done the
>     close_reachable() check. What does it mean to throw away a commit at
>     this stage? If we're the parent of another commit, then it will have
>     trouble referring to us by a uint32_t. Will the actual writing phase
>     barf, or will we generate an invalid graph file?

Thanks,
Taylor
Taylor Blau March 22, 2020, 3:47 p.m. UTC | #16
On Sun, Mar 22, 2020 at 02:04:34AM -0400, Jeff King wrote:
> On Sun, Mar 22, 2020 at 01:49:16AM -0400, Jeff King wrote:
>
> > [1] I'm actually not quite sure about correctness here. It should be
> >     fine to generate a graph file without any given commit; readers will
> >     just have to load that commit the old-fashioned way. But at this
> >     phase of "commit-graph write", I think we'll already have done the
> >     close_reachable() check. What does it mean to throw away a commit at
> >     this stage? If we're the parent of another commit, then it will have
> >     trouble referring to us by a uint32_t. Will the actual writing phase
> >     barf, or will we generate an invalid graph file?
>
> It doesn't seem great. If I instrument Git like this to simulate an
> object temporarily "missing" (if it were really missing the whole repo
> would be corrupt; we're trying to see what would happen if a race causes
> us to momentarily not see it):

This is definitely a problem on either side of this patch, which is
demonstrated by the fact that you applied your changes without my patch
on top (and that my patch isn't changing anything substantial in this
area like removing the 'continue' statement).

Should we address this before moving on with my patch? I think that we
*could*, but I'd rather go forward with what we have for now, since it's
only improving the situation, and not introducing a new bug.

> diff --git a/commit-graph.c b/commit-graph.c
> index 3da52847e4..71419c2532 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1596,6 +1596,19 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
>  	}
>  }
>
> +static int pretend_commit_is_missing(const struct object_id *oid)
> +{
> +	static int initialized;
> +	static struct object_id missing;
> +	if (!initialized) {
> +		const char *x = getenv("PRETEND_COMMIT_IS_MISSING");
> +		if (x)
> +			get_oid_hex(x, &missing);
> +		initialized = 1;
> +	}
> +	return oideq(&missing, oid);
> +}
> +
>  static void merge_commit_graph(struct write_commit_graph_context *ctx,
>  			       struct commit_graph *g)
>  {
> @@ -1612,6 +1625,11 @@ static void merge_commit_graph(struct write_commit_graph_context *ctx,
>
>  		load_oid_from_graph(g, i + offset, &oid);
>
> +		if (pretend_commit_is_missing(&oid)) {
> +			warning("pretending %s is missing", oid_to_hex(&oid));
> +			continue;
> +		}
> +
>  		/* only add commits if they still exist in the repo */
>  		result = lookup_commit_reference_gently(ctx->r, &oid, 1);
>
>
> and then I make a fully-graphed repo like this:
>
>   git init repo
>   cd repo
>   for i in $(seq 10); do
>     git commit --allow-empty -m $i
>   done
>   git commit-graph write --input=reachable --split=no-merge
>
> if we pretend a parent is missing, I get a BUG():
>
>   $ git rev-parse HEAD |
>     PRETEND_COMMIT_IS_MISSING=$(git rev-parse HEAD^) \
>     git commit-graph write --stdin-commits --split=merge-all
>   warning: pretending 35e6e15c738cf2bfbe495957b2a941c2efe86dd9 is missing
>   BUG: commit-graph.c:879: missing parent 35e6e15c738cf2bfbe495957b2a941c2efe86dd9 for commit d4141fb57a9bbe26b247f23c790d63d078977833
>   Aborted
>
> So it seems like just skipping here (either with the new patch or
> without) isn't really a good strategy.
>
> -Peff
Thanks,
Taylor
Taylor Blau March 22, 2020, 4:45 p.m. UTC | #17
On Sun, Mar 22, 2020 at 01:36:35AM -0400, Jeff King wrote:
> On Sat, Mar 21, 2020 at 11:27:16AM -0600, Taylor Blau wrote:
>
> > > I'm not sure how to do that, though. Saying "--input=none" still puts
> > > all of those existing graphed objects into the list of oids to include.
> > > I think you'd need a case where you were legitimately only adding a few
> > > commits, but the merge rules say we need to create one big commit-graph
> > > file.
> >
> > You have to be careful, since we're taking the reachability closure over
> > any commits that you do give. So, one thing you could do to ensure that
> > you have an actually small graph is to take something from the output of
> > 'git rev-list --max-parents=0 HEAD'.
>
> I don't think you need to be that careful, though. In split-mode,
> close_reachable() will stop traversing when it finds a graphed commit.
> That's why using the tip of HEAD in my previous example worked.
>
> > To try and reproduce your results, I used '1da177e4c3', which is the
> > kernel's first commit in Git. If my interpretation of your setup is
> > faithful, it goes something like:
> >
> >   $ graphdir=.git/objects/info/commit-graphs
> >   $ git rev-parse 1da177e4c3 |
> >     git commit-graph write --split=no-merge --stdin-commits
> >   $ cp -r "$graphdir{,.bak}"
> >
> >   $ best-of-five -p "rm -rf $graphdir && cp -r $graphdir{.bak,}" \
> >     'git commit-graph write --split=merge-all'
>
> My case is the opposite, isn't it? Here it looks like you've made a very
> _tiny_ commit-graph file (with one commit), and then you're going to end
> up adding in all of the new objects. I don't think it would be improved
> much by this patch (which makes me very confused by the numbers you got
> below).
>
> I also don't think it's that interesting a real-world case.
>
> The more interesting one is where you do already have a big
> commit-graph, and want to add just a bit more to it. In the real world,
> that might look something like this:
>
>   # create a fake server repo
>   git clone --bare . dst.git
>
>   # imagine the server already has everything in a graph file
>   git -C dst.git commit-graph write --split=no-merge --reachable
>
>   # and now do a small push
>   git commit --allow-empty -m foo
>   git push dst.git
>
>   # the server might do an incremental immediately to cover the new
>   # objects; here we'll use --stdin-commits with the new data, but a
>   # real server might feed the new packfile. We'd probably just use
>   # regular --split here in practice, but let's imagine that we're
>   # starting to have a lot of graph files, and that triggers a desire to
>   # merge. We'll force that state with --split=merge-all.
>   git rev-list HEAD^..HEAD |
>   git -C dst.git commit-graph write --split=merge-all --stdin-commits
>
> Without your patch, that takes ~11s for me. With it, it takes ~2s.
>
> Another equally interesting case is if the per-push generation _doesn't_
> merge anything, and just creates a new, tiny graph file. And then later
> we want to do a real maintenance, merging them all done. I think that
> would be something like:
>
>   git -C dst.git commit-graph write --input=none --split=merge-all
>
> But that _isn't_ improved by your patch. For the simple reason that all
> of the commits will already have been parsed. The "--input=none" option
> isn't "no input"; it's actually "take all existing graphed objects as
> input" (i.e., it implies --append). So each of those objects will
> already have been examined in an earlier stage.
>
> > Where the last step is taking all commits listed in any pack, which is
> > cheap to iterate.
>
> I'm not sure it's all that cheap. It has to find the type of every
> object in every pack. And finding types involves walking delta chains.
> That's something like 7s on my machine for linux.git (compared to the 2s
> in which I can just merge down the existing graph files).
>
> > In the above setup, I get something like:
> >
> >   git version 2.26.0.rc2.221.ge327a58236
> >   Attempt 1: 16.933
> >   Attempt 2: 18.101
> >   Attempt 3: 17.603
> >   Attempt 4: 20.404
> >   Attempt 5: 18.871
> >
> >   real	0m16.933s
> >   user	0m16.440s
> >   sys	0m0.472s
> >
> > versus:
> >
> >   git version 2.26.0.rc2.222.g295e7905ee
> >   Attempt 1: 5.34
> >   Attempt 2: 4.623
> >   Attempt 3: 5.263
> >   Attempt 4: 5.268
> >   Attempt 5: 5.606
> >
> >   real	0m4.623s
> >   user	0m4.428s
> >   sys	0m0.176s
> >
> > which is a best-case savings of ~72.7%, and a savings of ~71.5%. That
> > seems much better.
>
> I'm still puzzled by this. In the setup you showed, hardly anything is
> graphed. But the change is only in the graph-merge code.

I agree with your reasoning. I tried to recreate these timings this
morning, and was unable to do so. Here's my setup (where 'git' is built
on the tip of 'next'):

  $ graphdir=".git/objects/info/commit-graphs"
  $ pack="pack-b10a98360eacb1f5a5ff67cb8d8113d8b3d0b39f.idx"
  $ rm -rf "$graphdir"
  $ git rev-list --max-parents=0 HEAD | tail -1 |
    git commit-graph write --input=stdin-commits --split=no-merge
  $ cp -r "$graphdir" "$graphdir.bak"

  $ git version
  git version 2.26.0.rc2.221.ge327a58236

  # repeat the below twice to ensure a warm cache
  $ rm -rf "$graphdir" && cp -r "$graphdir.bak" "$graphdir" &&
    echo "$pack" |
    sudo perf record -F 997 ~ttaylorr/bin/git commit-graph write \
      --split=merge-all --input=stdin-packs
  $ mv perf.data{,.base}

  $ git version
  git version 2.26.0.rc2.222.g1931b73955

  # do the above again, and mv perf.data{,.patch}

  $ sudo perf diff perf.data.base perf.data.patch
  # Event 'cpu-clock'
  #
  # Baseline    Delta  Shared Object       Symbol
  # ........  .......  ..................  ..........................................
  #
      18.42%   +0.34%  libc-2.24.so        [.] __memcmp_sse4_1
      15.50%   -0.45%  libz.so.1.2.8       [.] 0x00000000000083c9
       8.42%   -0.01%  libz.so.1.2.8       [.] inflate
       6.33%   +0.30%  libc-2.24.so        [.] __memmove_avx_unaligned_erms
       4.67%   -0.36%  git                 [.] lookup_object
       3.62%   -0.28%  git                 [.] unpack_object_header_buffer
       2.64%   +0.20%  git                 [.] commit_to_sha1
       2.41%   -0.03%  git                 [.] get_delta_base
       2.05%   -0.17%  git                 [.] sha1_compression_states
       2.00%   +0.06%  git                 [.] use_pack
       1.66%   +0.14%  git                 [.] nth_packed_object_offset
       1.63%   +0.11%  git                 [.] write_graph_chunk_extra_edges
       1.44%   +0.21%  libz.so.1.2.8       [.] adler32
       1.42%   -0.03%  git                 [.] sort_revindex
       1.42%   -0.06%  git                 [.] parse_commit_date
       1.41%   -0.15%  git                 [.] oidhash
       1.37%   +0.07%  git                 [.] commit_list_count
       1.11%   -0.02%  git                 [.] in_window
       1.08%   -0.05%  git                 [.] packed_to_object_type
       0.90%   -0.18%  git                 [.] ubc_check
       0.82%   -0.23%  git                 [.] hex2chr
       0.75%   -0.02%  git                 [.] hashcmp
       0.71%   +0.17%  libc-2.24.so        [.] msort_with_tmp.part.0
       0.70%   +0.25%  git                 [.] sha1_pos
       0.70%   +0.12%  git                 [.] repo_parse_commit_internal
       0.66%   -0.06%  git                 [.] bsearch_hash
       0.65%   -0.11%  git                 [.] compute_generation_numbers
       0.64%   +0.03%  git                 [.] unpack_object_header
       0.60%   -0.06%  git                 [.] find_entry_ptr
       0.59%   +0.09%  libc-2.24.so        [.] _int_malloc
       0.55%   -0.01%  git                 [.] hexval
       0.51%   +0.07%  git                 [.] entry_equals
       0.44%   +0.12%  git                 [.] get_sha1_hex
       0.41%   -0.13%  git                 [.] packed_object_info

...and many other functions that have a delta that can be attributed to
noise, all comfortably less than 1%.

So, I'm not sure what was going on with my original measurements, other
than that I took them around midnight, so the chance for error was
probably high. I'll go with your measurements, since they are both
closer to what we'd see in the real world, and actually representative
of the change here.

I'll wait for some of the discussion in the sub-thread about
'OBJECT_INFO_SKIP_FETCH' to settle, and then send v2 in the next couple
of days or so.

> -Peff

Thanks,
Taylor
Jeff King March 22, 2020, 6:45 p.m. UTC | #18
On Sun, Mar 22, 2020 at 12:04:24PM +0100, SZEDER Gábor wrote:

> > > Where the last step is taking all commits listed in any pack, which is
> > > cheap to iterate.
> > 
> > I'm not sure it's all that cheap. It has to find the type of every
> > object in every pack. And finding types involves walking delta chains.
> 
> Indeed, that's just about the most expensive way to find all commits,
> in my experience it can be well over a magnitude slower than
> '--reachable'.

They don't necessarily produce the same set of commits, of course, but
for the purposes of "commit-graph write" I think they can generally be
considered interchangeable.

The speed depends on a couple of factors. One is just the makeup of the
repository: tree size, how many are deltas, etc. But a big one is
whether commit-graphs are already enabled. :)

On linux.git without commit-graphs, both of these take about 7.5s for
me:

  git cat-file --batch-all-objects --unordered --batch-check='%(objecttype) %(objectname)'

  git rev-list --all

But with graphs, the latter takes about 700ms.

I was coincidentally thinking about this problem ("show me all commits")
for something unrelated recently, and I do think that cat-file
invocation could be sped up in some cases. Here are a few ideas:

1. If the pack has a .bitmap file, it contains 4 "type" bitmaps that you
   can use to quickly determine the type of a given object. But:

   - the current code for loading bitmaps files is slower than we'd
     need; it indexes all of the commit bitmaps, but we wouldn't use
     them. So we'd want something that just loads the type bitmaps and
     stops.

   - you have to know the bit position of the object you're interested
     in, which implies computing the revidx. We could provide a "bit
     position" cache in the bitmap file (basically one uint32_t for each
     object giving the packfile order position, but stored in pack .idx
     order).

   Neither is too hard a problem, but it's clearly not just a small
   patch to consult the bitmap.

   It could be patched directly into oid_object_info_extended(), which
   could do individual type lookups this way. But it would be faster
   still if callers could say "I'm only interested in commits", in which
   case we could just quickly walk the commit-type bitmap in one pass
   (to get the commits in just that packfile, and then go through the
   other objects the old-fashioned way).

2. The commit-graph file can quickly tell us whether an object is a
   commit. But unfortunately it can give us a definite "yes", but not a
   definite "no" (because it might be a commit that just isn't (yet) in
   the graph file). So we'd have to find the real type of any
   non-commits, and I think that's where we spend the bulk of our time
   anyway.

3. To find the type of a delta object, we have to walk back to a base
   object with a concrete type. When we access the actual object
   contents, we cache intermediate bases to avoid digging down the same
   chain over and over. But I don't think there's any such cache for the
   type lookup. So if you have a 50-deep delta chain and want the type
   of each object, you'll walk down 49 links, then 48 links, and so on.
   It probably wouldn't be that hard to have a small type cache (or just
   allow the existing cache to have entries for "I know the type, but
   don't have the contents").

I suspect (1) would give the biggest speedup, but is more work and
requires bitmaps. I think (3) is likely to give a moderate win and
probably isn't that big a patch.

-Peff
Jeff King March 22, 2020, 7:18 p.m. UTC | #19
On Sun, Mar 22, 2020 at 02:45:19PM -0400, Jeff King wrote:

> 3. To find the type of a delta object, we have to walk back to a base
>    object with a concrete type. When we access the actual object
>    contents, we cache intermediate bases to avoid digging down the same
>    chain over and over. But I don't think there's any such cache for the
>    type lookup. So if you have a 50-deep delta chain and want the type
>    of each object, you'll walk down 49 links, then 48 links, and so on.
>    It probably wouldn't be that hard to have a small type cache (or just
>    allow the existing cache to have entries for "I know the type, but
>    don't have the contents").
> 
> I suspect (1) would give the biggest speedup, but is more work and
> requires bitmaps. I think (3) is likely to give a moderate win and
> probably isn't that big a patch.

Sadly, it doesn't seem to help. Here's a hacky implementation (I tested
it only on my limited cat-file case; I wouldn't be at all surprised if
it causes problems for users of the delta cache that want the actual
object contents).

After some rudimentary profiling (which I _should_ have done in the
first place before opening my mouth), it looks like finding the types
isn't actually the most expensive thing anyway: it's just dealing with
the sheer number of objects in the pack.

That implies that teaching packed_object_info() to use bitmaps to find
the individual types wouldn't help much either (but being able to ask
"just give me the commits" and iterating over the commit-type bitmap
probably _would_).

Anyway, here's the patch for the curious.

---
diff --git a/packfile.c b/packfile.c
index f4e752996d..482733689a 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1273,6 +1273,13 @@ static int retry_bad_packed_offset(struct repository *r,
 
 #define POI_STACK_PREALLOC 64
 
+/* some reordering of static functions could avoid these */
+static enum object_type get_delta_base_cache_type(struct packed_git *p,
+						  off_t curpos);
+static void add_delta_base_cache(struct packed_git *p, off_t base_offset,
+				 void *base, unsigned long base_size,
+				 enum object_type type);
+
 static enum object_type packed_to_object_type(struct repository *r,
 					      struct packed_git *p,
 					      off_t obj_offset,
@@ -1287,6 +1294,14 @@ static enum object_type packed_to_object_type(struct repository *r,
 	while (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) {
 		off_t base_offset;
 		unsigned long size;
+		enum object_type cached_type;
+
+		cached_type = get_delta_base_cache_type(p, obj_offset);
+		if (cached_type > 0) {
+			type = cached_type;
+			goto out;
+		}
+
 		/* Push the object we're going to leave behind */
 		if (poi_stack_nr >= poi_stack_alloc && poi_stack == small_poi_stack) {
 			poi_stack_alloc = alloc_nr(poi_stack_nr);
@@ -1326,6 +1341,11 @@ static enum object_type packed_to_object_type(struct repository *r,
 	}
 
 out:
+	if (type != OBJ_BAD) {
+		size_t i;
+		for (i = 0; i < poi_stack_nr; i++)
+			add_delta_base_cache(p, poi_stack[i], NULL, 0, type);
+	}
 	if (poi_stack != small_poi_stack)
 		free(poi_stack);
 	return type;
@@ -1385,6 +1405,13 @@ get_delta_base_cache_entry(struct packed_git *p, off_t base_offset)
 	return e ? container_of(e, struct delta_base_cache_entry, ent) : NULL;
 }
 
+static enum object_type get_delta_base_cache_type(struct packed_git *p, off_t curpos)
+{
+	struct delta_base_cache_entry *ent =
+		get_delta_base_cache_entry(p, curpos);
+	return ent ? ent->type : OBJ_BAD;
+}
+
 static int delta_base_cache_key_eq(const struct delta_base_cache_key *a,
 				   const struct delta_base_cache_key *b)
 {
@@ -1433,7 +1460,7 @@ static void *cache_or_unpack_entry(struct repository *r, struct packed_git *p,
 	struct delta_base_cache_entry *ent;
 
 	ent = get_delta_base_cache_entry(p, base_offset);
-	if (!ent)
+	if (!ent || !ent->data)
 		return unpack_entry(r, p, base_offset, type, base_size);
 
 	if (type)
@@ -1677,7 +1704,7 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset,
 		struct delta_base_cache_entry *ent;
 
 		ent = get_delta_base_cache_entry(p, curpos);
-		if (ent) {
+		if (ent && ent->data) {
 			type = ent->type;
 			data = ent->data;
 			size = ent->size;
Taylor Blau March 23, 2020, 8:15 p.m. UTC | #20
On Sun, Mar 22, 2020 at 12:04:24PM +0100, SZEDER Gábor wrote:
> On Sun, Mar 22, 2020 at 01:36:35AM -0400, Jeff King wrote:
> > The "--input=none" option
> > isn't "no input"; it's actually "take all existing graphed objects as
> > input" (i.e., it implies --append). So each of those objects will
> > already have been examined in an earlier stage.
>
> Yeah, we discussed it earlier that '--input=none' doesn't properly
> describe what it tries to accomplish, and suggested '--input=exists'
> (or was it 'existing'?) instead.

Good idea. I don't remember discussing '--input=exist{s,ing}',
specifically, but I do think that '--input=none' is not as clear as it
could be.

I sent [1], which I think may be a step in the right direction. I'll
look forward to your thoughts.

> > > Where the last step is taking all commits listed in any pack, which is
> > > cheap to iterate.
> >
> > I'm not sure it's all that cheap. It has to find the type of every
> > object in every pack. And finding types involves walking delta chains.
>
> Indeed, that's just about the most expensive way to find all commits,
> in my experience it can be well over a magnitude slower than
> '--reachable'.

Thanks,
Taylor

[1]: https://lore.kernel.org/git/e0f42a2f3c0162a5d43bb2bce0f69264b59f92e9.1584994172.git.me@ttaylorr.com/
Jeff King March 24, 2020, 6:06 a.m. UTC | #21
On Sun, Mar 22, 2020 at 10:45:31AM -0600, Taylor Blau wrote:

> So, I'm not sure what was going on with my original measurements, other
> than that I took them around midnight, so the chance for error was
> probably high. I'll go with your measurements, since they are both
> closer to what we'd see in the real world, and actually representative
> of the change here.

I wondered if perhaps you had stray commit-graph files sitting around. I
don't think we'd look at them if they aren't mentioned in the
commit-graphs index, though. And anyway, feeding all of the packed
objects would mean those would all be parsed in the earlier stage, so
even if we _did_ have a bunch of stuff to merge, it would have been fast
even with the old code.

So...I dunno.

One other thing that's bitten me in the past is accidentally compiling
with the wrong optimization settings. I have my config.mak set up to
compile with -O0 by default (because it's faster and because I'm
frequently debugging anyway), and then I do a -O2 build for my daily
driver. But when performance testing, I sometimes accidentally compare
-O0 and -O2 builds to each other. That's all very specific to my
config.mak setup, but I know you've picked up some of it, so it's
possible you inherited that subtlety. :)

-Peff
Jeff King March 24, 2020, 6:11 a.m. UTC | #22
On Sun, Mar 22, 2020 at 09:47:49AM -0600, Taylor Blau wrote:

> > > [1] I'm actually not quite sure about correctness here. It should be
> > >     fine to generate a graph file without any given commit; readers will
> > >     just have to load that commit the old-fashioned way. But at this
> > >     phase of "commit-graph write", I think we'll already have done the
> > >     close_reachable() check. What does it mean to throw away a commit at
> > >     this stage? If we're the parent of another commit, then it will have
> > >     trouble referring to us by a uint32_t. Will the actual writing phase
> > >     barf, or will we generate an invalid graph file?
> >
> > It doesn't seem great. If I instrument Git like this to simulate an
> > object temporarily "missing" (if it were really missing the whole repo
> > would be corrupt; we're trying to see what would happen if a race causes
> > us to momentarily not see it):
> 
> This is definitely a problem on either side of this patch, which is
> demonstrated by the fact that you applied your changes without my patch
> on top (and that my patch isn't changing anything substantial in this
> area like removing the 'continue' statement).
> 
> Should we address this before moving on with my patch? I think that we
> *could*, but I'd rather go forward with what we have for now, since it's
> only improving the situation, and not introducing a new bug.

I do agree it's a problem before your patch. But I think your patch may
make it a lot more common, if only because it means we'd _actually_ be
dropping entries for objects that went away, instead of accidentally
keeping them due to re-using the graph result. So it probably is worth
trying to deal with it now, or at least thinking hard about it.

The trouble is that I'm not sure what _should_ happen. Aborting the
whole commit-graph generation seems overboard (and would be annoying for
cases where whole swaths of history became unreachable and went away;
presumably we'd be dropping _all_ of those objects, and the write phase
would be just fine). I have a feeling the correct solution is to do this
merging pass earlier, before we check close_reachable(). Or if not a
true merging pass, at least a pass to check which existing entries are
still valid. But I don't know how painful that reordering would be.

-Peff
Jeff King March 24, 2020, 6:14 a.m. UTC | #23
On Sun, Mar 22, 2020 at 09:44:39AM -0600, Taylor Blau wrote:

> > Using QUICK means we won't waste time double-checking in the second
> > case. But it means we won't catch the first case, and we may generate a
> > new graph file that omits the object. They're both optimizations, and I
> > don't think we're impacting correctness[1], but it's not clear to me
> > whether one is a win over the other. We don't generally expect objects
> > we have to go away often.
> >
> > Skipping fetching seems a little more straight-forward to me. If we had
> > it in a graph file, presumably we had the actual object before, too. And
> > either we're in the first case above (we really do have it and just need
> > to double-check) in which case not saying QUICK would be enough. Or we
> > intentionally got rid of it. In which case downloading it just to
> > generate a cache is quite silly.
> 
> I was going to write that I'm not entirely sure of this, but I tried to
> talk myself through it below, and I think that the right flag is *only*
> OBJECT_INFO_SKIP_FETCH_OBJECT.

Re-reading what I wrote, I think I didn't say it very well. But yes,
that's what I think we ought to do, too: only use SKIP_FETCH_OBJECT.

-Peff
Taylor Blau March 24, 2020, 11:08 p.m. UTC | #24
On Tue, Mar 24, 2020 at 02:11:59AM -0400, Jeff King wrote:
> On Sun, Mar 22, 2020 at 09:47:49AM -0600, Taylor Blau wrote:
>
> > > > [1] I'm actually not quite sure about correctness here. It should be
> > > >     fine to generate a graph file without any given commit; readers will
> > > >     just have to load that commit the old-fashioned way. But at this
> > > >     phase of "commit-graph write", I think we'll already have done the
> > > >     close_reachable() check. What does it mean to throw away a commit at
> > > >     this stage? If we're the parent of another commit, then it will have
> > > >     trouble referring to us by a uint32_t. Will the actual writing phase
> > > >     barf, or will we generate an invalid graph file?
> > >
> > > It doesn't seem great. If I instrument Git like this to simulate an
> > > object temporarily "missing" (if it were really missing the whole repo
> > > would be corrupt; we're trying to see what would happen if a race causes
> > > us to momentarily not see it):
> >
> > This is definitely a problem on either side of this patch, which is
> > demonstrated by the fact that you applied your changes without my patch
> > on top (and that my patch isn't changing anything substantial in this
> > area like removing the 'continue' statement).
> >
> > Should we address this before moving on with my patch? I think that we
> > *could*, but I'd rather go forward with what we have for now, since it's
> > only improving the situation, and not introducing a new bug.
>
> I do agree it's a problem before your patch. But I think your patch may
> make it a lot more common, if only because it means we'd _actually_ be
> dropping entries for objects that went away, instead of accidentally
> keeping them due to re-using the graph result. So it probably is worth
> trying to deal with it now, or at least thinking hard about it.
>
> The trouble is that I'm not sure what _should_ happen. Aborting the
> whole commit-graph generation seems overboard (and would be annoying for
> cases where whole swaths of history became unreachable and went away;
> presumably we'd be dropping _all_ of those objects, and the write phase
> would be just fine). I have a feeling the correct solution is to do this
> merging pass earlier, before we check close_reachable(). Or if not a
> true merging pass, at least a pass to check which existing entries are
> still valid. But I don't know how painful that reordering would be.

Maybe, but I'm not sure that moving 'close_reachable' up would
necessarily solve all of our problems. Or, at least, that it wouldn't
create a set of new problems :).

Let's say that you have (1) some connected component of your history
that is written to a commit-graph, where (2) part of that cluster has
been dropped (e.g., due to corruption, a rogue 'git gc', etc), and (3)
you're writing a new graph with '--input=graphed'.

What is 'close_reachable' supposed to do? If some of the now-missing
commits are in the reachability closure of the commits that still exist
in the repository, then we're back to where we started. We *could* have
'close_reachable' take all missing commits and drop their ancestors and
descendants, but this creates quite the headache for 'close_reachable',
which now needs to keep track of such things.

I'm hopeful that this isn't so common in practice, but I'm also not
entirely sure one way or another. I can plan to deploy this patch to
GitHub's servers for a ~month and see if we experience it. If it turns
out to be common, I'll assume that others have this problem, too, in
which case we can go back and think more about it.

But, I'm hopeful that this problem is rare enough that it isn't worth
worrying about for anybody, GitHub or not.

> -Peff

Thanks,
Taylor
Jeff King March 27, 2020, 8:42 a.m. UTC | #25
On Tue, Mar 24, 2020 at 05:08:26PM -0600, Taylor Blau wrote:

> > The trouble is that I'm not sure what _should_ happen. Aborting the
> > whole commit-graph generation seems overboard (and would be annoying for
> > cases where whole swaths of history became unreachable and went away;
> > presumably we'd be dropping _all_ of those objects, and the write phase
> > would be just fine). I have a feeling the correct solution is to do this
> > merging pass earlier, before we check close_reachable(). Or if not a
> > true merging pass, at least a pass to check which existing entries are
> > still valid. But I don't know how painful that reordering would be.
> 
> Maybe, but I'm not sure that moving 'close_reachable' up would
> necessarily solve all of our problems. Or, at least, that it wouldn't
> create a set of new problems :).
> 
> Let's say that you have (1) some connected component of your history
> that is written to a commit-graph, where (2) part of that cluster has
> been dropped (e.g., due to corruption, a rogue 'git gc', etc), and (3)
> you're writing a new graph with '--input=graphed'.
> 
> What is 'close_reachable' supposed to do? If some of the now-missing
> commits are in the reachability closure of the commits that still exist
> in the repository, then we're back to where we started. We *could* have
> 'close_reachable' take all missing commits and drop their ancestors and
> descendants, but this creates quite the headache for 'close_reachable',
> which now needs to keep track of such things.

Yeah, I think my suggestion was dumb. I was thinking that
close_reachable() might somehow fix things up, but all it ever does with
the current code is _add_ commits to the graph. And we would want the
opposite: removing ones which can no longer be present, because their
ancestors are gone. That would be possible, but would be quite a pain
(you'd have to walk an inversion of the parent DAG).

> I'm hopeful that this isn't so common in practice, but I'm also not
> entirely sure one way or another. I can plan to deploy this patch to
> GitHub's servers for a ~month and see if we experience it. If it turns
> out to be common, I'll assume that others have this problem, too, in
> which case we can go back and think more about it.

It _shouldn't_ be too common, because it implies that we have commit X
but not its ancestor X^. And the pruning and repacking code tries to
avoid throwing away X^ in such a case. It could still happen due to a
race, though.

So yeah, I think we should mostly ignore it. It's not a new problem with
your series (and the only reason it is more common with your series is
that the old code was actually incorrectly handling some cases). It
might be worth turning the "missing parent" BUG() into a die(), since we
know it's reachable. But that's largely academic, since in either case
commit-graph is going to barf and fail to write.

As a follow-up to this part for anyone else following the discussion:

> I can plan to deploy this patch to GitHub's servers for a ~month and
> see if we experience it.

...I don't think we'll actually generate good data here. We're probably
going to end up doing our "big maintenance" commit-graph roll-ups by
just feeding --reachable as input, and dropping all of the existing
graphs.

-Peff
Taylor Blau March 27, 2020, 3:03 p.m. UTC | #26
On Fri, Mar 27, 2020 at 04:42:48AM -0400, Jeff King wrote:
> On Tue, Mar 24, 2020 at 05:08:26PM -0600, Taylor Blau wrote:
> > I can plan to deploy this patch to GitHub's servers for a ~month and
> > see if we experience it.
>
> ...I don't think we'll actually generate good data here. We're probably
> going to end up doing our "big maintenance" commit-graph roll-ups by
> just feeding --reachable as input, and dropping all of the existing
> graphs.

For what it's worth (and I'm not sure that it's worth much), but this
is only true in the last day or so. Before, we were running:

  $ git commit-graph write --split=merge-all --input=none

which *did* exercise this code quite frequently (and thus would have
been helped by this patch). But now, we are running something like:

  $ git commit-graph write --split=replace --input=reachable

...where '--split=replace' means "write a split commit-graph, but drop
all existing layers before writing it". This case is obviously not
helped by this patch, although I think the patch is worthwhile for
callers who do the first thing.

I'll post patches about that shortly after they've been a little more
well-vetted (we're only generating split commit-graphs on a percentage
of repositories for now).

> -Peff

Thanks,
Taylor
diff mbox series

Patch

diff --git a/commit-graph.c b/commit-graph.c
index f013a84e29..c7cfadc786 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1599,14 +1599,18 @@  static void merge_commit_graph(struct write_commit_graph_context *ctx,
 
 	for (i = 0; i < g->num_commits; i++) {
 		struct object_id oid;
-		struct commit *result;
+		struct commit *result = NULL;
 
 		display_progress(ctx->progress, i + 1);
 
 		load_oid_from_graph(g, i + offset, &oid);
 
 		/* only add commits if they still exist in the repo */
-		result = lookup_commit_reference_gently(ctx->r, &oid, 1);
+		if (repo_has_object_file(ctx->r, &oid)) {
+			result = lookup_commit(ctx->r, &oid);
+			if (repo_parse_commit(ctx->r, result))
+				result = NULL;
+		}
 
 		if (result) {
 			ctx->commits.list[ctx->commits.nr] = result;