Message ID | 4c79a9ea909ebff8c0987bcf95692da92e79bda4.1584762087.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | commit-graph: avoid unnecessary tag deference when merging | expand |
On Fri, Mar 20, 2020 at 09:44:23PM -0600, Taylor Blau wrote: > In my testing environment, this improves the time to "merge" a split > commit-graph containing all reachable commits in the kernel by > re-writing the same commit-graph (effectively measuring the time it > takes to check that all of those commits still exist) from: > > Attempt 1: 9.614 > Attempt 2: 10.984 > Attempt 3: 10.39 > Attempt 4: 9.14 > Attempt 5: 9.439 > > real 0m9.140s > user 0m8.207s > sys 0m0.602s > > to: > > Attempt 1: 9.12 > Attempt 2: 8.904 > Attempt 3: 9.361 > Attempt 4: 9.288 > Attempt 5: 9.677 > > real 0m8.904s > user 0m8.208s > sys 0m0.596s > > yielding a modest ~2.6% improvement in the best timings from each run, > and ~7.4% improvement on average. That still seems really slow to me. If we were truly eliding the load of most of the commit objects, I'd expect an order of magnitude or so improvement. For example, with a fully constructed commit graph in linux.git, I get: $ time git -c core.commitGraph=1 rev-list HEAD | wc -l 886922 real 0m1.088s user 0m0.659s sys 0m1.161s $ time git -c core.commitGraph=0 rev-list HEAD | wc -l 886922 real 0m7.185s user 0m6.729s sys 0m1.882s Obviously not the same operation, but that should give us a rough idea that commit graph lookups are 6-7 times cheaper than loading the actual objects. I don't remember the details of the case that originally led us towards this patch. Can you share more of the setup you used to generate the numbers above (which repo, but more importantly the commands to create the initial state and then time the test). The patch otherwise still makes sense to me, but I suspect there are other similar optimizations nearby that we'll need to do in tandem. -Peff
Taylor Blau <me@ttaylorr.com> writes: > When performing a 'git commit-graph write' with '--split', the > commit-graph machinery calls 'merge_commit_graph()' after deciding on a > split strategy to optionally clean up any existing commit-graph > layers that were made obsolete by the split strategy [1]. > > At this time, 'merge_commit_graph()' checks each commit that it writes > into the merged graph to make sure that it still exists in the object > store. > > To do this, it uses 'lookup_commit_reference_gently()', which accepts > either a commit object, or a tag that refers to a commit. However, since > all 'oid' arguments passed to this function are from within the > commit-graphs being merged, we never pass a commit reference, and so any > time we spend in 'deref_tag()' is wasted. Ahh, so my question on the cover letter was utterly off the mark. It is that feeding a commit to deref_tag() is unneeded. It is quite surprising to hear that deref_tag() call is _so_ expensive that it wastes 7% of the total cycles. The patch itself looks good. Thanks. > for (i = 0; i < g->num_commits; i++) { > struct object_id oid; > - struct commit *result; > + struct commit *result = NULL; > > display_progress(ctx->progress, i + 1); > > load_oid_from_graph(g, i + offset, &oid); > > /* only add commits if they still exist in the repo */ > - result = lookup_commit_reference_gently(ctx->r, &oid, 1); > + if (repo_has_object_file(ctx->r, &oid)) { > + result = lookup_commit(ctx->r, &oid); > + if (repo_parse_commit(ctx->r, result)) > + result = NULL; > + } > > if (result) { > ctx->commits.list[ctx->commits.nr] = result;
On Sat, Mar 21, 2020 at 01:00:25AM -0400, Jeff King wrote: > On Fri, Mar 20, 2020 at 09:44:23PM -0600, Taylor Blau wrote: > > > In my testing environment, this improves the time to "merge" a split > > commit-graph containing all reachable commits in the kernel by > > re-writing the same commit-graph (effectively measuring the time it > > takes to check that all of those commits still exist) from: > > > > Attempt 1: 9.614 > > Attempt 2: 10.984 > > Attempt 3: 10.39 > > Attempt 4: 9.14 > > Attempt 5: 9.439 > > > > real 0m9.140s > > user 0m8.207s > > sys 0m0.602s > > > > to: > > > > Attempt 1: 9.12 > > Attempt 2: 8.904 > > Attempt 3: 9.361 > > Attempt 4: 9.288 > > Attempt 5: 9.677 > > > > real 0m8.904s > > user 0m8.208s > > sys 0m0.596s > > > > yielding a modest ~2.6% improvement in the best timings from each run, > > and ~7.4% improvement on average. > > That still seems really slow to me. If we were truly eliding the load of > most of the commit objects, I'd expect an order of magnitude or so > improvement. For example, with a fully constructed commit graph in > linux.git, I get: > > $ time git -c core.commitGraph=1 rev-list HEAD | wc -l > 886922 > > real 0m1.088s > user 0m0.659s > sys 0m1.161s > > $ time git -c core.commitGraph=0 rev-list HEAD | wc -l > 886922 > > real 0m7.185s > user 0m6.729s > sys 0m1.882s > > Obviously not the same operation, but that should give us a rough idea > that commit graph lookups are 6-7 times cheaper than loading the actual > objects. I don't remember the details of the case that originally led us > towards this patch. Can you share more of the setup you used to generate > the numbers above (which repo, but more importantly the commands to > create the initial state and then time the test). Sure. I'm running best-of-five on the time it takes to re-generate and merge a commit-graph based on in-pack commits. The script is (in linux.git): $ best-of-five \ -p 'rm -rf .git/objects/info/commit-graph{,s/}; git commit-graph write --split=no-merge 2>/dev/null' \ git commit-graph write --split=merge-all So we're measuring the time it takes to crawl all the packs, decide on the splitting strategy, and then compare all commits in the new merged graph to make sure that they don't already exist in the object store. But, here's where things get... Bizarre. I was trying to come up with a way to do fewer things and spend proportionally more time in 'merge_commit_graphs', so I did something like: - Generate a pack containing a single, empty commit. - Generate a split commit-graph containing commits in the single large pack containing all of history. - Generate a commit-graph of the small pack, and merge it with the large pack. That script is: $ git --version $ git commit -m "empty" --allow-empty $ pack="pack-$(git rev-parse HEAD | git pack-objects .git/objects/pack/pack).idx" $ best-of-five \ -p "rm -rf .git/objects/info/commit-graphs && cp -r .git/objects/info/commit-graphs{.bak,}" \ sh -c "echo $pack | git commit-graph write --split=merge-all" but things get... slower with this patch? Here are the results before and after: Attempt 1: 8.444 Attempt 2: 8.453 Attempt 3: 8.391 Attempt 4: 8.376 Attempt 5: 7.859 real 0m7.859s user 0m7.309s sys 0m0.511s vs: Attempt 1: 8.69 Attempt 2: 8.735 Attempt 3: 8.619 Attempt 4: 8.747 Attempt 5: 8.695 real 0m8.619s user 0m8.030s sys 0m0.538s Without more profiling, I'm puzzled by why this patch seems to make things *worse* under this scenario. So, I'd appreciate your thoughts: does this set-up seem reasonable? Is it introducing some latency that isn't being accounted for in the original setup? > The patch otherwise still makes sense to me, but I suspect there are > other similar optimizations nearby that we'll need to do in tandem. > > -Peff Thanks, Taylor
On Sat, Mar 21, 2020 at 12:11:41AM -0600, Taylor Blau wrote: > On Sat, Mar 21, 2020 at 01:00:25AM -0400, Jeff King wrote: > > On Fri, Mar 20, 2020 at 09:44:23PM -0600, Taylor Blau wrote: > > > > > In my testing environment, this improves the time to "merge" a split > > > commit-graph containing all reachable commits in the kernel by > > > re-writing the same commit-graph (effectively measuring the time it > > > takes to check that all of those commits still exist) from: > > > > > > Attempt 1: 9.614 > > > Attempt 2: 10.984 > > > Attempt 3: 10.39 > > > Attempt 4: 9.14 > > > Attempt 5: 9.439 > > > > > > real 0m9.140s > > > user 0m8.207s > > > sys 0m0.602s > > > > > > to: > > > > > > Attempt 1: 9.12 > > > Attempt 2: 8.904 > > > Attempt 3: 9.361 > > > Attempt 4: 9.288 > > > Attempt 5: 9.677 > > > > > > real 0m8.904s > > > user 0m8.208s > > > sys 0m0.596s > > > > > > yielding a modest ~2.6% improvement in the best timings from each run, > > > and ~7.4% improvement on average. > > > > That still seems really slow to me. If we were truly eliding the load of > > most of the commit objects, I'd expect an order of magnitude or so > > improvement. For example, with a fully constructed commit graph in > > linux.git, I get: > > > > $ time git -c core.commitGraph=1 rev-list HEAD | wc -l > > 886922 > > > > real 0m1.088s > > user 0m0.659s > > sys 0m1.161s > > > > $ time git -c core.commitGraph=0 rev-list HEAD | wc -l > > 886922 > > > > real 0m7.185s > > user 0m6.729s > > sys 0m1.882s > > > > Obviously not the same operation, but that should give us a rough idea > > that commit graph lookups are 6-7 times cheaper than loading the actual > > objects. I don't remember the details of the case that originally led us > > towards this patch. Can you share more of the setup you used to generate > > the numbers above (which repo, but more importantly the commands to > > create the initial state and then time the test). > > Sure. I'm running best-of-five on the time it takes to re-generate and > merge a commit-graph based on in-pack commits. > > The script is (in linux.git): > > $ best-of-five \ > -p 'rm -rf .git/objects/info/commit-graph{,s/}; git commit-graph write --split=no-merge 2>/dev/null' \ > git commit-graph write --split=merge-all > > So we're measuring the time it takes to crawl all the packs, decide on > the splitting strategy, and then compare all commits in the new merged > graph to make sure that they don't already exist in the object store. > > But, here's where things get... Bizarre. I was trying to come up with a > way to do fewer things and spend proportionally more time in > 'merge_commit_graphs', so I did something like: > > - Generate a pack containing a single, empty commit. > - Generate a split commit-graph containing commits in the single large > pack containing all of history. > - Generate a commit-graph of the small pack, and merge it with the > large pack. > > That script is: > > $ git --version > $ git commit -m "empty" --allow-empty > $ pack="pack-$(git rev-parse HEAD | git pack-objects .git/objects/pack/pack).idx" > $ best-of-five \ > -p "rm -rf .git/objects/info/commit-graphs && cp -r .git/objects/info/commit-graphs{.bak,}" \ > sh -c "echo $pack | git commit-graph write --split=merge-all" > > but things get... slower with this patch? Here are the results before > and after: > > Attempt 1: 8.444 > Attempt 2: 8.453 > Attempt 3: 8.391 > Attempt 4: 8.376 > Attempt 5: 7.859 > > real 0m7.859s > user 0m7.309s > sys 0m0.511s > > vs: > > Attempt 1: 8.69 > Attempt 2: 8.735 > Attempt 3: 8.619 > Attempt 4: 8.747 > Attempt 5: 8.695 > > real 0m8.619s > user 0m8.030s > sys 0m0.538s > > Without more profiling, I'm puzzled by why this patch seems to make > things *worse* under this scenario. Hmm. I tried roughly the same setup as the latter of the above two, but this time under perf. On a warm cache, it does look like this parch is an improvement: $ sudo perf diff perf.data perf.data.patch | grep merge_commit 0.11% -0.02% git [.] merge_commit_graph (perf.data is from running 'next', perf.data.patch is from running 'next' with this patch applied on top). So... a slight improvement, but nothing like what I had reported in the original patch. I am puzzled. > So, I'd appreciate your thoughts: does this set-up seem reasonable? Is > it introducing some latency that isn't being accounted for in the > original setup? > > > The patch otherwise still makes sense to me, but I suspect there are > > other similar optimizations nearby that we'll need to do in tandem. > > > > -Peff > > Thanks, > Taylor Thanks, Taylor
On Sat, Mar 21, 2020 at 12:11:41AM -0600, Taylor Blau wrote: > Sure. I'm running best-of-five on the time it takes to re-generate and > merge a commit-graph based on in-pack commits. > > The script is (in linux.git): > > $ best-of-five \ > -p 'rm -rf .git/objects/info/commit-graph{,s/}; git commit-graph write --split=no-merge 2>/dev/null' \ > git commit-graph write --split=merge-all If I build Git without your patch and run the "--split=merge-all" command under a debugger, and then break on parse_object() I find that all of the commits are already parsed. This happens in close_reachable(). So we weren't actually reading all of the commits even under the old code. We were just going into deref_tag(), seeing that the object is already parsed, and then quickly returning when we see that we already have an OBJ_COMMIT. I suspect most of your timing differences are mostly noise. Perhaps a more interesting case is when you're _not_ adding all of the existing packed commits as input. There we'd feed only a few objects to close_reachable(), and it would stop traversing as soon as it hits a parent that's already in a graph file. So most of the existing objects would remain unparsed. I'm not sure how to do that, though. Saying "--input=none" still puts all of those existing graphed objects into the list of oids to include. I think you'd need a case where you were legitimately only adding a few commits, but the merge rules say we need to create one big commit-graph file. I guess --input=stdin-commits is a good way to simulate that. Try this (assuming there's already a split-graph file with all of the commits in it): git rev-parse HEAD >input time git commit-graph write --input=stdin-commits --split=merge-all <input Without your patch on linux.git I get: real 0m11.713s user 0m11.349s sys 0m0.341s but with it I get: real 0m2.305s user 0m2.177s sys 0m0.100s A more realistic case would probably be feeding a new small pack to --input=stdin-packs. > But, here's where things get... Bizarre. I was trying to come up with a > way to do fewer things and spend proportionally more time in > 'merge_commit_graphs', so I did something like: > > - Generate a pack containing a single, empty commit. > - Generate a split commit-graph containing commits in the single large > pack containing all of history. > - Generate a commit-graph of the small pack, and merge it with the > large pack. > > That script is: > > $ git --version > $ git commit -m "empty" --allow-empty > $ pack="pack-$(git rev-parse HEAD | git pack-objects .git/objects/pack/pack).idx" > $ best-of-five \ > -p "rm -rf .git/objects/info/commit-graphs && cp -r .git/objects/info/commit-graphs{.bak,}" \ > sh -c "echo $pack | git commit-graph write --split=merge-all" I think you'd need --stdin-packs in the actual timed command? At any rate, I think there is a demonstrable speedup there. But moreover, I'm pretty sure this existing code is not doing what it expects: /* only add commits if they still exist in the repo */ result = lookup_commit_reference_gently(ctx->r, &oid, 1); That won't look at the object database at all if the commit is already marked as parsed. And that parsing might have come from the commit graph itself, as our earlier attempts showed. So switching to a real has_object_file() call is an important _correctness_ fix, even leaving aside the speed improvements. -Peff
On Sat, Mar 21, 2020 at 03:03:33AM -0400, Jeff King wrote: > On Sat, Mar 21, 2020 at 12:11:41AM -0600, Taylor Blau wrote: > > > Sure. I'm running best-of-five on the time it takes to re-generate and > > merge a commit-graph based on in-pack commits. > > > > The script is (in linux.git): > > > > $ best-of-five \ > > -p 'rm -rf .git/objects/info/commit-graph{,s/}; git commit-graph write --split=no-merge 2>/dev/null' \ > > git commit-graph write --split=merge-all > > If I build Git without your patch and run the "--split=merge-all" > command under a debugger, and then break on parse_object() I find that > all of the commits are already parsed. This happens in > close_reachable(). > > So we weren't actually reading all of the commits even under the old > code. We were just going into deref_tag(), seeing that the object is > already parsed, and then quickly returning when we see that we already > have an OBJ_COMMIT. I suspect most of your timing differences are mostly > noise. Aha. Thanks for reasoning through this. That makes sense, and I think that you're probably right. > Perhaps a more interesting case is when you're _not_ adding all of the > existing packed commits as input. There we'd feed only a few objects to > close_reachable(), and it would stop traversing as soon as it hits a > parent that's already in a graph file. So most of the existing objects > would remain unparsed. > > I'm not sure how to do that, though. Saying "--input=none" still puts > all of those existing graphed objects into the list of oids to include. > I think you'd need a case where you were legitimately only adding a few > commits, but the merge rules say we need to create one big commit-graph > file. You have to be careful, since we're taking the reachability closure over any commits that you do give. So, one thing you could do to ensure that you have an actually small graph is to take something from the output of 'git rev-list --max-parents=0 HEAD'. To try and reproduce your results, I used '1da177e4c3', which is the kernel's first commit in Git. If my interpretation of your setup is faithful, it goes something like: $ graphdir=.git/objects/info/commit-graphs $ git rev-parse 1da177e4c3 | git commit-graph write --split=no-merge --stdin-commits $ cp -r "$graphdir{,.bak}" $ best-of-five -p "rm -rf $graphdir && cp -r $graphdir{.bak,}" \ 'git commit-graph write --split=merge-all' Where the last step is taking all commits listed in any pack, which is cheap to iterate. > I guess --input=stdin-commits is a good way to simulate that. Try this > (assuming there's already a split-graph file with all of the commits in > it): > > git rev-parse HEAD >input > time git commit-graph write --input=stdin-commits --split=merge-all <input > > Without your patch on linux.git I get: > > real 0m11.713s > user 0m11.349s > sys 0m0.341s > > but with it I get: > > real 0m2.305s > user 0m2.177s > sys 0m0.100s In the above setup, I get something like: git version 2.26.0.rc2.221.ge327a58236 Attempt 1: 16.933 Attempt 2: 18.101 Attempt 3: 17.603 Attempt 4: 20.404 Attempt 5: 18.871 real 0m16.933s user 0m16.440s sys 0m0.472s versus: git version 2.26.0.rc2.222.g295e7905ee Attempt 1: 5.34 Attempt 2: 4.623 Attempt 3: 5.263 Attempt 4: 5.268 Attempt 5: 5.606 real 0m4.623s user 0m4.428s sys 0m0.176s which is a best-case savings of ~72.7%, and a savings of ~71.5%. That seems much better. If you think that this is a realistic setup, I'll amend the original patch text to include these updated numbers, along with the reproduction steps. > A more realistic case would probably be feeding a new small pack to > --input=stdin-packs. > > > But, here's where things get... Bizarre. I was trying to come up with a > > way to do fewer things and spend proportionally more time in > > 'merge_commit_graphs', so I did something like: > > > > - Generate a pack containing a single, empty commit. > > - Generate a split commit-graph containing commits in the single large > > pack containing all of history. > > - Generate a commit-graph of the small pack, and merge it with the > > large pack. > > > > That script is: > > > > $ git --version > > $ git commit -m "empty" --allow-empty > > $ pack="pack-$(git rev-parse HEAD | git pack-objects .git/objects/pack/pack).idx" > > $ best-of-five \ > > -p "rm -rf .git/objects/info/commit-graphs && cp -r .git/objects/info/commit-graphs{.bak,}" \ > > sh -c "echo $pack | git commit-graph write --split=merge-all" > > I think you'd need --stdin-packs in the actual timed command? > > At any rate, I think there is a demonstrable speedup there. But > moreover, I'm pretty sure this existing code is not doing what it > expects: > > /* only add commits if they still exist in the repo */ > result = lookup_commit_reference_gently(ctx->r, &oid, 1); > > That won't look at the object database at all if the commit is already > marked as parsed. And that parsing might have come from the commit graph > itself, as our earlier attempts showed. So switching to a real > has_object_file() call is an important _correctness_ fix, even leaving > aside the speed improvements. That's a great point. I hadn't thought of this myself, but I agree with your reasoning here, too, and think that this is an important correctness fix even if it yielded no performance benefit (thankfully, we get both :)). > -Peff Thanks, Taylor
Jeff King <peff@peff.net> writes: > So we weren't actually reading all of the commits even under the old > code. We were just going into deref_tag(), seeing that the object is > already parsed, and then quickly returning when we see that we already > have an OBJ_COMMIT. I suspect most of your timing differences are mostly > noise. Noise that contributes largely to ~7% fluctuation? That sounds big. > I guess --input=stdin-commits is a good way to simulate that. Try this > (assuming there's already a split-graph file with all of the commits in > it): > > git rev-parse HEAD >input > time git commit-graph write --input=stdin-commits --split=merge-all <input > ... > A more realistic case would probably be feeding a new small pack to > --input=stdin-packs. Makes sense. > At any rate, I think there is a demonstrable speedup there. But > moreover, I'm pretty sure this existing code is not doing what it > expects: > > /* only add commits if they still exist in the repo */ > result = lookup_commit_reference_gently(ctx->r, &oid, 1); > > That won't look at the object database at all if the commit is already > marked as parsed. And that parsing might have come from the commit graph > itself, as our earlier attempts showed. So switching to a real > has_object_file() call is an important _correctness_ fix, even leaving > aside the speed improvements. True. has_object_file() asks the question without even lookup-replace flag set, so it would try to see if the object is truly there (modulo the "pretended" ones are declared to exist via the cached-object mechanism). Do we need to worry about INFO_QUICK and SKIP_FETCH_OBJECT in this codepath, by the way?
On 3/21/2020 2:50 PM, Junio C Hamano wrote: > Do we need to worry about INFO_QUICK and SKIP_FETCH_OBJECT in this > codepath, by the way? I was coming back to this thread to bring up these exact flags for consideration. The good news is that in a partial clone with any amount of filtering we will still have all reachable commits, which are necessary for the commit-graph to make sense. The only ones that would fail has_object_file() are ones removed by GC, but they may still exist on the remote. So without SKIP_FETCH_OBJECT, we would generate a network call even if the server has GC'd to remove the commits. This gets particularly bad when the server returns all reachable objects from that commit! Thanks, -Stolee
On Sat, Mar 21, 2020 at 08:03:01PM -0400, Derrick Stolee wrote: > On 3/21/2020 2:50 PM, Junio C Hamano wrote: > > Do we need to worry about INFO_QUICK and SKIP_FETCH_OBJECT in this > > codepath, by the way? > > I was coming back to this thread to bring up these exact flags for > consideration. The good news is that in a partial clone with any > amount of filtering we will still have all reachable commits, which > are necessary for the commit-graph to make sense. The only ones that > would fail has_object_file() are ones removed by GC, but they may > still exist on the remote. So without SKIP_FETCH_OBJECT, we would > generate a network call even if the server has GC'd to remove the > commits. This gets particularly bad when the server returns all > reachable objects from that commit! That makes sense. Do you think something like this should be applied? diff --git a/commit-graph.c b/commit-graph.c index c7cfadc786..0097318798 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1594,6 +1594,7 @@ static void merge_commit_graph(struct write_commit_graph_context *ctx, { uint32_t i; uint32_t offset = g->num_commits_in_base; + int flags = OBJECT_INFO_QUICK | OBJECT_INFO_SKIP_FETCH_OBJECT; ALLOC_GROW(ctx->commits.list, ctx->commits.nr + g->num_commits, ctx->commits.alloc); @@ -1606,7 +1607,7 @@ static void merge_commit_graph(struct write_commit_graph_context *ctx, load_oid_from_graph(g, i + offset, &oid); /* only add commits if they still exist in the repo */ - if (repo_has_object_file(ctx->r, &oid)) { + if (repo_has_object_file_with_flags(ctx->r, &oid, flags)) { result = lookup_commit(ctx->r, &oid); if (repo_parse_commit(ctx->r, result)) result = NULL; > Thanks, > -Stolee Thanks, Taylor
On 3/21/2020 8:20 PM, Taylor Blau wrote: > On Sat, Mar 21, 2020 at 08:03:01PM -0400, Derrick Stolee wrote: >> On 3/21/2020 2:50 PM, Junio C Hamano wrote: >>> Do we need to worry about INFO_QUICK and SKIP_FETCH_OBJECT in this >>> codepath, by the way? >> >> I was coming back to this thread to bring up these exact flags for >> consideration. The good news is that in a partial clone with any >> amount of filtering we will still have all reachable commits, which >> are necessary for the commit-graph to make sense. The only ones that >> would fail has_object_file() are ones removed by GC, but they may >> still exist on the remote. So without SKIP_FETCH_OBJECT, we would >> generate a network call even if the server has GC'd to remove the >> commits. This gets particularly bad when the server returns all >> reachable objects from that commit! > > That makes sense. Do you think something like this should be applied? Yes, I think this is the appropriate update. Thanks! > diff --git a/commit-graph.c b/commit-graph.c > index c7cfadc786..0097318798 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -1594,6 +1594,7 @@ static void merge_commit_graph(struct write_commit_graph_context *ctx, > { > uint32_t i; > uint32_t offset = g->num_commits_in_base; > + int flags = OBJECT_INFO_QUICK | OBJECT_INFO_SKIP_FETCH_OBJECT; > > ALLOC_GROW(ctx->commits.list, ctx->commits.nr + g->num_commits, ctx->commits.alloc); > > @@ -1606,7 +1607,7 @@ static void merge_commit_graph(struct write_commit_graph_context *ctx, > load_oid_from_graph(g, i + offset, &oid); > > /* only add commits if they still exist in the repo */ > - if (repo_has_object_file(ctx->r, &oid)) { > + if (repo_has_object_file_with_flags(ctx->r, &oid, flags)) { > result = lookup_commit(ctx->r, &oid); > if (repo_parse_commit(ctx->r, result)) > result = NULL;
On Sat, Mar 21, 2020 at 11:27:16AM -0600, Taylor Blau wrote: > > I'm not sure how to do that, though. Saying "--input=none" still puts > > all of those existing graphed objects into the list of oids to include. > > I think you'd need a case where you were legitimately only adding a few > > commits, but the merge rules say we need to create one big commit-graph > > file. > > You have to be careful, since we're taking the reachability closure over > any commits that you do give. So, one thing you could do to ensure that > you have an actually small graph is to take something from the output of > 'git rev-list --max-parents=0 HEAD'. I don't think you need to be that careful, though. In split-mode, close_reachable() will stop traversing when it finds a graphed commit. That's why using the tip of HEAD in my previous example worked. > To try and reproduce your results, I used '1da177e4c3', which is the > kernel's first commit in Git. If my interpretation of your setup is > faithful, it goes something like: > > $ graphdir=.git/objects/info/commit-graphs > $ git rev-parse 1da177e4c3 | > git commit-graph write --split=no-merge --stdin-commits > $ cp -r "$graphdir{,.bak}" > > $ best-of-five -p "rm -rf $graphdir && cp -r $graphdir{.bak,}" \ > 'git commit-graph write --split=merge-all' My case is the opposite, isn't it? Here it looks like you've made a very _tiny_ commit-graph file (with one commit), and then you're going to end up adding in all of the new objects. I don't think it would be improved much by this patch (which makes me very confused by the numbers you got below). I also don't think it's that interesting a real-world case. The more interesting one is where you do already have a big commit-graph, and want to add just a bit more to it. In the real world, that might look something like this: # create a fake server repo git clone --bare . dst.git # imagine the server already has everything in a graph file git -C dst.git commit-graph write --split=no-merge --reachable # and now do a small push git commit --allow-empty -m foo git push dst.git # the server might do an incremental immediately to cover the new # objects; here we'll use --stdin-commits with the new data, but a # real server might feed the new packfile. We'd probably just use # regular --split here in practice, but let's imagine that we're # starting to have a lot of graph files, and that triggers a desire to # merge. We'll force that state with --split=merge-all. git rev-list HEAD^..HEAD | git -C dst.git commit-graph write --split=merge-all --stdin-commits Without your patch, that takes ~11s for me. With it, it takes ~2s. Another equally interesting case is if the per-push generation _doesn't_ merge anything, and just creates a new, tiny graph file. And then later we want to do a real maintenance, merging them all done. I think that would be something like: git -C dst.git commit-graph write --input=none --split=merge-all But that _isn't_ improved by your patch. For the simple reason that all of the commits will already have been parsed. The "--input=none" option isn't "no input"; it's actually "take all existing graphed objects as input" (i.e., it implies --append). So each of those objects will already have been examined in an earlier stage. > Where the last step is taking all commits listed in any pack, which is > cheap to iterate. I'm not sure it's all that cheap. It has to find the type of every object in every pack. And finding types involves walking delta chains. That's something like 7s on my machine for linux.git (compared to the 2s in which I can just merge down the existing graph files). > In the above setup, I get something like: > > git version 2.26.0.rc2.221.ge327a58236 > Attempt 1: 16.933 > Attempt 2: 18.101 > Attempt 3: 17.603 > Attempt 4: 20.404 > Attempt 5: 18.871 > > real 0m16.933s > user 0m16.440s > sys 0m0.472s > > versus: > > git version 2.26.0.rc2.222.g295e7905ee > Attempt 1: 5.34 > Attempt 2: 4.623 > Attempt 3: 5.263 > Attempt 4: 5.268 > Attempt 5: 5.606 > > real 0m4.623s > user 0m4.428s > sys 0m0.176s > > which is a best-case savings of ~72.7%, and a savings of ~71.5%. That > seems much better. I'm still puzzled by this. In the setup you showed, hardly anything is graphed. But the change is only in the graph-merge code. -Peff
On Sat, Mar 21, 2020 at 08:23:13PM -0400, Derrick Stolee wrote: > On 3/21/2020 8:20 PM, Taylor Blau wrote: > > On Sat, Mar 21, 2020 at 08:03:01PM -0400, Derrick Stolee wrote: > >> On 3/21/2020 2:50 PM, Junio C Hamano wrote: > >>> Do we need to worry about INFO_QUICK and SKIP_FETCH_OBJECT in this > >>> codepath, by the way? > >> > >> I was coming back to this thread to bring up these exact flags for > >> consideration. The good news is that in a partial clone with any > >> amount of filtering we will still have all reachable commits, which > >> are necessary for the commit-graph to make sense. The only ones that > >> would fail has_object_file() are ones removed by GC, but they may > >> still exist on the remote. So without SKIP_FETCH_OBJECT, we would > >> generate a network call even if the server has GC'd to remove the > >> commits. This gets particularly bad when the server returns all > >> reachable objects from that commit! > > > > That makes sense. Do you think something like this should be applied? > > + int flags = OBJECT_INFO_QUICK | OBJECT_INFO_SKIP_FETCH_OBJECT; > [...] > > Yes, I think this is the appropriate update. Thanks! I'm not so sure. In a non-partial clone, when would expect QUICK to matter? If we find the object is missing, then either: - we are racing with repack, and we would benefit from double-checking that the object really is gone - we used to have it (since it was mentioned in the graph file) but it went away Using QUICK means we won't waste time double-checking in the second case. But it means we won't catch the first case, and we may generate a new graph file that omits the object. They're both optimizations, and I don't think we're impacting correctness[1], but it's not clear to me whether one is a win over the other. We don't generally expect objects we have to go away often. Skipping fetching seems a little more straight-forward to me. If we had it in a graph file, presumably we had the actual object before, too. And either we're in the first case above (we really do have it and just need to double-check) in which case not saying QUICK would be enough. Or we intentionally got rid of it. In which case downloading it just to generate a cache is quite silly. So it seems like SKIP_FETCH_OBJECT but _not_ QUICK might be reasonable here. -Peff [1] I'm actually not quite sure about correctness here. It should be fine to generate a graph file without any given commit; readers will just have to load that commit the old-fashioned way. But at this phase of "commit-graph write", I think we'll already have done the close_reachable() check. What does it mean to throw away a commit at this stage? If we're the parent of another commit, then it will have trouble referring to us by a uint32_t. Will the actual writing phase barf, or will we generate an invalid graph file?
On Sun, Mar 22, 2020 at 01:49:16AM -0400, Jeff King wrote: > [1] I'm actually not quite sure about correctness here. It should be > fine to generate a graph file without any given commit; readers will > just have to load that commit the old-fashioned way. But at this > phase of "commit-graph write", I think we'll already have done the > close_reachable() check. What does it mean to throw away a commit at > this stage? If we're the parent of another commit, then it will have > trouble referring to us by a uint32_t. Will the actual writing phase > barf, or will we generate an invalid graph file? It doesn't seem great. If I instrument Git like this to simulate an object temporarily "missing" (if it were really missing the whole repo would be corrupt; we're trying to see what would happen if a race causes us to momentarily not see it): diff --git a/commit-graph.c b/commit-graph.c index 3da52847e4..71419c2532 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1596,6 +1596,19 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx) } } +static int pretend_commit_is_missing(const struct object_id *oid) +{ + static int initialized; + static struct object_id missing; + if (!initialized) { + const char *x = getenv("PRETEND_COMMIT_IS_MISSING"); + if (x) + get_oid_hex(x, &missing); + initialized = 1; + } + return oideq(&missing, oid); +} + static void merge_commit_graph(struct write_commit_graph_context *ctx, struct commit_graph *g) { @@ -1612,6 +1625,11 @@ static void merge_commit_graph(struct write_commit_graph_context *ctx, load_oid_from_graph(g, i + offset, &oid); + if (pretend_commit_is_missing(&oid)) { + warning("pretending %s is missing", oid_to_hex(&oid)); + continue; + } + /* only add commits if they still exist in the repo */ result = lookup_commit_reference_gently(ctx->r, &oid, 1); and then I make a fully-graphed repo like this: git init repo cd repo for i in $(seq 10); do git commit --allow-empty -m $i done git commit-graph write --input=reachable --split=no-merge if we pretend a parent is missing, I get a BUG(): $ git rev-parse HEAD | PRETEND_COMMIT_IS_MISSING=$(git rev-parse HEAD^) \ git commit-graph write --stdin-commits --split=merge-all warning: pretending 35e6e15c738cf2bfbe495957b2a941c2efe86dd9 is missing BUG: commit-graph.c:879: missing parent 35e6e15c738cf2bfbe495957b2a941c2efe86dd9 for commit d4141fb57a9bbe26b247f23c790d63d078977833 Aborted So it seems like just skipping here (either with the new patch or without) isn't really a good strategy. -Peff
On Sun, Mar 22, 2020 at 01:36:35AM -0400, Jeff King wrote: > The "--input=none" option > isn't "no input"; it's actually "take all existing graphed objects as > input" (i.e., it implies --append). So each of those objects will > already have been examined in an earlier stage. Yeah, we discussed it earlier that '--input=none' doesn't properly describe what it tries to accomplish, and suggested '--input=exists' (or was it 'existing'?) instead. > > Where the last step is taking all commits listed in any pack, which is > > cheap to iterate. > > I'm not sure it's all that cheap. It has to find the type of every > object in every pack. And finding types involves walking delta chains. Indeed, that's just about the most expensive way to find all commits, in my experience it can be well over a magnitude slower than '--reachable'.
On Sun, Mar 22, 2020 at 01:49:16AM -0400, Jeff King wrote: > On Sat, Mar 21, 2020 at 08:23:13PM -0400, Derrick Stolee wrote: > > > On 3/21/2020 8:20 PM, Taylor Blau wrote: > > > On Sat, Mar 21, 2020 at 08:03:01PM -0400, Derrick Stolee wrote: > > >> On 3/21/2020 2:50 PM, Junio C Hamano wrote: > > >>> Do we need to worry about INFO_QUICK and SKIP_FETCH_OBJECT in this > > >>> codepath, by the way? > > >> > > >> I was coming back to this thread to bring up these exact flags for > > >> consideration. The good news is that in a partial clone with any > > >> amount of filtering we will still have all reachable commits, which > > >> are necessary for the commit-graph to make sense. The only ones that > > >> would fail has_object_file() are ones removed by GC, but they may > > >> still exist on the remote. So without SKIP_FETCH_OBJECT, we would > > >> generate a network call even if the server has GC'd to remove the > > >> commits. This gets particularly bad when the server returns all > > >> reachable objects from that commit! > > > > > > That makes sense. Do you think something like this should be applied? > > > + int flags = OBJECT_INFO_QUICK | OBJECT_INFO_SKIP_FETCH_OBJECT; > > [...] > > > > Yes, I think this is the appropriate update. Thanks! > > I'm not so sure. > > In a non-partial clone, when would expect QUICK to matter? If we find > the object is missing, then either: > > - we are racing with repack, and we would benefit from double-checking > that the object really is gone > > - we used to have it (since it was mentioned in the graph file) but it > went away > > Using QUICK means we won't waste time double-checking in the second > case. But it means we won't catch the first case, and we may generate a > new graph file that omits the object. They're both optimizations, and I > don't think we're impacting correctness[1], but it's not clear to me > whether one is a win over the other. We don't generally expect objects > we have to go away often. > > Skipping fetching seems a little more straight-forward to me. If we had > it in a graph file, presumably we had the actual object before, too. And > either we're in the first case above (we really do have it and just need > to double-check) in which case not saying QUICK would be enough. Or we > intentionally got rid of it. In which case downloading it just to > generate a cache is quite silly. I was going to write that I'm not entirely sure of this, but I tried to talk myself through it below, and I think that the right flag is *only* OBJECT_INFO_SKIP_FETCH_OBJECT. First, I agree with your reasoning that we shouldn't have OBJECT_INFO_QUICK, that much makes sense to me. To consider OBJECT_INFO_SKIP_FETCH_OBJECT, let's say that we used to have some commit, and it got GC'd away. This shouldn't happen unless a commit is unreachable, so if we ever find ourselves in a situation where the parent of some other commit is missing, we know that that represents a true corruption, not the result of a normal 'git gc'. Now, if we do find ourselves in this case, we know that a 'git commit-graph write --input=none --split' *would* work because we would find that the unreachable commit (if it were in the graph) was missing, and 'OBJECT_INFO_SKIP_FETCH_OBJECT' would tell us not to go looking for it. Likewise, if the commit weren't already in the graph, this would work fine, too. But, most of the time we won't even get a chance to write a new commit-graph based on existing ones, since 'gc.writeCommitGraph' will take care of this for us in-process with 'git gc'. This uses '--reachable' without '--split', so we won't ever invoke 'merge_commit_graph{s,}'. > So it seems like SKIP_FETCH_OBJECT but _not_ QUICK might be reasonable > here. I agree with your original reasoning that OBJECT_INFO_QUICK is the wrong choice, but I think my reasoning above that OBJECT_INFO_SKIP_FETCH_OBJECT *is* an appropriate flag is sound. > -Peff > > [1] I'm actually not quite sure about correctness here. It should be > fine to generate a graph file without any given commit; readers will > just have to load that commit the old-fashioned way. But at this > phase of "commit-graph write", I think we'll already have done the > close_reachable() check. What does it mean to throw away a commit at > this stage? If we're the parent of another commit, then it will have > trouble referring to us by a uint32_t. Will the actual writing phase > barf, or will we generate an invalid graph file? Thanks, Taylor
On Sun, Mar 22, 2020 at 02:04:34AM -0400, Jeff King wrote: > On Sun, Mar 22, 2020 at 01:49:16AM -0400, Jeff King wrote: > > > [1] I'm actually not quite sure about correctness here. It should be > > fine to generate a graph file without any given commit; readers will > > just have to load that commit the old-fashioned way. But at this > > phase of "commit-graph write", I think we'll already have done the > > close_reachable() check. What does it mean to throw away a commit at > > this stage? If we're the parent of another commit, then it will have > > trouble referring to us by a uint32_t. Will the actual writing phase > > barf, or will we generate an invalid graph file? > > It doesn't seem great. If I instrument Git like this to simulate an > object temporarily "missing" (if it were really missing the whole repo > would be corrupt; we're trying to see what would happen if a race causes > us to momentarily not see it): This is definitely a problem on either side of this patch, which is demonstrated by the fact that you applied your changes without my patch on top (and that my patch isn't changing anything substantial in this area like removing the 'continue' statement). Should we address this before moving on with my patch? I think that we *could*, but I'd rather go forward with what we have for now, since it's only improving the situation, and not introducing a new bug. > diff --git a/commit-graph.c b/commit-graph.c > index 3da52847e4..71419c2532 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -1596,6 +1596,19 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx) > } > } > > +static int pretend_commit_is_missing(const struct object_id *oid) > +{ > + static int initialized; > + static struct object_id missing; > + if (!initialized) { > + const char *x = getenv("PRETEND_COMMIT_IS_MISSING"); > + if (x) > + get_oid_hex(x, &missing); > + initialized = 1; > + } > + return oideq(&missing, oid); > +} > + > static void merge_commit_graph(struct write_commit_graph_context *ctx, > struct commit_graph *g) > { > @@ -1612,6 +1625,11 @@ static void merge_commit_graph(struct write_commit_graph_context *ctx, > > load_oid_from_graph(g, i + offset, &oid); > > + if (pretend_commit_is_missing(&oid)) { > + warning("pretending %s is missing", oid_to_hex(&oid)); > + continue; > + } > + > /* only add commits if they still exist in the repo */ > result = lookup_commit_reference_gently(ctx->r, &oid, 1); > > > and then I make a fully-graphed repo like this: > > git init repo > cd repo > for i in $(seq 10); do > git commit --allow-empty -m $i > done > git commit-graph write --input=reachable --split=no-merge > > if we pretend a parent is missing, I get a BUG(): > > $ git rev-parse HEAD | > PRETEND_COMMIT_IS_MISSING=$(git rev-parse HEAD^) \ > git commit-graph write --stdin-commits --split=merge-all > warning: pretending 35e6e15c738cf2bfbe495957b2a941c2efe86dd9 is missing > BUG: commit-graph.c:879: missing parent 35e6e15c738cf2bfbe495957b2a941c2efe86dd9 for commit d4141fb57a9bbe26b247f23c790d63d078977833 > Aborted > > So it seems like just skipping here (either with the new patch or > without) isn't really a good strategy. > > -Peff Thanks, Taylor
On Sun, Mar 22, 2020 at 01:36:35AM -0400, Jeff King wrote: > On Sat, Mar 21, 2020 at 11:27:16AM -0600, Taylor Blau wrote: > > > > I'm not sure how to do that, though. Saying "--input=none" still puts > > > all of those existing graphed objects into the list of oids to include. > > > I think you'd need a case where you were legitimately only adding a few > > > commits, but the merge rules say we need to create one big commit-graph > > > file. > > > > You have to be careful, since we're taking the reachability closure over > > any commits that you do give. So, one thing you could do to ensure that > > you have an actually small graph is to take something from the output of > > 'git rev-list --max-parents=0 HEAD'. > > I don't think you need to be that careful, though. In split-mode, > close_reachable() will stop traversing when it finds a graphed commit. > That's why using the tip of HEAD in my previous example worked. > > > To try and reproduce your results, I used '1da177e4c3', which is the > > kernel's first commit in Git. If my interpretation of your setup is > > faithful, it goes something like: > > > > $ graphdir=.git/objects/info/commit-graphs > > $ git rev-parse 1da177e4c3 | > > git commit-graph write --split=no-merge --stdin-commits > > $ cp -r "$graphdir{,.bak}" > > > > $ best-of-five -p "rm -rf $graphdir && cp -r $graphdir{.bak,}" \ > > 'git commit-graph write --split=merge-all' > > My case is the opposite, isn't it? Here it looks like you've made a very > _tiny_ commit-graph file (with one commit), and then you're going to end > up adding in all of the new objects. I don't think it would be improved > much by this patch (which makes me very confused by the numbers you got > below). > > I also don't think it's that interesting a real-world case. > > The more interesting one is where you do already have a big > commit-graph, and want to add just a bit more to it. In the real world, > that might look something like this: > > # create a fake server repo > git clone --bare . dst.git > > # imagine the server already has everything in a graph file > git -C dst.git commit-graph write --split=no-merge --reachable > > # and now do a small push > git commit --allow-empty -m foo > git push dst.git > > # the server might do an incremental immediately to cover the new > # objects; here we'll use --stdin-commits with the new data, but a > # real server might feed the new packfile. We'd probably just use > # regular --split here in practice, but let's imagine that we're > # starting to have a lot of graph files, and that triggers a desire to > # merge. We'll force that state with --split=merge-all. > git rev-list HEAD^..HEAD | > git -C dst.git commit-graph write --split=merge-all --stdin-commits > > Without your patch, that takes ~11s for me. With it, it takes ~2s. > > Another equally interesting case is if the per-push generation _doesn't_ > merge anything, and just creates a new, tiny graph file. And then later > we want to do a real maintenance, merging them all done. I think that > would be something like: > > git -C dst.git commit-graph write --input=none --split=merge-all > > But that _isn't_ improved by your patch. For the simple reason that all > of the commits will already have been parsed. The "--input=none" option > isn't "no input"; it's actually "take all existing graphed objects as > input" (i.e., it implies --append). So each of those objects will > already have been examined in an earlier stage. > > > Where the last step is taking all commits listed in any pack, which is > > cheap to iterate. > > I'm not sure it's all that cheap. It has to find the type of every > object in every pack. And finding types involves walking delta chains. > That's something like 7s on my machine for linux.git (compared to the 2s > in which I can just merge down the existing graph files). > > > In the above setup, I get something like: > > > > git version 2.26.0.rc2.221.ge327a58236 > > Attempt 1: 16.933 > > Attempt 2: 18.101 > > Attempt 3: 17.603 > > Attempt 4: 20.404 > > Attempt 5: 18.871 > > > > real 0m16.933s > > user 0m16.440s > > sys 0m0.472s > > > > versus: > > > > git version 2.26.0.rc2.222.g295e7905ee > > Attempt 1: 5.34 > > Attempt 2: 4.623 > > Attempt 3: 5.263 > > Attempt 4: 5.268 > > Attempt 5: 5.606 > > > > real 0m4.623s > > user 0m4.428s > > sys 0m0.176s > > > > which is a best-case savings of ~72.7%, and a savings of ~71.5%. That > > seems much better. > > I'm still puzzled by this. In the setup you showed, hardly anything is > graphed. But the change is only in the graph-merge code. I agree with your reasoning. I tried to recreate these timings this morning, and was unable to do so. Here's my setup (where 'git' is built on the tip of 'next'): $ graphdir=".git/objects/info/commit-graphs" $ pack="pack-b10a98360eacb1f5a5ff67cb8d8113d8b3d0b39f.idx" $ rm -rf "$graphdir" $ git rev-list --max-parents=0 HEAD | tail -1 | git commit-graph write --input=stdin-commits --split=no-merge $ cp -r "$graphdir" "$graphdir.bak" $ git version git version 2.26.0.rc2.221.ge327a58236 # repeat the below twice to ensure a warm cache $ rm -rf "$graphdir" && cp -r "$graphdir.bak" "$graphdir" && echo "$pack" | sudo perf record -F 997 ~ttaylorr/bin/git commit-graph write \ --split=merge-all --input=stdin-packs $ mv perf.data{,.base} $ git version git version 2.26.0.rc2.222.g1931b73955 # do the above again, and mv perf.data{,.patch} $ sudo perf diff perf.data.base perf.data.patch # Event 'cpu-clock' # # Baseline Delta Shared Object Symbol # ........ ....... .................. .......................................... # 18.42% +0.34% libc-2.24.so [.] __memcmp_sse4_1 15.50% -0.45% libz.so.1.2.8 [.] 0x00000000000083c9 8.42% -0.01% libz.so.1.2.8 [.] inflate 6.33% +0.30% libc-2.24.so [.] __memmove_avx_unaligned_erms 4.67% -0.36% git [.] lookup_object 3.62% -0.28% git [.] unpack_object_header_buffer 2.64% +0.20% git [.] commit_to_sha1 2.41% -0.03% git [.] get_delta_base 2.05% -0.17% git [.] sha1_compression_states 2.00% +0.06% git [.] use_pack 1.66% +0.14% git [.] nth_packed_object_offset 1.63% +0.11% git [.] write_graph_chunk_extra_edges 1.44% +0.21% libz.so.1.2.8 [.] adler32 1.42% -0.03% git [.] sort_revindex 1.42% -0.06% git [.] parse_commit_date 1.41% -0.15% git [.] oidhash 1.37% +0.07% git [.] commit_list_count 1.11% -0.02% git [.] in_window 1.08% -0.05% git [.] packed_to_object_type 0.90% -0.18% git [.] ubc_check 0.82% -0.23% git [.] hex2chr 0.75% -0.02% git [.] hashcmp 0.71% +0.17% libc-2.24.so [.] msort_with_tmp.part.0 0.70% +0.25% git [.] sha1_pos 0.70% +0.12% git [.] repo_parse_commit_internal 0.66% -0.06% git [.] bsearch_hash 0.65% -0.11% git [.] compute_generation_numbers 0.64% +0.03% git [.] unpack_object_header 0.60% -0.06% git [.] find_entry_ptr 0.59% +0.09% libc-2.24.so [.] _int_malloc 0.55% -0.01% git [.] hexval 0.51% +0.07% git [.] entry_equals 0.44% +0.12% git [.] get_sha1_hex 0.41% -0.13% git [.] packed_object_info ...and many other functions that have a delta that can be attributed to noise, all comfortably less than 1%. So, I'm not sure what was going on with my original measurements, other than that I took them around midnight, so the chance for error was probably high. I'll go with your measurements, since they are both closer to what we'd see in the real world, and actually representative of the change here. I'll wait for some of the discussion in the sub-thread about 'OBJECT_INFO_SKIP_FETCH' to settle, and then send v2 in the next couple of days or so. > -Peff Thanks, Taylor
On Sun, Mar 22, 2020 at 12:04:24PM +0100, SZEDER Gábor wrote: > > > Where the last step is taking all commits listed in any pack, which is > > > cheap to iterate. > > > > I'm not sure it's all that cheap. It has to find the type of every > > object in every pack. And finding types involves walking delta chains. > > Indeed, that's just about the most expensive way to find all commits, > in my experience it can be well over a magnitude slower than > '--reachable'. They don't necessarily produce the same set of commits, of course, but for the purposes of "commit-graph write" I think they can generally be considered interchangeable. The speed depends on a couple of factors. One is just the makeup of the repository: tree size, how many are deltas, etc. But a big one is whether commit-graphs are already enabled. :) On linux.git without commit-graphs, both of these take about 7.5s for me: git cat-file --batch-all-objects --unordered --batch-check='%(objecttype) %(objectname)' git rev-list --all But with graphs, the latter takes about 700ms. I was coincidentally thinking about this problem ("show me all commits") for something unrelated recently, and I do think that cat-file invocation could be sped up in some cases. Here are a few ideas: 1. If the pack has a .bitmap file, it contains 4 "type" bitmaps that you can use to quickly determine the type of a given object. But: - the current code for loading bitmaps files is slower than we'd need; it indexes all of the commit bitmaps, but we wouldn't use them. So we'd want something that just loads the type bitmaps and stops. - you have to know the bit position of the object you're interested in, which implies computing the revidx. We could provide a "bit position" cache in the bitmap file (basically one uint32_t for each object giving the packfile order position, but stored in pack .idx order). Neither is too hard a problem, but it's clearly not just a small patch to consult the bitmap. It could be patched directly into oid_object_info_extended(), which could do individual type lookups this way. But it would be faster still if callers could say "I'm only interested in commits", in which case we could just quickly walk the commit-type bitmap in one pass (to get the commits in just that packfile, and then go through the other objects the old-fashioned way). 2. The commit-graph file can quickly tell us whether an object is a commit. But unfortunately it can give us a definite "yes", but not a definite "no" (because it might be a commit that just isn't (yet) in the graph file). So we'd have to find the real type of any non-commits, and I think that's where we spend the bulk of our time anyway. 3. To find the type of a delta object, we have to walk back to a base object with a concrete type. When we access the actual object contents, we cache intermediate bases to avoid digging down the same chain over and over. But I don't think there's any such cache for the type lookup. So if you have a 50-deep delta chain and want the type of each object, you'll walk down 49 links, then 48 links, and so on. It probably wouldn't be that hard to have a small type cache (or just allow the existing cache to have entries for "I know the type, but don't have the contents"). I suspect (1) would give the biggest speedup, but is more work and requires bitmaps. I think (3) is likely to give a moderate win and probably isn't that big a patch. -Peff
On Sun, Mar 22, 2020 at 02:45:19PM -0400, Jeff King wrote: > 3. To find the type of a delta object, we have to walk back to a base > object with a concrete type. When we access the actual object > contents, we cache intermediate bases to avoid digging down the same > chain over and over. But I don't think there's any such cache for the > type lookup. So if you have a 50-deep delta chain and want the type > of each object, you'll walk down 49 links, then 48 links, and so on. > It probably wouldn't be that hard to have a small type cache (or just > allow the existing cache to have entries for "I know the type, but > don't have the contents"). > > I suspect (1) would give the biggest speedup, but is more work and > requires bitmaps. I think (3) is likely to give a moderate win and > probably isn't that big a patch. Sadly, it doesn't seem to help. Here's a hacky implementation (I tested it only on my limited cat-file case; I wouldn't be at all surprised if it causes problems for users of the delta cache that want the actual object contents). After some rudimentary profiling (which I _should_ have done in the first place before opening my mouth), it looks like finding the types isn't actually the most expensive thing anyway: it's just dealing with the sheer number of objects in the pack. That implies that teaching packed_object_info() to use bitmaps to find the individual types wouldn't help much either (but being able to ask "just give me the commits" and iterating over the commit-type bitmap probably _would_). Anyway, here's the patch for the curious. --- diff --git a/packfile.c b/packfile.c index f4e752996d..482733689a 100644 --- a/packfile.c +++ b/packfile.c @@ -1273,6 +1273,13 @@ static int retry_bad_packed_offset(struct repository *r, #define POI_STACK_PREALLOC 64 +/* some reordering of static functions could avoid these */ +static enum object_type get_delta_base_cache_type(struct packed_git *p, + off_t curpos); +static void add_delta_base_cache(struct packed_git *p, off_t base_offset, + void *base, unsigned long base_size, + enum object_type type); + static enum object_type packed_to_object_type(struct repository *r, struct packed_git *p, off_t obj_offset, @@ -1287,6 +1294,14 @@ static enum object_type packed_to_object_type(struct repository *r, while (type == OBJ_OFS_DELTA || type == OBJ_REF_DELTA) { off_t base_offset; unsigned long size; + enum object_type cached_type; + + cached_type = get_delta_base_cache_type(p, obj_offset); + if (cached_type > 0) { + type = cached_type; + goto out; + } + /* Push the object we're going to leave behind */ if (poi_stack_nr >= poi_stack_alloc && poi_stack == small_poi_stack) { poi_stack_alloc = alloc_nr(poi_stack_nr); @@ -1326,6 +1341,11 @@ static enum object_type packed_to_object_type(struct repository *r, } out: + if (type != OBJ_BAD) { + size_t i; + for (i = 0; i < poi_stack_nr; i++) + add_delta_base_cache(p, poi_stack[i], NULL, 0, type); + } if (poi_stack != small_poi_stack) free(poi_stack); return type; @@ -1385,6 +1405,13 @@ get_delta_base_cache_entry(struct packed_git *p, off_t base_offset) return e ? container_of(e, struct delta_base_cache_entry, ent) : NULL; } +static enum object_type get_delta_base_cache_type(struct packed_git *p, off_t curpos) +{ + struct delta_base_cache_entry *ent = + get_delta_base_cache_entry(p, curpos); + return ent ? ent->type : OBJ_BAD; +} + static int delta_base_cache_key_eq(const struct delta_base_cache_key *a, const struct delta_base_cache_key *b) { @@ -1433,7 +1460,7 @@ static void *cache_or_unpack_entry(struct repository *r, struct packed_git *p, struct delta_base_cache_entry *ent; ent = get_delta_base_cache_entry(p, base_offset); - if (!ent) + if (!ent || !ent->data) return unpack_entry(r, p, base_offset, type, base_size); if (type) @@ -1677,7 +1704,7 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset, struct delta_base_cache_entry *ent; ent = get_delta_base_cache_entry(p, curpos); - if (ent) { + if (ent && ent->data) { type = ent->type; data = ent->data; size = ent->size;
On Sun, Mar 22, 2020 at 12:04:24PM +0100, SZEDER Gábor wrote: > On Sun, Mar 22, 2020 at 01:36:35AM -0400, Jeff King wrote: > > The "--input=none" option > > isn't "no input"; it's actually "take all existing graphed objects as > > input" (i.e., it implies --append). So each of those objects will > > already have been examined in an earlier stage. > > Yeah, we discussed it earlier that '--input=none' doesn't properly > describe what it tries to accomplish, and suggested '--input=exists' > (or was it 'existing'?) instead. Good idea. I don't remember discussing '--input=exist{s,ing}', specifically, but I do think that '--input=none' is not as clear as it could be. I sent [1], which I think may be a step in the right direction. I'll look forward to your thoughts. > > > Where the last step is taking all commits listed in any pack, which is > > > cheap to iterate. > > > > I'm not sure it's all that cheap. It has to find the type of every > > object in every pack. And finding types involves walking delta chains. > > Indeed, that's just about the most expensive way to find all commits, > in my experience it can be well over a magnitude slower than > '--reachable'. Thanks, Taylor [1]: https://lore.kernel.org/git/e0f42a2f3c0162a5d43bb2bce0f69264b59f92e9.1584994172.git.me@ttaylorr.com/
On Sun, Mar 22, 2020 at 10:45:31AM -0600, Taylor Blau wrote: > So, I'm not sure what was going on with my original measurements, other > than that I took them around midnight, so the chance for error was > probably high. I'll go with your measurements, since they are both > closer to what we'd see in the real world, and actually representative > of the change here. I wondered if perhaps you had stray commit-graph files sitting around. I don't think we'd look at them if they aren't mentioned in the commit-graphs index, though. And anyway, feeding all of the packed objects would mean those would all be parsed in the earlier stage, so even if we _did_ have a bunch of stuff to merge, it would have been fast even with the old code. So...I dunno. One other thing that's bitten me in the past is accidentally compiling with the wrong optimization settings. I have my config.mak set up to compile with -O0 by default (because it's faster and because I'm frequently debugging anyway), and then I do a -O2 build for my daily driver. But when performance testing, I sometimes accidentally compare -O0 and -O2 builds to each other. That's all very specific to my config.mak setup, but I know you've picked up some of it, so it's possible you inherited that subtlety. :) -Peff
On Sun, Mar 22, 2020 at 09:47:49AM -0600, Taylor Blau wrote: > > > [1] I'm actually not quite sure about correctness here. It should be > > > fine to generate a graph file without any given commit; readers will > > > just have to load that commit the old-fashioned way. But at this > > > phase of "commit-graph write", I think we'll already have done the > > > close_reachable() check. What does it mean to throw away a commit at > > > this stage? If we're the parent of another commit, then it will have > > > trouble referring to us by a uint32_t. Will the actual writing phase > > > barf, or will we generate an invalid graph file? > > > > It doesn't seem great. If I instrument Git like this to simulate an > > object temporarily "missing" (if it were really missing the whole repo > > would be corrupt; we're trying to see what would happen if a race causes > > us to momentarily not see it): > > This is definitely a problem on either side of this patch, which is > demonstrated by the fact that you applied your changes without my patch > on top (and that my patch isn't changing anything substantial in this > area like removing the 'continue' statement). > > Should we address this before moving on with my patch? I think that we > *could*, but I'd rather go forward with what we have for now, since it's > only improving the situation, and not introducing a new bug. I do agree it's a problem before your patch. But I think your patch may make it a lot more common, if only because it means we'd _actually_ be dropping entries for objects that went away, instead of accidentally keeping them due to re-using the graph result. So it probably is worth trying to deal with it now, or at least thinking hard about it. The trouble is that I'm not sure what _should_ happen. Aborting the whole commit-graph generation seems overboard (and would be annoying for cases where whole swaths of history became unreachable and went away; presumably we'd be dropping _all_ of those objects, and the write phase would be just fine). I have a feeling the correct solution is to do this merging pass earlier, before we check close_reachable(). Or if not a true merging pass, at least a pass to check which existing entries are still valid. But I don't know how painful that reordering would be. -Peff
On Sun, Mar 22, 2020 at 09:44:39AM -0600, Taylor Blau wrote: > > Using QUICK means we won't waste time double-checking in the second > > case. But it means we won't catch the first case, and we may generate a > > new graph file that omits the object. They're both optimizations, and I > > don't think we're impacting correctness[1], but it's not clear to me > > whether one is a win over the other. We don't generally expect objects > > we have to go away often. > > > > Skipping fetching seems a little more straight-forward to me. If we had > > it in a graph file, presumably we had the actual object before, too. And > > either we're in the first case above (we really do have it and just need > > to double-check) in which case not saying QUICK would be enough. Or we > > intentionally got rid of it. In which case downloading it just to > > generate a cache is quite silly. > > I was going to write that I'm not entirely sure of this, but I tried to > talk myself through it below, and I think that the right flag is *only* > OBJECT_INFO_SKIP_FETCH_OBJECT. Re-reading what I wrote, I think I didn't say it very well. But yes, that's what I think we ought to do, too: only use SKIP_FETCH_OBJECT. -Peff
On Tue, Mar 24, 2020 at 02:11:59AM -0400, Jeff King wrote: > On Sun, Mar 22, 2020 at 09:47:49AM -0600, Taylor Blau wrote: > > > > > [1] I'm actually not quite sure about correctness here. It should be > > > > fine to generate a graph file without any given commit; readers will > > > > just have to load that commit the old-fashioned way. But at this > > > > phase of "commit-graph write", I think we'll already have done the > > > > close_reachable() check. What does it mean to throw away a commit at > > > > this stage? If we're the parent of another commit, then it will have > > > > trouble referring to us by a uint32_t. Will the actual writing phase > > > > barf, or will we generate an invalid graph file? > > > > > > It doesn't seem great. If I instrument Git like this to simulate an > > > object temporarily "missing" (if it were really missing the whole repo > > > would be corrupt; we're trying to see what would happen if a race causes > > > us to momentarily not see it): > > > > This is definitely a problem on either side of this patch, which is > > demonstrated by the fact that you applied your changes without my patch > > on top (and that my patch isn't changing anything substantial in this > > area like removing the 'continue' statement). > > > > Should we address this before moving on with my patch? I think that we > > *could*, but I'd rather go forward with what we have for now, since it's > > only improving the situation, and not introducing a new bug. > > I do agree it's a problem before your patch. But I think your patch may > make it a lot more common, if only because it means we'd _actually_ be > dropping entries for objects that went away, instead of accidentally > keeping them due to re-using the graph result. So it probably is worth > trying to deal with it now, or at least thinking hard about it. > > The trouble is that I'm not sure what _should_ happen. Aborting the > whole commit-graph generation seems overboard (and would be annoying for > cases where whole swaths of history became unreachable and went away; > presumably we'd be dropping _all_ of those objects, and the write phase > would be just fine). I have a feeling the correct solution is to do this > merging pass earlier, before we check close_reachable(). Or if not a > true merging pass, at least a pass to check which existing entries are > still valid. But I don't know how painful that reordering would be. Maybe, but I'm not sure that moving 'close_reachable' up would necessarily solve all of our problems. Or, at least, that it wouldn't create a set of new problems :). Let's say that you have (1) some connected component of your history that is written to a commit-graph, where (2) part of that cluster has been dropped (e.g., due to corruption, a rogue 'git gc', etc), and (3) you're writing a new graph with '--input=graphed'. What is 'close_reachable' supposed to do? If some of the now-missing commits are in the reachability closure of the commits that still exist in the repository, then we're back to where we started. We *could* have 'close_reachable' take all missing commits and drop their ancestors and descendants, but this creates quite the headache for 'close_reachable', which now needs to keep track of such things. I'm hopeful that this isn't so common in practice, but I'm also not entirely sure one way or another. I can plan to deploy this patch to GitHub's servers for a ~month and see if we experience it. If it turns out to be common, I'll assume that others have this problem, too, in which case we can go back and think more about it. But, I'm hopeful that this problem is rare enough that it isn't worth worrying about for anybody, GitHub or not. > -Peff Thanks, Taylor
On Tue, Mar 24, 2020 at 05:08:26PM -0600, Taylor Blau wrote: > > The trouble is that I'm not sure what _should_ happen. Aborting the > > whole commit-graph generation seems overboard (and would be annoying for > > cases where whole swaths of history became unreachable and went away; > > presumably we'd be dropping _all_ of those objects, and the write phase > > would be just fine). I have a feeling the correct solution is to do this > > merging pass earlier, before we check close_reachable(). Or if not a > > true merging pass, at least a pass to check which existing entries are > > still valid. But I don't know how painful that reordering would be. > > Maybe, but I'm not sure that moving 'close_reachable' up would > necessarily solve all of our problems. Or, at least, that it wouldn't > create a set of new problems :). > > Let's say that you have (1) some connected component of your history > that is written to a commit-graph, where (2) part of that cluster has > been dropped (e.g., due to corruption, a rogue 'git gc', etc), and (3) > you're writing a new graph with '--input=graphed'. > > What is 'close_reachable' supposed to do? If some of the now-missing > commits are in the reachability closure of the commits that still exist > in the repository, then we're back to where we started. We *could* have > 'close_reachable' take all missing commits and drop their ancestors and > descendants, but this creates quite the headache for 'close_reachable', > which now needs to keep track of such things. Yeah, I think my suggestion was dumb. I was thinking that close_reachable() might somehow fix things up, but all it ever does with the current code is _add_ commits to the graph. And we would want the opposite: removing ones which can no longer be present, because their ancestors are gone. That would be possible, but would be quite a pain (you'd have to walk an inversion of the parent DAG). > I'm hopeful that this isn't so common in practice, but I'm also not > entirely sure one way or another. I can plan to deploy this patch to > GitHub's servers for a ~month and see if we experience it. If it turns > out to be common, I'll assume that others have this problem, too, in > which case we can go back and think more about it. It _shouldn't_ be too common, because it implies that we have commit X but not its ancestor X^. And the pruning and repacking code tries to avoid throwing away X^ in such a case. It could still happen due to a race, though. So yeah, I think we should mostly ignore it. It's not a new problem with your series (and the only reason it is more common with your series is that the old code was actually incorrectly handling some cases). It might be worth turning the "missing parent" BUG() into a die(), since we know it's reachable. But that's largely academic, since in either case commit-graph is going to barf and fail to write. As a follow-up to this part for anyone else following the discussion: > I can plan to deploy this patch to GitHub's servers for a ~month and > see if we experience it. ...I don't think we'll actually generate good data here. We're probably going to end up doing our "big maintenance" commit-graph roll-ups by just feeding --reachable as input, and dropping all of the existing graphs. -Peff
On Fri, Mar 27, 2020 at 04:42:48AM -0400, Jeff King wrote: > On Tue, Mar 24, 2020 at 05:08:26PM -0600, Taylor Blau wrote: > > I can plan to deploy this patch to GitHub's servers for a ~month and > > see if we experience it. > > ...I don't think we'll actually generate good data here. We're probably > going to end up doing our "big maintenance" commit-graph roll-ups by > just feeding --reachable as input, and dropping all of the existing > graphs. For what it's worth (and I'm not sure that it's worth much), but this is only true in the last day or so. Before, we were running: $ git commit-graph write --split=merge-all --input=none which *did* exercise this code quite frequently (and thus would have been helped by this patch). But now, we are running something like: $ git commit-graph write --split=replace --input=reachable ...where '--split=replace' means "write a split commit-graph, but drop all existing layers before writing it". This case is obviously not helped by this patch, although I think the patch is worthwhile for callers who do the first thing. I'll post patches about that shortly after they've been a little more well-vetted (we're only generating split commit-graphs on a percentage of repositories for now). > -Peff Thanks, Taylor
diff --git a/commit-graph.c b/commit-graph.c index f013a84e29..c7cfadc786 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1599,14 +1599,18 @@ static void merge_commit_graph(struct write_commit_graph_context *ctx, for (i = 0; i < g->num_commits; i++) { struct object_id oid; - struct commit *result; + struct commit *result = NULL; display_progress(ctx->progress, i + 1); load_oid_from_graph(g, i + offset, &oid); /* only add commits if they still exist in the repo */ - result = lookup_commit_reference_gently(ctx->r, &oid, 1); + if (repo_has_object_file(ctx->r, &oid)) { + result = lookup_commit(ctx->r, &oid); + if (repo_parse_commit(ctx->r, result)) + result = NULL; + } if (result) { ctx->commits.list[ctx->commits.nr] = result;