Message ID | 20181204224238.50966-1-jonathantanmy@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [on,sb/more-repo-in-api] revision: use commit graph in get_reference() | expand |
On Tue, Dec 4, 2018 at 2:42 PM Jonathan Tan <jonathantanmy@google.com> wrote: > > When fetching into a repository, a connectivity check is first made by > check_exist_and_connected() in builtin/fetch.c that runs: > > git rev-list --objects --stdin --not --all --quiet <(list of objects) > > If the client repository has many refs, this command can be slow, > regardless of the nature of the server repository or what is being > fetched. A profiler reveals that most of the time is spent in > setup_revisions() (approx. 60/63), and of the time spent in > setup_revisions(), most of it is spent in parse_object() (approx. > 49/60). This is because setup_revisions() parses the target of every ref > (from "--all"), and parse_object() reads the buffer of the object. > > Reading the buffer is unnecessary if the repository has a commit graph > and if the ref points to a commit (which is typically the case). This > patch uses the commit graph wherever possible; on my computer, when I > run the above command with a list of 1 object on a many-ref repository, > I get a speedup from 1.8s to 1.0s. > > Another way to accomplish this effect would be to modify parse_object() > to use the commit graph if possible; however, I did not want to change > parse_object()'s current behavior of always checking the object > signature of the returned object. > > Signed-off-by: Jonathan Tan <jonathantanmy@google.com> > --- > This is on sb/more-repo-in-api because I'm using the repo_parse_commit() > function. This is a mere nicety, not strictly required. Before we had parse_commit(struct commit *) which would accomplish the same, (and we'd still have that afterwards as a #define falling back onto the_repository). As the function get_reference() is not the_repository safe as it contains a call to is_promisor_object() that is repository agnostic, I think it would be fair game to not depend on that series. I am not complaining, though. > A colleague noticed this issue when handling a mirror clone. > > Looking at the bigger picture, the speed of the connectivity check > during a fetch might be further improved by passing only the negotiation > tips (obtained through --negotiation-tip) instead of "--all". This patch > just handles the low-hanging fruit first. > --- > revision.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/revision.c b/revision.c > index b5108b75ab..e7da2c57ab 100644 > --- a/revision.c > +++ b/revision.c > @@ -212,7 +212,20 @@ static struct object *get_reference(struct rev_info *revs, const char *name, > { > struct object *object; > > - object = parse_object(revs->repo, oid); > + /* > + * If the repository has commit graphs, repo_parse_commit() avoids > + * reading the object buffer, so use it whenever possible. > + */ > + if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) { > + struct commit *c = lookup_commit(revs->repo, oid); > + if (!repo_parse_commit(revs->repo, c)) > + object = (struct object *) c; > + else > + object = NULL; Would it make sense in this case to rely on parse_object below instead of assigning NULL? The reason for that would be that when lookup_commit returns NULL, we would try more broadly. AFAICT oid_object_info doesn't take advantage of the commit graph, but just looks up the object header, which is still less than completely parsing it. Then lookup_commit is overly strict, as it may return NULL as when there still is a type mismatch (I don't think a mismatch could happen here, as both rely on just the object store, and not the commit graph.), so this would be just defensive programming for the sake of it. I dunno. struct commit *c; if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT && (c = lookup_commit(revs->repo, oid)) && !repo_parse_commit(revs->repo, c)) object = (struct object *) c; else object = parse_object(revs->repo, oid); So with all that said, I still think this is a good patch. Thanks, Stefan > + } else { > + object = parse_object(revs->repo, oid); > + } > + > if (!object) { > if (revs->ignore_missing) > return object; > -- > 2.19.0.271.gfe8321ec05.dirty >
On Tue, Dec 04, 2018 at 02:42:38PM -0800, Jonathan Tan wrote: > diff --git a/revision.c b/revision.c > index b5108b75ab..e7da2c57ab 100644 > --- a/revision.c > +++ b/revision.c > @@ -212,7 +212,20 @@ static struct object *get_reference(struct rev_info *revs, const char *name, > { > struct object *object; > > - object = parse_object(revs->repo, oid); > + /* > + * If the repository has commit graphs, repo_parse_commit() avoids > + * reading the object buffer, so use it whenever possible. > + */ > + if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) { > + struct commit *c = lookup_commit(revs->repo, oid); > + if (!repo_parse_commit(revs->repo, c)) > + object = (struct object *) c; > + else > + object = NULL; > + } else { > + object = parse_object(revs->repo, oid); > + } This seems like a reasonable thing to do, but I have sort of a meta-comment. In several places we've started doing this kind of "if it's this type of object, do X, otherwise do Y" optimization (e.g., handling large blobs for streaming). And in the many cases we end up doubling the effort to do object lookups: here we do one lookup to get the type, and then if it's not a commit (or if we don't have a commit graph) we end up parsing it anyway. I wonder if we could do better. In this instance, it might make sense to first see if we actually have a commit graph available (it might not have this object, of course, but at least we'd expect it to have most commits). In general, it would be nice if we had a more incremental API for accessing objects: open, get metadata, then read the data. That would make these kinds of optimizations "free". I don't have numbers for how much the extra lookups cost. The lookups are probably dwarfed by parse_object() in general, so even if we save only a few full object loads, it may be a win. It just seems a shame that we may be making the "slow" paths (when our type-specific check doesn't match) even slower. -Peff
Jonathan Tan <jonathantanmy@google.com> writes: > Looking at the bigger picture, the speed of the connectivity check > during a fetch might be further improved by passing only the negotiation > tips (obtained through --negotiation-tip) instead of "--all". This patch > just handles the low-hanging fruit first. That sounds like a good direction, when having to list excessive number of refs is the primary problem. When fetching their 'master' into our 'remotes/origin/master' and doing nothing else, we may end up showing only the latter, which will miss optimization opportunity a lot if the latest change made over there is to merge in the change we asked them to pull earlier (which would be greatly helped if we let them know about the tip of the topic they earlier pulled from us), but also avoids having to send irrelevant refs that point at tags addded to months old states. So there is a subtle trade-off between sending more refs to reduce the resulting packfile, and sending fewer refs to reduce the cost of the "have" exchange. Changing the way to throw each object pointed at by a ref into the queue to be emitted in the "have" exchange from regular object parsing to peeking of precomputed data would reduce the local cost of "have" exchange, but it does not reduce the network cost at all, though. As to the change being specific to get_reference() and not to parse_object(), I think what we see here is probably better, simply because parse_object() is not in the position to asssume that it is likely to be asked to parse commits, but I think get_reference() is, after looking at its callsites in revision.c. I do share the meta-comment concern with Peff, though. > --- > revision.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/revision.c b/revision.c > index b5108b75ab..e7da2c57ab 100644 > --- a/revision.c > +++ b/revision.c > @@ -212,7 +212,20 @@ static struct object *get_reference(struct rev_info *revs, const char *name, > { > struct object *object; > > - object = parse_object(revs->repo, oid); > + /* > + * If the repository has commit graphs, repo_parse_commit() avoids > + * reading the object buffer, so use it whenever possible. > + */ > + if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) { > + struct commit *c = lookup_commit(revs->repo, oid); > + if (!repo_parse_commit(revs->repo, c)) > + object = (struct object *) c; > + else > + object = NULL; > + } else { > + object = parse_object(revs->repo, oid); > + } > + > if (!object) { > if (revs->ignore_missing) > return object;
> > This is on sb/more-repo-in-api because I'm using the repo_parse_commit() > > function. > > This is a mere nicety, not strictly required. > Before we had parse_commit(struct commit *) which would accomplish the > same, (and we'd still have that afterwards as a #define falling back onto > the_repository). As the function get_reference() is not the_repository safe > as it contains a call to is_promisor_object() that is repository > agnostic, I think > it would be fair game to not depend on that series. I am not > complaining, though. Good point - I'll base the next version on master (and add a TODO explaining which functions are not yet converted). > AFAICT oid_object_info doesn't take advantage of the commit graph, > but just looks up the object header, which is still less than completely > parsing it. Then lookup_commit is overly strict, as it may return > NULL as when there still is a type mismatch (I don't think a mismatch > could happen here, as both rely on just the object store, and not the > commit graph.), so this would be just defensive programming for > the sake of it. I dunno. > > struct commit *c; > > if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT && > (c = lookup_commit(revs->repo, oid)) && > !repo_parse_commit(revs->repo, c)) > object = (struct object *) c; > else > object = parse_object(revs->repo, oid); I like this way better - I'll do it in the next version.
Also CC-ing Stolee since I mention multi-pack indices at the end. > This seems like a reasonable thing to do, but I have sort of a > meta-comment. In several places we've started doing this kind of "if > it's this type of object, do X, otherwise do Y" optimization (e.g., > handling large blobs for streaming). > > And in the many cases we end up doubling the effort to do object > lookups: here we do one lookup to get the type, and then if it's not a > commit (or if we don't have a commit graph) we end up parsing it anyway. > > I wonder if we could do better. In this instance, it might make sense > to first see if we actually have a commit graph available (it might not > have this object, of course, but at least we'd expect it to have most > commits). This makes sense - I thought I shouldn't mention the commit graph in the code since it seems like a layering violation, but I felt the need to mention commit graph in a comment, so maybe the need to mention commit graph in the code is there too. Subsequently, maybe the lookup-for-type could be replaced by a lookup-in-commit-graph (maybe by using parse_commit_in_graph() directly), which should be at least slightly faster. > In general, it would be nice if we had a more incremental API > for accessing objects: open, get metadata, then read the data. That > would make these kinds of optimizations "free". Would this be assuming that to read the data, you would (1) first need to read the metadata, and (2) there would be no redundancy in reading the two? It seems to me that for loose objects, you would want to perform all your reads at once, since any read requires opening the file, and for commit graphs, you just want to read what you want, since the metadata and the data are in separate places. > I don't have numbers for how much the extra lookups cost. The lookups > are probably dwarfed by parse_object() in general, so even if we save > only a few full object loads, it may be a win. It just seems a shame > that we may be making the "slow" paths (when our type-specific check > doesn't match) even slower. I agree. I think it will always remain a tradeoff when we have multiple data sources of objects (loose, packed, commit graph - and we can't unify them all, since they each have their uses). Unless the multi-pack index can reference commit graphs as well...then it could be our first point of reference without introducing any inefficiencies...
On Thu, Dec 06, 2018 at 03:54:46PM -0800, Jonathan Tan wrote: > This makes sense - I thought I shouldn't mention the commit graph in the > code since it seems like a layering violation, but I felt the need to > mention commit graph in a comment, so maybe the need to mention commit > graph in the code is there too. Subsequently, maybe the lookup-for-type > could be replaced by a lookup-in-commit-graph (maybe by using > parse_commit_in_graph() directly), which should be at least slightly > faster. That makes more sense to me. If we don't have a commit graph at all, it's a quick noop. If we do, we might binary search in the list of commits for a non-commit. But that's strictly faster than finding the object's type (which involves a binary search of a larger list, followed by actually accessing the type info). > > In general, it would be nice if we had a more incremental API > > for accessing objects: open, get metadata, then read the data. That > > would make these kinds of optimizations "free". > > Would this be assuming that to read the data, you would (1) first need to > read the metadata, and (2) there would be no redundancy in reading the > two? It seems to me that for loose objects, you would want to perform > all your reads at once, since any read requires opening the file, and > for commit graphs, you just want to read what you want, since the > metadata and the data are in separate places. By metadata here, I don't mean the commit-graph data, but just the object type and size. So I'm imagining an interface more like: - object_open() locates the object, and stores either the pack file/offset or a descriptor to a loose path in an opaque handle struct - object_size() and object_type() on that handle would do what you expect. For loose objects, these would parse the header (the equivalent of unpack_sha1_header()). For packed ones, they'd use the object header in the pack (and chase down the delta bits as needed). - object_contents() would return the full content - object_read() could sequentially read a subset of the file (this could replace the streaming interface we currently have) We have most of the low-level bits for this already, if you poke into what object_info_extended() is doing. We just don't have them packaged in an interface which can persist across multiple calls. With an interface like that, parse_object()'s large-blob check could be something like the patch below. But your case here is a bit more interesting. If we have a commit graph, then we can avoid opening (or even finding!) the on-disk object at all. So I actually think it makes sense to just check the commit-graph first, as discussed above. --- diff --git a/object.c b/object.c index e54160550c..afce58c0bc 100644 --- a/object.c +++ b/object.c @@ -254,23 +254,31 @@ struct object *parse_object(struct repository *r, const struct object_id *oid) const struct object_id *repl = lookup_replace_object(r, oid); void *buffer; struct object *obj; + struct object_handle oh; obj = lookup_object(r, oid->hash); if (obj && obj->parsed) return obj; - if ((obj && obj->type == OBJ_BLOB && has_object_file(oid)) || - (!obj && has_object_file(oid) && - oid_object_info(r, oid, NULL) == OBJ_BLOB)) { - if (check_object_signature(repl, NULL, 0, NULL) < 0) { + if (object_open(&oh, oid) < 0) + return NULL; /* missing object */ + + if (object_type(&oh) == OBJ_BLOB) { + /* this will call object_read() on 4k chunks */ + if (check_object_signature_stream(&oh, oid)) { error(_("sha1 mismatch %s"), oid_to_hex(oid)); return NULL; } + object_close(&oh); /* we don't care about contents */ parse_blob_buffer(lookup_blob(r, oid), NULL, 0); return lookup_object(r, oid->hash); } - buffer = read_object_file(oid, &type, &size); + type = object_type(&oh); + size = object_size(&oh); + buffer = object_contents(&oh); + object_close(&oh); + if (buffer) { if (check_object_signature(repl, buffer, size, type_name(type)) < 0) { free(buffer);
On 12/6/2018 6:36 PM, Jonathan Tan wrote: >> AFAICT oid_object_info doesn't take advantage of the commit graph, >> but just looks up the object header, which is still less than completely >> parsing it. Then lookup_commit is overly strict, as it may return >> NULL as when there still is a type mismatch (I don't think a mismatch >> could happen here, as both rely on just the object store, and not the >> commit graph.), so this would be just defensive programming for >> the sake of it. I dunno. >> >> struct commit *c; >> >> if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT && >> (c = lookup_commit(revs->repo, oid)) && >> !repo_parse_commit(revs->repo, c)) >> object = (struct object *) c; >> else >> object = parse_object(revs->repo, oid); > I like this way better - I'll do it in the next version. If we do _not_ have a commit-graph or if the commit-graph does not have that commit, this will have the same performance problem, right? Should we instead create a direct dependence on the commit-graph, and try to parse the oid from the graph directly? If it succeeds, then we learn that the object is a commit, in addition to all of the parsing work. This means we could avoid oid_object_info() loading data if we succeed. We would fall back to parse_object() if it fails. I was thinking this should be a simple API call to parse_commit_in_graph(), but that requires a struct commit filled with an oid, which is not the best idea if we don't actually know it is a commit yet. The approach I recommend would then be more detailed: 1. Modify find_commit_in_graph() to take a struct object_id instead of a struct commit. This helps find the integer position in the graph. That position can be used in fill_commit_in_graph() to load the commit contents. Keep find_commit_in_graph() static as it should not be a public function. 2. Create a public function with prototype struct commit *try_parse_commit_from_graph(struct repository *r, struct object_id *oid) that returns a commit struct fully parsed if and only if the repository has that oid. It can call find_commit_in_graph(), then lookup_commit() and fill_commit_in_graph() to create the commit and parse the data. 3. In replace of the snippet above, do: struct commit *c; if ((c = try_parse_commit_from_graph(revs->repo, oid)) object = (struct object *)c; else object = parse_object(revs->repo, oid); A similar pattern _could_ be used in parse_object(), but I don't recommend doing this pattern unless we have a reasonable suspicion that we are going to parse commits more often than other objects. (It adds an O(log(# commits)) binary search to each object.) A final thought: consider making this "try the commit graph first, but fall back to parse_object()" a library function with a name like struct object *parse_probably_commit(struct repository *r, struct object_id *oid) so other paths that are parsing a lot of commits (but also maybe tags) could use the logic. Thanks! -Stolee
On Tue, Dec 04, 2018 at 02:42:38PM -0800, Jonathan Tan wrote: > When fetching into a repository, a connectivity check is first made by > check_exist_and_connected() in builtin/fetch.c that runs: > > git rev-list --objects --stdin --not --all --quiet <(list of objects) > > If the client repository has many refs, this command can be slow, > regardless of the nature of the server repository or what is being > fetched. A profiler reveals that most of the time is spent in > setup_revisions() (approx. 60/63), and of the time spent in > setup_revisions(), most of it is spent in parse_object() (approx. > 49/60). This is because setup_revisions() parses the target of every ref > (from "--all"), and parse_object() reads the buffer of the object. > > Reading the buffer is unnecessary if the repository has a commit graph > and if the ref points to a commit (which is typically the case). This > patch uses the commit graph wherever possible; on my computer, when I > run the above command with a list of 1 object on a many-ref repository, > I get a speedup from 1.8s to 1.0s. > > Another way to accomplish this effect would be to modify parse_object() > to use the commit graph if possible; however, I did not want to change > parse_object()'s current behavior of always checking the object > signature of the returned object. > > Signed-off-by: Jonathan Tan <jonathantanmy@google.com> > --- > This is on sb/more-repo-in-api because I'm using the repo_parse_commit() > function. > > A colleague noticed this issue when handling a mirror clone. > > Looking at the bigger picture, the speed of the connectivity check > during a fetch might be further improved by passing only the negotiation > tips (obtained through --negotiation-tip) instead of "--all". This patch > just handles the low-hanging fruit first. > --- I stumbled upon a regression that bisects down to this commit ec0c5798ee (revision: use commit graph in get_reference(), 2018-12-04): $ ~/src/git/bin-wrappers/git version git version 2.19.1.566.gec0c5798ee $ ~/src/git/bin-wrappers/git commit-graph write --reachable Computing commit graph generation numbers: 100% (58994/58994), done. $ ~/src/git/bin-wrappers/git status HEAD detached at origin/pu nothing to commit, working tree clean $ ~/src/git/bin-wrappers/git -c core.commitGraph=false describe --dirty v2.20.1-833-gcb3b9e7ee3 $ ~/src/git/bin-wrappers/git -c core.commitGraph=true describe --dirty v2.20.1-833-gcb3b9e7ee3 It's all good with only '--dirty', but watch this with '--all --dirty': $ ~/src/git/bin-wrappers/git -c core.commitGraph=false describe --all --dirty remotes/origin/pu $ ~/src/git/bin-wrappers/git -c core.commitGraph=true describe --all --dirty remotes/origin/pu-dirty IOW if the commit-graph is enabled, then my clean worktree is reported as dirty. And to add a cherry on top of my confusion: $ git checkout v2.20.0 Previous HEAD position was cb3b9e7ee3 Merge branch 'jh/trace2' into pu HEAD is now at 5d826e9729 Git 2.20 $ ~/src/git/bin-wrappers/git -c core.commitGraph=true describe --all --dirty tags/v2.20.0 It's clean even with '--all' and commit-graph enabled, but watch this: $ git branch this-will-screw-it-up $ ~/src/git/bin-wrappers/git -c core.commitGraph=true describe --all --dirty tags/v2.20.0-dirty Have fun! :) > revision.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/revision.c b/revision.c > index b5108b75ab..e7da2c57ab 100644 > --- a/revision.c > +++ b/revision.c > @@ -212,7 +212,20 @@ static struct object *get_reference(struct rev_info *revs, const char *name, > { > struct object *object; > > - object = parse_object(revs->repo, oid); > + /* > + * If the repository has commit graphs, repo_parse_commit() avoids > + * reading the object buffer, so use it whenever possible. > + */ > + if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) { > + struct commit *c = lookup_commit(revs->repo, oid); > + if (!repo_parse_commit(revs->repo, c)) > + object = (struct object *) c; > + else > + object = NULL; > + } else { > + object = parse_object(revs->repo, oid); > + } > + > if (!object) { > if (revs->ignore_missing) > return object; > -- > 2.19.0.271.gfe8321ec05.dirty >
> Have fun! :)
$ git gc
...
Computing commit graph generation numbers: 100% (164264/164264), done.
$ ./git version
git version 2.20.1.775.g2313a6b87fe.dirty
# pu + one commit addressing
# https://public-inbox.org/git/CAGZ79kaUg3NTRPRi5mLk6ag87iDB_Ltq_kEiLwZ2HGZ+-Vsd8w@mail.gmail.com/
$ ./git -c core.commitGraph=false describe --dirty --all
remotes/gitgitgadget/pu-1-g03745a36e6
$ ./git -c core.commitGraph=true describe --dirty --all
remotes/gitgitgadget/pu-1-g03745a36e6
$ ./git -c core.commitGraph=true describe --dirty
v2.20.1-776-g03745a36e6
$ ./git -c core.commitGraph=false describe --dirty
v2.20.1-776-g03745a36e6
it looks like it is working correctly here?
Or did I miss some hint as in how to setup the reproduction properly?
> > Have fun! :) > > $ git gc > ... > Computing commit graph generation numbers: 100% (164264/164264), done. > $ ./git version > git version 2.20.1.775.g2313a6b87fe.dirty > # pu + one commit addressing > # https://public-inbox.org/git/CAGZ79kaUg3NTRPRi5mLk6ag87iDB_Ltq_kEiLwZ2HGZ+-Vsd8w@mail.gmail.com/ > > $ ./git -c core.commitGraph=false describe --dirty --all > remotes/gitgitgadget/pu-1-g03745a36e6 > $ ./git -c core.commitGraph=true describe --dirty --all > remotes/gitgitgadget/pu-1-g03745a36e6 > $ ./git -c core.commitGraph=true describe --dirty > v2.20.1-776-g03745a36e6 > $ ./git -c core.commitGraph=false describe --dirty > v2.20.1-776-g03745a36e6 > > it looks like it is working correctly here? > Or did I miss some hint as in how to setup the reproduction properly? I could reproduce it with version ec0c5798ee (as stated in Szeder's original email) - as stated by Szeder, it doesn't work, but its parent does. I'm looking into this, but any help is appreciated.
On Fri, Jan 25, 2019 at 11:56:38AM -0800, Stefan Beller wrote: > > Have fun! :) > > $ git gc > ... > Computing commit graph generation numbers: 100% (164264/164264), done. > $ ./git version > git version 2.20.1.775.g2313a6b87fe.dirty > # pu + one commit addressing > # https://public-inbox.org/git/CAGZ79kaUg3NTRPRi5mLk6ag87iDB_Ltq_kEiLwZ2HGZ+-Vsd8w@mail.gmail.com/ > > $ ./git -c core.commitGraph=false describe --dirty --all > remotes/gitgitgadget/pu-1-g03745a36e6 > $ ./git -c core.commitGraph=true describe --dirty --all > remotes/gitgitgadget/pu-1-g03745a36e6 > $ ./git -c core.commitGraph=true describe --dirty > v2.20.1-776-g03745a36e6 > $ ./git -c core.commitGraph=false describe --dirty > v2.20.1-776-g03745a36e6 > > it looks like it is working correctly here? > Or did I miss some hint as in how to setup the reproduction properly? How many refs are pointing to the commits you tried to describe? In the git repo, with an all-encompassing commit-graph it seems to be important that more than one refs point there. I could reproduce the issue in a fresh git.git clone with Git built from commit 2313a6b87fe: $ git clone https://github.com/git/git Cloning into 'git'... <...> $ git commit-graph write --reachable Computing commit graph generation numbers: 100% (56722/56722), done. # 'HOME=.' makes sure that this command doesn't read my global # gitconfig. $ HOME=. ~/src/git/git describe --all --dirty heads/master-dirty $ git checkout origin/pu HEAD is now at cb3b9e7ee3 Merge branch 'jh/trace2' into pu $ HOME=. ~/src/git/git -c core.commitGraph=true describe --all --dirty remotes/origin/pu $ git branch a-second-ref-pointing-at-pu buzz ~/src/tmp/git $ HOME=. ~/src/git/git -c core.commitGraph=true describe --all --dirty heads/a-second-ref-pointing-at-pu-dirty I could also reproduce it in other repositories lying around here, but could not manage to reproduce it in a minimal repository. The smallest I could get is the test script below, where the last test fails, i.e. the clean worktree is described as '-dirty', when the to-be-described HEAD is not in the commit-graph. I suspect this is the same issue, because it bisects down to this same commit. --- >8 --- Subject: [PATCH] test --- t/t9999-test.sh | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) create mode 100755 t/t9999-test.sh diff --git a/t/t9999-test.sh b/t/t9999-test.sh new file mode 100755 index 0000000000..cd1286e157 --- /dev/null +++ b/t/t9999-test.sh @@ -0,0 +1,26 @@ +#!/bin/sh + +test_description='test' + +. ./test-lib.sh + +test_expect_success 'setup' ' + test_commit one && + test_commit two && + # Two refs point there. + git for-each-ref --points-at=two && + git config core.commitGraph true +' + +test_expect_success 'full commit-graph' ' + git commit-graph write --reachable && + verbose test "$(git describe --all --dirty)" = tags/two +' + +test_expect_success 'partial commit-graph, described HEAD is not in C-G' ' + git rev-parse one | git commit-graph write --stdin-commits && + git status && + verbose test "$(git describe --all --dirty)" = tags/two +' + +test_done
On Fri, Jan 25, 2019 at 11:14:14PM +0100, SZEDER Gábor wrote: > On Fri, Jan 25, 2019 at 11:56:38AM -0800, Stefan Beller wrote: > > > Have fun! :) > > > > $ git gc > > ... > > Computing commit graph generation numbers: 100% (164264/164264), done. > > $ ./git version > > git version 2.20.1.775.g2313a6b87fe.dirty > > # pu + one commit addressing > > # https://public-inbox.org/git/CAGZ79kaUg3NTRPRi5mLk6ag87iDB_Ltq_kEiLwZ2HGZ+-Vsd8w@mail.gmail.com/ > > > > $ ./git -c core.commitGraph=false describe --dirty --all > > remotes/gitgitgadget/pu-1-g03745a36e6 > > $ ./git -c core.commitGraph=true describe --dirty --all > > remotes/gitgitgadget/pu-1-g03745a36e6 > > $ ./git -c core.commitGraph=true describe --dirty > > v2.20.1-776-g03745a36e6 > > $ ./git -c core.commitGraph=false describe --dirty > > v2.20.1-776-g03745a36e6 > > > > it looks like it is working correctly here? > > Or did I miss some hint as in how to setup the reproduction properly? > > How many refs are pointing to the commits you tried to describe? In > the git repo, with an all-encompassing commit-graph it seems to be > important that more than one refs point there. Erm, let me try to clarify this sentence. In general it seems to be important that more than one refs point to the described HEAD. In the git repo (and in other non-toy repos) I could reproduce the issue with a commit-graph file containing all commits, but in a minimal repo only when the described HEAD was not in the commit-graph. > I could reproduce the > issue in a fresh git.git clone with Git built from commit 2313a6b87fe: > > $ git clone https://github.com/git/git > Cloning into 'git'... > <...> > $ git commit-graph write --reachable > Computing commit graph generation numbers: 100% (56722/56722), done. > # 'HOME=.' makes sure that this command doesn't read my global > # gitconfig. > $ HOME=. ~/src/git/git describe --all --dirty > heads/master-dirty > $ git checkout origin/pu > HEAD is now at cb3b9e7ee3 Merge branch 'jh/trace2' into pu > $ HOME=. ~/src/git/git -c core.commitGraph=true describe --all --dirty > remotes/origin/pu > $ git branch a-second-ref-pointing-at-pu buzz ~/src/tmp/git > $ HOME=. ~/src/git/git -c core.commitGraph=true describe --all --dirty > heads/a-second-ref-pointing-at-pu-dirty > > I could also reproduce it in other repositories lying around here, but > could not manage to reproduce it in a minimal repository. > > The smallest I could get is the test script below, where the last test > fails, i.e. the clean worktree is described as '-dirty', when the > to-be-described HEAD is not in the commit-graph. I suspect this is > the same issue, because it bisects down to this same commit. > > --- >8 --- > > Subject: [PATCH] test > > --- > t/t9999-test.sh | 26 ++++++++++++++++++++++++++ > 1 file changed, 26 insertions(+) > create mode 100755 t/t9999-test.sh > > diff --git a/t/t9999-test.sh b/t/t9999-test.sh > new file mode 100755 > index 0000000000..cd1286e157 > --- /dev/null > +++ b/t/t9999-test.sh > @@ -0,0 +1,26 @@ > +#!/bin/sh > + > +test_description='test' > + > +. ./test-lib.sh > + > +test_expect_success 'setup' ' > + test_commit one && > + test_commit two && > + # Two refs point there. > + git for-each-ref --points-at=two && > + git config core.commitGraph true > +' > + > +test_expect_success 'full commit-graph' ' > + git commit-graph write --reachable && > + verbose test "$(git describe --all --dirty)" = tags/two > +' > + > +test_expect_success 'partial commit-graph, described HEAD is not in C-G' ' > + git rev-parse one | git commit-graph write --stdin-commits && > + git status && > + verbose test "$(git describe --all --dirty)" = tags/two > +' > + > +test_done > -- > 2.20.1.642.gc55a771460 >
diff --git a/revision.c b/revision.c index b5108b75ab..e7da2c57ab 100644 --- a/revision.c +++ b/revision.c @@ -212,7 +212,20 @@ static struct object *get_reference(struct rev_info *revs, const char *name, { struct object *object; - object = parse_object(revs->repo, oid); + /* + * If the repository has commit graphs, repo_parse_commit() avoids + * reading the object buffer, so use it whenever possible. + */ + if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) { + struct commit *c = lookup_commit(revs->repo, oid); + if (!repo_parse_commit(revs->repo, c)) + object = (struct object *) c; + else + object = NULL; + } else { + object = parse_object(revs->repo, oid); + } + if (!object) { if (revs->ignore_missing) return object;
When fetching into a repository, a connectivity check is first made by check_exist_and_connected() in builtin/fetch.c that runs: git rev-list --objects --stdin --not --all --quiet <(list of objects) If the client repository has many refs, this command can be slow, regardless of the nature of the server repository or what is being fetched. A profiler reveals that most of the time is spent in setup_revisions() (approx. 60/63), and of the time spent in setup_revisions(), most of it is spent in parse_object() (approx. 49/60). This is because setup_revisions() parses the target of every ref (from "--all"), and parse_object() reads the buffer of the object. Reading the buffer is unnecessary if the repository has a commit graph and if the ref points to a commit (which is typically the case). This patch uses the commit graph wherever possible; on my computer, when I run the above command with a list of 1 object on a many-ref repository, I get a speedup from 1.8s to 1.0s. Another way to accomplish this effect would be to modify parse_object() to use the commit graph if possible; however, I did not want to change parse_object()'s current behavior of always checking the object signature of the returned object. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- This is on sb/more-repo-in-api because I'm using the repo_parse_commit() function. A colleague noticed this issue when handling a mirror clone. Looking at the bigger picture, the speed of the connectivity check during a fetch might be further improved by passing only the negotiation tips (obtained through --negotiation-tip) instead of "--all". This patch just handles the low-hanging fruit first. --- revision.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-)