diff mbox series

[on,sb/more-repo-in-api] revision: use commit graph in get_reference()

Message ID 20181204224238.50966-1-jonathantanmy@google.com (mailing list archive)
State New, archived
Headers show
Series [on,sb/more-repo-in-api] revision: use commit graph in get_reference() | expand

Commit Message

Jonathan Tan Dec. 4, 2018, 10:42 p.m. UTC
When fetching into a repository, a connectivity check is first made by
check_exist_and_connected() in builtin/fetch.c that runs:

  git rev-list --objects --stdin --not --all --quiet <(list of objects)

If the client repository has many refs, this command can be slow,
regardless of the nature of the server repository or what is being
fetched. A profiler reveals that most of the time is spent in
setup_revisions() (approx. 60/63), and of the time spent in
setup_revisions(), most of it is spent in parse_object() (approx.
49/60). This is because setup_revisions() parses the target of every ref
(from "--all"), and parse_object() reads the buffer of the object.

Reading the buffer is unnecessary if the repository has a commit graph
and if the ref points to a commit (which is typically the case). This
patch uses the commit graph wherever possible; on my computer, when I
run the above command with a list of 1 object on a many-ref repository,
I get a speedup from 1.8s to 1.0s.

Another way to accomplish this effect would be to modify parse_object()
to use the commit graph if possible; however, I did not want to change
parse_object()'s current behavior of always checking the object
signature of the returned object.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
This is on sb/more-repo-in-api because I'm using the repo_parse_commit()
function.

A colleague noticed this issue when handling a mirror clone.

Looking at the bigger picture, the speed of the connectivity check
during a fetch might be further improved by passing only the negotiation
tips (obtained through --negotiation-tip) instead of "--all". This patch
just handles the low-hanging fruit first.
---
 revision.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Comments

Stefan Beller Dec. 4, 2018, 11:12 p.m. UTC | #1
On Tue, Dec 4, 2018 at 2:42 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> When fetching into a repository, a connectivity check is first made by
> check_exist_and_connected() in builtin/fetch.c that runs:
>
>   git rev-list --objects --stdin --not --all --quiet <(list of objects)
>
> If the client repository has many refs, this command can be slow,
> regardless of the nature of the server repository or what is being
> fetched. A profiler reveals that most of the time is spent in
> setup_revisions() (approx. 60/63), and of the time spent in
> setup_revisions(), most of it is spent in parse_object() (approx.
> 49/60). This is because setup_revisions() parses the target of every ref
> (from "--all"), and parse_object() reads the buffer of the object.
>
> Reading the buffer is unnecessary if the repository has a commit graph
> and if the ref points to a commit (which is typically the case). This
> patch uses the commit graph wherever possible; on my computer, when I
> run the above command with a list of 1 object on a many-ref repository,
> I get a speedup from 1.8s to 1.0s.
>
> Another way to accomplish this effect would be to modify parse_object()
> to use the commit graph if possible; however, I did not want to change
> parse_object()'s current behavior of always checking the object
> signature of the returned object.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> This is on sb/more-repo-in-api because I'm using the repo_parse_commit()
> function.

This is a mere nicety, not strictly required.
Before we had parse_commit(struct commit *) which would accomplish the
same, (and we'd still have that afterwards as a #define falling back onto
the_repository). As the function get_reference() is not the_repository safe
as it contains a call to is_promisor_object() that is repository
agnostic, I think
it would be fair game to not depend on that series. I am not
complaining, though.

> A colleague noticed this issue when handling a mirror clone.
>
> Looking at the bigger picture, the speed of the connectivity check
> during a fetch might be further improved by passing only the negotiation
> tips (obtained through --negotiation-tip) instead of "--all". This patch
> just handles the low-hanging fruit first.
> ---
>  revision.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/revision.c b/revision.c
> index b5108b75ab..e7da2c57ab 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -212,7 +212,20 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
>  {
>         struct object *object;
>
> -       object = parse_object(revs->repo, oid);
> +       /*
> +        * If the repository has commit graphs, repo_parse_commit() avoids
> +        * reading the object buffer, so use it whenever possible.
> +        */
> +       if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) {
> +               struct commit *c = lookup_commit(revs->repo, oid);
> +               if (!repo_parse_commit(revs->repo, c))
> +                       object = (struct object *) c;
> +               else
> +                       object = NULL;

Would it make sense in this case to rely on parse_object below
instead of assigning NULL? The reason for that would be that
when lookup_commit returns NULL, we would try more broadly.

AFAICT oid_object_info doesn't take advantage of the commit graph,
but just looks up the object header, which is still less than completely
parsing it. Then lookup_commit is overly strict, as it may return
NULL as when there still is a type mismatch (I don't think a mismatch
could happen here, as both rely on just the object store, and not the
commit graph.), so this would be just defensive programming for
the sake of it. I dunno.

    struct commit *c;

    if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT &&
        (c = lookup_commit(revs->repo, oid)) &&
        !repo_parse_commit(revs->repo, c))
            object = (struct object *) c;
    else
        object = parse_object(revs->repo, oid);


So with all that said, I still think this is a good patch.

Thanks,
Stefan

> +       } else {
> +               object = parse_object(revs->repo, oid);
> +       }
> +
>         if (!object) {
>                 if (revs->ignore_missing)
>                         return object;
> --
> 2.19.0.271.gfe8321ec05.dirty
>
Jeff King Dec. 5, 2018, 4:54 a.m. UTC | #2
On Tue, Dec 04, 2018 at 02:42:38PM -0800, Jonathan Tan wrote:

> diff --git a/revision.c b/revision.c
> index b5108b75ab..e7da2c57ab 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -212,7 +212,20 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
>  {
>  	struct object *object;
>  
> -	object = parse_object(revs->repo, oid);
> +	/*
> +	 * If the repository has commit graphs, repo_parse_commit() avoids
> +	 * reading the object buffer, so use it whenever possible.
> +	 */
> +	if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) {
> +		struct commit *c = lookup_commit(revs->repo, oid);
> +		if (!repo_parse_commit(revs->repo, c))
> +			object = (struct object *) c;
> +		else
> +			object = NULL;
> +	} else {
> +		object = parse_object(revs->repo, oid);
> +	}

This seems like a reasonable thing to do, but I have sort of a
meta-comment. In several places we've started doing this kind of "if
it's this type of object, do X, otherwise do Y" optimization (e.g.,
handling large blobs for streaming).

And in the many cases we end up doubling the effort to do object
lookups: here we do one lookup to get the type, and then if it's not a
commit (or if we don't have a commit graph) we end up parsing it anyway.

I wonder if we could do better. In this instance, it might make sense
to first see if we actually have a commit graph available (it might not
have this object, of course, but at least we'd expect it to have most
commits). In general, it would be nice if we had a more incremental API
for accessing objects: open, get metadata, then read the data. That
would make these kinds of optimizations "free".

I don't have numbers for how much the extra lookups cost. The lookups
are probably dwarfed by parse_object() in general, so even if we save
only a few full object loads, it may be a win. It just seems a shame
that we may be making the "slow" paths (when our type-specific check
doesn't match) even slower.

-Peff
Junio C Hamano Dec. 5, 2018, 11:15 p.m. UTC | #3
Jonathan Tan <jonathantanmy@google.com> writes:

> Looking at the bigger picture, the speed of the connectivity check
> during a fetch might be further improved by passing only the negotiation
> tips (obtained through --negotiation-tip) instead of "--all". This patch
> just handles the low-hanging fruit first.

That sounds like a good direction, when having to list excessive
number of refs is the primary problem.  When fetching their 'master'
into our 'remotes/origin/master' and doing nothing else, we may end
up showing only the latter, which will miss optimization opportunity
a lot if the latest change made over there is to merge in the change
we asked them to pull earlier (which would be greatly helped if we
let them know about the tip of the topic they earlier pulled from
us), but also avoids having to send irrelevant refs that point at
tags addded to months old states.  So there is a subtle trade-off
between sending more refs to reduce the resulting packfile, and
sending fewer refs to reduce the cost of the "have" exchange.

Changing the way to throw each object pointed at by a ref into the
queue to be emitted in the "have" exchange from regular object
parsing to peeking of precomputed data would reduce the local cost
of "have" exchange, but it does not reduce the network cost at all,
though.

As to the change being specific to get_reference() and not to
parse_object(), I think what we see here is probably better, simply
because parse_object() is not in the position to asssume that it is
likely to be asked to parse commits, but I think get_reference() is,
after looking at its callsites in revision.c.

I do share the meta-comment concern with Peff, though.

> ---
>  revision.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/revision.c b/revision.c
> index b5108b75ab..e7da2c57ab 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -212,7 +212,20 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
>  {
>  	struct object *object;
>  
> -	object = parse_object(revs->repo, oid);
> +	/*
> +	 * If the repository has commit graphs, repo_parse_commit() avoids
> +	 * reading the object buffer, so use it whenever possible.
> +	 */
> +	if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) {
> +		struct commit *c = lookup_commit(revs->repo, oid);
> +		if (!repo_parse_commit(revs->repo, c))
> +			object = (struct object *) c;
> +		else
> +			object = NULL;
> +	} else {
> +		object = parse_object(revs->repo, oid);
> +	}
> +
>  	if (!object) {
>  		if (revs->ignore_missing)
>  			return object;
Jonathan Tan Dec. 6, 2018, 11:36 p.m. UTC | #4
> > This is on sb/more-repo-in-api because I'm using the repo_parse_commit()
> > function.
> 
> This is a mere nicety, not strictly required.
> Before we had parse_commit(struct commit *) which would accomplish the
> same, (and we'd still have that afterwards as a #define falling back onto
> the_repository). As the function get_reference() is not the_repository safe
> as it contains a call to is_promisor_object() that is repository
> agnostic, I think
> it would be fair game to not depend on that series. I am not
> complaining, though.

Good point - I'll base the next version on master (and add a TODO
explaining which functions are not yet converted).

> AFAICT oid_object_info doesn't take advantage of the commit graph,
> but just looks up the object header, which is still less than completely
> parsing it. Then lookup_commit is overly strict, as it may return
> NULL as when there still is a type mismatch (I don't think a mismatch
> could happen here, as both rely on just the object store, and not the
> commit graph.), so this would be just defensive programming for
> the sake of it. I dunno.
> 
>     struct commit *c;
> 
>     if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT &&
>         (c = lookup_commit(revs->repo, oid)) &&
>         !repo_parse_commit(revs->repo, c))
>             object = (struct object *) c;
>     else
>         object = parse_object(revs->repo, oid);

I like this way better - I'll do it in the next version.
Jonathan Tan Dec. 6, 2018, 11:54 p.m. UTC | #5
Also CC-ing Stolee since I mention multi-pack indices at the end.

> This seems like a reasonable thing to do, but I have sort of a
> meta-comment. In several places we've started doing this kind of "if
> it's this type of object, do X, otherwise do Y" optimization (e.g.,
> handling large blobs for streaming).
> 
> And in the many cases we end up doubling the effort to do object
> lookups: here we do one lookup to get the type, and then if it's not a
> commit (or if we don't have a commit graph) we end up parsing it anyway.
> 
> I wonder if we could do better. In this instance, it might make sense
> to first see if we actually have a commit graph available (it might not
> have this object, of course, but at least we'd expect it to have most
> commits).

This makes sense - I thought I shouldn't mention the commit graph in the
code since it seems like a layering violation, but I felt the need to
mention commit graph in a comment, so maybe the need to mention commit
graph in the code is there too. Subsequently, maybe the lookup-for-type
could be replaced by a lookup-in-commit-graph (maybe by using
parse_commit_in_graph() directly), which should be at least slightly
faster.

> In general, it would be nice if we had a more incremental API
> for accessing objects: open, get metadata, then read the data. That
> would make these kinds of optimizations "free".

Would this be assuming that to read the data, you would (1) first need to
read the metadata, and (2) there would be no redundancy in reading the
two? It seems to me that for loose objects, you would want to perform
all your reads at once, since any read requires opening the file, and
for commit graphs, you just want to read what you want, since the
metadata and the data are in separate places.

> I don't have numbers for how much the extra lookups cost. The lookups
> are probably dwarfed by parse_object() in general, so even if we save
> only a few full object loads, it may be a win. It just seems a shame
> that we may be making the "slow" paths (when our type-specific check
> doesn't match) even slower.

I agree. I think it will always remain a tradeoff when we have multiple
data sources of objects (loose, packed, commit graph - and we can't
unify them all, since they each have their uses). Unless the multi-pack
index can reference commit graphs as well...then it could be our first
point of reference without introducing any inefficiencies...
Jeff King Dec. 7, 2018, 8:53 a.m. UTC | #6
On Thu, Dec 06, 2018 at 03:54:46PM -0800, Jonathan Tan wrote:

> This makes sense - I thought I shouldn't mention the commit graph in the
> code since it seems like a layering violation, but I felt the need to
> mention commit graph in a comment, so maybe the need to mention commit
> graph in the code is there too. Subsequently, maybe the lookup-for-type
> could be replaced by a lookup-in-commit-graph (maybe by using
> parse_commit_in_graph() directly), which should be at least slightly
> faster.

That makes more sense to me. If we don't have a commit graph at all,
it's a quick noop. If we do, we might binary search in the list of
commits for a non-commit. But that's strictly faster than finding the
object's type (which involves a binary search of a larger list, followed
by actually accessing the type info).

> > In general, it would be nice if we had a more incremental API
> > for accessing objects: open, get metadata, then read the data. That
> > would make these kinds of optimizations "free".
> 
> Would this be assuming that to read the data, you would (1) first need to
> read the metadata, and (2) there would be no redundancy in reading the
> two? It seems to me that for loose objects, you would want to perform
> all your reads at once, since any read requires opening the file, and
> for commit graphs, you just want to read what you want, since the
> metadata and the data are in separate places.

By metadata here, I don't mean the commit-graph data, but just the
object type and size. So I'm imagining an interface more like:

  - object_open() locates the object, and stores either the pack
    file/offset or a descriptor to a loose path in an opaque handle
    struct

  - object_size() and object_type() on that handle would do what you
    expect. For loose objects, these would parse the header (the
    equivalent of unpack_sha1_header()). For packed ones, they'd use the
    object header in the pack (and chase down the delta bits as needed).

  - object_contents() would return the full content

  - object_read() could sequentially read a subset of the file (this
    could replace the streaming interface we currently have)

We have most of the low-level bits for this already, if you poke into
what object_info_extended() is doing. We just don't have them packaged
in an interface which can persist across multiple calls.

With an interface like that, parse_object()'s large-blob check could be
something like the patch below.

But your case here is a bit more interesting. If we have a commit graph,
then we can avoid opening (or even finding!) the on-disk object at all.
So I actually think it makes sense to just check the commit-graph first,
as discussed above.

---
diff --git a/object.c b/object.c
index e54160550c..afce58c0bc 100644
--- a/object.c
+++ b/object.c
@@ -254,23 +254,31 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
 	const struct object_id *repl = lookup_replace_object(r, oid);
 	void *buffer;
 	struct object *obj;
+	struct object_handle oh;
 
 	obj = lookup_object(r, oid->hash);
 	if (obj && obj->parsed)
 		return obj;
 
-	if ((obj && obj->type == OBJ_BLOB && has_object_file(oid)) ||
-	    (!obj && has_object_file(oid) &&
-	     oid_object_info(r, oid, NULL) == OBJ_BLOB)) {
-		if (check_object_signature(repl, NULL, 0, NULL) < 0) {
+	if (object_open(&oh, oid) < 0)
+		return NULL; /* missing object */
+
+	if (object_type(&oh) == OBJ_BLOB) {
+		/* this will call object_read() on 4k chunks */
+		if (check_object_signature_stream(&oh, oid)) {
 			error(_("sha1 mismatch %s"), oid_to_hex(oid));
 			return NULL;
 		}
+		object_close(&oh); /* we don't care about contents */
 		parse_blob_buffer(lookup_blob(r, oid), NULL, 0);
 		return lookup_object(r, oid->hash);
 	}
 
-	buffer = read_object_file(oid, &type, &size);
+	type = object_type(&oh);
+	size = object_size(&oh);
+	buffer = object_contents(&oh);
+	object_close(&oh);
+
 	if (buffer) {
 		if (check_object_signature(repl, buffer, size, type_name(type)) < 0) {
 			free(buffer);
Derrick Stolee Dec. 7, 2018, 1:49 p.m. UTC | #7
On 12/6/2018 6:36 PM, Jonathan Tan wrote:
>> AFAICT oid_object_info doesn't take advantage of the commit graph,
>> but just looks up the object header, which is still less than completely
>> parsing it. Then lookup_commit is overly strict, as it may return
>> NULL as when there still is a type mismatch (I don't think a mismatch
>> could happen here, as both rely on just the object store, and not the
>> commit graph.), so this would be just defensive programming for
>> the sake of it. I dunno.
>>
>>      struct commit *c;
>>
>>      if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT &&
>>          (c = lookup_commit(revs->repo, oid)) &&
>>          !repo_parse_commit(revs->repo, c))
>>              object = (struct object *) c;
>>      else
>>          object = parse_object(revs->repo, oid);
> I like this way better - I'll do it in the next version.

If we do _not_ have a commit-graph or if the commit-graph does not have
that commit, this will have the same performance problem, right?

Should we instead create a direct dependence on the commit-graph, and try
to parse the oid from the graph directly? If it succeeds, then we learn
that the object is a commit, in addition to all of the parsing work. This
means we could avoid oid_object_info() loading data if we succeed. We
would fall back to parse_object() if it fails.

I was thinking this should be a simple API call to parse_commit_in_graph(),
but that requires a struct commit filled with an oid, which is not the
best idea if we don't actually know it is a commit yet.

The approach I recommend would then be more detailed:

1. Modify find_commit_in_graph() to take a struct object_id instead of a
    struct commit. This helps find the integer position in the graph. That
    position can be used in fill_commit_in_graph() to load the commit
    contents. Keep find_commit_in_graph() static as it should not be a
    public function.

2. Create a public function with prototype

struct commit *try_parse_commit_from_graph(struct repository *r, struct 
object_id *oid)

    that returns a commit struct fully parsed if and only if the repository
    has that oid. It can call find_commit_in_graph(), then 
lookup_commit() and
    fill_commit_in_graph() to create the commit and parse the data.

3. In replace of the snippet above, do:

     struct commit *c;

     if ((c = try_parse_commit_from_graph(revs->repo, oid))
         object = (struct object *)c;
     else
         object = parse_object(revs->repo, oid);

A similar pattern _could_ be used in parse_object(), but I don't recommend
doing this pattern unless we have a reasonable suspicion that we are going
to parse commits more often than other objects. (It adds an O(log(# 
commits))
binary search to each object.)

A final thought: consider making this "try the commit graph first, but fall
back to parse_object()" a library function with a name like

     struct object *parse_probably_commit(struct repository *r, struct 
object_id *oid)

so other paths that are parsing a lot of commits (but also maybe tags) could
use the logic.

Thanks!
-Stolee
SZEDER Gábor Jan. 25, 2019, 3:33 p.m. UTC | #8
On Tue, Dec 04, 2018 at 02:42:38PM -0800, Jonathan Tan wrote:
> When fetching into a repository, a connectivity check is first made by
> check_exist_and_connected() in builtin/fetch.c that runs:
> 
>   git rev-list --objects --stdin --not --all --quiet <(list of objects)
> 
> If the client repository has many refs, this command can be slow,
> regardless of the nature of the server repository or what is being
> fetched. A profiler reveals that most of the time is spent in
> setup_revisions() (approx. 60/63), and of the time spent in
> setup_revisions(), most of it is spent in parse_object() (approx.
> 49/60). This is because setup_revisions() parses the target of every ref
> (from "--all"), and parse_object() reads the buffer of the object.
> 
> Reading the buffer is unnecessary if the repository has a commit graph
> and if the ref points to a commit (which is typically the case). This
> patch uses the commit graph wherever possible; on my computer, when I
> run the above command with a list of 1 object on a many-ref repository,
> I get a speedup from 1.8s to 1.0s.
> 
> Another way to accomplish this effect would be to modify parse_object()
> to use the commit graph if possible; however, I did not want to change
> parse_object()'s current behavior of always checking the object
> signature of the returned object.
> 
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> This is on sb/more-repo-in-api because I'm using the repo_parse_commit()
> function.
> 
> A colleague noticed this issue when handling a mirror clone.
> 
> Looking at the bigger picture, the speed of the connectivity check
> during a fetch might be further improved by passing only the negotiation
> tips (obtained through --negotiation-tip) instead of "--all". This patch
> just handles the low-hanging fruit first.
> ---

I stumbled upon a regression that bisects down to this commit
ec0c5798ee (revision: use commit graph in get_reference(),
2018-12-04):

  $ ~/src/git/bin-wrappers/git version
  git version 2.19.1.566.gec0c5798ee
  $ ~/src/git/bin-wrappers/git commit-graph write --reachable
  Computing commit graph generation numbers: 100% (58994/58994), done.
  $ ~/src/git/bin-wrappers/git status
  HEAD detached at origin/pu
  nothing to commit, working tree clean
  $ ~/src/git/bin-wrappers/git -c core.commitGraph=false describe --dirty
  v2.20.1-833-gcb3b9e7ee3
  $ ~/src/git/bin-wrappers/git -c core.commitGraph=true describe --dirty
  v2.20.1-833-gcb3b9e7ee3

It's all good with only '--dirty', but watch this with '--all
--dirty':

  $ ~/src/git/bin-wrappers/git -c core.commitGraph=false describe --all --dirty
  remotes/origin/pu
  $ ~/src/git/bin-wrappers/git -c core.commitGraph=true describe --all --dirty
  remotes/origin/pu-dirty

IOW if the commit-graph is enabled, then my clean worktree is reported
as dirty.

And to add a cherry on top of my confusion:

  $ git checkout v2.20.0
  Previous HEAD position was cb3b9e7ee3 Merge branch 'jh/trace2' into pu
  HEAD is now at 5d826e9729 Git 2.20
  $ ~/src/git/bin-wrappers/git -c core.commitGraph=true describe --all --dirty
  tags/v2.20.0

It's clean even with '--all' and commit-graph enabled, but watch this:

  $ git branch this-will-screw-it-up
  $ ~/src/git/bin-wrappers/git -c core.commitGraph=true describe --all --dirty
  tags/v2.20.0-dirty

Have fun! :)


>  revision.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/revision.c b/revision.c
> index b5108b75ab..e7da2c57ab 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -212,7 +212,20 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
>  {
>  	struct object *object;
>  
> -	object = parse_object(revs->repo, oid);
> +	/*
> +	 * If the repository has commit graphs, repo_parse_commit() avoids
> +	 * reading the object buffer, so use it whenever possible.
> +	 */
> +	if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) {
> +		struct commit *c = lookup_commit(revs->repo, oid);
> +		if (!repo_parse_commit(revs->repo, c))
> +			object = (struct object *) c;
> +		else
> +			object = NULL;
> +	} else {
> +		object = parse_object(revs->repo, oid);
> +	}
> +
>  	if (!object) {
>  		if (revs->ignore_missing)
>  			return object;
> -- 
> 2.19.0.271.gfe8321ec05.dirty
>
Stefan Beller Jan. 25, 2019, 7:56 p.m. UTC | #9
> Have fun! :)

$ git gc
...
Computing commit graph generation numbers: 100% (164264/164264), done.
$ ./git version
git version 2.20.1.775.g2313a6b87fe.dirty
# pu + one commit addressing
# https://public-inbox.org/git/CAGZ79kaUg3NTRPRi5mLk6ag87iDB_Ltq_kEiLwZ2HGZ+-Vsd8w@mail.gmail.com/

$ ./git -c core.commitGraph=false describe --dirty --all
remotes/gitgitgadget/pu-1-g03745a36e6
$ ./git -c core.commitGraph=true describe --dirty --all
remotes/gitgitgadget/pu-1-g03745a36e6
$ ./git -c core.commitGraph=true describe --dirty
v2.20.1-776-g03745a36e6
$ ./git -c core.commitGraph=false describe --dirty
v2.20.1-776-g03745a36e6

it looks like it is working correctly here?
Or did I miss some hint as in how to setup the reproduction properly?
Jonathan Tan Jan. 25, 2019, 10:01 p.m. UTC | #10
> > Have fun! :)
> 
> $ git gc
> ...
> Computing commit graph generation numbers: 100% (164264/164264), done.
> $ ./git version
> git version 2.20.1.775.g2313a6b87fe.dirty
> # pu + one commit addressing
> # https://public-inbox.org/git/CAGZ79kaUg3NTRPRi5mLk6ag87iDB_Ltq_kEiLwZ2HGZ+-Vsd8w@mail.gmail.com/
> 
> $ ./git -c core.commitGraph=false describe --dirty --all
> remotes/gitgitgadget/pu-1-g03745a36e6
> $ ./git -c core.commitGraph=true describe --dirty --all
> remotes/gitgitgadget/pu-1-g03745a36e6
> $ ./git -c core.commitGraph=true describe --dirty
> v2.20.1-776-g03745a36e6
> $ ./git -c core.commitGraph=false describe --dirty
> v2.20.1-776-g03745a36e6
> 
> it looks like it is working correctly here?
> Or did I miss some hint as in how to setup the reproduction properly?

I could reproduce it with version ec0c5798ee (as stated in Szeder's
original email) - as stated by Szeder, it doesn't work, but its parent
does. I'm looking into this, but any help is appreciated.
SZEDER Gábor Jan. 25, 2019, 10:14 p.m. UTC | #11
On Fri, Jan 25, 2019 at 11:56:38AM -0800, Stefan Beller wrote:
> > Have fun! :)
> 
> $ git gc
> ...
> Computing commit graph generation numbers: 100% (164264/164264), done.
> $ ./git version
> git version 2.20.1.775.g2313a6b87fe.dirty
> # pu + one commit addressing
> # https://public-inbox.org/git/CAGZ79kaUg3NTRPRi5mLk6ag87iDB_Ltq_kEiLwZ2HGZ+-Vsd8w@mail.gmail.com/
> 
> $ ./git -c core.commitGraph=false describe --dirty --all
> remotes/gitgitgadget/pu-1-g03745a36e6
> $ ./git -c core.commitGraph=true describe --dirty --all
> remotes/gitgitgadget/pu-1-g03745a36e6
> $ ./git -c core.commitGraph=true describe --dirty
> v2.20.1-776-g03745a36e6
> $ ./git -c core.commitGraph=false describe --dirty
> v2.20.1-776-g03745a36e6
> 
> it looks like it is working correctly here?
> Or did I miss some hint as in how to setup the reproduction properly?

How many refs are pointing to the commits you tried to describe?  In
the git repo, with an all-encompassing commit-graph it seems to be
important that more than one refs point there.  I could reproduce the
issue in a fresh git.git clone with Git built from commit 2313a6b87fe:

  $ git clone https://github.com/git/git
  Cloning into 'git'...
  <...>
  $ git commit-graph write --reachable
  Computing commit graph generation numbers: 100% (56722/56722), done.
  # 'HOME=.' makes sure that this command doesn't read my global
  # gitconfig.
  $ HOME=. ~/src/git/git describe --all --dirty
  heads/master-dirty
  $ git checkout origin/pu 
  HEAD is now at cb3b9e7ee3 Merge branch 'jh/trace2' into pu
  $ HOME=. ~/src/git/git -c core.commitGraph=true describe --all --dirty
  remotes/origin/pu
  $ git branch a-second-ref-pointing-at-pu buzz ~/src/tmp/git
  $ HOME=. ~/src/git/git -c core.commitGraph=true describe --all --dirty
  heads/a-second-ref-pointing-at-pu-dirty

I could also reproduce it in other repositories lying around here, but
could not manage to reproduce it in a minimal repository.

The smallest I could get is the test script below, where the last test
fails, i.e. the clean worktree is described as '-dirty', when the
to-be-described HEAD is not in the commit-graph.  I suspect this is
the same issue, because it bisects down to this same commit.

  --- >8 ---

Subject: [PATCH] test

---
 t/t9999-test.sh | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)
 create mode 100755 t/t9999-test.sh

diff --git a/t/t9999-test.sh b/t/t9999-test.sh
new file mode 100755
index 0000000000..cd1286e157
--- /dev/null
+++ b/t/t9999-test.sh
@@ -0,0 +1,26 @@
+#!/bin/sh
+
+test_description='test'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	test_commit one &&
+	test_commit two &&
+	# Two refs point there.
+	git for-each-ref --points-at=two &&
+	git config core.commitGraph true
+'
+
+test_expect_success 'full commit-graph' '
+	git commit-graph write --reachable &&
+	verbose test "$(git describe --all --dirty)" = tags/two
+'
+
+test_expect_success 'partial commit-graph, described HEAD is not in C-G' '
+	git rev-parse one | git commit-graph write --stdin-commits &&
+	git status &&
+	verbose test "$(git describe --all --dirty)" = tags/two
+'
+
+test_done
SZEDER Gábor Jan. 25, 2019, 10:21 p.m. UTC | #12
On Fri, Jan 25, 2019 at 11:14:14PM +0100, SZEDER Gábor wrote:
> On Fri, Jan 25, 2019 at 11:56:38AM -0800, Stefan Beller wrote:
> > > Have fun! :)
> > 
> > $ git gc
> > ...
> > Computing commit graph generation numbers: 100% (164264/164264), done.
> > $ ./git version
> > git version 2.20.1.775.g2313a6b87fe.dirty
> > # pu + one commit addressing
> > # https://public-inbox.org/git/CAGZ79kaUg3NTRPRi5mLk6ag87iDB_Ltq_kEiLwZ2HGZ+-Vsd8w@mail.gmail.com/
> > 
> > $ ./git -c core.commitGraph=false describe --dirty --all
> > remotes/gitgitgadget/pu-1-g03745a36e6
> > $ ./git -c core.commitGraph=true describe --dirty --all
> > remotes/gitgitgadget/pu-1-g03745a36e6
> > $ ./git -c core.commitGraph=true describe --dirty
> > v2.20.1-776-g03745a36e6
> > $ ./git -c core.commitGraph=false describe --dirty
> > v2.20.1-776-g03745a36e6
> > 
> > it looks like it is working correctly here?
> > Or did I miss some hint as in how to setup the reproduction properly?
> 
> How many refs are pointing to the commits you tried to describe?  In
> the git repo, with an all-encompassing commit-graph it seems to be
> important that more than one refs point there.

Erm, let me try to clarify this sentence.

In general it seems to be important that more than one refs point to
the described HEAD.  In the git repo (and in other non-toy repos) I
could reproduce the issue with a commit-graph file containing all
commits, but in a minimal repo only when the described HEAD was not in
the commit-graph.

> I could reproduce the
> issue in a fresh git.git clone with Git built from commit 2313a6b87fe:
> 
>   $ git clone https://github.com/git/git
>   Cloning into 'git'...
>   <...>
>   $ git commit-graph write --reachable
>   Computing commit graph generation numbers: 100% (56722/56722), done.
>   # 'HOME=.' makes sure that this command doesn't read my global
>   # gitconfig.
>   $ HOME=. ~/src/git/git describe --all --dirty
>   heads/master-dirty
>   $ git checkout origin/pu 
>   HEAD is now at cb3b9e7ee3 Merge branch 'jh/trace2' into pu
>   $ HOME=. ~/src/git/git -c core.commitGraph=true describe --all --dirty
>   remotes/origin/pu
>   $ git branch a-second-ref-pointing-at-pu buzz ~/src/tmp/git
>   $ HOME=. ~/src/git/git -c core.commitGraph=true describe --all --dirty
>   heads/a-second-ref-pointing-at-pu-dirty
> 
> I could also reproduce it in other repositories lying around here, but
> could not manage to reproduce it in a minimal repository.
> 
> The smallest I could get is the test script below, where the last test
> fails, i.e. the clean worktree is described as '-dirty', when the
> to-be-described HEAD is not in the commit-graph.  I suspect this is
> the same issue, because it bisects down to this same commit.
> 
>   --- >8 ---
> 
> Subject: [PATCH] test
> 
> ---
>  t/t9999-test.sh | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
>  create mode 100755 t/t9999-test.sh
> 
> diff --git a/t/t9999-test.sh b/t/t9999-test.sh
> new file mode 100755
> index 0000000000..cd1286e157
> --- /dev/null
> +++ b/t/t9999-test.sh
> @@ -0,0 +1,26 @@
> +#!/bin/sh
> +
> +test_description='test'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup' '
> +	test_commit one &&
> +	test_commit two &&
> +	# Two refs point there.
> +	git for-each-ref --points-at=two &&
> +	git config core.commitGraph true
> +'
> +
> +test_expect_success 'full commit-graph' '
> +	git commit-graph write --reachable &&
> +	verbose test "$(git describe --all --dirty)" = tags/two
> +'
> +
> +test_expect_success 'partial commit-graph, described HEAD is not in C-G' '
> +	git rev-parse one | git commit-graph write --stdin-commits &&
> +	git status &&
> +	verbose test "$(git describe --all --dirty)" = tags/two
> +'
> +
> +test_done
> -- 
> 2.20.1.642.gc55a771460
>
diff mbox series

Patch

diff --git a/revision.c b/revision.c
index b5108b75ab..e7da2c57ab 100644
--- a/revision.c
+++ b/revision.c
@@ -212,7 +212,20 @@  static struct object *get_reference(struct rev_info *revs, const char *name,
 {
 	struct object *object;
 
-	object = parse_object(revs->repo, oid);
+	/*
+	 * If the repository has commit graphs, repo_parse_commit() avoids
+	 * reading the object buffer, so use it whenever possible.
+	 */
+	if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) {
+		struct commit *c = lookup_commit(revs->repo, oid);
+		if (!repo_parse_commit(revs->repo, c))
+			object = (struct object *) c;
+		else
+			object = NULL;
+	} else {
+		object = parse_object(revs->repo, oid);
+	}
+
 	if (!object) {
 		if (revs->ignore_missing)
 			return object;