fetch-pack: speed up loading of refs via commit graph

Message ID	08519b8ab6f395cffbcd5e530bfba6aaf64241a2.1628085347.git.ps@pks.im (mailing list archive)
State	Accepted
Commit	3e5e6c6e94f09973d3a72049aad09fd2131a3648
Headers	show Return-Path: <git-owner@kernel.org> Date: Wed, 4 Aug 2021 15:56:11 +0200 From: Patrick Steinhardt <ps@pks.im> To: git@vger.kernel.org Subject: [PATCH] fetch-pack: speed up loading of refs via commit graph Message-ID: <08519b8ab6f395cffbcd5e530bfba6aaf64241a2.1628085347.git.ps@pks.im> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="9F7sNZDdcEYJm8DX" Content-Disposition: inline Precedence: bulk
Series	fetch-pack: speed up loading of refs via commit graph \| expand fetch-pack: speed up loading of refs via commit graph

Patrick Steinhardt Aug. 4, 2021, 1:56 p.m. UTC

When doing reference negotiation, git-fetch-pack(1) is loading all refs
from disk in order to determine which commits it has in common with the
remote repository. This can be quite expensive in repositories with many
references though: in a real-world repository with around 2.2 million
refs, fetching a single commit by its ID takes around 44 seconds.

Dominating the loading time is decompression and parsing of the objects
which are referenced by commits. Given the fact that we only care about
commits (or tags which can be peeled to one) in this context, there is
thus an easy performance win by switching the parsing logic to make use
of the commit graph in case we have one available. Like this, we avoid
hitting the object database to parse these commits but instead only load
them from the commit-graph. This results in a significant performance
boost when executing git-fetch in said repository with 2.2 million refs:

    Benchmark #1: HEAD~: git fetch $remote $commit
      Time (mean ± σ):     44.168 s ±  0.341 s    [User: 42.985 s, System: 1.106 s]
      Range (min … max):   43.565 s … 44.577 s    10 runs

    Benchmark #2: HEAD: git fetch $remote $commit
      Time (mean ± σ):     19.498 s ±  0.724 s    [User: 18.751 s, System: 0.690 s]
      Range (min … max):   18.629 s … 20.454 s    10 runs

    Summary
      'HEAD: git fetch $remote $commit' ran
        2.27 ± 0.09 times faster than 'HEAD~: git fetch $remote $commit'

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 fetch-pack.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Derrick Stolee Aug. 4, 2021, 2:55 p.m. UTC | #1

On 8/4/2021 9:56 AM, Patrick Steinhardt wrote:
> When doing reference negotiation, git-fetch-pack(1) is loading all refs
> from disk in order to determine which commits it has in common with the
> remote repository. This can be quite expensive in repositories with many
> references though: in a real-world repository with around 2.2 million
> refs, fetching a single commit by its ID takes around 44 seconds.
> 
> Dominating the loading time is decompression and parsing of the objects
> which are referenced by commits. Given the fact that we only care about
> commits (or tags which can be peeled to one) in this context, there is
> thus an easy performance win by switching the parsing logic to make use
> of the commit graph in case we have one available.

Nice find!

> Like this, we avoid
> hitting the object database to parse these commits but instead only load
> them from the commit-graph. This results in a significant performance
> boost when executing git-fetch in said repository with 2.2 million refs:
> 
>     Benchmark #1: HEAD~: git fetch $remote $commit
>       Time (mean ± σ):     44.168 s ±  0.341 s    [User: 42.985 s, System: 1.106 s]
>       Range (min … max):   43.565 s … 44.577 s    10 runs
> 
>     Benchmark #2: HEAD: git fetch $remote $commit
>       Time (mean ± σ):     19.498 s ±  0.724 s    [User: 18.751 s, System: 0.690 s]
>       Range (min … max):   18.629 s … 20.454 s    10 runs
> 
>     Summary
>       'HEAD: git fetch $remote $commit' ran
>         2.27 ± 0.09 times faster than 'HEAD~: git fetch $remote $commit'

That's a great improvement. I'm sure that the remaining time is dominated
by ref parsing.

> -	if (type == OBJ_COMMIT)
> -		return (struct commit *) parse_object(the_repository, oid);
> +
> +	if (type == OBJ_COMMIT) {
> +		struct commit *commit = lookup_commit(the_repository, oid);
> +		if (!commit || repo_parse_commit(the_repository, commit))
> +			return NULL;
> +		return commit;
> +	}
> +

And this change looks obviously correct to me. I'm glad that the
implementation actually doesn't care about the commit-graph, but
instead cares about using the "standard" parsing approach instead
of side-stepping the commit-graph via parse_object().

I took a quick look for other instances where we use parse_object()
but also know to expect a commit, but did not find one.

Thanks,
-Stolee

Junio C Hamano Aug. 4, 2021, 5:45 p.m. UTC | #2

Patrick Steinhardt <ps@pks.im> writes:

> -	if (type == OBJ_COMMIT)
> -		return (struct commit *) parse_object(the_repository, oid);
> +
> +	if (type == OBJ_COMMIT) {
> +		struct commit *commit = lookup_commit(the_repository, oid);
> +		if (!commit || repo_parse_commit(the_repository, commit))
> +			return NULL;
> +		return commit;
> +	}

OK.

Asking parse_object() to turn an object name to an in-core instance
of the object is a pretty much standard and generic idiom, but in
this codepath, we have already asked for the type and know it must
be a commit, so asking parse_commit() that is more specialized makes
quite a lot of sense.

Thanks.

Jeff King Aug. 4, 2021, 8:59 p.m. UTC | #3

On Wed, Aug 04, 2021 at 03:56:11PM +0200, Patrick Steinhardt wrote:

> When doing reference negotiation, git-fetch-pack(1) is loading all refs
> from disk in order to determine which commits it has in common with the
> remote repository. This can be quite expensive in repositories with many
> references though: in a real-world repository with around 2.2 million
> refs, fetching a single commit by its ID takes around 44 seconds.
> 
> Dominating the loading time is decompression and parsing of the objects
> which are referenced by commits. Given the fact that we only care about
> commits (or tags which can be peeled to one) in this context, there is
> thus an easy performance win by switching the parsing logic to make use
> of the commit graph in case we have one available. Like this, we avoid
> hitting the object database to parse these commits but instead only load
> them from the commit-graph. This results in a significant performance
> boost when executing git-fetch in said repository with 2.2 million refs:
> 
>     Benchmark #1: HEAD~: git fetch $remote $commit
>       Time (mean ± σ):     44.168 s ±  0.341 s    [User: 42.985 s, System: 1.106 s]
>       Range (min … max):   43.565 s … 44.577 s    10 runs
> 
>     Benchmark #2: HEAD: git fetch $remote $commit
>       Time (mean ± σ):     19.498 s ±  0.724 s    [User: 18.751 s, System: 0.690 s]
>       Range (min … max):   18.629 s … 20.454 s    10 runs
> 
>     Summary
>       'HEAD: git fetch $remote $commit' ran
>         2.27 ± 0.09 times faster than 'HEAD~: git fetch $remote $commit'

Nice. I've sometimes wondered if parse_object() should be doing this
optimization itself. Though we'd possibly still want callers (like this
one) to give us more hints, since we already know the type is
OBJ_COMMIT. Whereas parse_object() would have to discover that itself
(though we already incur the extra type lookup there to handle blobs).

I wonder where the remaining 20s is going. Do you have a lot of tags in
your repository? We'll still parse all of those, which could be
expensive. There might be some benefit to using peel_iterated_ref(),
which will make us of packed-ref's peel hints, but:

  - you'd want to double check that we always call this during ref
    iteration (it looks like we do, and I think peel_iterated_ref()
    falls back to a normal peel otherwise)

  - for a tag-of-tag-of-X, that will give us the complete peel to X. But
    it looks like deref_without_lazy_fetch() marks intermediate tags
    with the COMPLETE flag, too. I'm not sure how important that is
    (i.e., is it necessary for correctness, or just an optimization, in
    which case we might be better off guessing that tags are
    single-layer, as it's by far the common case).

If we don't go that route, there's another possible speedup: after
parsing a tag, the type of tag->tagged (if it is not NULL) will be known
from the tag's contents, and we can avoid the oid_object_info_extended()
type lookup. It might need some extra surgery to convince the tag-parse
not to fetch promisor objects, though.

I'm not sure it would make that big a difference, though. If we save one
type-lookup per parsed tag, then the tag parsing is likely to dwarf it.

> diff --git a/fetch-pack.c b/fetch-pack.c
> index b0c7be717c..0bf7ed7e47 100644
> --- a/fetch-pack.c
> +++ b/fetch-pack.c
> @@ -137,8 +137,14 @@ static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
>  			break;
>  		}
>  	}
> -	if (type == OBJ_COMMIT)
> -		return (struct commit *) parse_object(the_repository, oid);
> +
> +	if (type == OBJ_COMMIT) {
> +		struct commit *commit = lookup_commit(the_repository, oid);
> +		if (!commit || repo_parse_commit(the_repository, commit))
> +			return NULL;
> +		return commit;
> +	}

Looks correct. You're using lookup_commit(), so we'll auto-create the
struct as necessary. If there's any kind of type mismatch (say,
previously we saw that oid as a non-commit), we'll get NULL there and
bail, which makes sense. I think the original code could produce
undefined behavior there if parse_object() found something other than
"type", though in practice that is quite unlikely (since
oid_object_info() would have just gone to the on-disk odb to get the
type itself).

-Peff

Junio C Hamano Aug. 4, 2021, 9:32 p.m. UTC | #4

Jeff King <peff@peff.net> writes:

> Nice. I've sometimes wondered if parse_object() should be doing this
> optimization itself. Though we'd possibly still want callers (like this
> one) to give us more hints, since we already know the type is
> OBJ_COMMIT. Whereas parse_object() would have to discover that itself
> (though we already incur the extra type lookup there to handle blobs).

Ahh, you read one step further than I did ;-) Yes, if we are already
inspecting type there, doing this optimization inside parse_object()
becomes much easier to justify.

Patrick Steinhardt Aug. 5, 2021, 6:04 a.m. UTC | #5

On Wed, Aug 04, 2021 at 04:59:40PM -0400, Jeff King wrote:
> On Wed, Aug 04, 2021 at 03:56:11PM +0200, Patrick Steinhardt wrote:
> 
> > When doing reference negotiation, git-fetch-pack(1) is loading all refs
> > from disk in order to determine which commits it has in common with the
> > remote repository. This can be quite expensive in repositories with many
> > references though: in a real-world repository with around 2.2 million
> > refs, fetching a single commit by its ID takes around 44 seconds.
> > 
> > Dominating the loading time is decompression and parsing of the objects
> > which are referenced by commits. Given the fact that we only care about
> > commits (or tags which can be peeled to one) in this context, there is
> > thus an easy performance win by switching the parsing logic to make use
> > of the commit graph in case we have one available. Like this, we avoid
> > hitting the object database to parse these commits but instead only load
> > them from the commit-graph. This results in a significant performance
> > boost when executing git-fetch in said repository with 2.2 million refs:
> > 
> >     Benchmark #1: HEAD~: git fetch $remote $commit
> >       Time (mean ± σ):     44.168 s ±  0.341 s    [User: 42.985 s, System: 1.106 s]
> >       Range (min … max):   43.565 s … 44.577 s    10 runs
> > 
> >     Benchmark #2: HEAD: git fetch $remote $commit
> >       Time (mean ± σ):     19.498 s ±  0.724 s    [User: 18.751 s, System: 0.690 s]
> >       Range (min … max):   18.629 s … 20.454 s    10 runs
> > 
> >     Summary
> >       'HEAD: git fetch $remote $commit' ran
> >         2.27 ± 0.09 times faster than 'HEAD~: git fetch $remote $commit'
> 
> Nice. I've sometimes wondered if parse_object() should be doing this
> optimization itself. Though we'd possibly still want callers (like this
> one) to give us more hints, since we already know the type is
> OBJ_COMMIT. Whereas parse_object() would have to discover that itself
> (though we already incur the extra type lookup there to handle blobs).

Would certainly make it much harder to hit this pitfall. The only thing
one needs to be cautious about is that we need to somehow assert the
object still exists in our ODB. Otherwise you may look up a commit via
the commit-graph even though the commit doesn't exist anymore.

> Do you have a lot of tags in your repository?

No, it's only about 2000 tags.

> I wonder where the remaining 20s is going. 

Rebasing this commit on top of my git-rev-list(1) series [1] for the
connectivity check gives another 25% speedup, going down from 20s to 14s
(numbers are a bit different given that I'm on a different machine right
now). From here on, it's multiple things which take time:

    - 20% of the time is spent sorting the refs in
      `mark_complete_and_common_ref()`. This time around I feel less
      comfortable to just disable sorting given that it may impact
      correctness.

    - 30% of the time is spent looking up object types via
      `oid_object_info_extended()`, where 75% of these lookups come from
      `deref_without_lazy_fetch()`. This can be improved a bit by doing
      the `lookup_unknown_object()` dance, buying a modest speedup of
      ~8%. But this again has memory tradeoffs given that we must
      allocate the object such that all types would fit.

Other than that I don't see any obvious things in the flame graphs. In
case anybody is interested, I've posted flame graphs in our GitLab issue
at [2], with the state before this patch, with this patch and in
combination with [1].

[1]: http://public-inbox.org/git/cover.1627896460.git.ps@pks.im/
[2]: https://gitlab.com/gitlab-org/gitlab/-/issues/336657#note_642957933

>   - you'd want to double check that we always call this during ref
>     iteration (it looks like we do, and I think peel_iterated_ref()
>     falls back to a normal peel otherwise)
> 
>   - for a tag-of-tag-of-X, that will give us the complete peel to X. But
>     it looks like deref_without_lazy_fetch() marks intermediate tags
>     with the COMPLETE flag, too. I'm not sure how important that is
>     (i.e., is it necessary for correctness, or just an optimization, in
>     which case we might be better off guessing that tags are
>     single-layer, as it's by far the common case).
> 
> If we don't go that route, there's another possible speedup: after
> parsing a tag, the type of tag->tagged (if it is not NULL) will be known
> from the tag's contents, and we can avoid the oid_object_info_extended()
> type lookup. It might need some extra surgery to convince the tag-parse
> not to fetch promisor objects, though.
> 
> I'm not sure it would make that big a difference, though. If we save one
> type-lookup per parsed tag, then the tag parsing is likely to dwarf it.

Yeah, I'd assume the same. And in any case, our repo doesn't really have
any problems with tags given that there's so few of them. So I wouldn't
really have the data to back up any performance improvements here.

Patrick

Patrick Steinhardt Aug. 5, 2021, 11:53 a.m. UTC | #6

On Thu, Aug 05, 2021 at 08:04:53AM +0200, Patrick Steinhardt wrote:
> On Wed, Aug 04, 2021 at 04:59:40PM -0400, Jeff King wrote:
> > On Wed, Aug 04, 2021 at 03:56:11PM +0200, Patrick Steinhardt wrote:
[snip]
> > I wonder where the remaining 20s is going. 
> 
> Rebasing this commit on top of my git-rev-list(1) series [1] for the
> connectivity check gives another 25% speedup, going down from 20s to 14s
> (numbers are a bit different given that I'm on a different machine right
> now). From here on, it's multiple things which take time:
> 
>     - 20% of the time is spent sorting the refs in
>       `mark_complete_and_common_ref()`. This time around I feel less
>       comfortable to just disable sorting given that it may impact
>       correctness.
> 
>     - 30% of the time is spent looking up object types via
>       `oid_object_info_extended()`, where 75% of these lookups come from
>       `deref_without_lazy_fetch()`. This can be improved a bit by doing
>       the `lookup_unknown_object()` dance, buying a modest speedup of
>       ~8%. But this again has memory tradeoffs given that we must
>       allocate the object such that all types would fit.
> 
> Other than that I don't see any obvious things in the flame graphs. In
> case anybody is interested, I've posted flame graphs in our GitLab issue
> at [2], with the state before this patch, with this patch and in
> combination with [1].
> 
> [1]: http://public-inbox.org/git/cover.1627896460.git.ps@pks.im/
> [2]: https://gitlab.com/gitlab-org/gitlab/-/issues/336657#note_642957933

I've put some more time into this. If rebased on top of v4 of [1], then
we can also use `parse_commit_in_graph_gently()` to further speed this
up from 15.8 seconds to 11.6 seconds with below patch. It's the same
memory/speed tradeoff as I'm doing in [1].

I guess I'd still like to treat both series separately for now given
that [1] is more involved compared to this patch series here. I'll then
do a follow-up when (if?) both series have landed.

Patrick

diff --git a/fetch-pack.c b/fetch-pack.c
index 0bf7ed7e47..cc8b2ffa6c 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -119,6 +119,17 @@ static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
 {
 	enum object_type type;
 	struct object_info info = { .typep = &type };
+	struct object *object = lookup_unknown_object(the_repository, oid);
+
+	if (object->type == OBJ_COMMIT)
+		return (struct commit *) object;
+
+	if (object->type == OBJ_NONE &&
+	    parse_commit_in_graph_gently(the_repository, object) == 0) {
+		if (!repo_has_object_file(the_repository, oid))
+			return NULL;
+		return (struct commit *) object;
+	}

 	while (1) {
 		if (oid_object_info_extended(the_repository, oid, &info,

Junio C Hamano Aug. 5, 2021, 4:26 p.m. UTC | #7

Patrick Steinhardt <ps@pks.im> writes:

> I've put some more time into this. If rebased on top of v4 of [1], then
> we can also use `parse_commit_in_graph_gently()` to further speed this
> up from 15.8 seconds to 11.6 seconds with below patch. It's the same
> memory/speed tradeoff as I'm doing in [1].

Sounds good.

> I guess I'd still like to treat both series separately for now given
> that [1] is more involved compared to this patch series here. I'll then
> do a follow-up when (if?) both series have landed.

During the pre-release freeze (cf. https://tinyurl.com/gitCal), new
topics will not "land" anywhere so you have two weeks or so to
stabilize the base topic.  I would prefer not to have to worry about
inter-dependent multiple topics anyway, especially the ones that are
not relevant to the upcoming release during a pre-release period.

Thanks.

Ævar Arnfjörð Bjarmason Aug. 5, 2021, 7:05 p.m. UTC | #8

On Wed, Aug 04 2021, Jeff King wrote:

> [...]
> If we don't go that route, there's another possible speedup: after
> parsing a tag, the type of tag->tagged (if it is not NULL) will be known
> from the tag's contents, and we can avoid the oid_object_info_extended()
> type lookup. It might need some extra surgery to convince the tag-parse
> not to fetch promisor objects, though.
>
> I'm not sure it would make that big a difference, though. If we save one
> type-lookup per parsed tag, then the tag parsing is likely to dwarf it.

Except that the tag type can be a lie, as our tests and some recent
object/fsck serieses (not yet integrated) from me show. It can be relied
upon for >99.99% of cases, but code like that needs to make sure it's
not segfaulting or whatever if that relationship doesn't hold.

Jeff King Aug. 5, 2021, 8:29 p.m. UTC | #9

On Thu, Aug 05, 2021 at 09:05:59PM +0200, Ævar Arnfjörð Bjarmason wrote:

> 
> On Wed, Aug 04 2021, Jeff King wrote:
> 
> > [...]
> > If we don't go that route, there's another possible speedup: after
> > parsing a tag, the type of tag->tagged (if it is not NULL) will be known
> > from the tag's contents, and we can avoid the oid_object_info_extended()
> > type lookup. It might need some extra surgery to convince the tag-parse
> > not to fetch promisor objects, though.
> >
> > I'm not sure it would make that big a difference, though. If we save one
> > type-lookup per parsed tag, then the tag parsing is likely to dwarf it.
> 
> Except that the tag type can be a lie, as our tests and some recent
> object/fsck serieses (not yet integrated) from me show. It can be relied
> upon for >99.99% of cases, but code like that needs to make sure it's
> not segfaulting or whatever if that relationship doesn't hold.

Yes, but we'd discover that when we called repo_parse_commit(), and
Patrick's patch already handles failure there.

-Peff

Jeff King Aug. 5, 2021, 8:40 p.m. UTC | #10

On Thu, Aug 05, 2021 at 08:04:51AM +0200, Patrick Steinhardt wrote:

> > Nice. I've sometimes wondered if parse_object() should be doing this
> > optimization itself. Though we'd possibly still want callers (like this
> > one) to give us more hints, since we already know the type is
> > OBJ_COMMIT. Whereas parse_object() would have to discover that itself
> > (though we already incur the extra type lookup there to handle blobs).
> 
> Would certainly make it much harder to hit this pitfall. The only thing
> one needs to be cautious about is that we need to somehow assert the
> object still exists in our ODB. Otherwise you may look up a commit via
> the commit-graph even though the commit doesn't exist anymore.

True, though what I really wonder is what exactly people are depending
on parse_object() for. I.e., how many callers care about making sure it
exists, and how many would be happy to have things magically go faster?

I'm not sure of a good way to answer that question, but I agree it's the
sticking point on pushing the optimization down to that lower level.

(In case it's not clear, this is all a question for the future, and
shouldn't hold up your patch).

> > Do you have a lot of tags in your repository?
> 
> No, it's only about 2000 tags.

Hmm. In that case, I wonder if we could be using the ref peel data in
the opposite direction:

  0. With modern packed-refs, the file tells us definitely whether an
     object was peel-able or not.

  1. If it is peel-able, then do the same tag-peeling we do now.  This
     handles the intermediate stages, etc, and the optimization doesn't
     help us.

  2. If it isn't peel-able, then we know it's not a tag. We can
     _guess_ that it's a commit, because most refs that point to
     non-tags are, and try to look it up in the commit graph.

  3. If we do find it in the commit graph, we win. We saved having to
     call oid_object_info() to get the type.

  4. If we didn't, then we have to get the real type (it might simply be
     a commit that isn't in the graph files), and we lose. We did a
     pointless lookup in the graph file.

I think in general we'd win on balance, because most refs do point to
commits. But I'm not sure how big the win would be. You'd have to time
it on your weird 2M-ref case (though your numbers below suggest it may
be a few seconds).

(And I know I'm spouting a lot of micro-optimization ideas here; I won't
be offended if you don't feel like following up on them).

> > I wonder where the remaining 20s is going. 
> 
> Rebasing this commit on top of my git-rev-list(1) series [1] for the
> connectivity check gives another 25% speedup, going down from 20s to 14s
> (numbers are a bit different given that I'm on a different machine right
> now). From here on, it's multiple things which take time:
> 
>     - 20% of the time is spent sorting the refs in
>       `mark_complete_and_common_ref()`. This time around I feel less
>       comfortable to just disable sorting given that it may impact
>       correctness.
> 
>     - 30% of the time is spent looking up object types via
>       `oid_object_info_extended()`, where 75% of these lookups come from
>       `deref_without_lazy_fetch()`. This can be improved a bit by doing
>       the `lookup_unknown_object()` dance, buying a modest speedup of
>       ~8%. But this again has memory tradeoffs given that we must
>       allocate the object such that all types would fit.

Thanks, that all makes sense.

I think the suggestion I made above is similar to what you're thinking
with lookup_unknown_commit(), but would avoid the extra allocations.

-Peff

Jeff King Aug. 5, 2021, 8:42 p.m. UTC | #11

On Thu, Aug 05, 2021 at 01:53:50PM +0200, Patrick Steinhardt wrote:

> I've put some more time into this. If rebased on top of v4 of [1], then
> we can also use `parse_commit_in_graph_gently()` to further speed this
> up from 15.8 seconds to 11.6 seconds with below patch. It's the same
> memory/speed tradeoff as I'm doing in [1].
> 
> I guess I'd still like to treat both series separately for now given
> that [1] is more involved compared to this patch series here. I'll then
> do a follow-up when (if?) both series have landed.

Heh, I should have read this before writing my other response. Your
strategy here is what I imagined. If you split the find/fill steps from
parse_commit_in_graph(), then you should be able to speculatively ask
"if this is a commit in the graph, fill it in, otherwise do nothing".
Which would solve the memory tradeoff.

-Peff

fetch-pack: speed up loading of refs via commit graph

Commit Message

Comments

Patch