Message ID | b9897e102afbcab3bfee58ed8bda24257d8b54fb.1627896460.git.ps@pks.im (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Speed up connectivity checks | expand |
On Mon, Aug 02 2021, Patrick Steinhardt wrote: > [[PGP Signed Part:Undecided]] > When loading references, we try to optimize loading of commits by using > the commit graph. To do so, we first need to determine whether the > object actually is a commit or not, which is why we always execute > `oid_object_info()` first. Like this, we'll unpack the object header of > each object first. > > This pattern can be quite inefficient in case many references point to > the same commit: if the object didn't end up in the cached objects, then > we'll repeatedly unpack the same object header, even if we've already > seen the object before. > > Optimize this pattern by using `lookup_unknown_object()` first in order > to determine whether we've seen the object before. If so, then we don't > need to re-parse the header but can directly use its object information > and thus gain a modest performance improvement. Executed in a real-world > repository with around 2.2 million references: > > Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev > Time (mean ± σ): 4.771 s ± 0.238 s [User: 4.440 s, System: 0.330 s] > Range (min … max): 4.539 s … 5.219 s 10 runs > > Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev > Time (mean ± σ): 4.454 s ± 0.037 s [User: 4.122 s, System: 0.332 s] > Range (min … max): 4.375 s … 4.496 s 10 runs > > Summary > 'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran > 1.07 ± 0.05 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' > > The downside is that `lookup_unknown_object()` is forced to always > allocate an object such that it's big enough to host all object types' > structs and thus we may waste memory here. This tradeoff is probably > worth it though considering the following struct sizes: > > - commit: 72 bytes > - tree: 56 bytes > - blob: 40 bytes > - tag: 64 bytes > > Assuming that in almost all repositories, most references will point to > either a tag or a commit, we'd have a modest increase in memory > consumption of about 12.5% here. > > Signed-off-by: Patrick Steinhardt <ps@pks.im> > --- > revision.c | 15 ++++++++++++--- > 1 file changed, 12 insertions(+), 3 deletions(-) > > diff --git a/revision.c b/revision.c > index f06a5d63a3..671b6d6513 100644 > --- a/revision.c > +++ b/revision.c > @@ -359,14 +359,22 @@ static struct object *get_reference(struct rev_info *revs, const char *name, > const struct object_id *oid, > unsigned int flags) > { > - struct object *object; > + struct object *object = lookup_unknown_object(revs->repo, oid); > + > + if (object->type == OBJ_NONE) { > + int type = oid_object_info(revs->repo, oid, NULL); > + if (type < 0 || !object_as_type(object, type, 1)) { Let's s/int type/enum object_type, personally I think we should never do "type < 0" either, and check OBJ_BAD explicitly, but I've seemingly lost that discussion on-list before. But I think the consensus is that we should not do !type, but rather type == OBJ_NONE.
Patrick Steinhardt <ps@pks.im> writes: > When loading references, we try to optimize loading of commits by using > the commit graph. To do so, we first need to determine whether the > object actually is a commit or not, which is why we always execute > `oid_object_info()` first. Like this, we'll unpack the object header of > each object first. > > This pattern can be quite inefficient in case many references point to > the same commit: if the object didn't end up in the cached objects, then > we'll repeatedly unpack the same object header, even if we've already > seen the object before. > ... > Assuming that in almost all repositories, most references will point to > either a tag or a commit, we'd have a modest increase in memory > consumption of about 12.5% here. I wonder if we can also say almost all repositories, the majority of refs point at the same object. If that holds, this would certainly be a win, but otherwise, it is not so clear.
On Mon, Aug 02, 2021 at 12:40:56PM -0700, Junio C Hamano wrote: > Patrick Steinhardt <ps@pks.im> writes: > > > When loading references, we try to optimize loading of commits by using > > the commit graph. To do so, we first need to determine whether the > > object actually is a commit or not, which is why we always execute > > `oid_object_info()` first. Like this, we'll unpack the object header of > > each object first. > > > > This pattern can be quite inefficient in case many references point to > > the same commit: if the object didn't end up in the cached objects, then > > we'll repeatedly unpack the same object header, even if we've already > > seen the object before. > > ... > > Assuming that in almost all repositories, most references will point to > > either a tag or a commit, we'd have a modest increase in memory > > consumption of about 12.5% here. > > I wonder if we can also say almost all repositories, the majority of > refs point at the same object. If that holds, this would certainly > be a win, but otherwise, it is not so clear. I doubt that's the case in general. I rather assume that it's typically going to be a smallish subset that points to the same commit, but for these cases we at least avoid doing the lookup multiple times. As I said, it's definitely a tradeoff between memory and performance: in the worst case (all references point to different blobs) we allocate 33% more memory without having any speedups. A more realistic scenario would probably be something like a trunk-based development repo, where there's a single branch only and the rest is tags. There we'd allocate 11% more memory without any speedups. In general, it's going to be various shades of gray, where we allocate something from 0% to 11% more memory while getting some modest speedups in some cases. So if we only inspect this commit as a standalone it's definitely debatable whether we'd want to take it or not. But one important thing is that it's a prerequisite for patch 4/4: in order to not parse commits in case they're part of the commit-graph, we need to first obtain an object such that we can fill it in via the graph. So we have to call `lookup_unknown_object()` anyway. Might be sensible to document this as part of the commit message. Patrick
On Mon, Aug 02, 2021 at 02:55:09PM +0200, Ævar Arnfjörð Bjarmason wrote: > On Mon, Aug 02 2021, Patrick Steinhardt wrote: [snip] > > diff --git a/revision.c b/revision.c > > index f06a5d63a3..671b6d6513 100644 > > --- a/revision.c > > +++ b/revision.c > > @@ -359,14 +359,22 @@ static struct object *get_reference(struct rev_info *revs, const char *name, > > const struct object_id *oid, > > unsigned int flags) > > { > > - struct object *object; > > + struct object *object = lookup_unknown_object(revs->repo, oid); > > + > > + if (object->type == OBJ_NONE) { > > + int type = oid_object_info(revs->repo, oid, NULL); > > + if (type < 0 || !object_as_type(object, type, 1)) { > > Let's s/int type/enum object_type, personally I think we should never do > "type < 0" either, and check OBJ_BAD explicitly, but I've seemingly lost > that discussion on-list before. > > But I think the consensus is that we should not do !type, but rather > type == OBJ_NONE. `oid_object_info()` does return an `int` though, so it feels kind of weird to me to stuff it into an enum right away. Furthermore, while `OBJ_BAD` does map to `-1`, the documentation of `oid_object_info()` currently asks for what I'm doing: """returns enum object_type or negative""". Patrick
On Tue, Aug 03, 2021 at 11:07:29AM +0200, Patrick Steinhardt wrote: > On Mon, Aug 02, 2021 at 12:40:56PM -0700, Junio C Hamano wrote: > > Patrick Steinhardt <ps@pks.im> writes: > > > > > When loading references, we try to optimize loading of commits by using > > > the commit graph. To do so, we first need to determine whether the > > > object actually is a commit or not, which is why we always execute > > > `oid_object_info()` first. Like this, we'll unpack the object header of > > > each object first. > > > > > > This pattern can be quite inefficient in case many references point to > > > the same commit: if the object didn't end up in the cached objects, then > > > we'll repeatedly unpack the same object header, even if we've already > > > seen the object before. > > > ... > > > Assuming that in almost all repositories, most references will point to > > > either a tag or a commit, we'd have a modest increase in memory > > > consumption of about 12.5% here. > > > > I wonder if we can also say almost all repositories, the majority of > > refs point at the same object. If that holds, this would certainly > > be a win, but otherwise, it is not so clear. > > I doubt that's the case in general. I rather assume that it's typically > going to be a smallish subset that points to the same commit, but for > these cases we at least avoid doing the lookup multiple times. As I > said, it's definitely a tradeoff between memory and performance: in the > worst case (all references point to different blobs) we allocate 33% > more memory without having any speedups. A more realistic scenario would > probably be something like a trunk-based development repo, where there's > a single branch only and the rest is tags. There we'd allocate 11% more > memory without any speedups. In general, it's going to be various shades > of gray, where we allocate something from 0% to 11% more memory while > getting some modest speedups in some cases. > > So if we only inspect this commit as a standalone it's definitely > debatable whether we'd want to take it or not. But one important thing > is that it's a prerequisite for patch 4/4: in order to not parse commits > in case they're part of the commit-graph, we need to first obtain an > object such that we can fill it in via the graph. So we have to call > `lookup_unknown_object()` anyway. Might be sensible to document this as > part of the commit message. Scratch that: we can just rewrite this to use `lookup_object()` to check whether we already know the object, and then we can rewrite the `parse_commit_in_graph_gently()` to be `lookup_commit_in_graph()`, which does a `lookup_commit()` to create the object in case the OID was found in the graph. That's also got nicer calling semantics.. It's negligibly slower (3.021s instead of 2.983s), but it doesn't need me arguing about the performance/memory tradeoff. Which is a good thing I guess: there will always be that one repo that's completely different and where such assumptions don't hold. Patrick
diff --git a/revision.c b/revision.c index f06a5d63a3..671b6d6513 100644 --- a/revision.c +++ b/revision.c @@ -359,14 +359,22 @@ static struct object *get_reference(struct rev_info *revs, const char *name, const struct object_id *oid, unsigned int flags) { - struct object *object; + struct object *object = lookup_unknown_object(revs->repo, oid); + + if (object->type == OBJ_NONE) { + int type = oid_object_info(revs->repo, oid, NULL); + if (type < 0 || !object_as_type(object, type, 1)) { + object = NULL; + goto out; + } + } /* * If the repository has commit graphs, repo_parse_commit() avoids * reading the object buffer, so use it whenever possible. */ - if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) { - struct commit *c = lookup_commit(revs->repo, oid); + if (object->type == OBJ_COMMIT) { + struct commit *c = (struct commit *) object; if (!repo_parse_commit(revs->repo, c)) object = (struct object *) c; else @@ -375,6 +383,7 @@ static struct object *get_reference(struct rev_info *revs, const char *name, object = parse_object(revs->repo, oid); } +out: if (!object) { if (revs->ignore_missing) return object;
When loading references, we try to optimize loading of commits by using the commit graph. To do so, we first need to determine whether the object actually is a commit or not, which is why we always execute `oid_object_info()` first. Like this, we'll unpack the object header of each object first. This pattern can be quite inefficient in case many references point to the same commit: if the object didn't end up in the cached objects, then we'll repeatedly unpack the same object header, even if we've already seen the object before. Optimize this pattern by using `lookup_unknown_object()` first in order to determine whether we've seen the object before. If so, then we don't need to re-parse the header but can directly use its object information and thus gain a modest performance improvement. Executed in a real-world repository with around 2.2 million references: Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev Time (mean ± σ): 4.771 s ± 0.238 s [User: 4.440 s, System: 0.330 s] Range (min … max): 4.539 s … 5.219 s 10 runs Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev Time (mean ± σ): 4.454 s ± 0.037 s [User: 4.122 s, System: 0.332 s] Range (min … max): 4.375 s … 4.496 s 10 runs Summary 'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran 1.07 ± 0.05 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' The downside is that `lookup_unknown_object()` is forced to always allocate an object such that it's big enough to host all object types' structs and thus we may waste memory here. This tradeoff is probably worth it though considering the following struct sizes: - commit: 72 bytes - tree: 56 bytes - blob: 40 bytes - tag: 64 bytes Assuming that in almost all repositories, most references will point to either a tag or a commit, we'd have a modest increase in memory consumption of about 12.5% here. Signed-off-by: Patrick Steinhardt <ps@pks.im> --- revision.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-)