[v2,1/5] speed up alt_odb_usable() with many alternates

Message ID	20210629205305.7100-2-e@80x24.org (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> From: Eric Wong <e@80x24.org> To: git@vger.kernel.org Cc: Jeff King <peff@peff.net>, =?utf-8?q?Ren=C3=A9_Scharfe?= <l.s.r@web.de> Subject: [PATCH v2 1/5] speed up alt_odb_usable() with many alternates Date: Tue, 29 Jun 2021 20:53:01 +0000 Message-Id: <20210629205305.7100-2-e@80x24.org> In-Reply-To: <20210627024718.25383-1-e@80x24.org> References: <20210627024718.25383-1-e@80x24.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[v2,1/5] speed up alt_odb_usable() with many alternates \| expand [v2,1/5] speed up alt_odb_usable() with many alternates

Message ID

20210629205305.7100-2-e@80x24.org (mailing list archive)

State

Superseded

Headers

From: Eric Wong <e@80x24.org>
To: git@vger.kernel.org
Cc: Jeff King <peff@peff.net>, =?utf-8?q?Ren=C3=A9_Scharfe?= <l.s.r@web.de>
Subject: [PATCH v2 1/5] speed up alt_odb_usable() with many alternates
Date: Tue, 29 Jun 2021 20:53:01 +0000
Message-Id: <20210629205305.7100-2-e@80x24.org>
In-Reply-To: <20210627024718.25383-1-e@80x24.org>
References: <20210627024718.25383-1-e@80x24.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

[v2,1/5] speed up alt_odb_usable() with many alternates | expand

Commit Message

Eric Wong June 29, 2021, 8:53 p.m. UTC

With many alternates, the duplicate check in alt_odb_usable()
wastes many cycles doing repeated fspathcmp() on every existing
alternate.  Use a khash to speed up lookups by odb->path.

Since the kh_put_* API uses the supplied key without
duplicating it, we also take advantage of it to replace both
xstrdup() and strbuf_release() in link_alt_odb_entry() with
strbuf_detach() to avoid the allocation and copy.

In a test repository with 50K alternates and each of those 50K
alternates having one alternate each (for a total of 100K total
alternates); this speeds up lookup of a non-existent blob from
over 16 minutes to roughly 2.7 seconds on my busy workstation.

Note: all underlying git object directories were small and
unpacked with only loose objects and no packs.  Having to load
packs increases times significantly.

Signed-off-by: Eric Wong <e@80x24.org>
---
 object-file.c  | 33 ++++++++++++++++++++++-----------
 object-store.h | 17 +++++++++++++++++
 object.c       |  2 ++
 3 files changed, 41 insertions(+), 11 deletions(-)

Comments

René Scharfe July 3, 2021, 10:05 a.m. UTC | #1

Am 29.06.21 um 22:53 schrieb Eric Wong:
> With many alternates, the duplicate check in alt_odb_usable()
> wastes many cycles doing repeated fspathcmp() on every existing
> alternate.  Use a khash to speed up lookups by odb->path.
>
> Since the kh_put_* API uses the supplied key without
> duplicating it, we also take advantage of it to replace both
> xstrdup() and strbuf_release() in link_alt_odb_entry() with
> strbuf_detach() to avoid the allocation and copy.
>
> In a test repository with 50K alternates and each of those 50K
> alternates having one alternate each (for a total of 100K total
> alternates); this speeds up lookup of a non-existent blob from
> over 16 minutes to roughly 2.7 seconds on my busy workstation.

Yay for hashmaps! :)

> Note: all underlying git object directories were small and
> unpacked with only loose objects and no packs.  Having to load
> packs increases times significantly.
>
> Signed-off-by: Eric Wong <e@80x24.org>
> ---
>  object-file.c  | 33 ++++++++++++++++++++++-----------
>  object-store.h | 17 +++++++++++++++++
>  object.c       |  2 ++
>  3 files changed, 41 insertions(+), 11 deletions(-)
>
> diff --git a/object-file.c b/object-file.c
> index f233b440b2..304af3a172 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -517,9 +517,9 @@ const char *loose_object_path(struct repository *r, struct strbuf *buf,
>   */
>  static int alt_odb_usable(struct raw_object_store *o,
>  			  struct strbuf *path,
> -			  const char *normalized_objdir)
> +			  const char *normalized_objdir, khiter_t *pos)
>  {
> -	struct object_directory *odb;
> +	int r;
>
>  	/* Detect cases where alternate disappeared */
>  	if (!is_directory(path->buf)) {
> @@ -533,14 +533,22 @@ static int alt_odb_usable(struct raw_object_store *o,
>  	 * Prevent the common mistake of listing the same
>  	 * thing twice, or object directory itself.
>  	 */
> -	for (odb = o->odb; odb; odb = odb->next) {
> -		if (!fspathcmp(path->buf, odb->path))
> -			return 0;
> +	if (!o->odb_by_path) {
> +		khiter_t p;
> +
> +		o->odb_by_path = kh_init_odb_path_map();
> +		assert(!o->odb->next);
> +		p = kh_put_odb_path_map(o->odb_by_path, o->odb->path, &r);

So on the first run you not just create the hashmap, but you also
pre-populate it with the main object directory.  Makes sense.  The
hashmap wouldn't even be created in repositories without alternates.

> +		if (r < 0) die_errno(_("kh_put_odb_path_map"));

Our other callers don't handle a negative return code because it would
indicate an allocation failure, and in our version we use ALLOC_ARRAY,
which dies on error.  So you don't need that check here, but we better
clarify that in khash.h.

> +		assert(r == 1); /* never used */
> +		kh_value(o->odb_by_path, p) = o->odb;
>  	}
>  	if (!fspathcmp(path->buf, normalized_objdir))
>  		return 0;
> -
> -	return 1;
> +	*pos = kh_put_odb_path_map(o->odb_by_path, path->buf, &r);
> +	if (r < 0) die_errno(_("kh_put_odb_path_map"));

Dito.

> +	/* r: 0 = exists, 1 = never used, 2 = deleted */
> +	return r == 0 ? 0 : 1;

The comment indicates that khash would be nicer to use if it had an
enum for the kh_put return values.  Perhaps, but that should be done in
another series.

I like the solution in oidset.c to make this more readable, though: Call
the return value "added" instead of "r" and then a "return !added;"
makes sense without additional comments.

>  }
>
>  /*
> @@ -566,6 +574,7 @@ static int link_alt_odb_entry(struct repository *r, const char *entry,
>  {
>  	struct object_directory *ent;
>  	struct strbuf pathbuf = STRBUF_INIT;
> +	khiter_t pos;
>
>  	if (!is_absolute_path(entry) && relative_base) {
>  		strbuf_realpath(&pathbuf, relative_base, 1);
> @@ -587,23 +596,25 @@ static int link_alt_odb_entry(struct repository *r, const char *entry,
>  	while (pathbuf.len && pathbuf.buf[pathbuf.len - 1] == '/')
>  		strbuf_setlen(&pathbuf, pathbuf.len - 1);
>
> -	if (!alt_odb_usable(r->objects, &pathbuf, normalized_objdir)) {
> +	if (!alt_odb_usable(r->objects, &pathbuf, normalized_objdir, &pos)) {
>  		strbuf_release(&pathbuf);
>  		return -1;
>  	}
>
>  	CALLOC_ARRAY(ent, 1);
> -	ent->path = xstrdup(pathbuf.buf);
> +	/* pathbuf.buf is already in r->objects->odb_by_path */

Tricky stuff (to me), important comment.

> +	ent->path = strbuf_detach(&pathbuf, NULL);
>
>  	/* add the alternate entry */
>  	*r->objects->odb_tail = ent;
>  	r->objects->odb_tail = &(ent->next);
>  	ent->next = NULL;
> +	assert(r->objects->odb_by_path);
> +	kh_value(r->objects->odb_by_path, pos) = ent;
>
>  	/* recursively add alternates */
> -	read_info_alternates(r, pathbuf.buf, depth + 1);
> +	read_info_alternates(r, ent->path, depth + 1);
>
> -	strbuf_release(&pathbuf);
>  	return 0;
>  }
>
> diff --git a/object-store.h b/object-store.h
> index ec32c23dcb..20c1cedb75 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -7,6 +7,8 @@
>  #include "oid-array.h"
>  #include "strbuf.h"
>  #include "thread-utils.h"
> +#include "khash.h"
> +#include "dir.h"
>
>  struct object_directory {
>  	struct object_directory *next;
> @@ -30,6 +32,19 @@ struct object_directory {
>  	char *path;
>  };
>
> +static inline int odb_path_eq(const char *a, const char *b)
> +{
> +	return !fspathcmp(a, b);
> +}

This is not specific to the object store.  It could be called fspatheq
and live in dir.h.  Or dir.c -- a surprising amount of code seems to
necessary for that negation (https://godbolt.org/z/MY7Wda3a7).  Anyway,
it's just an idea for another series.

> +
> +static inline int odb_path_hash(const char *str)
> +{
> +	return ignore_case ? strihash(str) : __ac_X31_hash_string(str);
> +}

The internal Attractive Chaos (__ac_*) macros should be left confined
to khash.h, I think.  Its alias kh_str_hash_func would be better
suited here.

Do we want to use the K&R hash function here at all, though?  If we
use FNV-1 when ignoring case, why not also use it (i.e. strhash) when
respecting it?  At least that's done in builtin/sparse-checkout.c,
dir.c and merge-recursive.c.  This is just handwaving and yammering
about lack of symmetry, but I do wonder how your performance numbers
look with strhash.  If it's fine then we could package this up as
fspathhash..

And I also wonder how it looks if you use strihash unconditionally.
I guess case collisions are usually rare and branching based on a
global variable may be more expensive than case folding..

Anyway, just ideas; kh_str_hash_func would be OK as well.

> +
> +KHASH_INIT(odb_path_map, const char * /* key: odb_path */,
> +	struct object_directory *, 1, odb_path_hash, odb_path_eq);
> +
>  void prepare_alt_odb(struct repository *r);
>  char *compute_alternate_path(const char *path, struct strbuf *err);
>  typedef int alt_odb_fn(struct object_directory *, void *);
> @@ -116,6 +131,8 @@ struct raw_object_store {
>  	 */
>  	struct object_directory *odb;
>  	struct object_directory **odb_tail;
> +	kh_odb_path_map_t *odb_by_path;
> +
>  	int loaded_alternates;
>
>  	/*
> diff --git a/object.c b/object.c
> index 14188453c5..2b3c075a15 100644
> --- a/object.c
> +++ b/object.c
> @@ -511,6 +511,8 @@ static void free_object_directories(struct raw_object_store *o)
>  		free_object_directory(o->odb);
>  		o->odb = next;
>  	}
> +	kh_destroy_odb_path_map(o->odb_by_path);
> +	o->odb_by_path = NULL;
>  }
>
>  void raw_object_store_clear(struct raw_object_store *o)
>

René Scharfe July 4, 2021, 9:02 a.m. UTC | #2

Am 03.07.21 um 12:05 schrieb René Scharfe:
> Am 29.06.21 um 22:53 schrieb Eric Wong:
>> +	*pos = kh_put_odb_path_map(o->odb_by_path, path->buf, &r);
>> +	if (r < 0) die_errno(_("kh_put_odb_path_map"));

>> +	/* r: 0 = exists, 1 = never used, 2 = deleted */
>> +	return r == 0 ? 0 : 1;

> I like the solution in oidset.c to make this more readable, though: Call
> the return value "added" instead of "r" and then a "return !added;"
> makes sense without additional comments.

That's probably because I wrote that part; see 8b2f8cbcb1 (oidset: use
khash, 2018-10-04) -- I had somehow forgotten about that. o_O

And here we wouldn't negate.  Passing on the value verbatim, without
normalizing 2 to 1, would work fine.

alt_odb_usable() and its caller become quite entangled due to the
hashmap insert operation being split between them.  I suspect the code
would improve by inlining the function in a follow-up patch, making
return code considerations moot.  The improvement is not significant
enough to hold up this series in case you don't like the idea, though.

Rough demo:

 object-file.c | 82 +++++++++++++++++++++++++++--------------------------------
 1 file changed, 37 insertions(+), 45 deletions(-)

diff --git a/object-file.c b/object-file.c
index 304af3a172..a5e91091ee 100644
--- a/object-file.c
+++ b/object-file.c
@@ -512,45 +512,6 @@ const char *loose_object_path(struct repository *r, struct strbuf *buf,
 	return odb_loose_path(r->objects->odb, buf, oid);
 }

-/*
- * Return non-zero iff the path is usable as an alternate object database.
- */
-static int alt_odb_usable(struct raw_object_store *o,
-			  struct strbuf *path,
-			  const char *normalized_objdir, khiter_t *pos)
-{
-	int r;
-
-	/* Detect cases where alternate disappeared */
-	if (!is_directory(path->buf)) {
-		error(_("object directory %s does not exist; "
-			"check .git/objects/info/alternates"),
-		      path->buf);
-		return 0;
-	}
-
-	/*
-	 * Prevent the common mistake of listing the same
-	 * thing twice, or object directory itself.
-	 */
-	if (!o->odb_by_path) {
-		khiter_t p;
-
-		o->odb_by_path = kh_init_odb_path_map();
-		assert(!o->odb->next);
-		p = kh_put_odb_path_map(o->odb_by_path, o->odb->path, &r);
-		if (r < 0) die_errno(_("kh_put_odb_path_map"));
-		assert(r == 1); /* never used */
-		kh_value(o->odb_by_path, p) = o->odb;
-	}
-	if (!fspathcmp(path->buf, normalized_objdir))
-		return 0;
-	*pos = kh_put_odb_path_map(o->odb_by_path, path->buf, &r);
-	if (r < 0) die_errno(_("kh_put_odb_path_map"));
-	/* r: 0 = exists, 1 = never used, 2 = deleted */
-	return r == 0 ? 0 : 1;
-}
-
 /*
  * Prepare alternate object database registry.
  *
@@ -575,6 +536,9 @@ static int link_alt_odb_entry(struct repository *r, const char *entry,
 	struct object_directory *ent;
 	struct strbuf pathbuf = STRBUF_INIT;
 	khiter_t pos;
+	int ret = -1;
+	int added;
+	struct raw_object_store *o = r->objects;

 	if (!is_absolute_path(entry) && relative_base) {
 		strbuf_realpath(&pathbuf, relative_base, 1);
@@ -585,8 +549,7 @@ static int link_alt_odb_entry(struct repository *r, const char *entry,
 	if (strbuf_normalize_path(&pathbuf) < 0 && relative_base) {
 		error(_("unable to normalize alternate object path: %s"),
 		      pathbuf.buf);
-		strbuf_release(&pathbuf);
-		return -1;
+		goto out;
 	}

 	/*
@@ -596,11 +559,37 @@ static int link_alt_odb_entry(struct repository *r, const char *entry,
 	while (pathbuf.len && pathbuf.buf[pathbuf.len - 1] == '/')
 		strbuf_setlen(&pathbuf, pathbuf.len - 1);

-	if (!alt_odb_usable(r->objects, &pathbuf, normalized_objdir, &pos)) {
-		strbuf_release(&pathbuf);
-		return -1;
+	/* Detect cases where alternate disappeared */
+	if (!is_directory(pathbuf.buf)) {
+		error(_("object directory %s does not exist; "
+			"check .git/objects/info/alternates"),
+		      pathbuf.buf);
+		goto out;
+	}
+
+	/*
+	 * Prevent the common mistake of listing the same
+	 * thing twice, or object directory itself.
+	 */
+	if (!o->odb_by_path) {
+		khiter_t p;
+
+		o->odb_by_path = kh_init_odb_path_map();
+		assert(!o->odb->next);
+		p = kh_put_odb_path_map(o->odb_by_path, o->odb->path, &added);
+		if (added < 0) die_errno(_("kh_put_odb_path_map"));
+		assert(added);
+		kh_value(o->odb_by_path, p) = o->odb;
 	}

+	if (!fspathcmp(pathbuf.buf, normalized_objdir))
+		goto out;
+
+	pos = kh_put_odb_path_map(o->odb_by_path, pathbuf.buf, &added);
+	if (added < 0) die_errno(_("kh_put_odb_path_map"));
+	if (!added)
+		goto out;
+
 	CALLOC_ARRAY(ent, 1);
 	/* pathbuf.buf is already in r->objects->odb_by_path */
 	ent->path = strbuf_detach(&pathbuf, NULL);
@@ -615,7 +604,10 @@ static int link_alt_odb_entry(struct repository *r, const char *entry,
 	/* recursively add alternates */
 	read_info_alternates(r, ent->path, depth + 1);

-	return 0;
+	ret = 0;
+out:
+	strbuf_release(&pathbuf);
+	return ret;
 }

 static const char *parse_alt_odb_entry(const char *string,

Eric Wong July 6, 2021, 11:01 p.m. UTC | #3

René Scharfe <l.s.r@web.de> wrote:
> Am 29.06.21 um 22:53 schrieb Eric Wong:
> > With many alternates, the duplicate check in alt_odb_usable()
> > wastes many cycles doing repeated fspathcmp() on every existing
> > alternate.  Use a khash to speed up lookups by odb->path.
> >
> > Since the kh_put_* API uses the supplied key without
> > duplicating it, we also take advantage of it to replace both
> > xstrdup() and strbuf_release() in link_alt_odb_entry() with
> > strbuf_detach() to avoid the allocation and copy.
> >
> > In a test repository with 50K alternates and each of those 50K
> > alternates having one alternate each (for a total of 100K total
> > alternates); this speeds up lookup of a non-existent blob from
> > over 16 minutes to roughly 2.7 seconds on my busy workstation.
> 
> Yay for hashmaps! :)
> 
> > Note: all underlying git object directories were small and
> > unpacked with only loose objects and no packs.  Having to load
> > packs increases times significantly.
> >
> > Signed-off-by: Eric Wong <e@80x24.org>
> > ---
> >  object-file.c  | 33 ++++++++++++++++++++++-----------
> >  object-store.h | 17 +++++++++++++++++
> >  object.c       |  2 ++
> >  3 files changed, 41 insertions(+), 11 deletions(-)
> >
> > diff --git a/object-file.c b/object-file.c
> > index f233b440b2..304af3a172 100644
> > --- a/object-file.c
> > +++ b/object-file.c
> > @@ -517,9 +517,9 @@ const char *loose_object_path(struct repository *r, struct strbuf *buf,
> >   */
> >  static int alt_odb_usable(struct raw_object_store *o,
> >  			  struct strbuf *path,
> > -			  const char *normalized_objdir)
> > +			  const char *normalized_objdir, khiter_t *pos)
> >  {
> > -	struct object_directory *odb;
> > +	int r;
> >
> >  	/* Detect cases where alternate disappeared */
> >  	if (!is_directory(path->buf)) {
> > @@ -533,14 +533,22 @@ static int alt_odb_usable(struct raw_object_store *o,
> >  	 * Prevent the common mistake of listing the same
> >  	 * thing twice, or object directory itself.
> >  	 */
> > -	for (odb = o->odb; odb; odb = odb->next) {
> > -		if (!fspathcmp(path->buf, odb->path))
> > -			return 0;
> > +	if (!o->odb_by_path) {
> > +		khiter_t p;
> > +
> > +		o->odb_by_path = kh_init_odb_path_map();
> > +		assert(!o->odb->next);
> > +		p = kh_put_odb_path_map(o->odb_by_path, o->odb->path, &r);
> 
> So on the first run you not just create the hashmap, but you also
> pre-populate it with the main object directory.  Makes sense.  The
> hashmap wouldn't even be created in repositories without alternates.
> 
> > +		if (r < 0) die_errno(_("kh_put_odb_path_map"));
> 
> Our other callers don't handle a negative return code because it would
> indicate an allocation failure, and in our version we use ALLOC_ARRAY,
> which dies on error.  So you don't need that check here, but we better
> clarify that in khash.h.
> 
> > +		assert(r == 1); /* never used */
> > +		kh_value(o->odb_by_path, p) = o->odb;
> >  	}
> >  	if (!fspathcmp(path->buf, normalized_objdir))
> >  		return 0;
> > -
> > -	return 1;
> > +	*pos = kh_put_odb_path_map(o->odb_by_path, path->buf, &r);
> > +	if (r < 0) die_errno(_("kh_put_odb_path_map"));
> 
> Dito.
> 
> > +	/* r: 0 = exists, 1 = never used, 2 = deleted */
> > +	return r == 0 ? 0 : 1;
> 
> The comment indicates that khash would be nicer to use if it had an
> enum for the kh_put return values.  Perhaps, but that should be done in
> another series.

Agreed for another series.  I've also found myself wishing khash
used enums.  But I'm also not sure how much changing of 3rd
party code we should be doing...

> I like the solution in oidset.c to make this more readable, though: Call
> the return value "added" instead of "r" and then a "return !added;"
> makes sense without additional comments.
> 
> >  }
> >
> >  /*
> > diff --git a/object-store.h b/object-store.h
> > index ec32c23dcb..20c1cedb75 100644
> > --- a/object-store.h
> > +++ b/object-store.h
> > @@ -7,6 +7,8 @@
> >  #include "oid-array.h"
> >  #include "strbuf.h"
> >  #include "thread-utils.h"
> > +#include "khash.h"
> > +#include "dir.h"
> >
> >  struct object_directory {
> >  	struct object_directory *next;
> > @@ -30,6 +32,19 @@ struct object_directory {
> >  	char *path;
> >  };
> >
> > +static inline int odb_path_eq(const char *a, const char *b)
> > +{
> > +	return !fspathcmp(a, b);
> > +}
> 
> This is not specific to the object store.  It could be called fspatheq
> and live in dir.h.  Or dir.c -- a surprising amount of code seems to
> necessary for that negation (https://godbolt.org/z/MY7Wda3a7).  Anyway,
> it's just an idea for another series.

No JS here for godbolt, but there's also a bunch of "!fspathcmp"
here that could probably be changed to fspatheq.

> > +
> > +static inline int odb_path_hash(const char *str)
> > +{
> > +	return ignore_case ? strihash(str) : __ac_X31_hash_string(str);
> > +}
> 
> The internal Attractive Chaos (__ac_*) macros should be left confined
> to khash.h, I think.  Its alias kh_str_hash_func would be better
> suited here.
> 
> Do we want to use the K&R hash function here at all, though?  If we
> use FNV-1 when ignoring case, why not also use it (i.e. strhash) when
> respecting it?  At least that's done in builtin/sparse-checkout.c,
> dir.c and merge-recursive.c.  This is just handwaving and yammering
> about lack of symmetry, but I do wonder how your performance numbers
> look with strhash.  If it's fine then we could package this up as
> fspathhash..

Yeah, I think fspathhash should be path_hash in merge-recursive.c
(and path_hash eliminated).

I don't have performance numbers, and I doubt hash function
performance is much overhead, here.  I used X31 since it was
local to khash.

I would prefer we only have one non-cryptographic hash
implementation to reduce cognitive overhead, so maybe we can
drop X31 entirely for FNV-1.  I'd also prefer we only have khash
or hashmap, not both.

> And I also wonder how it looks if you use strihash unconditionally.
> I guess case collisions are usually rare and branching based on a
> global variable may be more expensive than case folding.

*shrug* I'll let somebody with more appropriate systems do
benchmarks, there.  But it could be an easy switch once
fspathhash is in place.

diff --git a/object-file.c b/object-file.c
index f233b440b2..304af3a172 100644
--- a/object-file.c
+++ b/object-file.c
@@ -517,9 +517,9 @@  const char *loose_object_path(struct repository *r, struct strbuf *buf,
  */
 static int alt_odb_usable(struct raw_object_store *o,
 			  struct strbuf *path,
-			  const char *normalized_objdir)
+			  const char *normalized_objdir, khiter_t *pos)
 {
-	struct object_directory *odb;
+	int r;
 
 	/* Detect cases where alternate disappeared */
 	if (!is_directory(path->buf)) {
@@ -533,14 +533,22 @@  static int alt_odb_usable(struct raw_object_store *o,
 	 * Prevent the common mistake of listing the same
 	 * thing twice, or object directory itself.
 	 */
-	for (odb = o->odb; odb; odb = odb->next) {
-		if (!fspathcmp(path->buf, odb->path))
-			return 0;
+	if (!o->odb_by_path) {
+		khiter_t p;
+
+		o->odb_by_path = kh_init_odb_path_map();
+		assert(!o->odb->next);
+		p = kh_put_odb_path_map(o->odb_by_path, o->odb->path, &r);
+		if (r < 0) die_errno(_("kh_put_odb_path_map"));
+		assert(r == 1); /* never used */
+		kh_value(o->odb_by_path, p) = o->odb;
 	}
 	if (!fspathcmp(path->buf, normalized_objdir))
 		return 0;
-
-	return 1;
+	*pos = kh_put_odb_path_map(o->odb_by_path, path->buf, &r);
+	if (r < 0) die_errno(_("kh_put_odb_path_map"));
+	/* r: 0 = exists, 1 = never used, 2 = deleted */
+	return r == 0 ? 0 : 1;
 }
 
 /*
@@ -566,6 +574,7 @@  static int link_alt_odb_entry(struct repository *r, const char *entry,
 {
 	struct object_directory *ent;
 	struct strbuf pathbuf = STRBUF_INIT;
+	khiter_t pos;
 
 	if (!is_absolute_path(entry) && relative_base) {
 		strbuf_realpath(&pathbuf, relative_base, 1);
@@ -587,23 +596,25 @@  static int link_alt_odb_entry(struct repository *r, const char *entry,
 	while (pathbuf.len && pathbuf.buf[pathbuf.len - 1] == '/')
 		strbuf_setlen(&pathbuf, pathbuf.len - 1);
 
-	if (!alt_odb_usable(r->objects, &pathbuf, normalized_objdir)) {
+	if (!alt_odb_usable(r->objects, &pathbuf, normalized_objdir, &pos)) {
 		strbuf_release(&pathbuf);
 		return -1;
 	}
 
 	CALLOC_ARRAY(ent, 1);
-	ent->path = xstrdup(pathbuf.buf);
+	/* pathbuf.buf is already in r->objects->odb_by_path */
+	ent->path = strbuf_detach(&pathbuf, NULL);
 
 	/* add the alternate entry */
 	*r->objects->odb_tail = ent;
 	r->objects->odb_tail = &(ent->next);
 	ent->next = NULL;
+	assert(r->objects->odb_by_path);
+	kh_value(r->objects->odb_by_path, pos) = ent;
 
 	/* recursively add alternates */
-	read_info_alternates(r, pathbuf.buf, depth + 1);
+	read_info_alternates(r, ent->path, depth + 1);
 
-	strbuf_release(&pathbuf);
 	return 0;
 }
 
diff --git a/object-store.h b/object-store.h
index ec32c23dcb..20c1cedb75 100644
--- a/object-store.h
+++ b/object-store.h
@@ -7,6 +7,8 @@ 
 #include "oid-array.h"
 #include "strbuf.h"
 #include "thread-utils.h"
+#include "khash.h"
+#include "dir.h"
 
 struct object_directory {
 	struct object_directory *next;
@@ -30,6 +32,19 @@  struct object_directory {
 	char *path;
 };
 
+static inline int odb_path_eq(const char *a, const char *b)
+{
+	return !fspathcmp(a, b);
+}
+
+static inline int odb_path_hash(const char *str)
+{
+	return ignore_case ? strihash(str) : __ac_X31_hash_string(str);
+}
+
+KHASH_INIT(odb_path_map, const char * /* key: odb_path */,
+	struct object_directory *, 1, odb_path_hash, odb_path_eq);
+
 void prepare_alt_odb(struct repository *r);
 char *compute_alternate_path(const char *path, struct strbuf *err);
 typedef int alt_odb_fn(struct object_directory *, void *);
@@ -116,6 +131,8 @@  struct raw_object_store {
 	 */
 	struct object_directory *odb;
 	struct object_directory **odb_tail;
+	kh_odb_path_map_t *odb_by_path;
+
 	int loaded_alternates;
 
 	/*
diff --git a/object.c b/object.c
index 14188453c5..2b3c075a15 100644
--- a/object.c
+++ b/object.c
@@ -511,6 +511,8 @@  static void free_object_directories(struct raw_object_store *o)
 		free_object_directory(o->odb);
 		o->odb = next;
 	}
+	kh_destroy_odb_path_map(o->odb_by_path);
+	o->odb_by_path = NULL;
 }
 
 void raw_object_store_clear(struct raw_object_store *o)

[v2,1/5] speed up alt_odb_usable() with many alternates

Commit Message

Comments

Patch