diff mbox series

[1/8] midx: expose 'write_midx_file_only()' publicly

Message ID 4afa03b972a1885c60fbf3716f22a7ab58056383.1631331139.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series repack: introduce `--write-midx` | expand

Commit Message

Taylor Blau Sept. 11, 2021, 3:32 a.m. UTC
Expose a variant of the write_midx_file() function which ignores packs
that aren't included in an explicit "allow" list.

This will be used in an upcoming patch to power a new `--stdin-packs`
mode of `git multi-pack-index write` for callers that only want to
include certain packs in a MIDX (and ignore any packs which may have
happened to enter the repository independently, e.g., from pushes).

Those patches will provide test coverage for this new function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 50 +++++++++++++++++++++++++++++++++++++++-----------
 midx.h |  2 ++
 2 files changed, 41 insertions(+), 11 deletions(-)

Comments

Junio C Hamano Sept. 11, 2021, 5 a.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:

>  	if (ends_with(file_name, ".idx")) {
>  		display_progress(ctx->progress, ++ctx->pack_paths_checked);
> -		if (ctx->m && midx_contains_pack(ctx->m, file_name))
> -			return;
> +		if (ctx->m) {
> +			if (midx_contains_pack(ctx->m, file_name))
> +				return;
> +		} else if (ctx->to_include) {
> +			if (!string_list_has_string(ctx->to_include, file_name))
> +				return;

What's the expected number of elements on the to_include list?  I am
wondering about the performance implications of using linear search
over the string-list, of course.  Is it about the same order of the
number of packfiles in a repository (up to several dozens, or 1000
at most unless you are insane, or something like that)?
Ævar Arnfjörð Bjarmason Sept. 11, 2021, 10:07 a.m. UTC | #2
On Fri, Sep 10 2021, Taylor Blau wrote:

> -		if (ctx->m && midx_contains_pack(ctx->m, file_name))
> -			return;
> +		if (ctx->m) {
> +			if (midx_contains_pack(ctx->m, file_name))
> +				return;
> +		} else if (ctx->to_include) {
> +			if (!string_list_has_string(ctx->to_include, file_name))
> +				return;
> +		}

I think this is equivalent to:

		if (ctx->m && midx_contains_pack(ctx->m, file_name))
			return;
		else if (!ctx->m && string_list_has_string(...))
			return;

Just a suggestion/tip to reduce the diff size/nested if/if blocks.


> +int write_midx_file_only(const char *object_dir, struct string_list *packs_to_include, const char *preferred_pack_name, unsigned flags);

Let's also line-wrap header changes like in the *.c file.
Taylor Blau Sept. 11, 2021, 4:17 p.m. UTC | #3
On Fri, Sep 10, 2021 at 10:00:26PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> >  	if (ends_with(file_name, ".idx")) {
> >  		display_progress(ctx->progress, ++ctx->pack_paths_checked);
> > -		if (ctx->m && midx_contains_pack(ctx->m, file_name))
> > -			return;
> > +		if (ctx->m) {
> > +			if (midx_contains_pack(ctx->m, file_name))
> > +				return;
> > +		} else if (ctx->to_include) {
> > +			if (!string_list_has_string(ctx->to_include, file_name))
> > +				return;
>
> What's the expected number of elements on the to_include list?  I am
> wondering about the performance implications of using linear search
> over the string-list, of course.  Is it about the same order of the
> number of packfiles in a repository (up to several dozens, or 1000
> at most unless you are insane, or something like that)?

You're definitely in the right ballpark. It depends on the repack
settings and size of repository, of course, but I imagine that roughly
1,000 entries would be the most anybody could ever pass (e.g., during a
`--geometric` repack, the biggest pack would have to contain 2^1000
times as many objects as the smallest pack).

Of course, you could just constantly be adding packs and doing
incremental `git repack -d --write-midx`. Seems unlikely to me, but if
it does become a problem we could easily read the values into a hashmap
and constant-ize the lookup.

But the scan is logarithmic, not linear, since the string list is
sorted.

Thanks,
Taylor
Taylor Blau Sept. 11, 2021, 4:21 p.m. UTC | #4
On Sat, Sep 11, 2021 at 12:07:57PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Fri, Sep 10 2021, Taylor Blau wrote:
>
> > -		if (ctx->m && midx_contains_pack(ctx->m, file_name))
> > -			return;
> > +		if (ctx->m) {
> > +			if (midx_contains_pack(ctx->m, file_name))
> > +				return;
> > +		} else if (ctx->to_include) {
> > +			if (!string_list_has_string(ctx->to_include, file_name))
> > +				return;
> > +		}
>
> I think this is equivalent to:
>
> 		if (ctx->m && midx_contains_pack(ctx->m, file_name))
> 			return;
> 		else if (!ctx->m && string_list_has_string(...))
> 			return;

They are equivalent, but it took me a minute to convince myself ;). They
are only equivalent because *either* ctx->m or ctx->to_include is
non-NULL.

So it wouldn't be the case that ctx->m being set and the subsequent
midx_contains_pack() call returning zero would cause us to check the
to_include list, because to_include will be NULL if ctx->m isn't.

Anyway, that makes them equivalent, so I'm happy to clean it up and
decrease the nested-ness.

> > +int write_midx_file_only(const char *object_dir, struct string_list *packs_to_include, const char *preferred_pack_name, unsigned flags);
>
> Let's also line-wrap header changes like in the *.c file.

Sure.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/midx.c b/midx.c
index 864034a6ad..29d1d107b3 100644
--- a/midx.c
+++ b/midx.c
@@ -475,6 +475,8 @@  struct write_midx_context {
 	uint32_t num_large_offsets;
 
 	int preferred_pack_idx;
+
+	struct string_list *to_include;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -484,8 +486,13 @@  static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 
 	if (ends_with(file_name, ".idx")) {
 		display_progress(ctx->progress, ++ctx->pack_paths_checked);
-		if (ctx->m && midx_contains_pack(ctx->m, file_name))
-			return;
+		if (ctx->m) {
+			if (midx_contains_pack(ctx->m, file_name))
+				return;
+		} else if (ctx->to_include) {
+			if (!string_list_has_string(ctx->to_include, file_name))
+				return;
+		}
 
 		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
 
@@ -1058,6 +1065,7 @@  static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash,
 }
 
 static int write_midx_internal(const char *object_dir,
+			       struct string_list *packs_to_include,
 			       struct string_list *packs_to_drop,
 			       const char *preferred_pack_name,
 			       unsigned flags)
@@ -1082,10 +1090,17 @@  static int write_midx_internal(const char *object_dir,
 		die_errno(_("unable to create leading directories of %s"),
 			  midx_name);
 
-	for (cur = get_multi_pack_index(the_repository); cur; cur = cur->next) {
-		if (!strcmp(object_dir, cur->object_dir)) {
-			ctx.m = cur;
-			break;
+	if (!packs_to_include) {
+		/*
+		 * Only reference an existing MIDX when not filtering which
+		 * packs to include, since all packs and objects are copied
+		 * blindly from an existing MIDX if one is present.
+		 */
+		for (cur = get_multi_pack_index(the_repository); cur; cur = cur->next) {
+			if (!strcmp(object_dir, cur->object_dir)) {
+				ctx.m = cur;
+				break;
+			}
 		}
 	}
 
@@ -1136,10 +1151,13 @@  static int write_midx_internal(const char *object_dir,
 	else
 		ctx.progress = NULL;
 
+	ctx.to_include = packs_to_include;
+
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
 	stop_progress(&ctx.progress);
 
-	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop) {
+	if ((ctx.m && ctx.nr == ctx.m->num_packs) &&
+	    !(packs_to_include || packs_to_drop)) {
 		struct bitmap_index *bitmap_git;
 		int bitmap_exists;
 		int want_bitmap = flags & MIDX_WRITE_BITMAP;
@@ -1237,7 +1255,7 @@  static int write_midx_internal(const char *object_dir,
 
 	QSORT(ctx.info, ctx.nr, pack_info_compare);
 
-	if (packs_to_drop && packs_to_drop->nr) {
+	if (ctx.m && packs_to_drop && packs_to_drop->nr) {
 		int drop_index = 0;
 		int missing_drops = 0;
 
@@ -1380,7 +1398,17 @@  int write_midx_file(const char *object_dir,
 		    const char *preferred_pack_name,
 		    unsigned flags)
 {
-	return write_midx_internal(object_dir, NULL, preferred_pack_name, flags);
+	return write_midx_internal(object_dir, NULL, NULL, preferred_pack_name,
+				   flags);
+}
+
+int write_midx_file_only(const char *object_dir,
+			 struct string_list *packs_to_include,
+			 const char *preferred_pack_name,
+			 unsigned flags)
+{
+	return write_midx_internal(object_dir, packs_to_include, NULL,
+				   preferred_pack_name, flags);
 }
 
 struct clear_midx_data {
@@ -1660,7 +1688,7 @@  int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla
 	free(count);
 
 	if (packs_to_drop.nr) {
-		result = write_midx_internal(object_dir, &packs_to_drop, NULL, flags);
+		result = write_midx_internal(object_dir, NULL, &packs_to_drop, NULL, flags);
 		m = NULL;
 	}
 
@@ -1851,7 +1879,7 @@  int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 		goto cleanup;
 	}
 
-	result = write_midx_internal(object_dir, NULL, NULL, flags);
+	result = write_midx_internal(object_dir, NULL, NULL, NULL, flags);
 	m = NULL;
 
 cleanup:
diff --git a/midx.h b/midx.h
index aa3da557bb..aefa371c90 100644
--- a/midx.h
+++ b/midx.h
@@ -2,6 +2,7 @@ 
 #define MIDX_H
 
 #include "repository.h"
+#include "string-list.h"
 
 struct object_id;
 struct pack_entry;
@@ -62,6 +63,7 @@  int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
 
 int write_midx_file(const char *object_dir, const char *preferred_pack_name, unsigned flags);
+int write_midx_file_only(const char *object_dir, struct string_list *packs_to_include, const char *preferred_pack_name, unsigned flags);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(struct repository *r, const char *object_dir, unsigned flags);
 int expire_midx_packs(struct repository *r, const char *object_dir, unsigned flags);