Message ID | 4afa03b972a1885c60fbf3716f22a7ab58056383.1631331139.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | repack: introduce `--write-midx` | expand |
Taylor Blau <me@ttaylorr.com> writes: > if (ends_with(file_name, ".idx")) { > display_progress(ctx->progress, ++ctx->pack_paths_checked); > - if (ctx->m && midx_contains_pack(ctx->m, file_name)) > - return; > + if (ctx->m) { > + if (midx_contains_pack(ctx->m, file_name)) > + return; > + } else if (ctx->to_include) { > + if (!string_list_has_string(ctx->to_include, file_name)) > + return; What's the expected number of elements on the to_include list? I am wondering about the performance implications of using linear search over the string-list, of course. Is it about the same order of the number of packfiles in a repository (up to several dozens, or 1000 at most unless you are insane, or something like that)?
On Fri, Sep 10 2021, Taylor Blau wrote: > - if (ctx->m && midx_contains_pack(ctx->m, file_name)) > - return; > + if (ctx->m) { > + if (midx_contains_pack(ctx->m, file_name)) > + return; > + } else if (ctx->to_include) { > + if (!string_list_has_string(ctx->to_include, file_name)) > + return; > + } I think this is equivalent to: if (ctx->m && midx_contains_pack(ctx->m, file_name)) return; else if (!ctx->m && string_list_has_string(...)) return; Just a suggestion/tip to reduce the diff size/nested if/if blocks. > +int write_midx_file_only(const char *object_dir, struct string_list *packs_to_include, const char *preferred_pack_name, unsigned flags); Let's also line-wrap header changes like in the *.c file.
On Fri, Sep 10, 2021 at 10:00:26PM -0700, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > if (ends_with(file_name, ".idx")) { > > display_progress(ctx->progress, ++ctx->pack_paths_checked); > > - if (ctx->m && midx_contains_pack(ctx->m, file_name)) > > - return; > > + if (ctx->m) { > > + if (midx_contains_pack(ctx->m, file_name)) > > + return; > > + } else if (ctx->to_include) { > > + if (!string_list_has_string(ctx->to_include, file_name)) > > + return; > > What's the expected number of elements on the to_include list? I am > wondering about the performance implications of using linear search > over the string-list, of course. Is it about the same order of the > number of packfiles in a repository (up to several dozens, or 1000 > at most unless you are insane, or something like that)? You're definitely in the right ballpark. It depends on the repack settings and size of repository, of course, but I imagine that roughly 1,000 entries would be the most anybody could ever pass (e.g., during a `--geometric` repack, the biggest pack would have to contain 2^1000 times as many objects as the smallest pack). Of course, you could just constantly be adding packs and doing incremental `git repack -d --write-midx`. Seems unlikely to me, but if it does become a problem we could easily read the values into a hashmap and constant-ize the lookup. But the scan is logarithmic, not linear, since the string list is sorted. Thanks, Taylor
On Sat, Sep 11, 2021 at 12:07:57PM +0200, Ævar Arnfjörð Bjarmason wrote: > > On Fri, Sep 10 2021, Taylor Blau wrote: > > > - if (ctx->m && midx_contains_pack(ctx->m, file_name)) > > - return; > > + if (ctx->m) { > > + if (midx_contains_pack(ctx->m, file_name)) > > + return; > > + } else if (ctx->to_include) { > > + if (!string_list_has_string(ctx->to_include, file_name)) > > + return; > > + } > > I think this is equivalent to: > > if (ctx->m && midx_contains_pack(ctx->m, file_name)) > return; > else if (!ctx->m && string_list_has_string(...)) > return; They are equivalent, but it took me a minute to convince myself ;). They are only equivalent because *either* ctx->m or ctx->to_include is non-NULL. So it wouldn't be the case that ctx->m being set and the subsequent midx_contains_pack() call returning zero would cause us to check the to_include list, because to_include will be NULL if ctx->m isn't. Anyway, that makes them equivalent, so I'm happy to clean it up and decrease the nested-ness. > > +int write_midx_file_only(const char *object_dir, struct string_list *packs_to_include, const char *preferred_pack_name, unsigned flags); > > Let's also line-wrap header changes like in the *.c file. Sure. Thanks, Taylor
diff --git a/midx.c b/midx.c index 864034a6ad..29d1d107b3 100644 --- a/midx.c +++ b/midx.c @@ -475,6 +475,8 @@ struct write_midx_context { uint32_t num_large_offsets; int preferred_pack_idx; + + struct string_list *to_include; }; static void add_pack_to_midx(const char *full_path, size_t full_path_len, @@ -484,8 +486,13 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len, if (ends_with(file_name, ".idx")) { display_progress(ctx->progress, ++ctx->pack_paths_checked); - if (ctx->m && midx_contains_pack(ctx->m, file_name)) - return; + if (ctx->m) { + if (midx_contains_pack(ctx->m, file_name)) + return; + } else if (ctx->to_include) { + if (!string_list_has_string(ctx->to_include, file_name)) + return; + } ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc); @@ -1058,6 +1065,7 @@ static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash, } static int write_midx_internal(const char *object_dir, + struct string_list *packs_to_include, struct string_list *packs_to_drop, const char *preferred_pack_name, unsigned flags) @@ -1082,10 +1090,17 @@ static int write_midx_internal(const char *object_dir, die_errno(_("unable to create leading directories of %s"), midx_name); - for (cur = get_multi_pack_index(the_repository); cur; cur = cur->next) { - if (!strcmp(object_dir, cur->object_dir)) { - ctx.m = cur; - break; + if (!packs_to_include) { + /* + * Only reference an existing MIDX when not filtering which + * packs to include, since all packs and objects are copied + * blindly from an existing MIDX if one is present. + */ + for (cur = get_multi_pack_index(the_repository); cur; cur = cur->next) { + if (!strcmp(object_dir, cur->object_dir)) { + ctx.m = cur; + break; + } } } @@ -1136,10 +1151,13 @@ static int write_midx_internal(const char *object_dir, else ctx.progress = NULL; + ctx.to_include = packs_to_include; + for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx); stop_progress(&ctx.progress); - if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop) { + if ((ctx.m && ctx.nr == ctx.m->num_packs) && + !(packs_to_include || packs_to_drop)) { struct bitmap_index *bitmap_git; int bitmap_exists; int want_bitmap = flags & MIDX_WRITE_BITMAP; @@ -1237,7 +1255,7 @@ static int write_midx_internal(const char *object_dir, QSORT(ctx.info, ctx.nr, pack_info_compare); - if (packs_to_drop && packs_to_drop->nr) { + if (ctx.m && packs_to_drop && packs_to_drop->nr) { int drop_index = 0; int missing_drops = 0; @@ -1380,7 +1398,17 @@ int write_midx_file(const char *object_dir, const char *preferred_pack_name, unsigned flags) { - return write_midx_internal(object_dir, NULL, preferred_pack_name, flags); + return write_midx_internal(object_dir, NULL, NULL, preferred_pack_name, + flags); +} + +int write_midx_file_only(const char *object_dir, + struct string_list *packs_to_include, + const char *preferred_pack_name, + unsigned flags) +{ + return write_midx_internal(object_dir, packs_to_include, NULL, + preferred_pack_name, flags); } struct clear_midx_data { @@ -1660,7 +1688,7 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla free(count); if (packs_to_drop.nr) { - result = write_midx_internal(object_dir, &packs_to_drop, NULL, flags); + result = write_midx_internal(object_dir, NULL, &packs_to_drop, NULL, flags); m = NULL; } @@ -1851,7 +1879,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size, goto cleanup; } - result = write_midx_internal(object_dir, NULL, NULL, flags); + result = write_midx_internal(object_dir, NULL, NULL, NULL, flags); m = NULL; cleanup: diff --git a/midx.h b/midx.h index aa3da557bb..aefa371c90 100644 --- a/midx.h +++ b/midx.h @@ -2,6 +2,7 @@ #define MIDX_H #include "repository.h" +#include "string-list.h" struct object_id; struct pack_entry; @@ -62,6 +63,7 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name) int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local); int write_midx_file(const char *object_dir, const char *preferred_pack_name, unsigned flags); +int write_midx_file_only(const char *object_dir, struct string_list *packs_to_include, const char *preferred_pack_name, unsigned flags); void clear_midx_file(struct repository *r); int verify_midx_file(struct repository *r, const char *object_dir, unsigned flags); int expire_midx_packs(struct repository *r, const char *object_dir, unsigned flags);
Expose a variant of the write_midx_file() function which ignores packs that aren't included in an explicit "allow" list. This will be used in an upcoming patch to power a new `--stdin-packs` mode of `git multi-pack-index write` for callers that only want to include certain packs in a MIDX (and ignore any packs which may have happened to enter the repository independently, e.g., from pushes). Those patches will provide test coverage for this new function. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- midx.c | 50 +++++++++++++++++++++++++++++++++++++++----------- midx.h | 2 ++ 2 files changed, 41 insertions(+), 11 deletions(-)