Message ID | 4218d9e08aba629d8f64b5a999f60d12e5d8785b.1663706401.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | midx: use `--stdin-packs` to implement `repack` | expand |
On 9/20/2022 4:40 PM, Taylor Blau wrote: > + for (i = 0; i < m->num_packs; i++) { > + struct repack_info *info = &pack_info[i]; > > - if (!include_pack[pack_int_id]) > - continue; ... > + fprintf(cmd_in, "%s%s\n", include_pack[info->pack_int_id] ? "" : "^", scratch.buf); Outside of how the object set is provided (a list of objects or a list of packs) it seems this is equivalent. The only difference is that we are writing a line for _every_ pack in the multi-pack-index, but we preface the line with "^" to say "not this pack". Do you know if there is any reason to do this explicitly? Does this change the set of objects in any way (perhaps by not including duplicates that are tracked in those other packs)? Personally, I would have kept the "if (...) continue;" pattern, so maybe you had a concrete idea why this approach is better. Thanks, -Stolee
On Fri, Sep 23, 2022 at 02:23:30PM -0400, Derrick Stolee wrote: > Do you know if there is any reason to do this explicitly? Does this > change the set of objects in any way (perhaps by not including > duplicates that are tracked in those other packs)? Yes. The "^" lines become excluded packs from the perspective of the follow-on reachability traversal to discover the namehashes. So as soon as we hit an object contained in one of the packs marked as excluded, we'll halt the traversal along that direction, since we know that we're not going to pick up any objects in those packs. So you could omit them, and you'd get the same resulting pack, but it would take longer to generate since we wouldn't be stopping the follow-on traversal as early as possible. Thanks, Taylor
On 9/23/2022 2:28 PM, Taylor Blau wrote: > On Fri, Sep 23, 2022 at 02:23:30PM -0400, Derrick Stolee wrote: >> Do you know if there is any reason to do this explicitly? Does this >> change the set of objects in any way (perhaps by not including >> duplicates that are tracked in those other packs)? > > Yes. The "^" lines become excluded packs from the perspective of the > follow-on reachability traversal to discover the namehashes. So as soon > as we hit an object contained in one of the packs marked as excluded, > we'll halt the traversal along that direction, since we know that we're > not going to pick up any objects in those packs. > > So you could omit them, and you'd get the same resulting pack, but it > would take longer to generate since we wouldn't be stopping the > follow-on traversal as early as possible. Excellent. Thanks for explaining that! -Stolee
diff --git a/midx.c b/midx.c index db0a70c5af..5fbf964bba 100644 --- a/midx.c +++ b/midx.c @@ -1955,6 +1955,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size, struct child_process cmd = CHILD_PROCESS_INIT; FILE *cmd_in; struct strbuf base_name = STRBUF_INIT; + struct strbuf scratch = STRBUF_INIT; struct multi_pack_index *m = lookup_multi_pack_index(r, object_dir); struct repack_info *pack_info = NULL; @@ -1996,7 +1997,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size, repo_config_get_bool(r, "repack.usedeltabaseoffset", &delta_base_offset); repo_config_get_bool(r, "repack.usedeltaislands", &use_delta_islands); - strvec_push(&cmd.args, "pack-objects"); + strvec_pushl(&cmd.args, "pack-objects", "--stdin-packs", NULL); strbuf_addstr(&base_name, object_dir); strbuf_addstr(&base_name, "/pack/pack"); @@ -2025,17 +2026,19 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size, cmd_in = xfdopen(cmd.in, "w"); - for (i = 0; i < m->num_objects; i++) { - struct object_id oid; - uint32_t pack_int_id = nth_midxed_pack_int_id(m, i); + for (i = 0; i < m->num_packs; i++) { + struct repack_info *info = &pack_info[i]; - if (!include_pack[pack_int_id]) - continue; + strbuf_reset(&scratch); - nth_midxed_object_oid(&oid, m, i); - fprintf(cmd_in, "%s\n", oid_to_hex(&oid)); + strbuf_addstr(&scratch, m->pack_names[info->pack_int_id]); + strbuf_strip_suffix(&scratch, ".idx"); + strbuf_addstr(&scratch, ".pack"); + + fprintf(cmd_in, "%s%s\n", include_pack[info->pack_int_id] ? "" : "^", scratch.buf); } fclose(cmd_in); + strbuf_release(&scratch); if (finish_command(&cmd)) { error(_("could not finish pack-objects"));
The `git multi-pack-index repack` command aggregates all objects in certain packs contained in the MIDX, largely dictated by the value of its `--batch-size` parameter. To create the new pack, the MIDX machinery spawns a `pack-objects` sub-process, which it feeds the object IDs in each of the pack(s) that it wants to aggregate. This implementation, which dates back to ce1e4a105b (midx: implement midx_repack(), 2019-06-10), predates the `--stdin-packs` mode in `pack-objects`. This mode (which was introduced a couple of years later in 339bce27f4 (builtin/pack-objects.c: add '--stdin-packs' option, 2021-02-22)) allows `pack-objects` callers to specify the contents of the output pack by indicating which packs to aggregate. This patch replaces the pre-`--stdin-packs` invocation (where each object is given to `pack-objects` one by one) with the more modern `--stdin-packs` option. This allows us to avoid some CPU cycles serializing and deserializing every object ID in all of the packs we're aggregating. It also avoids us having to send a potentially large amount of data down to `pack-objects`. But more importantly, it generates slightly higher quality (read: more tightly compressed) packs, because of the reachability traversal that `--stdin-packs` does after the fact in order to gather namehash values which seed the delta selection process. In practice, this seems to add a slight amount of overhead (on the order of a few seconds for git.git broken up into ~100 packs), in exchange for a modest reduction (on the order of ~3.5%) in the resulting pack size. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- midx.c | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-)