Message ID | 0dd26bb584bb7e8b9616eb32f7b1235462df77fa.1598380599.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Maintenance II: prefetch, loose-objects, incremental-repack tasks | expand |
> +incremental-repack:: > + The `incremental-repack` job repacks the object directory > + using the `multi-pack-index` feature. In order to prevent race > + conditions with concurrent Git commands, it follows a two-step > + process. [snip] > First, it deletes any pack-files included in the > + `multi-pack-index` where none of the objects in the > + `multi-pack-index` reference those pack-files; this only happens > + if all objects in the pack-file are also stored in a newer > + pack-file. Second, it selects a group of pack-files whose "expected > + size" is below the batch size until the group has total expected > + size at least the batch size; see the `--batch-size` option for > + the `repack` subcommand in linkgit:git-multi-pack-index[1]. The > + default batch-size is zero, which is a special case that attempts > + to repack all pack-files into a single pack-file. This lacks the detail of what happens to the selected group of packfiles (in the second step) - in particular, that a new packfile is created and the MIDX is rewritten so that all references to the selected group are updated to refer to the new packfile instead, thus making it possible to delete the selected group of packfiles in a subsequent first step. All this is explained in the documentation of git-multi-pack-index (expire and repack), though, so it might be better to refer to that. E.g. First, it calls `multi-pack-index expire` to delete packfiles unreferenced by the MIDX file. Second, it calls `multi-pack-index repack` to select several small packfiles and repack them into a bigger one, and then update the MIDX entries that refer to the small packfiles to refer to the big one instead, thus preparing it for deletion upon a subsequent `multi-pack-index expire` invocation. The selection of the small packfiles is such that the expected size of the big packfile is at least the batch size; see the ... > diff --git a/midx.c b/midx.c > index aa37d5da86..66d7053d83 100644 > --- a/midx.c > +++ b/midx.c > @@ -37,7 +37,7 @@ > > #define PACK_EXPIRED UINT_MAX > > -static char *get_midx_filename(const char *object_dir) > +char *get_midx_filename(const char *object_dir) > { > return xstrfmt("%s/pack/multi-pack-index", object_dir); > } > diff --git a/midx.h b/midx.h > index b18cf53bc4..baeecc70c9 100644 > --- a/midx.h > +++ b/midx.h > @@ -37,6 +37,7 @@ struct multi_pack_index { > > #define MIDX_PROGRESS (1 << 0) > > +char *get_midx_filename(const char *object_dir); > struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local); > int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id); > int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result); Do we need get_midx_filename() to be global?
On 9/22/2020 7:26 PM, Jonathan Tan wrote: >> +incremental-repack:: >> + The `incremental-repack` job repacks the object directory >> + using the `multi-pack-index` feature. In order to prevent race >> + conditions with concurrent Git commands, it follows a two-step >> + process. > > [snip] > >> First, it deletes any pack-files included in the >> + `multi-pack-index` where none of the objects in the >> + `multi-pack-index` reference those pack-files; this only happens >> + if all objects in the pack-file are also stored in a newer >> + pack-file. Second, it selects a group of pack-files whose "expected >> + size" is below the batch size until the group has total expected >> + size at least the batch size; see the `--batch-size` option for >> + the `repack` subcommand in linkgit:git-multi-pack-index[1]. The >> + default batch-size is zero, which is a special case that attempts >> + to repack all pack-files into a single pack-file. > > This lacks the detail of what happens to the selected group of packfiles > (in the second step) - in particular, that a new packfile is created and > the MIDX is rewritten so that all references to the selected group are > updated to refer to the new packfile instead, thus making it possible to > delete the selected group of packfiles in a subsequent first step. All > this is explained in the documentation of git-multi-pack-index (expire > and repack), though, so it might be better to refer to that. E.g. Here is my attempt to incorporate your recommendations into this doc: incremental-repack:: The `incremental-repack` job repacks the object directory using the `multi-pack-index` feature. In order to prevent race conditions with concurrent Git commands, it follows a two-step process. First, it calls `git multi-pack-index expire` to delete pack-files unreferenced by the `multi-pack-index` file. Second, it calls `git multi-pack-index repack` to select several small pack-files and repack them into a bigger one, and then update the `multi-pack-index` entries that refer to the small pack-files to refer to the new pack-file. This prepares those small pack-files for deletion upon the next run of `git multi-pack-index expire`. The selection of the small pack-files is such that the expected size of the big pack-file is at least the batch size; see the `--batch-size` option for the `repack` subcommand in linkgit:git-multi-pack-index[1]. The default batch-size is zero, which is a special case that attempts to repack all pack-files into a single pack-file. > Do we need get_midx_filename() to be global? No, that does not appear to be important (to this patch). It _was_ necessary when doing the "verify, then delete if problematic" mode. Thanks for catching it now that it is not necessary. Thanks, -Stolee
> Here is my attempt to incorporate your recommendations into this doc: > > incremental-repack:: > The `incremental-repack` job repacks the object directory > using the `multi-pack-index` feature. In order to prevent race > conditions with concurrent Git commands, it follows a two-step > process. First, it calls `git multi-pack-index expire` to delete > pack-files unreferenced by the `multi-pack-index` file. Second, it > calls `git multi-pack-index repack` to select several small > pack-files and repack them into a bigger one, and then update the > `multi-pack-index` entries that refer to the small pack-files to > refer to the new pack-file. This prepares those small pack-files > for deletion upon the next run of `git multi-pack-index expire`. > The selection of the small pack-files is such that the expected > size of the big pack-file is at least the batch size; see the > `--batch-size` option for the `repack` subcommand in > linkgit:git-multi-pack-index[1]. The default batch-size is zero, > which is a special case that attempts to repack all pack-files > into a single pack-file. Thanks, this looks good.
diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt index fc95eb594f..b44efb05a3 100644 --- a/Documentation/git-maintenance.txt +++ b/Documentation/git-maintenance.txt @@ -85,6 +85,21 @@ loose-objects:: advisable to enable both the `loose-objects` and `gc` tasks at the same time. +incremental-repack:: + The `incremental-repack` job repacks the object directory + using the `multi-pack-index` feature. In order to prevent race + conditions with concurrent Git commands, it follows a two-step + process. First, it deletes any pack-files included in the + `multi-pack-index` where none of the objects in the + `multi-pack-index` reference those pack-files; this only happens + if all objects in the pack-file are also stored in a newer + pack-file. Second, it selects a group of pack-files whose "expected + size" is below the batch size until the group has total expected + size at least the batch size; see the `--batch-size` option for + the `repack` subcommand in linkgit:git-multi-pack-index[1]. The + default batch-size is zero, which is a special case that attempts + to repack all pack-files into a single pack-file. + OPTIONS ------- --auto:: diff --git a/builtin/gc.c b/builtin/gc.c index 25245bcc10..fbf84996fa 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -30,6 +30,7 @@ #include "promisor-remote.h" #include "refs.h" #include "remote.h" +#include "midx.h" #define FAILED_RUN "failed to run %s" @@ -1001,6 +1002,77 @@ static int maintenance_task_loose_objects(struct maintenance_run_opts *opts) return prune_packed(opts) || pack_loose(opts); } +static int multi_pack_index_write(struct maintenance_run_opts *opts) +{ + struct child_process child = CHILD_PROCESS_INIT; + + child.git_cmd = 1; + strvec_pushl(&child.args, "multi-pack-index", "write", NULL); + + if (opts->quiet) + strvec_push(&child.args, "--no-progress"); + + if (run_command(&child)) + return error(_("failed to write multi-pack-index")); + + return 0; +} + +static int multi_pack_index_expire(struct maintenance_run_opts *opts) +{ + struct child_process child = CHILD_PROCESS_INIT; + + child.git_cmd = 1; + strvec_pushl(&child.args, "multi-pack-index", "expire", NULL); + + if (opts->quiet) + strvec_push(&child.args, "--no-progress"); + + close_object_store(the_repository->objects); + + if (run_command(&child)) + return error(_("'git multi-pack-index expire' failed")); + + return 0; +} + +static int multi_pack_index_repack(struct maintenance_run_opts *opts) +{ + struct child_process child = CHILD_PROCESS_INIT; + + child.git_cmd = 1; + strvec_pushl(&child.args, "multi-pack-index", "repack", NULL); + + if (opts->quiet) + strvec_push(&child.args, "--no-progress"); + + strvec_push(&child.args, "--batch-size=0"); + + close_object_store(the_repository->objects); + + if (run_command(&child)) + return error(_("'git multi-pack-index repack' failed")); + + return 0; +} + +static int maintenance_task_incremental_repack(struct maintenance_run_opts *opts) +{ + prepare_repo_settings(the_repository); + if (!the_repository->settings.core_multi_pack_index) { + warning(_("skipping incremental-repack task because core.multiPackIndex is disabled")); + return 0; + } + + if (multi_pack_index_write(opts)) + return 1; + if (multi_pack_index_expire(opts)) + return 1; + if (multi_pack_index_repack(opts)) + return 1; + return 0; +} + typedef int maintenance_task_fn(struct maintenance_run_opts *opts); /* @@ -1023,6 +1095,7 @@ struct maintenance_task { enum maintenance_task_label { TASK_PREFETCH, TASK_LOOSE_OBJECTS, + TASK_INCREMENTAL_REPACK, TASK_GC, TASK_COMMIT_GRAPH, @@ -1040,6 +1113,10 @@ static struct maintenance_task tasks[] = { maintenance_task_loose_objects, loose_object_auto_condition, }, + [TASK_INCREMENTAL_REPACK] = { + "incremental-repack", + maintenance_task_incremental_repack, + }, [TASK_GC] = { "gc", maintenance_task_gc, diff --git a/midx.c b/midx.c index aa37d5da86..66d7053d83 100644 --- a/midx.c +++ b/midx.c @@ -37,7 +37,7 @@ #define PACK_EXPIRED UINT_MAX -static char *get_midx_filename(const char *object_dir) +char *get_midx_filename(const char *object_dir) { return xstrfmt("%s/pack/multi-pack-index", object_dir); } diff --git a/midx.h b/midx.h index b18cf53bc4..baeecc70c9 100644 --- a/midx.h +++ b/midx.h @@ -37,6 +37,7 @@ struct multi_pack_index { #define MIDX_PROGRESS (1 << 0) +char *get_midx_filename(const char *object_dir); struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local); int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id); int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result); diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index ec87f616c6..2f942ee1fa 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -3,6 +3,7 @@ test_description='multi-pack-indexes' . ./test-lib.sh +GIT_TEST_MULTI_PACK_INDEX=0 objdir=.git/objects midx_read_expect () { diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh index efda1cf69b..dde28cf837 100755 --- a/t/t7900-maintenance.sh +++ b/t/t7900-maintenance.sh @@ -5,6 +5,7 @@ test_description='git maintenance builtin' . ./test-lib.sh GIT_TEST_COMMIT_GRAPH=0 +GIT_TEST_MULTI_PACK_INDEX=0 test_expect_success 'help text' ' test_expect_code 129 git maintenance -h 2>err && @@ -150,4 +151,41 @@ test_expect_success 'maintenance.loose-objects.auto' ' done ' +test_expect_success 'incremental-repack task' ' + packDir=.git/objects/pack && + for i in $(test_seq 1 5) + do + test_commit $i || return 1 + done && + + # Create three disjoint pack-files with size BIG, small, small. + echo HEAD~2 | git pack-objects --revs $packDir/test-1 && + test_tick && + git pack-objects --revs $packDir/test-2 <<-\EOF && + HEAD~1 + ^HEAD~2 + EOF + test_tick && + git pack-objects --revs $packDir/test-3 <<-\EOF && + HEAD + ^HEAD~1 + EOF + rm -f $packDir/pack-* && + rm -f $packDir/loose-* && + ls $packDir/*.pack >packs-before && + test_line_count = 3 packs-before && + + # the job repacks the two into a new pack, but does not + # delete the old ones. + git maintenance run --task=incremental-repack && + ls $packDir/*.pack >packs-between && + test_line_count = 4 packs-between && + + # the job deletes the two old packs, and does not write + # a new one because only one pack remains. + git maintenance run --task=incremental-repack && + ls .git/objects/pack/*.pack >packs-after && + test_line_count = 1 packs-after +' + test_done