Message ID | pull.1818.git.1730356023.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | PATH WALK I: The path-walk API | expand |
On 10/31/24 2:26 AM, Derrick Stolee via GitGitGadget wrote: > This is a new series that rerolls the initial "path-walk API" patches of my > RFC [1] "Path-walk API and applications". This new API (in path-walk.c and > path-walk.h) presents a new way to walk objects such that trees and blobs > are walked in batches according to their path. > > This also replaces the previous version of ds/path-walk that was being > reviewed in [2]. The consensus was that the series was too long/dense and > could use some reduction in size. This series takes the first few patches, > but also makes some updates (which will be described later). > > [1] > https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/ > [RFC] Path-walk API and applications > > [2] > https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@gmail.com/ > [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas ... > I will include a full range diff relative to the previous versions of these > patches in [2] in a reply to this cover letter. Here is the promised range-diff: 1: 98bdc94a773 ! 1: c71f0a0e361 path-walk: introduce an object walk by path @@ Commit message In anticipation of a few planned applications, introduce the most basic form of a path-walk API. It currently assumes that there are no UNINTERESTING - objects and does not include any complicated filters. It calls a function + objects, and does not include any complicated filters. It calls a function pointer on groups of tree and blob objects as grouped by path. This only includes objects the first time they are discovered, so an object that appears at multiple paths will not be included in two batches. + These batches are collected in 'struct type_and_oid_list' objects, which + store an object type and an oid_array of objects. + + The data structures are documented in 'struct path_walk_context', but in + summary the most important are: + + * 'paths_to_lists' is a strmap that connects a path to a + type_and_oid_list for that path. To avoid conflicts in path names, + we make sure that tree paths end in "/" (except the root path with + is an empty string) and blob paths do not end in "/". + + * 'path_stack' is a string list that is added to in an append-only + way. This stores the stack of our depth-first search on the heap + instead of using recursion. + + * 'path_stack_pushed' is a strmap that stores path names that were + already added to 'path_stack', to avoid repeating paths in the + stack. Mostly, this saves us from quadratic lookups from doing + unsorted checks into the string_list. + + The coupling of 'path_stack' and 'path_stack_pushed' is protected by the + push_to_stack() method. Call this instead of inserting into these + structures directly. + + The walk_objects_by_path() method initializes these structures and + starts walking commits from the given rev_info struct. The commits are + used to find the list of root trees which populate the start of our + depth-first search. + + The core of our depth-first search is in a while loop that continues + while we have not indicated an early exit and our 'path_stack' still has + entries in it. The loop body pops a path off of the stack and "visits" + the path via the walk_path() method. + + The walk_path() method gets the list of OIDs from the 'path_to_lists' + strmap and executes the callback method on that list with the given path + and type. If the OIDs correspond to tree objects, then iterate over all + trees in the list and run add_children() to add the child objects to + their own lists, adding new entries to the stack if necessary. + + In testing, this depth-first search approach was the one that used the + least memory while iterating over the object lists. There is still a + chance that repositories with too-wide path patterns could cause memory + pressure issues. Limiting the stack size could be done in the future by + limiting how many objects are being considered in-progress, or by + visiting blob paths earlier than trees. + There are many future adaptations that could be made, but they are left for future updates when consumers are ready to take advantage of those features. @@ Documentation/technical/api-path-walk.txt (new) +multiple paths possible to reach the same object, then only one of those +paths is used to visit the object. + -+When walking a range of commits with some `UNINTERESTING` objects, the -+objects with the `UNINTERESTING` flag are included in these batches. In -+order to walk `UNINTERESTING` objects, the `--boundary` option must be -+used in the commit walk in order to visit `UNINTERESTING` commits. -+ +Basics +------ + @@ Documentation/technical/api-path-walk.txt (new) +`revs` struct. The revision walk should only be used to walk commits, and +the objects will be walked in a separate way based on those starting +commits. -++ -+If you want the path-walk API to emit `UNINTERESTING` objects based on the -+commit walk's boundary, be sure to set `revs.boundary` so the boundary -+commits are emitted. + +Examples +-------- @@ path-walk.c (new) + /** + * Store the current list of paths in a stack, to + * facilitate depth-first-search without recursion. ++ * ++ * Use path_stack_pushed to indicate whether a path ++ * was previously added to path_stack. + */ + struct string_list path_stack; ++ struct strset path_stack_pushed; +}; + ++static void push_to_stack(struct path_walk_context *ctx, ++ const char *path) ++{ ++ if (strset_contains(&ctx->path_stack_pushed, path)) ++ return; ++ ++ strset_add(&ctx->path_stack_pushed, path); ++ string_list_append(&ctx->path_stack, path); ++} ++ +static int add_children(struct path_walk_context *ctx, + const char *base_path, + struct object_id *oid) @@ path-walk.c (new) + if (!o) /* report error?*/ + continue; + -+ /* Skip this object if already seen. */ -+ if (o->flags & SEEN) -+ continue; -+ o->flags |= SEEN; -+ + strbuf_setlen(&path, base_len); + strbuf_add(&path, entry.path, entry.pathlen); + @@ path-walk.c (new) + CALLOC_ARRAY(list, 1); + list->type = type; + strmap_put(&ctx->paths_to_lists, path.buf, list); -+ string_list_append(&ctx->path_stack, path.buf); + } ++ push_to_stack(ctx, path.buf); ++ ++ /* Skip this object if already seen. */ ++ if (o->flags & SEEN) ++ continue; ++ o->flags |= SEEN; + oid_array_append(&list->oids, &entry.oid); + } + @@ path-walk.c (new) + .revs = info->revs, + .info = info, + .path_stack = STRING_LIST_INIT_DUP, ++ .path_stack_pushed = STRSET_INIT, + .paths_to_lists = STRMAP_INIT + }; + @@ path-walk.c (new) + CALLOC_ARRAY(root_tree_list, 1); + root_tree_list->type = OBJ_TREE; + strmap_put(&ctx.paths_to_lists, root_path, root_tree_list); ++ push_to_stack(&ctx, root_path); + + if (prepare_revision_walk(info->revs)) + die(_("failed to setup revision walk")); + + while ((c = get_revision(info->revs))) { + struct object_id *oid = get_commit_tree_oid(c); -+ struct tree *t = lookup_tree(info->revs->repo, oid); ++ struct tree *t; + commits_nr++; + -+ if (t) { -+ if (t->object.flags & SEEN) -+ continue; -+ t->object.flags |= SEEN; -+ oid_array_append(&root_tree_list->oids, oid); -+ } else { ++ oid = get_commit_tree_oid(c); ++ t = lookup_tree(info->revs->repo, oid); ++ ++ if (!t) { + warning("could not find tree %s", oid_to_hex(oid)); ++ continue; + } ++ ++ if (t->object.flags & SEEN) ++ continue; ++ t->object.flags |= SEEN; ++ oid_array_append(&root_tree_list->oids, oid); + } + + trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr); + trace2_region_leave("path-walk", "commit-walk", info->revs->repo); + -+ string_list_append(&ctx.path_stack, root_path); -+ + trace2_region_enter("path-walk", "path-walk", info->revs->repo); + while (!ret && ctx.path_stack.nr) { + char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string; @@ path-walk.c (new) + trace2_region_leave("path-walk", "path-walk", info->revs->repo); + + clear_strmap(&ctx.paths_to_lists); ++ strset_clear(&ctx.path_stack_pushed); + string_list_clear(&ctx.path_stack, 0); + return ret; +} 5: 6e89fb219b5 ! 2: 4f9f898fec1 revision: create mark_trees_uninteresting_dense() @@ Metadata Author: Derrick Stolee <stolee@gmail.com> ## Commit message ## - revision: create mark_trees_uninteresting_dense() + test-lib-functions: add test_cmp_sorted - The sparse tree walk algorithm was created in d5d2e93577e (revision: - implement sparse algorithm, 2019-01-16) and involves using the - mark_trees_uninteresting_sparse() method. This method takes a repository - and an oidset of tree IDs, some of which have the UNINTERESTING flag and - some of which do not. - - Create a method that has an equivalent set of preconditions but uses a - "dense" walk (recursively visits all reachable trees, as long as they - have not previously been marked UNINTERESTING). This is an important - difference from mark_tree_uninteresting(), which short-circuits if the - given tree has the UNINTERESTING flag. - - A use of this method will be added in a later change, with a condition - set whether the sparse or dense approach should be used. + This test helper will be helpful to reduce repeated logic in + t6601-path-walk.sh, but may be helpful elsewhere, too. Signed-off-by: Derrick Stolee <stolee@gmail.com> - ## revision.c ## -@@ revision.c: static void add_children_by_path(struct repository *r, - free_tree_buffer(tree); + ## t/test-lib-functions.sh ## +@@ t/test-lib-functions.sh: test_cmp () { + eval "$GIT_TEST_CMP" '"$@"' } -+void mark_trees_uninteresting_dense(struct repository *r, -+ struct oidset *trees) -+{ -+ struct object_id *oid; -+ struct oidset_iter iter; -+ -+ oidset_iter_init(trees, &iter); -+ while ((oid = oidset_iter_next(&iter))) { -+ struct tree *tree = lookup_tree(r, oid); ++# test_cmp_sorted runs test_cmp on sorted versions of the two ++# input files. Uses "$1.sorted" and "$2.sorted" as temp files. + -+ if (tree && (tree->object.flags & UNINTERESTING)) -+ mark_tree_contents_uninteresting(r, tree); -+ } ++test_cmp_sorted () { ++ sort <"$1" >"$1.sorted" && ++ sort <"$2" >"$2.sorted" && ++ test_cmp "$1.sorted" "$2.sorted" && ++ rm "$1.sorted" "$2.sorted" +} + - void mark_trees_uninteresting_sparse(struct repository *r, - struct oidset *trees) - { - - ## revision.h ## -@@ revision.h: void put_revision_mark(const struct rev_info *revs, - - void mark_parents_uninteresting(struct rev_info *revs, struct commit *commit); - void mark_tree_uninteresting(struct repository *r, struct tree *tree); -+void mark_trees_uninteresting_dense(struct repository *r, struct oidset *trees); - void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *trees); - - void show_object_with_name(FILE *, struct object *, const char *); + # Check that the given config key has the expected value. + # + # test_cmp_config [-C <dir>] <expected-value> 2: a00ab0c62c9 ! 3: 6f93dff88e7 t6601: add helper for testing path-walk API @@ Commit message sets a baseline for the behavior and we can extend it as new options are introduced. + It is important to mention that the behavior of the API will change soon as + we start to handle UNINTERESTING objects differently, but these tests will + demonstrate the change in behavior. + Signed-off-by: Derrick Stolee <stolee@gmail.com> ## Documentation/technical/api-path-walk.txt ## -@@ Documentation/technical/api-path-walk.txt: commits are emitted. +@@ Documentation/technical/api-path-walk.txt: commits. Examples -------- @@ t/t6601-path-walk.sh (new) + blobs:6 + EOF + -+ sort expect >expect.sorted && -+ sort out >out.sorted && -+ -+ test_cmp expect.sorted out.sorted ++ test_cmp_sorted expect out +' + +test_expect_success 'topic only' ' @@ t/t6601-path-walk.sh (new) + blobs:5 + EOF + -+ sort expect >expect.sorted && -+ sort out >out.sorted && -+ -+ test_cmp expect.sorted out.sorted ++ test_cmp_sorted expect out +' + +test_expect_success 'topic, not base' ' @@ t/t6601-path-walk.sh (new) + blobs:4 + EOF + -+ sort expect >expect.sorted && -+ sort out >out.sorted && -+ -+ test_cmp expect.sorted out.sorted ++ test_cmp_sorted expect out +' + +test_expect_success 'topic, not base, boundary' ' @@ t/t6601-path-walk.sh (new) + blobs:5 + EOF + -+ sort expect >expect.sorted && -+ sort out >out.sorted && -+ -+ test_cmp expect.sorted out.sorted ++ test_cmp_sorted expect out +' + +test_done 3: 14375d19392 ! 4: f4bf8be30b5 path-walk: allow consumer to specify object types @@ Commit message We add the ability to filter the object types in the path-walk API so the callback function is called fewer times. - This adds the ability to ask for the commits in a list, as well. Future - changes will add the ability to visit annotated tags. + This adds the ability to ask for the commits in a list, as well. We + re-use the empty string for this set of objects because these are passed + directly to the callback function instead of being part of the + 'path_stack'. + + Future changes will add the ability to visit annotated tags. Signed-off-by: Derrick Stolee <stolee@gmail.com> ## Documentation/technical/api-path-walk.txt ## -@@ Documentation/technical/api-path-walk.txt: If you want the path-walk API to emit `UNINTERESTING` objects based on the - commit walk's boundary, be sure to set `revs.boundary` so the boundary - commits are emitted. +@@ Documentation/technical/api-path-walk.txt: It is also important that you do not specify the `--objects` flag for the + the objects will be walked in a separate way based on those starting + commits. +`commits`, `blobs`, `trees`:: + By default, these members are enabled and signal that the path-walk @@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info) /* Insert a single list for the root tree into the paths. */ CALLOC_ARRAY(root_tree_list, 1); root_tree_list->type = OBJ_TREE; - strmap_put(&ctx.paths_to_lists, root_path, root_tree_list); -- - if (prepare_revision_walk(info->revs)) +@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info) die(_("failed to setup revision walk")); while ((c = get_revision(info->revs))) { - struct object_id *oid = get_commit_tree_oid(c); -- struct tree *t = lookup_tree(info->revs->repo, oid); + struct object_id *oid; -+ struct tree *t; + struct tree *t; commits_nr++; + if (info->commits) @@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info) + if (!info->trees && !info->blobs) + continue; + -+ oid = get_commit_tree_oid(c); -+ t = lookup_tree(info->revs->repo, oid); -+ - if (t) { - if (t->object.flags & SEEN) - continue; + oid = get_commit_tree_oid(c); + t = lookup_tree(info->revs->repo, oid); + @@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info) trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr); trace2_region_leave("path-walk", "commit-walk", info->revs->repo); @@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info) + oid_array_clear(&commit_list->oids); + free(commit_list); + - string_list_append(&ctx.path_stack, root_path); - trace2_region_enter("path-walk", "path-walk", info->revs->repo); + while (!ret && ctx.path_stack.nr) { + char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string; ## path-walk.h ## @@ path-walk.h: struct path_walk_info { */ path_fn path_fn; void *path_fn_data; ++ + /** + * Initialize which object types the path_fn should be called on. This + * could also limit the walk to skip blobs if not set. @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' ' TREE:left/:$(git rev-parse topic:left) TREE:right/:$(git rev-parse topic:right) @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' ' - test_cmp expect.sorted out.sorted + test_cmp_sorted expect out ' +test_expect_success 'topic, not base, only blobs' ' @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' ' + blobs:4 + EOF + -+ sort expect >expect.sorted && -+ sort out >out.sorted && -+ -+ test_cmp expect.sorted out.sorted ++ test_cmp_sorted expect out +' + +# No, this doesn't make a lot of sense for the path-walk API, @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' ' + blobs:0 + EOF + -+ sort expect >expect.sorted && -+ sort out >out.sorted && -+ -+ test_cmp expect.sorted out.sorted ++ test_cmp_sorted expect out +' + +test_expect_success 'topic, not base, only trees' ' @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' ' + blobs:0 + EOF + -+ sort expect >expect.sorted && -+ sort out >out.sorted && -+ -+ test_cmp expect.sorted out.sorted ++ test_cmp_sorted expect out +' + test_expect_success 'topic, not base, boundary' ' 4: c321f58c62d ! 5: dfd00b2bf0c path-walk: allow visiting tags @@ Metadata Author: Derrick Stolee <stolee@gmail.com> ## Commit message ## - path-walk: allow visiting tags + path-walk: visit tags and cached objects - In anticipation of using the path-walk API to analyze tags or include - them in a pack-file, add the ability to walk the tags that were included - in the revision walk. + The rev_info that is specified for a path-walk traversal may specify + visiting tag refs (both lightweight and annotated) and also may specify + indexed objects (blobs and trees). Update the path-walk API to walk + these objects as well. - When these tag objects point to blobs or trees, we need to make sure - those objects are also visited. Treat tagged trees as root trees, but - put the tagged blobs in their own category. + When walking tags, we need to peel the annotated objects until reaching + a non-tag object. If we reach a commit, then we can add it to the + pending objects to make sure we visit in the commit walk portion. If we + reach a tree, then we will assume that it is a root tree. If we reach a + blob, then we have no good path name and so add it to a new list of + "tagged blobs". - Be careful about objects that are referred to by multiple references. + When the rev_info includes the "--indexed-objects" flag, then the + pending set includes blobs and trees found in the cache entries and + cache-tree. The cache entries are usually blobs, though they could be + trees in the case of a sparse index. The cache-tree stores + previously-hashed tree objects but these are cleared out when staging + objects below those paths. We add tests that demonstrate this. + + The indexed objects come with a non-NULL 'path' value in the pending + item. This allows us to prepopulate the 'path_to_lists' strmap with + lists for these paths. + + The tricky thing about this walk is that we will want to combine the + indexed objects walk with the commit walk, especially in the future case + of walking objects during a command like 'git repack'. + + Whenever possible, we want the objects from the index to be grouped with + similar objects in history. We don't want to miss any paths that appear + only in the index and not in the commit history. + + Thus, we need to be careful to let the path stack be populated initially + with only the root tree path (and possibly tags and tagged blobs) and go + through the normal depth-first search. Afterwards, if there are other + paths that are remaining in the paths_to_lists strmap, we should then + iterate through the stack and visit those objects recursively. - Co-authored-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> - Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Derrick Stolee <stolee@gmail.com> ## Documentation/technical/api-path-walk.txt ## -@@ Documentation/technical/api-path-walk.txt: If you want the path-walk API to emit `UNINTERESTING` objects based on the - commit walk's boundary, be sure to set `revs.boundary` so the boundary - commits are emitted. +@@ Documentation/technical/api-path-walk.txt: It is also important that you do not specify the `--objects` flag for the + the objects will be walked in a separate way based on those starting + commits. -`commits`, `blobs`, `trees`:: +`commits`, `blobs`, `trees`, `tags`:: @@ path-walk.c #include "trace2.h" #include "tree.h" #include "tree-walk.h" + ++static const char *root_path = ""; ++ + struct type_and_oid_list + { + enum object_type type; +@@ path-walk.c: static int walk_path(struct path_walk_context *ctx, + + list = strmap_get(&ctx->paths_to_lists, path); + ++ if (!list) ++ BUG("provided path '%s' that had no associated list", path); ++ + /* Evaluate function pointer on this data, if requested. */ + if ((list->type == OBJ_TREE && ctx->info->trees) || +- (list->type == OBJ_BLOB && ctx->info->blobs)) ++ (list->type == OBJ_BLOB && ctx->info->blobs)|| ++ (list->type == OBJ_TAG && ctx->info->tags)) + ret = ctx->info->path_fn(path, &list->oids, list->type, + ctx->info->path_fn_data); + +@@ path-walk.c: static void clear_strmap(struct strmap *map) + strmap_init(map); + } + ++static void setup_pending_objects(struct path_walk_info *info, ++ struct path_walk_context *ctx) ++{ ++ struct type_and_oid_list *tags = NULL; ++ struct type_and_oid_list *tagged_blobs = NULL; ++ struct type_and_oid_list *root_tree_list = NULL; ++ ++ if (info->tags) ++ CALLOC_ARRAY(tags, 1); ++ if (info->blobs) ++ CALLOC_ARRAY(tagged_blobs, 1); ++ if (info->trees) ++ root_tree_list = strmap_get(&ctx->paths_to_lists, root_path); ++ ++ /* ++ * Pending objects include: ++ * * Commits at branch tips. ++ * * Annotated tags at tag tips. ++ * * Any kind of object at lightweight tag tips. ++ * * Trees and blobs in the index (with an associated path). ++ */ ++ for (size_t i = 0; i < info->revs->pending.nr; i++) { ++ struct object_array_entry *pending = info->revs->pending.objects + i; ++ struct object *obj = pending->item; ++ ++ /* Commits will be picked up by revision walk. */ ++ if (obj->type == OBJ_COMMIT) ++ continue; ++ ++ /* Navigate annotated tag object chains. */ ++ while (obj->type == OBJ_TAG) { ++ struct tag *tag = lookup_tag(info->revs->repo, &obj->oid); ++ if (!tag) ++ break; ++ if (tag->object.flags & SEEN) ++ break; ++ tag->object.flags |= SEEN; ++ ++ if (tags) ++ oid_array_append(&tags->oids, &obj->oid); ++ obj = tag->tagged; ++ } ++ ++ if (obj->type == OBJ_TAG) ++ continue; ++ ++ /* We are now at a non-tag object. */ ++ if (obj->flags & SEEN) ++ continue; ++ obj->flags |= SEEN; ++ ++ switch (obj->type) { ++ case OBJ_TREE: ++ if (!info->trees) ++ continue; ++ if (pending->path) { ++ struct type_and_oid_list *list; ++ char *path = *pending->path ? xstrfmt("%s/", pending->path) ++ : xstrdup(""); ++ if (!(list = strmap_get(&ctx->paths_to_lists, path))) { ++ CALLOC_ARRAY(list, 1); ++ list->type = OBJ_TREE; ++ strmap_put(&ctx->paths_to_lists, path, list); ++ } ++ oid_array_append(&list->oids, &obj->oid); ++ free(path); ++ } else { ++ /* assume a root tree, such as a lightweight tag. */ ++ oid_array_append(&root_tree_list->oids, &obj->oid); ++ } ++ break; ++ ++ case OBJ_BLOB: ++ if (!info->blobs) ++ continue; ++ if (pending->path) { ++ struct type_and_oid_list *list; ++ char *path = pending->path; ++ if (!(list = strmap_get(&ctx->paths_to_lists, path))) { ++ CALLOC_ARRAY(list, 1); ++ list->type = OBJ_BLOB; ++ strmap_put(&ctx->paths_to_lists, path, list); ++ } ++ oid_array_append(&list->oids, &obj->oid); ++ } else { ++ /* assume a root tree, such as a lightweight tag. */ ++ oid_array_append(&tagged_blobs->oids, &obj->oid); ++ } ++ break; ++ ++ case OBJ_COMMIT: ++ /* Make sure it is in the object walk */ ++ if (obj != pending->item) ++ add_pending_object(info->revs, obj, ""); ++ break; ++ ++ default: ++ BUG("should not see any other type here"); ++ } ++ } ++ ++ /* ++ * Add tag objects and tagged blobs if they exist. ++ */ ++ if (tagged_blobs) { ++ if (tagged_blobs->oids.nr) { ++ const char *tagged_blob_path = "/tagged-blobs"; ++ tagged_blobs->type = OBJ_BLOB; ++ push_to_stack(ctx, tagged_blob_path); ++ strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs); ++ } else { ++ oid_array_clear(&tagged_blobs->oids); ++ free(tagged_blobs); ++ } ++ } ++ if (tags) { ++ if (tags->oids.nr) { ++ const char *tag_path = "/tags"; ++ tags->type = OBJ_TAG; ++ push_to_stack(ctx, tag_path); ++ strmap_put(&ctx->paths_to_lists, tag_path, tags); ++ } else { ++ oid_array_clear(&tags->oids); ++ free(tags); ++ } ++ } ++} ++ + /** + * Given the configuration of 'info', walk the commits based on 'info->revs' and + * call 'info->path_fn' on each discovered path. +@@ path-walk.c: static void clear_strmap(struct strmap *map) + */ + int walk_objects_by_path(struct path_walk_info *info) + { +- const char *root_path = ""; + int ret = 0; + size_t commits_nr = 0, paths_nr = 0; + struct commit *c; @@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info) CALLOC_ARRAY(commit_list, 1); commit_list->type = OBJ_COMMIT; @@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info) CALLOC_ARRAY(root_tree_list, 1); root_tree_list->type = OBJ_TREE; strmap_put(&ctx.paths_to_lists, root_path, root_tree_list); -+ + push_to_stack(&ctx, root_path); + + /* + * Set these values before preparing the walk to catch -+ * lightweight tags pointing to non-commits. ++ * lightweight tags pointing to non-commits and indexed objects. + */ + info->revs->blob_objects = info->blobs; + info->revs->tree_objects = info->trees; @@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info) + info->revs->blob_objects = info->revs->tree_objects = 0; + -+ if (info->tags) { -+ struct oid_array tagged_blob_list = OID_ARRAY_INIT; -+ struct oid_array tags = OID_ARRAY_INIT; -+ -+ trace2_region_enter("path-walk", "tag-walk", info->revs->repo); -+ -+ /* -+ * Walk any pending objects at this point, but they should only -+ * be tags. -+ */ -+ for (size_t i = 0; i < info->revs->pending.nr; i++) { -+ struct object_array_entry *pending = info->revs->pending.objects + i; -+ struct object *obj = pending->item; -+ -+ if (obj->type == OBJ_COMMIT || obj->flags & SEEN) -+ continue; -+ -+ while (obj->type == OBJ_TAG) { -+ struct tag *tag = lookup_tag(info->revs->repo, -+ &obj->oid); -+ if (!(obj->flags & SEEN)) { -+ obj->flags |= SEEN; -+ oid_array_append(&tags, &obj->oid); -+ } -+ obj = tag->tagged; -+ } -+ -+ if ((obj->flags & SEEN)) -+ continue; -+ obj->flags |= SEEN; ++ trace2_region_enter("path-walk", "pending-walk", info->revs->repo); ++ setup_pending_objects(info, &ctx); ++ trace2_region_leave("path-walk", "pending-walk", info->revs->repo); + -+ switch (obj->type) { -+ case OBJ_TREE: -+ if (info->trees) -+ oid_array_append(&root_tree_list->oids, &obj->oid); -+ break; -+ -+ case OBJ_BLOB: -+ if (info->blobs) -+ oid_array_append(&tagged_blob_list, &obj->oid); -+ break; + while ((c = get_revision(info->revs))) { + struct object_id *oid; + struct tree *t; +@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info) + + free(path); + } + -+ case OBJ_COMMIT: -+ /* Make sure it is in the object walk */ -+ add_pending_object(info->revs, obj, ""); -+ break; ++ /* Are there paths remaining? Likely they are from indexed objects. */ ++ if (!strmap_empty(&ctx.paths_to_lists)) { ++ struct hashmap_iter iter; ++ struct strmap_entry *entry; + -+ default: -+ BUG("should not see any other type here"); -+ } ++ strmap_for_each_entry(&ctx.paths_to_lists, &iter, entry) { ++ push_to_stack(&ctx, entry->key); + } + -+ info->path_fn("", &tags, OBJ_TAG, info->path_fn_data); ++ while (!ret && ctx.path_stack.nr) { ++ char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string; ++ ctx.path_stack.nr--; ++ paths_nr++; + -+ if (tagged_blob_list.nr && info->blobs) -+ info->path_fn("/tagged-blobs", &tagged_blob_list, OBJ_BLOB, -+ info->path_fn_data); ++ ret = walk_path(&ctx, path); + -+ trace2_data_intmax("path-walk", ctx.repo, "tags", tags.nr); -+ trace2_region_leave("path-walk", "tag-walk", info->revs->repo); -+ oid_array_clear(&tags); -+ oid_array_clear(&tagged_blob_list); ++ free(path); ++ } + } + - while ((c = get_revision(info->revs))) { - struct object_id *oid; - struct tree *t; + trace2_data_intmax("path-walk", ctx.repo, "paths", paths_nr); + trace2_region_leave("path-walk", "path-walk", info->revs->repo); + ## path-walk.h ## @@ path-walk.h: struct path_walk_info { @@ t/t6601-path-walk.sh: test_expect_success 'all' ' + BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{}) + BLOB:child/file:$(git rev-parse refs/tags/tree-tag^{}:child/file) + blobs:10 -+ TAG::$(git rev-parse refs/tags/first) -+ TAG::$(git rev-parse refs/tags/second.1) -+ TAG::$(git rev-parse refs/tags/second.2) -+ TAG::$(git rev-parse refs/tags/third) -+ TAG::$(git rev-parse refs/tags/fourth) -+ TAG::$(git rev-parse refs/tags/tree-tag) -+ TAG::$(git rev-parse refs/tags/blob-tag) ++ TAG:/tags:$(git rev-parse refs/tags/first) ++ TAG:/tags:$(git rev-parse refs/tags/second.1) ++ TAG:/tags:$(git rev-parse refs/tags/second.2) ++ TAG:/tags:$(git rev-parse refs/tags/third) ++ TAG:/tags:$(git rev-parse refs/tags/fourth) ++ TAG:/tags:$(git rev-parse refs/tags/tree-tag) ++ TAG:/tags:$(git rev-parse refs/tags/blob-tag) + tags:7 ++ EOF ++ ++ test_cmp_sorted expect out ++' ++ ++test_expect_success 'indexed objects' ' ++ test_when_finished git reset --hard && ++ ++ # stage change into index, adding a blob but ++ # also invalidating the cache-tree for the root ++ # and the "left" directory. ++ echo bogus >left/c && ++ git add left && ++ ++ test-tool path-walk -- --indexed-objects >out && ++ ++ cat >expect <<-EOF && ++ commits:0 ++ TREE:right/:$(git rev-parse topic:right) ++ trees:1 ++ BLOB:a:$(git rev-parse HEAD:a) ++ BLOB:left/b:$(git rev-parse HEAD:left/b) ++ BLOB:left/c:$(git rev-parse :left/c) ++ BLOB:right/c:$(git rev-parse HEAD:right/c) ++ BLOB:right/d:$(git rev-parse HEAD:right/d) ++ blobs:5 ++ tags:0 ++ EOF ++ ++ test_cmp_sorted expect out ++' ++ ++test_expect_success 'branches and indexed objects mix well' ' ++ test_when_finished git reset --hard && ++ ++ # stage change into index, adding a blob but ++ # also invalidating the cache-tree for the root ++ # and the "right" directory. ++ echo fake >right/d && ++ git add right && ++ ++ test-tool path-walk -- --indexed-objects --branches >out && ++ ++ cat >expect <<-EOF && ++ COMMIT::$(git rev-parse topic) ++ COMMIT::$(git rev-parse base) ++ COMMIT::$(git rev-parse base~1) ++ COMMIT::$(git rev-parse base~2) ++ commits:4 ++ TREE::$(git rev-parse topic^{tree}) ++ TREE::$(git rev-parse base^{tree}) ++ TREE::$(git rev-parse base~1^{tree}) ++ TREE::$(git rev-parse base~2^{tree}) ++ TREE:a/:$(git rev-parse base:a) ++ TREE:left/:$(git rev-parse base:left) ++ TREE:left/:$(git rev-parse base~2:left) ++ TREE:right/:$(git rev-parse topic:right) ++ TREE:right/:$(git rev-parse base~1:right) ++ TREE:right/:$(git rev-parse base~2:right) ++ trees:10 ++ BLOB:a:$(git rev-parse base~2:a) ++ BLOB:left/b:$(git rev-parse base:left/b) ++ BLOB:left/b:$(git rev-parse base~2:left/b) ++ BLOB:right/c:$(git rev-parse base~2:right/c) ++ BLOB:right/c:$(git rev-parse topic:right/c) ++ BLOB:right/d:$(git rev-parse base~1:right/d) ++ BLOB:right/d:$(git rev-parse :right/d) ++ blobs:7 ++ tags:0 EOF - sort expect >expect.sorted && + test_cmp_sorted expect out @@ t/t6601-path-walk.sh: test_expect_success 'topic only' ' BLOB:right/c:$(git rev-parse topic:right/c) BLOB:right/d:$(git rev-parse base~1:right/d) @@ t/t6601-path-walk.sh: test_expect_success 'topic only' ' + tags:0 EOF - sort expect >expect.sorted && + test_cmp_sorted expect out @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' ' BLOB:right/c:$(git rev-parse topic:right/c) BLOB:right/d:$(git rev-parse topic:right/d) @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' ' + tags:0 EOF - sort expect >expect.sorted && + test_cmp_sorted expect out @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only blobs' ' BLOB:right/c:$(git rev-parse topic:right/c) BLOB:right/d:$(git rev-parse topic:right/d) @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only blobs' ' + tags:0 EOF - sort expect >expect.sorted && + test_cmp_sorted expect out @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only commits' ' commits:1 trees:0 @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only commits' ' + tags:0 EOF - sort expect >expect.sorted && + test_cmp_sorted expect out @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only trees' ' TREE:right/:$(git rev-parse topic:right) trees:3 @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only trees' ' + tags:0 EOF - sort expect >expect.sorted && + test_cmp_sorted expect out @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' ' BLOB:right/c:$(git rev-parse topic:right/c) BLOB:right/d:$(git rev-parse base~1:right/d) @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' ' + tags:0 EOF - sort expect >expect.sorted && -@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' ' - test_cmp expect.sorted out.sorted + test_cmp_sorted expect out ' +test_expect_success 'trees are reported exactly once' ' 6: 238d7d95715 ! 6: 5252076d556 path-walk: add prune_all_uninteresting option @@ Metadata Author: Derrick Stolee <stolee@gmail.com> ## Commit message ## - path-walk: add prune_all_uninteresting option + path-walk: mark trees and blobs as UNINTERESTING - This option causes the path-walk API to act like the sparse tree-walk - algorithm implemented by mark_trees_uninteresting_sparse() in - list-objects.c. + When the input rev_info has UNINTERESTING starting points, we want to be + sure that the UNINTERESTING flag is passed appropriately through the + objects. To match how this is done in places such as 'git pack-objects', we + use the mark_edges_uninteresting() method. - Starting from the commits marked as UNINTERESTING, their root trees and - all objects reachable from those trees are UNINTERSTING, at least as we - walk path-by-path. When we reach a path where all objects associated - with that path are marked UNINTERESTING, then do no continue walking the - children of that path. + This method has an option for using the "sparse" walk, which is similar in + spirit to the path-walk API's walk. To be sure to keep it independent, add a + new 'prune_all_uninteresting' option to the path_walk_info struct. - We need to be careful to pass the UNINTERESTING flag in a deep way on - the UNINTERESTING objects before we start the path-walk, or else the - depth-first search for the path-walk API may accidentally report some - objects as interesting. + To check how the UNINTERSTING flag is spread through our objects, extend the + 'test-tool path-walk' command to output whether or not an object has that + flag. This changes our tests significantly, including the removal of some + objects that were previously visited due to the incomplete implementation. Signed-off-by: Derrick Stolee <stolee@gmail.com> ## Documentation/technical/api-path-walk.txt ## -@@ Documentation/technical/api-path-walk.txt: commits are emitted. +@@ Documentation/technical/api-path-walk.txt: commits. While it is possible to walk only commits in this way, consumers would be better off using the revision walk API instead. @@ Documentation/technical/api-path-walk.txt: commits are emitted. ## path-walk.c ## +@@ + #include "dir.h" + #include "hashmap.h" + #include "hex.h" ++#include "list-objects.h" + #include "object.h" + #include "oid-array.h" + #include "revision.h" @@ path-walk.c: struct type_and_oid_list { enum object_type type; @@ path-walk.c: struct type_and_oid_list #define TYPE_AND_OID_LIST_INIT { \ @@ path-walk.c: static int add_children(struct path_walk_context *ctx, - strmap_put(&ctx->paths_to_lists, path.buf, list); - string_list_append(&ctx->path_stack, path.buf); - } + if (o->flags & SEEN) + continue; + o->flags |= SEEN; ++ + if (!(o->flags & UNINTERESTING)) + list->maybe_interesting = 1; oid_array_append(&list->oids, &entry.oid); } @@ path-walk.c: static int walk_path(struct path_walk_context *ctx, - - list = strmap_get(&ctx->paths_to_lists, path); + if (!list) + BUG("provided path '%s' that had no associated list", path); + if (ctx->info->prune_all_uninteresting) { + /* @@ path-walk.c: static int walk_path(struct path_walk_context *ctx, + &list->oids.oid[i]); + if (t && !(t->object.flags & UNINTERESTING)) + list->maybe_interesting = 1; -+ } else { ++ } else if (list->type == OBJ_BLOB) { + struct blob *b = lookup_blob(ctx->repo, + &list->oids.oid[i]); + if (b && !(b->object.flags & UNINTERESTING)) + list->maybe_interesting = 1; ++ } else { ++ /* Tags are always interesting if visited. */ ++ list->maybe_interesting = 1; + } + } + @@ path-walk.c: static int walk_path(struct path_walk_context *ctx, + /* Evaluate function pointer on this data, if requested. */ if ((list->type == OBJ_TREE && ctx->info->trees) || - (list->type == OBJ_BLOB && ctx->info->blobs)) + (list->type == OBJ_BLOB && ctx->info->blobs)|| @@ path-walk.c: static void clear_strmap(struct strmap *map) - int walk_objects_by_path(struct path_walk_info *info) - { - const char *root_path = ""; -- int ret = 0; -+ int ret = 0, has_uninteresting = 0; - size_t commits_nr = 0, paths_nr = 0; - struct commit *c; - struct type_and_oid_list *root_tree_list; -@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info) - .path_stack = STRING_LIST_INIT_DUP, - .paths_to_lists = STRMAP_INIT - }; -+ struct oidset root_tree_set = OIDSET_INIT; - - trace2_region_enter("path-walk", "commit-walk", info->revs->repo); + strmap_init(map); + } ++static struct repository *edge_repo; ++static struct type_and_oid_list *edge_tree_list; ++ ++static void show_edge(struct commit *commit) ++{ ++ struct tree *t = repo_get_commit_tree(edge_repo, commit); ++ ++ if (!t) ++ return; ++ ++ if (commit->object.flags & UNINTERESTING) ++ t->object.flags |= UNINTERESTING; ++ ++ if (t->object.flags & SEEN) ++ return; ++ t->object.flags |= SEEN; ++ ++ oid_array_append(&edge_tree_list->oids, &t->object.oid); ++} ++ + static void setup_pending_objects(struct path_walk_info *info, + struct path_walk_context *ctx) + { +@@ path-walk.c: static void setup_pending_objects(struct path_walk_info *info, + if (tagged_blobs->oids.nr) { + const char *tagged_blob_path = "/tagged-blobs"; + tagged_blobs->type = OBJ_BLOB; ++ tagged_blobs->maybe_interesting = 1; + push_to_stack(ctx, tagged_blob_path); + strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs); + } else { +@@ path-walk.c: static void setup_pending_objects(struct path_walk_info *info, + if (tags->oids.nr) { + const char *tag_path = "/tags"; + tags->type = OBJ_TAG; ++ tags->maybe_interesting = 1; + push_to_stack(ctx, tag_path); + strmap_put(&ctx->paths_to_lists, tag_path, tags); + } else { @@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info) /* Insert a single list for the root tree into the paths. */ CALLOC_ARRAY(root_tree_list, 1); root_tree_list->type = OBJ_TREE; + root_tree_list->maybe_interesting = 1; strmap_put(&ctx.paths_to_lists, root_path, root_tree_list); + push_to_stack(&ctx, root_path); - /* @@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info) - t = lookup_tree(info->revs->repo, oid); + if (prepare_revision_walk(info->revs)) + die(_("failed to setup revision walk")); - if (t) { -+ if ((c->object.flags & UNINTERESTING)) { -+ t->object.flags |= UNINTERESTING; -+ has_uninteresting = 1; -+ } ++ /* Walk trees to mark them as UNINTERESTING. */ ++ edge_repo = info->revs->repo; ++ edge_tree_list = root_tree_list; ++ mark_edges_uninteresting(info->revs, show_edge, ++ info->prune_all_uninteresting); ++ edge_repo = NULL; ++ edge_tree_list = NULL; + - if (t->object.flags & SEEN) - continue; - t->object.flags |= SEEN; -- oid_array_append(&root_tree_list->oids, oid); -+ if (!oidset_insert(&root_tree_set, oid)) -+ oid_array_append(&root_tree_list->oids, oid); - } else { - warning("could not find tree %s", oid_to_hex(oid)); - } -@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info) - oid_array_clear(&commit_list->oids); - free(commit_list); + info->revs->blob_objects = info->revs->tree_objects = 0; -+ /* -+ * Before performing a DFS of our paths and emitting them as interesting, -+ * do a full walk of the trees to distribute the UNINTERESTING bit. Use -+ * the sparse algorithm if prune_all_uninteresting was set. -+ */ -+ if (has_uninteresting) { -+ trace2_region_enter("path-walk", "uninteresting-walk", info->revs->repo); -+ if (info->prune_all_uninteresting) -+ mark_trees_uninteresting_sparse(ctx.repo, &root_tree_set); -+ else -+ mark_trees_uninteresting_dense(ctx.repo, &root_tree_set); -+ trace2_region_leave("path-walk", "uninteresting-walk", info->revs->repo); -+ } -+ oidset_clear(&root_tree_set); -+ - string_list_append(&ctx.path_stack, root_path); - - trace2_region_enter("path-walk", "path-walk", info->revs->repo); + trace2_region_enter("path-walk", "pending-walk", info->revs->repo); ## path-walk.h ## @@ path-walk.h: struct path_walk_info { @@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv) ## t/t6601-path-walk.sh ## +@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' ' + COMMIT::$(git rev-parse topic) + commits:1 + TREE::$(git rev-parse topic^{tree}) +- TREE:left/:$(git rev-parse topic:left) ++ TREE:left/:$(git rev-parse base~1:left):UNINTERESTING + TREE:right/:$(git rev-parse topic:right) + trees:3 +- BLOB:a:$(git rev-parse topic:a) +- BLOB:left/b:$(git rev-parse topic:left/b) ++ BLOB:a:$(git rev-parse base~1:a):UNINTERESTING ++ BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING + BLOB:right/c:$(git rev-parse topic:right/c) +- BLOB:right/d:$(git rev-parse topic:right/d) ++ BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING + blobs:4 + tags:0 + EOF +@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' ' + test_cmp_sorted expect out + ' + ++test_expect_success 'fourth, blob-tag2, not base' ' ++ test-tool path-walk -- fourth blob-tag2 --not base >out && ++ ++ cat >expect <<-EOF && ++ COMMIT::$(git rev-parse topic) ++ commits:1 ++ TREE::$(git rev-parse topic^{tree}) ++ TREE:left/:$(git rev-parse base~1:left):UNINTERESTING ++ TREE:right/:$(git rev-parse topic:right) ++ trees:3 ++ BLOB:a:$(git rev-parse base~1:a):UNINTERESTING ++ BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING ++ BLOB:right/c:$(git rev-parse topic:right/c) ++ BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING ++ BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{}) ++ blobs:5 ++ TAG:/tags:$(git rev-parse fourth) ++ tags:1 ++ EOF ++ ++ test_cmp_sorted expect out ++' ++ + test_expect_success 'topic, not base, only blobs' ' + test-tool path-walk --no-trees --no-commits \ + -- topic --not base >out && +@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only blobs' ' + cat >expect <<-EOF && + commits:0 + trees:0 +- BLOB:a:$(git rev-parse topic:a) +- BLOB:left/b:$(git rev-parse topic:left/b) ++ BLOB:a:$(git rev-parse base~1:a):UNINTERESTING ++ BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING + BLOB:right/c:$(git rev-parse topic:right/c) +- BLOB:right/d:$(git rev-parse topic:right/d) ++ BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING + blobs:4 + tags:0 + EOF +@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only trees' ' + cat >expect <<-EOF && + commits:0 + TREE::$(git rev-parse topic^{tree}) +- TREE:left/:$(git rev-parse topic:left) ++ TREE:left/:$(git rev-parse base~1:left):UNINTERESTING + TREE:right/:$(git rev-parse topic:right) + trees:3 + blobs:0 @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' ' cat >expect <<-EOF && @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' ' tags:0 EOF @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' ' - test_cmp expect.sorted out.sorted + test_cmp_sorted expect out ' +-test_expect_success 'trees are reported exactly once' ' +- test_when_finished "rm -rf unique-trees" && +- test_create_repo unique-trees && +- ( +- cd unique-trees && +- mkdir initial && +- test_commit initial/file && +- +- git switch -c move-to-top && +- git mv initial/file.t ./ && +- test_tick && +- git commit -m moved && +- +- git update-ref refs/heads/other HEAD +- ) && +- +- test-tool -C unique-trees path-walk -- --all >out && +- tree=$(git -C unique-trees rev-parse HEAD:) && +- grep "$tree" out >out-filtered && +- test_line_count = 1 out-filtered +test_expect_success 'topic, not base, boundary with pruning' ' + test-tool path-walk --prune -- --boundary topic --not base >out && + @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' ' + tags:0 + EOF + -+ sort expect >expect.sorted && -+ sort out >out.sorted && -+ -+ test_cmp expect.sorted out.sorted -+' -+ - test_expect_success 'trees are reported exactly once' ' - test_when_finished "rm -rf unique-trees" && - test_create_repo unique-trees && ++ test_cmp_sorted expect out + ' + + test_done
Hi Stolee, On Thu, Oct 31, 2024 at 06:26:57AM +0000, Derrick Stolee via GitGitGadget wrote: > > Introduction and relation to prior series > ========================================= > > This is a new series that rerolls the initial "path-walk API" patches of my > RFC [1] "Path-walk API and applications". This new API (in path-walk.c and > path-walk.h) presents a new way to walk objects such that trees and blobs > are walked in batches according to their path. > > This also replaces the previous version of ds/path-walk that was being > reviewed in [2]. The consensus was that the series was too long/dense and > could use some reduction in size. This series takes the first few patches, > but also makes some updates (which will be described later). > > [1] > https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/ > [RFC] Path-walk API and applications > > [2] > https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@gmail.com/ > [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas I apologize for not having a better place to start discussing a topic which pertains to more than just this immediate patch series, but I figure here is as good a place as any to do so. From our earlier discussion, it seems to stand that the path-walk API is fundamentally incompatible with reachability bitmaps and delta-islands, making the series a non-starter in environments that rely significantly one or both of those features. My understanding as a result is that the path-walk API and feature are more targeted towards improving client-side repacks and push performance, where neither of the aforementioned two features are used quite as commonly. I was discussing this a bit off-list with Peff (who I hope will join the thread and share his own thoughts), but I wonder if it was a mistake to discard your '--full-name-hash' idea (or something similar, which I'll discuss in a bit below) from earlier. (Repeating a few things that I am sure are obvious to you out loud so that I can get a grasp on them for my own understanding): It seems that the problems you've identified which result in poor repack performance occur when you have files at the same path, but they get poorly sorted in the delta selection window due to other paths having the same final 16 characters, so Git doesn't see that much better delta opportunities exist. Your series takes into account the full name when hashing, which seems to produce a clear win in many cases. I'm sure that there are some cases where it presents a modest regression in pack sizes, but I think that's fine and probably par for the course when making any changes like this, as there is probably no easy silver bullet here that uniformly improves all cases. I suspect that you could go even further and intern the full path at which each object occurs, and sort lexically by that. Just stringing together all of the paths in linux.git only takes 3.099 MiB on my clone. (Of course, that's unbounded in the number of objects and length of their pathnames, but you could at least bound the latter by taking only the last, say, 128 characters, which would be more than good enough for the kernel, whose longest path is only 102 characters). Some of the repositories that you've tested on I don't have easy access to, so I wonder if either doing (a) that, or (b) using some fancier context-sensitive hash (like SimHash or MinHash) would be beneficial. I realize that this is taking us back to an idea you've already presented to the list, but I think (to me, at least) the benefit and simplicity of that approach has only become clear to me in hindsight when seeing some alternatives. I would like to apologize for the time you spent reworking this series back and forth to have the response be "maybe we should have just done the first thing you suggested". Like I said, I think to me it was really only clear in hindsight. In any event, the major benefit to doing --full-name-hash would be that *all* environments could benefit from the size reduction, not just those that don't rely on certain other features. Perhaps just --full-name-hash isn't quite as good by itself as the --path-walk implementation that this series starts us off implementing. So in that sense, maybe we want both, which I understand was the original approach. I see a couple of options here: - We take both, because doing --path-walk on top represents a significant enough improvement that we are collectively OK with taking on more code to improve a more narrow (but common) use-case. - Or we decide that either the benefit isn't significant enough to warrant an additional and relatively complex implementation, or in other words that --full-name-hash by itself is good enough. Again, I apologize for not having a clearer picture of this all to start with, and I want to tell you specifically and sincerely that I appreciate your patience as I wrap my head around all of this. I think the benefit of --full-name-hash is much clearer and appealing to me now having had both more time and seeing the series approached in a couple of different ways. Let me know what you think. Thanks, Taylor
On 11/1/24 3:23 PM, Taylor Blau wrote: > Hi Stolee, > > On Thu, Oct 31, 2024 at 06:26:57AM +0000, Derrick Stolee via GitGitGadget wrote: >> >> Introduction and relation to prior series >> ========================================= >> >> This is a new series that rerolls the initial "path-walk API" patches of my >> RFC [1] "Path-walk API and applications". This new API (in path-walk.c and >> path-walk.h) presents a new way to walk objects such that trees and blobs >> are walked in batches according to their path. >> >> This also replaces the previous version of ds/path-walk that was being >> reviewed in [2]. The consensus was that the series was too long/dense and >> could use some reduction in size. This series takes the first few patches, >> but also makes some updates (which will be described later). >> >> [1] >> https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/ >> [RFC] Path-walk API and applications >> >> [2] >> https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@gmail.com/ >> [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas > > I apologize for not having a better place to start discussing a topic > which pertains to more than just this immediate patch series, but I > figure here is as good a place as any to do so. > > From our earlier discussion, it seems to stand that the path-walk API > is fundamentally incompatible with reachability bitmaps and > delta-islands, making the series a non-starter in environments that > rely significantly one or both of those features. My understanding as a > result is that the path-walk API and feature are more targeted towards > improving client-side repacks and push performance, where neither of the > aforementioned two features are used quite as commonly. This is correct. I would go even farther to say that this approach was designed first and foremost for Git clients and specifically their performance while computing a thin packfile during "git push". The same logic to help the push case happens to also help the "git repack" case significantly. > I was discussing this a bit off-list with Peff (who I hope will join the > thread and share his own thoughts), but I wonder if it was a mistake to > discard your '--full-name-hash' idea (or something similar, which I'll > discuss in a bit below) from earlier. I'd be happy to resurrect that series, adding in the learnings from working on the path-walk feature. It helps that the current series adds the path-walk API and has no conflicting changes in the pack-objects or repack builtins. (I can handle those conflicts as things merge down.) > (Repeating a few things that I am sure are obvious to you out loud so > that I can get a grasp on them for my own understanding): > > It seems that the problems you've identified which result in poor repack > performance occur when you have files at the same path, but they get > poorly sorted in the delta selection window due to other paths having > the same final 16 characters, so Git doesn't see that much better delta > opportunities exist. > > Your series takes into account the full name when hashing, which seems > to produce a clear win in many cases. I'm sure that there are some cases > where it presents a modest regression in pack sizes, but I think that's > fine and probably par for the course when making any changes like this, > as there is probably no easy silver bullet here that uniformly improves > all cases. > > I suspect that you could go even further and intern the full path at > which each object occurs, and sort lexically by that. Just stringing > together all of the paths in linux.git only takes 3.099 MiB on my clone. > (Of course, that's unbounded in the number of objects and length of > their pathnames, but you could at least bound the latter by taking only > the last, say, 128 characters, which would be more than good enough for > the kernel, whose longest path is only 102 characters). When the optimization idea is to focus on the full path and not care about the "locality" of the path name by its later bits, storing the full name in a list and storing an index into that list would have a very similar effect. I'd be interested to explore the idea of storing the full path name. Based on my exploration with the 'test-tool name-hash' test helper in that series, I'm not sure that we will make significant gains by doing so. Worth trying. > Some of the repositories that you've tested on I don't have easy access > to, so I wonder if either doing (a) that, or (b) using some fancier > context-sensitive hash (like SimHash or MinHash) would be beneficial. I don't know too much about SimHash or MinHash, but based on what I could gather from some initial reading I'm not sure that they would be effective without increasing the hash length. We'd also get a different kind of locality, such as the appearance of a common word would be more likely to affect the locality than the end of the path. The size of the hash could probably be mitigated by storing it in the list of all full paths and accessing them from the index stored on each to-pack object. > I realize that this is taking us back to an idea you've already > presented to the list, but I think (to me, at least) the benefit and > simplicity of that approach has only become clear to me in hindsight > when seeing some alternatives. I would like to apologize for the time > you spent reworking this series back and forth to have the response be > "maybe we should have just done the first thing you suggested". Like I > said, I think to me it was really only clear in hindsight. I always assumed that we'd come back to it eventually. There is also the extra bit about making the change to the name-hash compatible with the way name-hashes are stored in the reachability bitmaps. That will need some work before it is ready for prime time. > In any event, the major benefit to doing --full-name-hash would be that > *all* environments could benefit from the size reduction, not just those > that don't rely on certain other features. I disagree that all environments will prefer the --full-name-hash. I'm currently repeating the performance tests right now, and I've added one. The issues are: 1. The --full-name-hash approach sometimes leads to a larger pack when using "git push" on the client, especially when the name-hash is already effective for compressing across paths. 2. A depth 1 shallow clone cannot use previous versions of a path, so those situations will want to use the normal name hash. This can be accomplished simply by disabling the --full-name-hash option when the --shallow option is present; a more detailed version could be used to check for a large depth before disabling it. This case also disables bitmaps, so that isn't something to worry about. > Perhaps just --full-name-hash isn't quite as good by itself as the > --path-walk implementation that this series starts us off implementing. > So in that sense, maybe we want both, which I understand was the > original approach. I see a couple of options here: > > - We take both, because doing --path-walk on top represents a > significant enough improvement that we are collectively OK with > taking on more code to improve a more narrow (but common) use-case. Doing both doesn't help at all, since the --path-walk approach already batches by the full path name. The --path-walk approach has a significant benefit by doing a second pass by the standard name-hash to pick up on the cross-path deltas. This is why the --path-walk approach with the standard name hash as consistently provided the most-compact pack-files in all tests. Aside: there were some initial tests that showed the --path-walk option led to slightly larger packfiles, but I've since discovered that those cases were due to an incorrect walking of indexed paths. This is fixed by the code in patch 5 of the current series and my WIP patches in [3] have the performance numbers with this change. [3] https://github.com/gitgitgadget/git/pull/1819 PATH WALK II: Add --path-walk option to 'git pack-objects' > - Or we decide that either the benefit isn't significant enough to > warrant an additional and relatively complex implementation, or in > other words that --full-name-hash by itself is good enough. I hope that I've sufficiently communicated that --full-name-hash is not good enough by itself. The point I was trying to make by submitting it first was that I believed it was likely the easiest way for Git servers to gain 90% of the benefits that the --path-walk approach provides while making it relatively easy to integrate with other server-side features such as bitmaps and delta islands. (Maybe the --path-walk approach could also be extended to be compatible with those features, but it would be a significant investment that rebuilds those features within the context of the new object walk instead of relying on the existing implementations. That could easily be a blocker.) > Again, I apologize for not having a clearer picture of this all to start > with, and I want to tell you specifically and sincerely that I > appreciate your patience as I wrap my head around all of this. I think > the benefit of --full-name-hash is much clearer and appealing to me now > having had both more time and seeing the series approached in a couple > of different ways. Let me know what you think. Thanks for taking the time to engage with the patches. I'm currently rerunning my performance tests on a rebased copy of the --full-name-hash patches and will submit a new version when it's ready. Thanks, -Stolee
On Mon, Nov 04, 2024 at 10:48:49AM -0500, Derrick Stolee wrote: > > I was discussing this a bit off-list with Peff (who I hope will join the > > thread and share his own thoughts), but I wonder if it was a mistake to > > discard your '--full-name-hash' idea (or something similar, which I'll > > discuss in a bit below) from earlier. > > I'd be happy to resurrect that series, adding in the learnings from > working on the path-walk feature. It helps that the current series adds > the path-walk API and has no conflicting changes in the pack-objects or > repack builtins. (I can handle those conflicts as things merge down.) Adding my two cents, the discussion we had came after reading this post: https://www.jonathancreamer.com/how-we-shrunk-our-git-repo-size-by-94-percent/ I think a few of the low-level details in there are confusing, but it seemed to me that most of the improvement he mentions is just about finding better delta candidates. And it seems obvious that our current pack_name_hash() is pretty rudimentary as context-sensitive hashing goes and won't do well for long paths with similar endings. So just swapping that out for something better seems like an easy thing to do regardless of whether we pursue --path-walk. It doesn't drastically change how we choose delta pairs so it's not much code and it shouldn't conflict with other features. And the risk of making anything worse should be pretty low. I wouldn't at all be surprised if --path-walk can do better, but if we do the easy thing first then I think it gives us a better idea of the cost/benefit it's providing. I suspect there's room for both in the long run. You seem to be focused on push size and cost, whereas I think Taylor and I are more interested in overall repo size and cost of serving bitmapped fetches. > When the optimization idea is to focus on the full path and not care > about the "locality" of the path name by its later bits, storing the > full name in a list and storing an index into that list would have a > very similar effect. > > I'd be interested to explore the idea of storing the full path name. > Based on my exploration with the 'test-tool name-hash' test helper in > that series, I'm not sure that we will make significant gains by doing > so. Worth trying. The way I look at it is a possible continuum. We want to use pathnames as a way to sort delta candidates near each other, since we expect them to have high locality with respect to delta-able contents. The current name_hash uses a very small bit of that path information and throws away most of it. The other extreme end is holding the whole path. We may want to end up in the middle for two reasons: 1. Dealing with whole paths might be costly (though I'm not yet convinced of that; outside of pathological cases, the number of paths in a repo tends to pale in comparison to the number of objects, and the per-object costs dominate during repacking). 2. It's possible that over-emphasizing the path might be a slightly worse heuristic (and I think this is a potential danger of --path-walk, too). We still want to find candidate pairs that were copied or renamed, for example, or that substantially share content found in different parts of the tree. So it would be interesting to be able to see the performance of various points on that line, from full path down to partial paths down to longer hashes down to the current hash. The true extreme end of course is no path info at all, but I think we know that sucks; that's why we implemented the bitmap name-hash extension in the first place. > I don't know too much about SimHash or MinHash, but based on what I > could gather from some initial reading I'm not sure that they would be > effective without increasing the hash length. We'd also get a different > kind of locality, such as the appearance of a common word would be more > likely to affect the locality than the end of the path. Good point. This is all heuristics, of course, but I suspect that the order of the path is important, and that foo/bar.c and bar/foo.c are unlikely to be good matches. > > I realize that this is taking us back to an idea you've already > > presented to the list, but I think (to me, at least) the benefit and > > simplicity of that approach has only become clear to me in hindsight > > when seeing some alternatives. I would like to apologize for the time > > you spent reworking this series back and forth to have the response be > > "maybe we should have just done the first thing you suggested". Like I > > said, I think to me it was really only clear in hindsight. > > I always assumed that we'd come back to it eventually. There is also the > extra bit about making the change to the name-hash compatible with the > way name-hashes are stored in the reachability bitmaps. That will need > some work before it is ready for prime time. Having worked on that feature of bitmaps, I'm not too worried about it. I think we'd just need to: - introduce a new bitmap ext with a flag (HASH_CACHE_V2 or something, either with the new hash, or with a "version" byte at the start for extensibility). - when bitmaps are not in use, we're free to use whichever hash we want internally. If the new hash is consistently better, we'd probably just enable it by default. - when packing using on-disk bitmaps, use internally whichever format the on-disk file provided. Technically the format could even provide both (in which case we'd prefer the new hash), but I don't see much point. - when writing bitmaps, use whichever hash the command-line options asked for. There's an off chance somebody might want to generate a .bitmap file whose hashes will be understood by an older version of git, in which case they'd use --no-full-name-hash or whatever while repacking. If we're considering full paths, then that is potentially a bit more involved, just because we'd want the format to avoid repeating duplicate paths for each object (plus they're no longer fixed-size). So probably an extension with packed NUL-terminated path strings, plus a set of fixed-length offsets into that block, one per object. > I disagree that all environments will prefer the --full-name-hash. I'm > currently repeating the performance tests right now, and I've added one. > The issues are: > > 1. The --full-name-hash approach sometimes leads to a larger pack when > using "git push" on the client, especially when the name-hash is > already effective for compressing across paths. That's interesting. I wonder which cases get worse, and if a larger window size might help. I.e., presumably we are pushing the candidates further away in the sorted delta list. > 2. A depth 1 shallow clone cannot use previous versions of a path, so > those situations will want to use the normal name hash. This can be > accomplished simply by disabling the --full-name-hash option when > the --shallow option is present; a more detailed version could be > used to check for a large depth before disabling it. This case also > disables bitmaps, so that isn't something to worry about. I'm not sure why a larger hash would be worse in a shallow clone. As you note, with only one version of each path the name-similarity heuristic is not likely to buy you much. But I'd have thought that would be true for the existing name hash as well as a longer one. Maybe this is the "over-emphasizing" case. -Peff
Jeff King <peff@peff.net> writes: > On Mon, Nov 04, 2024 at 10:48:49AM -0500, Derrick Stolee wrote: >> I disagree that all environments will prefer the --full-name-hash. I'm >> currently repeating the performance tests right now, and I've added one. >> The issues are: >> >> 1. The --full-name-hash approach sometimes leads to a larger pack when >> using "git push" on the client, especially when the name-hash is >> already effective for compressing across paths. > > That's interesting. I wonder which cases get worse, and if a larger > window size might help. I.e., presumably we are pushing the candidates > further away in the sorted delta list. > >> 2. A depth 1 shallow clone cannot use previous versions of a path, so >> those situations will want to use the normal name hash. This can be >> accomplished simply by disabling the --full-name-hash option when >> the --shallow option is present; a more detailed version could be >> used to check for a large depth before disabling it. This case also >> disables bitmaps, so that isn't something to worry about. > > I'm not sure why a larger hash would be worse in a shallow clone. As you > note, with only one version of each path the name-similarity heuristic > is not likely to buy you much. But I'd have thought that would be true > for the existing name hash as well as a longer one. Maybe this is the > "over-emphasizing" case. I too am curious to hear Derrick explain the above points and what was learned from the performance tests. The original hash was designed to place files that are renamed across directories closer to each other in the list sorted by the name hash, so a/Makefile and b/Makefile would likely be treated as delta-base candidates while foo/bar.c and bar/foo.c are treated as unrelated things. A push of a handful of commits that rename paths would likely place the rename source of older commits and rename destination of newer commits into the same delta chain, even with a smaller delta window. In such a history, uniformly-distributed-without-regard-to-renames hash is likely to make them into two distinct delta chains, leading to less optimal delta-base selection. A whole-repository packing, or a large push or fetch, of the same history with renamed files are affected a lot less by such negative effects of full-name hash. When generating a pack with more commits than the "--window", use of the original hash would mean blobs from paths that share similar names (e.g., "Makefile"s everywhere in the directory hierarchy) are placed close to each other, but full-name hash will likely group the blobs from exactly the same path and nothing else together, and the resulting delta-chain for identical (and not similar) paths would be sufficiently long. A long delta chain has to be broken into multiple chains _anyway_ due to finite "--depth" setting, so placing blobs from each path into its own (initial) delta chain, completely ignoring renamed paths, would likely to give us long enough (initial) delta chain to be split at the depth limit. It would lead to a good delta-base selection with smaller window size quite efficiently with full-name hash. I think a full-name hash forces a single-commit pack of a wide tree to give up on deltified blobs, but with the original hash, at least similar and common files (e.g. Makefile and COPYING) would sit close together in the delta queue and can be deltified with each other, which may be where the inefficiency comes from when full-name hash is used.
On 11/4/24 7:11 PM, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > >> On Mon, Nov 04, 2024 at 10:48:49AM -0500, Derrick Stolee wrote: >>> I disagree that all environments will prefer the --full-name-hash. I'm >>> currently repeating the performance tests right now, and I've added one. >>> The issues are: >>> >>> 1. The --full-name-hash approach sometimes leads to a larger pack when >>> using "git push" on the client, especially when the name-hash is >>> already effective for compressing across paths. >> >> That's interesting. I wonder which cases get worse, and if a larger >> window size might help. I.e., presumably we are pushing the candidates >> further away in the sorted delta list. I think the cases that make things get worse with --full-name-hash are: 1. The presence of renames, partitioning objects that used to fit into the same bucket (in the case of directory renames). 2. Some standard kinds of files may appear several times across the tree but do not change very often and are similar across path. 3. Common patterns across similar file types, such as similar includes in .c and .h files or other kinds of boilerplate in different languages. A larger window size will always expand the range of possible deltas, at a cost to the time taken to compute the deltas. My first experiment in these repositories was to increase the window size to 250 (from the default 10). This caused a very slow repack, but the repositories shrunk. For example, the big Javascript monorepo that repacked to ~100 GB with default settings would repack to ~30 GB with --window=250. This was an indicator that delta compression would work if we can find the right pairs to use for deltas. The point of the two strategies (--full-name-hash and --path-walk) is about putting objects close to each other in better ways than the name hash sort. >>> 2. A depth 1 shallow clone cannot use previous versions of a path, so >>> those situations will want to use the normal name hash. This can be >>> accomplished simply by disabling the --full-name-hash option when >>> the --shallow option is present; a more detailed version could be >>> used to check for a large depth before disabling it. This case also >>> disables bitmaps, so that isn't something to worry about. >> >> I'm not sure why a larger hash would be worse in a shallow clone. As you >> note, with only one version of each path the name-similarity heuristic >> is not likely to buy you much. But I'd have thought that would be true >> for the existing name hash as well as a longer one. Maybe this is the >> "over-emphasizing" case. I'm confused by your wording of "larger hash" because the hash size is the exact same: 32 bits. It's just that the --full-name-hash option has fewer collisions by abandoning the hope of locality. In a depth 1 shallow clone, there are no repeated paths, so any hash collisions are true collisions instead of good candidates for deltas. The full name hash is essentially random, so the delta compression algorithm basically says: 1. Sort by type. 2. Within each type, sort the objects randomly. With that sort, the delta compression scan is less effective than the standard name hash. > I too am curious to hear Derrick explain the above points and what > was learned from the performance tests. The original hash was > designed to place files that are renamed across directories closer > to each other in the list sorted by the name hash, so a/Makefile and > b/Makefile would likely be treated as delta-base candidates while > foo/bar.c and bar/foo.c are treated as unrelated things. A push > of a handful of commits that rename paths would likely place the > rename source of older commits and rename destination of newer > commits into the same delta chain, even with a smaller delta window. > In such a history, uniformly-distributed-without-regard-to-renames > hash is likely to make them into two distinct delta chains, leading > to less optimal delta-base selection. Yes. This is the downside of the --full-name-hash compared to the standard name hash. When repacking an entire repository, the effect of these renames is typically not important in the long run as it's basically a single break in the delta chain. The downside comes in when doing a small fetch or push where the rename has more impact. The --path-walk approach does not suffer from this problem because it has a second pass that sorts by the name hash and looks for better deltas than the ones that already exist. Thus, it gets the best of both worlds. The performance impact of the two passes of the --path-walk approach is interesting, as you'd typically expect this to always be slower. However: 1. The delta compression within each batch only compares the objects within that batch. We do not compare these objects to unrelated objects, which can be expensive and wasteful. This also means that small batches may even be smaller than the delta window, reducing the number of comparisons. 2. In the second pass, the delta calculation can short-circuit if the computed delta would be larger than the current-best delta. Thus, the good deltas from the first pass make the second pass faster. > A whole-repository packing, or a large push or fetch, of the same > history with renamed files are affected a lot less by such negative > effects of full-name hash. When generating a pack with more commits > than the "--window", use of the original hash would mean blobs from > paths that share similar names (e.g., "Makefile"s everywhere in the > directory hierarchy) are placed close to each other, but full-name > hash will likely group the blobs from exactly the same path and > nothing else together, and the resulting delta-chain for identical > (and not similar) paths would be sufficiently long. A long delta > chain has to be broken into multiple chains _anyway_ due to finite > "--depth" setting, so placing blobs from each path into its own > (initial) delta chain, completely ignoring renamed paths, would > likely to give us long enough (initial) delta chain to be split at > the depth limit. > > It would lead to a good delta-base selection with smaller window > size quite efficiently with full-name hash.> > I think a full-name hash forces a single-commit pack of a wide tree > to give up on deltified blobs, but with the original hash, at least > similar and common files (e.g. Makefile and COPYING) would sit close > together in the delta queue and can be deltified with each other, > which may be where the inefficiency comes from when full-name hash > is used. Yes, this is a good summary of why this works for the data efficiency in long histories. Your earlier observations are why the full-name hash has demonstrated issues with smaller time scales. These numbers are carefully detailed in the performance tests in the refreshed series [1]. The series also has a way to disable the full-name hash when serving a shallow clone for this reason. [1] https://lore.kernel.org/git/pull.1823.git.1730775907.gitgitgadget@gmail.com/ Thanks, -Stolee
Derrick Stolee <stolee@gmail.com> writes: >> Jeff King <peff@peff.net> writes: >> >>> That's interesting. I wonder which cases get worse, and if a larger >>> window size might help. I.e., presumably we are pushing the candidates >>> further away in the sorted delta list. > > I think the cases that make things get worse with --full-name-hash are: > > 1. The presence of renames, partitioning objects that used to fit into > the same bucket (in the case of directory renames). > > 2. Some standard kinds of files may appear several times across the > tree but do not change very often and are similar across path. > > 3. Common patterns across similar file types, such as similar includes > in .c and .h files or other kinds of boilerplate in different > languages. > ... > In a depth 1 shallow clone, there are no repeated paths, so any hash > collisions are true collisions instead of good candidates for deltas. Or #2 and #3 above, where large boilerplates are shared across similarly named files. > Yes. This is the downside of the --full-name-hash compared to the > standard name hash. When repacking an entire repository, the effect > of these renames is typically not important in the long run as it's > basically a single break in the delta chain. The downside comes in > when doing a small fetch or push where the rename has more impact. Yes. Due to --depth limit, we need to break delta chains somewhere anyway, and a rename boundary is just as good place as any other in a sufficiently long chain. > The --path-walk approach does not suffer from this problem because > it has a second pass that sorts by the name hash and looks for > better deltas than the ones that already exist. Thus, it gets the > best of both worlds. Yes, at the cost of being more complex :-) > The performance impact of the two passes of the --path-walk > approach is interesting, as you'd typically expect this to always > be slower. However: > > 1. The delta compression within each batch only compares the > objects within that batch. We do not compare these objects to > unrelated objects, which can be expensive and wasteful. This > also means that small batches may even be smaller than the > delta window, reducing the number of comparisons. Interesting. > 2. In the second pass, the delta calculation can short-circuit if > the computed delta would be larger than the current-best delta. > Thus, the good deltas from the first pass make the second pass > faster. Yes, the early give-up codepath helps when you already found something semi-decently good delta base. Thanks for a good summary.
On 11/10/24 9:56 PM, Junio C Hamano wrote: > Derrick Stolee <stolee@gmail.com> writes: >> The --path-walk approach does not suffer from this problem because >> it has a second pass that sorts by the name hash and looks for >> better deltas than the ones that already exist. Thus, it gets the >> best of both worlds. > > Yes, at the cost of being more complex :-) True. I hope that the results sufficiently justify the complexity. Further, the later uses of the path-walk API (git backfill and git survey) build upon this up-front investment. Thanks, -Stolee
On Mon, Nov 11, 2024 at 11:56:01AM +0900, Junio C Hamano wrote: > > Yes. This is the downside of the --full-name-hash compared to the > > standard name hash. When repacking an entire repository, the effect > > of these renames is typically not important in the long run as it's > > basically a single break in the delta chain. The downside comes in > > when doing a small fetch or push where the rename has more impact. > > Yes. Due to --depth limit, we need to break delta chains somewhere > anyway, and a rename boundary is just as good place as any other in > a sufficiently long chain. We don't necessarily have to break the chains due to depth limits, because they are not always linear. They can end up as bushy trees, like: A <- B <- C \ <- D <- E \ <- F We might get that way because it mirrors the shape of history (e.g., if D and E happened on a side branch from B and C). But we can also get there from a linear history by choosing a delta that balances size versus depth. In the above example, the smallest sequence might be: A <- B <- C <- D <- E <- F but if we have a limit of 3 and A-B-C already exists, then we might reject the C-D delta and choose A-D instead (or I guess if it really is linear, probably B-D is more likely). That may be a larger delta by itself, but it is still better than storing a full copy of the object. And we find it by having those candidates close together in the sorted list, reject C-D based on depth, and then moving on to the next candidate. I'm not sure in practice how often we find these kinds of deltas. If you look at, say, all the deltas for "Makefile" in git.git like this: git rev-list --objects --all --reflog --full-history -- Makefile | perl -lne 'print $1 if /(.*) Makefile/' | git cat-file --batch-check='%(objectsize:disk) %(objectname) %(deltabase)' my repo has 33 full copies (you can see non-deltas by grepping for "0{40}$" in the output) out of 4473 total. So it's not like we never break chains. But we can use graphviz to visualize it by piping the above through: perl -alne ' BEGIN { print "digraph {" } print "node_$F[1] [label=$F[0]]"; print "node_$F[1] -> node_$F[2]" if $F[2] !~ /^0+$/; END { print "}" } ' and then piping it through "dot" to make an svg, or using an interactive viewer like "xdot" (the labels are the on-disk size of each object). I see a lot of wide parts of the graph in the output. Of course this may all depend on packing patterns, too. I did my investigations after running "git repack -adf" to generate what should be a pretty reasonable pack. You might see something different from incremental repacking over time. I'm not sure what any of this means for --path-walk, of course. ;) Ultimately we care about resulting size and time to compute, so if it can do better on those metrics then it doesn't matter what the graph looks like. But maybe those tools can help us understand where things might go better (or worse) with various approaches. -Peff
On Fri, Nov 08, 2024 at 10:17:24AM -0500, Derrick Stolee wrote: > > > I'm not sure why a larger hash would be worse in a shallow clone. As you > > > note, with only one version of each path the name-similarity heuristic > > > is not likely to buy you much. But I'd have thought that would be true > > > for the existing name hash as well as a longer one. Maybe this is the > > > "over-emphasizing" case. > > I'm confused by your wording of "larger hash" because the hash size > is the exact same: 32 bits. It's just that the --full-name-hash option > has fewer collisions by abandoning the hope of locality. > > In a depth 1 shallow clone, there are no repeated paths, so any hash > collisions are true collisions instead of good candidates for deltas. > The full name hash is essentially random, so the delta compression > algorithm basically says: > > 1. Sort by type. > 2. Within each type, sort the objects randomly. > > With that sort, the delta compression scan is less effective than the > standard name hash. Ah, OK. I'm sorry, I had not really investigated the full-name-hash and misunderstood what it was doing. I thought we were using a larger hash in order to give more locality hints. I.e., to let us distinguish "foo/long-path.c" from "bar/long-path.c", but still retain some locality between them. But it is throwing away all locality outside of the exact name. So yes, it would never find a rename from "foo/" to "bar/", as those mean the name-hash is effectively random. So I guess getting back to what Taylor and I had talked about off-list: we were wondering if there was a way to provide a better "slider" for locality as part of the normal delta candidate sorting process. I think you could store the full pathname and then doing a sort based on the reverse of the string (so "foo.c" and "bar.c" would compare "c", then ".", etc). And that would let you tell the difference between "foo/long-path.c" and "bar/long-path.c" (preferring the identical filenames over the merely similar ones), but still sort them together versus "some-unrelated-path.c". And what I wondered (and what I had initially thought full-name-hash was doing) was whether we could store some fixed-size hash that produces a similar distribution (modulo collisions) to what that reverse-name sort would do. -Peff
Jeff King <peff@peff.net> writes: >> Yes. Due to --depth limit, we need to break delta chains somewhere >> anyway, and a rename boundary is just as good place as any other in >> a sufficiently long chain. > > We don't necessarily have to break the chains due to depth limits, > because they are not always linear. They can end up as bushy trees, True. And being able to pair blobs before and after a rename will give us more candidates to place in a single bushy tree, so in that sense, with a short segment of history, it is understandable that the full-name hash fails to have as many candidates as the original hash gives us. But with sufficiently large number of blobs at the same path that are similar (i.e. not a "pushing a short segment of history", but an initial clone), splitting what could be one delta family into two delta families at the rename boundary is not too bad, as long as both halves have enough blobs to deltify against each other. > I'm not sure in practice how often we find these kinds of deltas. If you > look at, say, all the deltas for "Makefile" in git.git like this: > > git rev-list --objects --all --reflog --full-history -- Makefile | > perl -lne 'print $1 if /(.*) Makefile/' | > git cat-file --batch-check='%(objectsize:disk) %(objectname) %(deltabase)' > > my repo has 33 full copies (you can see non-deltas by grepping for > "0{40}$" in the output) out of 4473 total. So it's not like we never > break chains. But we can use graphviz to visualize it by piping the > above through: > > perl -alne ' > BEGIN { print "digraph {" } > print "node_$F[1] [label=$F[0]]"; > print "node_$F[1] -> node_$F[2]" if $F[2] !~ /^0+$/; > END { print "}" } > ' > > and then piping it through "dot" to make an svg, or using an interactive > viewer like "xdot" (the labels are the on-disk size of each object). I > see a lot of wide parts of the graph in the output. > > Of course this may all depend on packing patterns, too. I did my > investigations after running "git repack -adf" to generate what should > be a pretty reasonable pack. You might see something different from > incremental repacking over time. That is very true. I forgot that we do things to encourage bushy delta-base selection. One thing I also am happy to see is the effect of our "clever" delta-base selection, where the algorithm does not blindly favor the delta-base that makes the resulting delta absolutely minimal, but takes the depth of the delta-base into account (i.e. a base at a much shallower depth is preferred over a base near the depth limit, even if it results in a slightly larger delta data). > I'm not sure what any of this means for --path-walk, of course. ;) > Ultimately we care about resulting size and time to compute, so if it > can do better on those metrics then it doesn't matter what the graph > looks like. True, too. Another thing that we care about is the time to access data, and favoring shallow delta chain, even with the help of the in-core delta-base cache, has merit. Thanks.