mbox series

[0/5] PATH WALK III: Add 'git backfill' command

Message ID pull.1820.git.1733515638.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series PATH WALK III: Add 'git backfill' command | expand

Message

Christoph Sommer via GitGitGadget Dec. 6, 2024, 8:07 p.m. UTC
This is based on v3 of ds/path-walk-1 [1] and an earlier version was part of
my initial path-walk RFC [2].

[1]
https://lore.kernel.org/git/pull.1818.v3.git.1733514358.gitgitgadget@gmail.com/

[2]
https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/

This series adds a new 'git backfill' command that uses the path-walk API to
download missing blobs in a blobless partial clone. Users can specify
interaction with the sparse-checkout using '--[no-]sparse' but the
'--sparse' option is implied by the existence of a sparse-checkout.

The reason to use the path-walk API is to make sure that the missing objects
are grouped by a common path, giving a reasonable process for batching
requests and expecting the server to compress the resulting packfile nicely
together.

I first prototyped this feature in June 2024 as an exploration and created
the path-walk algorithm for this purpose. It was only my intuition that led
me to believe that batching by path would lead to better packfiles. This has
been proven out as a very important feature due to recent investigations to
compressing full repositories by doing a better job of grouping objects by
path. See the --name-hash-version series [3] or the 'git pack-objects
--path-walk' series [4] (currently on hold as it conflicts with the
--name-hash-version series).

[3]
https://lore.kernel.org/git/pull.1823.v2.git.1733181682.gitgitgadget@gmail.com/

[4]
https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@gmail.com/

This idea can be further demonstrated by the evidence in testing this
feature: by downloading objects in small batch sizes, the client can force
the server to repack things more efficiently than a full repack.

The example repository I have used in multiple places is the
microsoft/fluentui repo [5] as it has many CHANGELOG.md files that cause
name hash collisions that make the full repack inefficient.

[5] https://github.com/microsoft/fluentui

If we create a blobless clone of the fluentui repo, then this downloads 105
MB across two packfiles (the commits and trees pack, followed by the blobs
needed for an initial checkout). Running 'git backfill --batch-size=' for
different sizes leads to some interesting results:

| Batch Size      | Pack Count | Pack Size | Time   |
|-----------------|------------|-----------|--------|
| (Initial clone) | 2          | 105 MB    |        |
| 5K              | 53         | 348 MB    | 2m 26s |
| 10K             | 28         | 365 MB    | 2m 22s |
| 15K             | 19         | 407 MB    | 2m 21s |
| 20K             | 15         | 393 MB    | 2m 28s |
| 25K             | 13         | 417 MB    | 2m 06s |
| 50K             | 8          | 509 MB    | 1m 34s |
| 100K            | 5          | 535 MB    | 1m 56s |
| 250K            | 4          | 698 MB    | 1m 33s |
| 500K            | 3          | 696 MB    | 1m 42s |


The smaller batches cause the server to realize that the existing deltas
cannot be reused and it finds better deltas. This takes some extra time for
the small batches, but halves the size of the repo. Even in the 500K batch
size, we get less data than the 738 MB of a full clone.

Implementing the --sparse feature is best done by augmenting the path-walk
API to be aware of a pattern list. This works for both cone and non-cone
mode sparse-checkouts.

There are future directions we could take this command, especially to run
the command with a user-specified pathspec. The tricky case for that
additional feature is trying to make the path-walk more efficient by
skipping tree paths that would not lead to a match of the pathspec. It would
likely need optimization in a small subset of pathspec features (such as
prefix matches) to work as efficiently as possible. I did prototype a
version that puts the pathspec match in the callback function within
builtin/backfill.c, but I found that uninspiring and unnecessary for now.

Thanks, -Stolee

Derrick Stolee (5):
  backfill: add builtin boilerplate
  backfill: basic functionality and tests
  backfill: add --batch-size=<n> option
  backfill: add --sparse option
  backfill: assume --sparse when sparse-checkout is enabled

 .gitignore                                |   1 +
 Documentation/git-backfill.txt            |  60 ++++++++
 Documentation/technical/api-path-walk.txt |  11 +-
 Makefile                                  |   1 +
 builtin.h                                 |   1 +
 builtin/backfill.c                        | 147 ++++++++++++++++++
 command-list.txt                          |   1 +
 dir.c                                     |  10 +-
 dir.h                                     |   3 +
 git.c                                     |   1 +
 path-walk.c                               |  18 +++
 path-walk.h                               |  11 ++
 t/helper/test-path-walk.c                 |  22 ++-
 t/t5620-backfill.sh                       | 178 ++++++++++++++++++++++
 t/t6601-path-walk.sh                      |  32 ++++
 15 files changed, 488 insertions(+), 9 deletions(-)
 create mode 100644 Documentation/git-backfill.txt
 create mode 100644 builtin/backfill.c
 create mode 100755 t/t5620-backfill.sh


base-commit: e716672c041473dd2bf257b7532b86696fef32a0
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1820%2Fderrickstolee%2Fbackfill-upstream-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1820/derrickstolee/backfill-upstream-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1820

Comments

Junio C Hamano Dec. 8, 2024, 10:53 a.m. UTC | #1
It seems that the result of calling init_revisions() from backfill
is leaked?

https://github.com/git/git/actions/runs/12218430154/job/34083929479 

I did not dig further but the below is from my local leaksanitizer
run.

Thanks.

=================================================================
==git==182342==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 576 byte(s) in 1 object(s) allocated from:
    #0 0x55d4b0e42915 in __interceptor_realloc (git+0x84915) (BuildId: c861e65ec43b0a3ef46b9555a81d6ddfc2358e8e)
    #1 0x55d4b119ce6d in xrealloc wrapper.c:140:8
    #2 0x55d4b11204d4 in add_rev_cmdline revision.c:1563:2
    #3 0x55d4b11187a1 in handle_revision_arg_1 revision.c:2263:2
    #4 0x55d4b1118398 in handle_revision_arg revision.c:2275:12
    #5 0x55d4b0e5233b in do_backfill builtin/backfill.c:100:2
    #6 0x55d4b0e52253 in cmd_backfill builtin/backfill.c:146:9
    #7 0x55d4b0e46b80 in run_builtin git.c:480:11
    #8 0x55d4b0e45502 in handle_builtin git.c:741:9
    #9 0x55d4b0e4649c in run_argv git.c:808:4
    #10 0x55d4b0e45274 in cmd_main git.c:948:19
    #11 0x55d4b0f6da8a in main common-main.c:9:11
    #12 0x7ff25f97ac89 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #13 0x7ff25f97ad44 in __libc_start_main csu/../csu/libc-start.c:360:3
    #14 0x55d4b0e171e0 in _start (git+0x591e0) (BuildId: c861e65ec43b0a3ef46b9555a81d6ddfc2358e8e)

DEDUP_TOKEN: __interceptor_realloc--xrealloc--add_rev_cmdline--handle_revision_arg_1--handle_revision_arg--do_backfill--cmd_backfill--run_builtin--handle_builtin--run_argv--cmd_main--main--__libc_start_call_main--__libc_start_main--_start
Indirect leak of 5 byte(s) in 1 object(s) allocated from:
    #0 0x55d4b0e424b6 in __interceptor_malloc (git+0x844b6) (BuildId: c861e65ec43b0a3ef46b9555a81d6ddfc2358e8e)
    #1 0x7ff25f9f34f9 in strdup string/strdup.c:42:15
    #2 0x55d4b119cb14 in xstrdup wrapper.c:43:14
    #3 0x55d4b1120506 in add_rev_cmdline revision.c:1565:23
    #4 0x55d4b11187a1 in handle_revision_arg_1 revision.c:2263:2
    #5 0x55d4b1118398 in handle_revision_arg revision.c:2275:12
    #6 0x55d4b0e5233b in do_backfill builtin/backfill.c:100:2
    #7 0x55d4b0e52253 in cmd_backfill builtin/backfill.c:146:9
    #8 0x55d4b0e46b80 in run_builtin git.c:480:11
    #9 0x55d4b0e45502 in handle_builtin git.c:741:9
    #10 0x55d4b0e4649c in run_argv git.c:808:4
    #11 0x55d4b0e45274 in cmd_main git.c:948:19
    #12 0x55d4b0f6da8a in main common-main.c:9:11
    #13 0x7ff25f97ac89 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #14 0x7ff25f97ad44 in __libc_start_main csu/../csu/libc-start.c:360:3
    #15 0x55d4b0e171e0 in _start (git+0x591e0) (BuildId: c861e65ec43b0a3ef46b9555a81d6ddfc2358e8e)

DEDUP_TOKEN: __interceptor_malloc--strdup--xstrdup--add_rev_cmdline--handle_revision_arg_1--handle_revision_arg--do_backfill--cmd_backfill--run_builtin--handle_builtin--run_argv--cmd_main--main--__libc_start_call_main--__libc_start_main--_start
SUMMARY: LeakSanitizer: 581 byte(s) leaked in 2 allocation(s).
Our logs revealed a memory leak...
++ rmdir /home/gitster/w/git.git/t/test-results/t5620-backfill.leak
++ :
++ exit 134
++ eval_ret=134
Junio C Hamano Dec. 9, 2024, 12:34 a.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

> It seems that the result of calling init_revisions() from backfill
> is leaked?
>
> https://github.com/git/git/actions/runs/12218430154/job/34083929479 
>
> I did not dig further but the below is from my local leaksanitizer
> run.

The attached patch seems to plug the leaks observed by your backfill
tests.  If you agree with the implementation of the change, you are
welcome to squash it in.  I may be missing a better mechanism in the
path-walk API that I could use to plug the leaks, in which case, of
course a fix using such a better mechanism is very much welcomed.

A few things I noticed about path-walk API during my poking are:

 - struct path_walk_info has PATH_WALK_INFO_INIT() macro, but not a
   matching path_walk_info_init() helper function to help initialize
   any heap-allocated instances.  In general, anything that requires
   _INIT() macro by definition wants to be initialized with more
   than nul-initialized (otherwise, it would be fine to leave it
   uninitialized in the BSS, clear with = {0} initialization on the
   stack), so lack of matching path_walk_info_init() raises an
   eyebrow.

 - struct path_walk_info seems to lack a destructor.  In general,
   anything complex enough to require initializer should have one.

 - lack of destructor for path_walk_info requires users to release
   resources held by "struct pattern_list" instance at pwi.pl; if we
   had a destructor, we wouldn't have 2 of the three leaks we had to
   fix here.

Thanks.


diff --git c/builtin/backfill.c w/builtin/backfill.c
index 225764f17e..1cf2df9303 100644
--- c/builtin/backfill.c
+++ w/builtin/backfill.c
@@ -92,8 +92,11 @@ static int do_backfill(struct backfill_context *ctx)
 
 	if (ctx->sparse) {
 		CALLOC_ARRAY(info.pl, 1);
-		if (get_sparse_checkout_patterns(info.pl))
+		if (get_sparse_checkout_patterns(info.pl)) {
+			clear_pattern_list(info.pl);
+			free(info.pl);
 			return error(_("problem loading sparse-checkout"));
+		}
 	}
 
 	repo_init_revisions(ctx->repo, &revs, "");
@@ -113,6 +116,11 @@ static int do_backfill(struct backfill_context *ctx)
 		download_batch(ctx);
 
 	clear_backfill_context(ctx);
+	release_revisions(&revs);
+	if (info.pl) {
+		clear_pattern_list(info.pl);
+		free(info.pl);
+	}
 	return ret;
 }