diff mbox series

[1/2] builtin/repack.c: simplify cruft pack aggregation

Message ID 8564f98259727225391edcb5ab3b47dd53f00e48.1740680964.git.me@ttaylorr.com (mailing list archive)
State New
Headers show
Series pack-objects: freshen objects with multi-cruft packs | expand

Commit Message

Taylor Blau Feb. 27, 2025, 6:29 p.m. UTC
In 37dc6d8104 (builtin/repack.c: implement support for
`--max-cruft-size`, 2023-10-02), 'git repack' built on support for
multiple cruft packs in Git by instructing 'git pack-objects --cruft'
how to aggregate smaller cruft packs up to the provided threshold.

The implementation in 37dc6d8104 worked something like the following
pseudo-code:

    total_size = 0;

    for (p in cruft packs) {
      if (p->pack_size + total_size < max_size) {
        total_size += p->pack_size;
        collapse(p)
      } else {
        retain(p);
      }
    }

The original idea behind this approach was that smaller cruft packs
would get combined together until the sum of their sizes was no larger
than the given max pack size.

There is a much simpler way to achieve this, however, which is to simply
combine *all* cruft packs which are smaller than the threshold,
regardless of what their sum is. With '--max-pack-size', 'pack-objects'
will split out the resulting pack into individual pack(s) if necessary
to ensure that the written pack(s) are each no larger than the provided
threshold.

This yields a slight behavior change, which is reflected in the removed
test. Previous to this change, we would aggregate smaller cruft packs
first, whereas now we will opportunistically combine as many cruft packs
as possible. As as result, that test is no longer relevant, and can be
deleted.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c        | 38 ++-----------------------------------
 t/t7704-repack-cruft.sh | 42 -----------------------------------------
 2 files changed, 2 insertions(+), 78 deletions(-)

Comments

Elijah Newren Feb. 27, 2025, 7:23 p.m. UTC | #1
On Thu, Feb 27, 2025 at 10:29 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> In 37dc6d8104 (builtin/repack.c: implement support for
> `--max-cruft-size`, 2023-10-02), 'git repack' built on support for
> multiple cruft packs in Git by instructing 'git pack-objects --cruft'
> how to aggregate smaller cruft packs up to the provided threshold.
>
> The implementation in 37dc6d8104 worked something like the following
> pseudo-code:
>
>     total_size = 0;
>
>     for (p in cruft packs) {
>       if (p->pack_size + total_size < max_size) {
>         total_size += p->pack_size;
>         collapse(p)
>       } else {
>         retain(p);
>       }
>     }
>
> The original idea behind this approach was that smaller cruft packs
> would get combined together until the sum of their sizes was no larger
> than the given max pack size.
>
> There is a much simpler way to achieve this, however, which is to simply
> combine *all* cruft packs which are smaller than the threshold,
> regardless of what their sum is. With '--max-pack-size', 'pack-objects'
> will split out the resulting pack into individual pack(s) if necessary
> to ensure that the written pack(s) are each no larger than the provided
> threshold.

That doesn't really "achieve this" though, unless the antecedent of
"this" isn't what was described in the previous paragraph but
something elsewhere.  I suspect your actual meaning was something
along the lines of "There is a much simpler way to combine cruft
packs, however, which..." ?

> This yields a slight behavior change, which is reflected in the removed
> test. Previous to this change, we would aggregate smaller cruft packs
> first, whereas now we will opportunistically combine as many cruft packs
> as possible. As as result, that test is no longer relevant, and can be
> deleted.

I like the idea, since it sounds like it should be simpler...

> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  builtin/repack.c        | 38 ++-----------------------------------
>  t/t7704-repack-cruft.sh | 42 -----------------------------------------
>  2 files changed, 2 insertions(+), 78 deletions(-)
>
> diff --git a/builtin/repack.c b/builtin/repack.c
> index 75e3752353a..4d83d40f39f 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -1022,29 +1022,13 @@ static int write_filtered_pack(const struct pack_objects_args *args,
>         return finish_pack_objects_cmd(&cmd, names, local);
>  }
>
> -static int existing_cruft_pack_cmp(const void *va, const void *vb)
> -{
> -       struct packed_git *a = *(struct packed_git **)va;
> -       struct packed_git *b = *(struct packed_git **)vb;
> -
> -       if (a->pack_size < b->pack_size)
> -               return -1;
> -       if (a->pack_size > b->pack_size)
> -               return 1;
> -       return 0;
> -}
> -
>  static void collapse_small_cruft_packs(FILE *in, size_t max_size,
>                                        struct existing_packs *existing)
>  {
> -       struct packed_git **existing_cruft, *p;
> +       struct packed_git *p;
>         struct strbuf buf = STRBUF_INIT;
> -       size_t total_size = 0;
> -       size_t existing_cruft_nr = 0;
>         size_t i;
>
> -       ALLOC_ARRAY(existing_cruft, existing->cruft_packs.nr);
> -
>         for (p = get_all_packs(the_repository); p; p = p->next) {
>                 if (!(p->is_cruft && p->pack_local))
>                         continue;
> @@ -1056,24 +1040,7 @@ static void collapse_small_cruft_packs(FILE *in, size_t max_size,
>                 if (!string_list_has_string(&existing->cruft_packs, buf.buf))
>                         continue;
>
> -               if (existing_cruft_nr >= existing->cruft_packs.nr)
> -                       BUG("too many cruft packs (found %"PRIuMAX", but knew "
> -                           "of %"PRIuMAX")",
> -                           (uintmax_t)existing_cruft_nr + 1,
> -                           (uintmax_t)existing->cruft_packs.nr);
> -               existing_cruft[existing_cruft_nr++] = p;
> -       }
> -
> -       QSORT(existing_cruft, existing_cruft_nr, existing_cruft_pack_cmp);
> -
> -       for (i = 0; i < existing_cruft_nr; i++) {
> -               size_t proposed;
> -
> -               p = existing_cruft[i];
> -               proposed = st_add(total_size, p->pack_size);
> -
> -               if (proposed <= max_size) {
> -                       total_size = proposed;
> +               if (p->pack_size < max_size) {

Look at all that deleted code.  Always nice to see a simplification in
action.  :-)

>                         fprintf(in, "-%s\n", pack_basename(p));
>                 } else {
>                         retain_cruft_pack(existing, p);
> @@ -1086,7 +1053,6 @@ static void collapse_small_cruft_packs(FILE *in, size_t max_size,
>                         existing->non_kept_packs.items[i].string);
>
>         strbuf_release(&buf);
> -       free(existing_cruft);
>  }
>
>  static int write_cruft_pack(const struct pack_objects_args *args,
> diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
> index 959e6e26488..5a76b541ddd 100755
> --- a/t/t7704-repack-cruft.sh
> +++ b/t/t7704-repack-cruft.sh
> @@ -194,48 +194,6 @@ test_expect_success '--max-cruft-size combines existing packs when below thresho
>         )
>  '
>
> -test_expect_success '--max-cruft-size combines smaller packs first' '
> -       git init max-cruft-size-consume-small &&
> -       (
> -               cd max-cruft-size-consume-small &&
> -
> -               test_commit base &&
> -               git repack -ad &&
> -
> -               cruft_foo="$(generate_cruft_pack foo 524288)" &&    # 0.5 MiB
> -               cruft_bar="$(generate_cruft_pack bar 524288)" &&    # 0.5 MiB
> -               cruft_baz="$(generate_cruft_pack baz 1048576)" &&   # 1.0 MiB
> -               cruft_quux="$(generate_cruft_pack quux 1572864)" && # 1.5 MiB
> -
> -               test-tool pack-mtimes "$(basename $cruft_foo)" >expect.raw &&
> -               test-tool pack-mtimes "$(basename $cruft_bar)" >>expect.raw &&
> -               sort expect.raw >expect.objects &&
> -
> -               # repacking with `--max-cruft-size=2M` should combine
> -               # both 0.5 MiB packs together, instead of, say, one of
> -               # the 0.5 MiB packs with the 1.0 MiB pack
> -               ls $packdir/pack-*.mtimes | sort >cruft.before &&
> -               git repack -d --cruft --max-cruft-size=2M &&
> -               ls $packdir/pack-*.mtimes | sort >cruft.after &&
> -
> -               comm -13 cruft.before cruft.after >cruft.new &&
> -               comm -23 cruft.before cruft.after >cruft.removed &&
> -
> -               test_line_count = 1 cruft.new &&
> -               test_line_count = 2 cruft.removed &&
> -
> -               # the two smaller packs should be rolled up first
> -               printf "%s\n" $cruft_foo $cruft_bar | sort >expect.removed &&
> -               test_cmp expect.removed cruft.removed &&
> -
> -               # ...and contain the set of objects rolled up
> -               test-tool pack-mtimes "$(basename $(cat cruft.new))" >actual.raw &&
> -               sort actual.raw >actual.objects &&
> -
> -               test_cmp expect.objects actual.objects
> -       )
> -'
> -
>  test_expect_success 'setup --max-cruft-size with freshened objects' '
>         git init max-cruft-size-freshen &&
>         (
> --
> 2.49.0.rc0.2.gc0c926adde2

Looks good to me, other than the misleading/ambiguous wording in that
one paragraph of the commit message.
Taylor Blau Feb. 27, 2025, 10:53 p.m. UTC | #2
On Thu, Feb 27, 2025 at 11:23:02AM -0800, Elijah Newren wrote:
> > The original idea behind this approach was that smaller cruft packs
> > would get combined together until the sum of their sizes was no larger
> > than the given max pack size.
> >
> > There is a much simpler way to achieve this, however, which is to simply
> > combine *all* cruft packs which are smaller than the threshold,
> > regardless of what their sum is. With '--max-pack-size', 'pack-objects'
> > will split out the resulting pack into individual pack(s) if necessary
> > to ensure that the written pack(s) are each no larger than the provided
> > threshold.
>
> That doesn't really "achieve this" though, unless the antecedent of
> "this" isn't what was described in the previous paragraph but
> something elsewhere.  I suspect your actual meaning was something
> along the lines of "There is a much simpler way to combine cruft
> packs, however, which..." ?

Great suggestion, thanks. I swapped out "achieve this" for your
recommendation.

> > This yields a slight behavior change, which is reflected in the removed
> > test. Previous to this change, we would aggregate smaller cruft packs
> > first, whereas now we will opportunistically combine as many cruft packs
> > as possible. As as result, that test is no longer relevant, and can be
> > deleted.
>
> I like the idea, since it sounds like it should be simpler...

Heh. I don't know why I wrote it the way it was originally. I wrote the
second patch in this series first, and when I was trying to explain how
multi-cruft pack aggregation works I paused and then wrote what is now
the first patch.

Hindsight is 20/20, I suppose ;-).

Thanks,
Taylor
Patrick Steinhardt Feb. 28, 2025, 7:52 a.m. UTC | #3
On Thu, Feb 27, 2025 at 01:29:28PM -0500, Taylor Blau wrote:
> In 37dc6d8104 (builtin/repack.c: implement support for
> `--max-cruft-size`, 2023-10-02), 'git repack' built on support for
> multiple cruft packs in Git by instructing 'git pack-objects --cruft'
> how to aggregate smaller cruft packs up to the provided threshold.
> 
> The implementation in 37dc6d8104 worked something like the following
> pseudo-code:
> 
>     total_size = 0;
> 
>     for (p in cruft packs) {
>       if (p->pack_size + total_size < max_size) {
>         total_size += p->pack_size;
>         collapse(p)
>       } else {
>         retain(p);
>       }
>     }
> 
> The original idea behind this approach was that smaller cruft packs
> would get combined together until the sum of their sizes was no larger
> than the given max pack size.
> 
> There is a much simpler way to achieve this, however, which is to simply
> combine *all* cruft packs which are smaller than the threshold,
> regardless of what their sum is. With '--max-pack-size', 'pack-objects'
> will split out the resulting pack into individual pack(s) if necessary
> to ensure that the written pack(s) are each no larger than the provided
> threshold.

Hm. So the result would be a new set of packfiles where each of them is
smaller than the threshold, right? Wouldn't that mean that the next time
we'll again do the same thing and try to combine the new set of cruft
packs into one, and basically never arrive at a state where we don't
touch the cruft packs anymore?

Patrick
diff mbox series

Patch

diff --git a/builtin/repack.c b/builtin/repack.c
index 75e3752353a..4d83d40f39f 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -1022,29 +1022,13 @@  static int write_filtered_pack(const struct pack_objects_args *args,
 	return finish_pack_objects_cmd(&cmd, names, local);
 }
 
-static int existing_cruft_pack_cmp(const void *va, const void *vb)
-{
-	struct packed_git *a = *(struct packed_git **)va;
-	struct packed_git *b = *(struct packed_git **)vb;
-
-	if (a->pack_size < b->pack_size)
-		return -1;
-	if (a->pack_size > b->pack_size)
-		return 1;
-	return 0;
-}
-
 static void collapse_small_cruft_packs(FILE *in, size_t max_size,
 				       struct existing_packs *existing)
 {
-	struct packed_git **existing_cruft, *p;
+	struct packed_git *p;
 	struct strbuf buf = STRBUF_INIT;
-	size_t total_size = 0;
-	size_t existing_cruft_nr = 0;
 	size_t i;
 
-	ALLOC_ARRAY(existing_cruft, existing->cruft_packs.nr);
-
 	for (p = get_all_packs(the_repository); p; p = p->next) {
 		if (!(p->is_cruft && p->pack_local))
 			continue;
@@ -1056,24 +1040,7 @@  static void collapse_small_cruft_packs(FILE *in, size_t max_size,
 		if (!string_list_has_string(&existing->cruft_packs, buf.buf))
 			continue;
 
-		if (existing_cruft_nr >= existing->cruft_packs.nr)
-			BUG("too many cruft packs (found %"PRIuMAX", but knew "
-			    "of %"PRIuMAX")",
-			    (uintmax_t)existing_cruft_nr + 1,
-			    (uintmax_t)existing->cruft_packs.nr);
-		existing_cruft[existing_cruft_nr++] = p;
-	}
-
-	QSORT(existing_cruft, existing_cruft_nr, existing_cruft_pack_cmp);
-
-	for (i = 0; i < existing_cruft_nr; i++) {
-		size_t proposed;
-
-		p = existing_cruft[i];
-		proposed = st_add(total_size, p->pack_size);
-
-		if (proposed <= max_size) {
-			total_size = proposed;
+		if (p->pack_size < max_size) {
 			fprintf(in, "-%s\n", pack_basename(p));
 		} else {
 			retain_cruft_pack(existing, p);
@@ -1086,7 +1053,6 @@  static void collapse_small_cruft_packs(FILE *in, size_t max_size,
 			existing->non_kept_packs.items[i].string);
 
 	strbuf_release(&buf);
-	free(existing_cruft);
 }
 
 static int write_cruft_pack(const struct pack_objects_args *args,
diff --git a/t/t7704-repack-cruft.sh b/t/t7704-repack-cruft.sh
index 959e6e26488..5a76b541ddd 100755
--- a/t/t7704-repack-cruft.sh
+++ b/t/t7704-repack-cruft.sh
@@ -194,48 +194,6 @@  test_expect_success '--max-cruft-size combines existing packs when below thresho
 	)
 '
 
-test_expect_success '--max-cruft-size combines smaller packs first' '
-	git init max-cruft-size-consume-small &&
-	(
-		cd max-cruft-size-consume-small &&
-
-		test_commit base &&
-		git repack -ad &&
-
-		cruft_foo="$(generate_cruft_pack foo 524288)" &&    # 0.5 MiB
-		cruft_bar="$(generate_cruft_pack bar 524288)" &&    # 0.5 MiB
-		cruft_baz="$(generate_cruft_pack baz 1048576)" &&   # 1.0 MiB
-		cruft_quux="$(generate_cruft_pack quux 1572864)" && # 1.5 MiB
-
-		test-tool pack-mtimes "$(basename $cruft_foo)" >expect.raw &&
-		test-tool pack-mtimes "$(basename $cruft_bar)" >>expect.raw &&
-		sort expect.raw >expect.objects &&
-
-		# repacking with `--max-cruft-size=2M` should combine
-		# both 0.5 MiB packs together, instead of, say, one of
-		# the 0.5 MiB packs with the 1.0 MiB pack
-		ls $packdir/pack-*.mtimes | sort >cruft.before &&
-		git repack -d --cruft --max-cruft-size=2M &&
-		ls $packdir/pack-*.mtimes | sort >cruft.after &&
-
-		comm -13 cruft.before cruft.after >cruft.new &&
-		comm -23 cruft.before cruft.after >cruft.removed &&
-
-		test_line_count = 1 cruft.new &&
-		test_line_count = 2 cruft.removed &&
-
-		# the two smaller packs should be rolled up first
-		printf "%s\n" $cruft_foo $cruft_bar | sort >expect.removed &&
-		test_cmp expect.removed cruft.removed &&
-
-		# ...and contain the set of objects rolled up
-		test-tool pack-mtimes "$(basename $(cat cruft.new))" >actual.raw &&
-		sort actual.raw >actual.objects &&
-
-		test_cmp expect.objects actual.objects
-	)
-'
-
 test_expect_success 'setup --max-cruft-size with freshened objects' '
 	git init max-cruft-size-freshen &&
 	(