diff mbox series

[12/17] builtin/repack.c: support generating a cruft pack

Message ID a05675ab834ac5e8bc3ab72847b0621a563e0e1b.1638224692.git.me@ttaylorr.com (mailing list archive)
State New, archived
Headers show
Series cruft packs | expand

Commit Message

Taylor Blau Nov. 29, 2021, 10:25 p.m. UTC
Expose a way to split the contents of a repository into a main and cruft
pack when doing an all-into-one repack with `git repack --cruft -d`, and
a complementary configuration variable.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt            |  11 ++
 Documentation/technical/cruft-packs.txt |   2 +-
 builtin/repack.c                        | 112 ++++++++++++++++-
 t/t5327-pack-objects-cruft.sh           | 153 ++++++++++++++++++++++++
 4 files changed, 272 insertions(+), 6 deletions(-)

Comments

Junio C Hamano Dec. 5, 2021, 8:46 p.m. UTC | #1
Various thoughts on just this part, as the hunk got my attention
while merging with other topics in 'seen'.

> +	if (pack_everything & PACK_CRUFT && delete_redundant) {
> +		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
> +			die(_("--cruft and -A are incompatible"));
> +		if (keep_unreachable)
> +			die(_("--cruft and -k are incompatible"));
> +		if (!(pack_everything & ALL_INTO_ONE))
> +			die(_("--cruft must be combined with all-into-one"));
> +	}

The "reuse similar messages for i18n" topic will encourage us to
turn this part into:

	if (pack_everything & PACK_CRUFT && delete_redundant) {
		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
			die(_("%s and %s are mutually exclusive"),
			    "--cruft", "-A");
		if (keep_unreachable)
			die(_("%s and %s are mutually exclusive"),
			    "--cruft", "-k");
		if (!(pack_everything & ALL_INTO_ONE))
			die(_("--cruft must be combined with all-into-one"));
	}

The conditionals are a bit unpleasant to read and maintain, but I
guess we cannot help it?

Saying ALL_INTO_ONE is a bit unfriendly to the end user, who would
probably not know that it is the name the code gave to the bit that
is turned on when given an option externally known under a different
name (is that "-a"?).

If "--cruft" must be used with "all into one", I wonder if it makes
sense to make it imply that?  Not in the sense that OPT_BIT()
initially flips the ALL_INTO_ONE bit on upon seeing "--cruft", but
after parse_options() returns, we check PACK_CRUFT and if it is on
turn ALL_INTO_ONE also on (so even if '-a' gains '--all-into-one'
option, the user won't break us by giving "--no-all-into-one" after
they gave us "--cruft")?  I didn't think about this part thoroughly
enough, though.

Thanks.
Derrick Stolee Dec. 7, 2021, 3:38 p.m. UTC | #2
On 11/29/2021 5:25 PM, Taylor Blau wrote:

> +static int write_cruft_pack(const struct pack_objects_args *args,
> +			    const char *pack_prefix,
> +			    struct string_list *names,
> +			    struct string_list *existing_packs,
> +			    struct string_list *existing_kept_packs)
> +{
> +	struct child_process cmd = CHILD_PROCESS_INIT;
> +	struct strbuf line = STRBUF_INIT;
> +	struct string_list_item *item;
> +	FILE *in, *out;
> +	int ret;
> +
> +	prepare_pack_objects(&cmd, args);
> +
> +	strvec_push(&cmd.args, "--cruft");
> +	if (cruft_expiration)
> +		strvec_pushf(&cmd.args, "--cruft-expiration=%s",
> +			     cruft_expiration);
> +
> +	strvec_push(&cmd.args, "--honor-pack-keep");
> +	strvec_push(&cmd.args, "--non-empty");
> +	strvec_push(&cmd.args, "--max-pack-size=0");

This --max-pack-size is meaningless, right? The config that would change
this is already ignored by 'git pack-objects'.

> +		OPT_BIT(0, "cruft", &pack_everything,
> +				N_("same as -a, pack unreachable cruft objects separately"),
> +				   PACK_CRUFT | ALL_INTO_ONE),

I can understand the use of OPT_BIT here. Keep in mind that --no-cruft would
remove the '-a' option, if it already existed. Perhaps we should just use
OPT_BOOL and update to add the ALL_INTO_ONE if PACK_CRUFT exists?

> +		OPT_STRING(0, "cruft-expiration", &cruft_expiration, N_("approxidate"),
> +				N_("with -C, expire objects older than this")),

Here, --no-cruft-expiration will set cruft_expiration to NULL and not overwrite
the --cruft option, as expected. Just pointing out that this is different than
the option in 'git pack-objects'.

> --- a/t/t5327-pack-objects-cruft.sh
> +++ b/t/t5327-pack-objects-cruft.sh
> @@ -358,4 +358,157 @@ test_expect_success 'expired objects are pruned' '
>  	)
>  '
>  
> +test_expect_success 'repack --cruft generates a cruft pack' '
> +	git init repo &&
> +	test_when_finished "rm -fr repo" &&
> +	(
> +		cd repo &&
> +
> +		test_commit reachable &&
> +		git branch -M main &&
> +		git checkout --orphan other &&

Here is a way to make objects unreachable!

> +		test_commit unreachable &&
> +
> +		git checkout main &&
> +		git branch -D other &&
> +		git tag -d unreachable &&

Thanks,
-Stolee
Taylor Blau Feb. 23, 2022, 11:37 p.m. UTC | #3
(Jumping forward a little bit while responding to your review to finish
my train of though before I log off for today...)

On Tue, Dec 07, 2021 at 10:38:05AM -0500, Derrick Stolee wrote:
> > --- a/t/t5327-pack-objects-cruft.sh
> > +++ b/t/t5327-pack-objects-cruft.sh
> > @@ -358,4 +358,157 @@ test_expect_success 'expired objects are pruned' '
> >  	)
> >  '
> >
> > +test_expect_success 'repack --cruft generates a cruft pack' '
> > +	git init repo &&
> > +	test_when_finished "rm -fr repo" &&
> > +	(
> > +		cd repo &&
> > +
> > +		test_commit reachable &&
> > +		git branch -M main &&
> > +		git checkout --orphan other &&
>
> Here is a way to make objects unreachable!

Yes, indeed. And this is the first spot where we *need* to care about
object reachability, because the set of packs that `git repack` passes
over stdin to `git pack-objects --cruft` depends on which objects are
and aren't reachable.

In the tests that exercise `pack-objects --cruft` directly, we can
pretend that certain packs contain only unreachable objects by marking
them as "discarded".

Thanks,
Taylor
Taylor Blau March 1, 2022, 2 a.m. UTC | #4
On Sun, Dec 05, 2021 at 12:46:19PM -0800, Junio C Hamano wrote:
> Various thoughts on just this part, as the hunk got my attention
> while merging with other topics in 'seen'.
>
> > +	if (pack_everything & PACK_CRUFT && delete_redundant) {
> > +		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
> > +			die(_("--cruft and -A are incompatible"));
> > +		if (keep_unreachable)
> > +			die(_("--cruft and -k are incompatible"));
> > +		if (!(pack_everything & ALL_INTO_ONE))
> > +			die(_("--cruft must be combined with all-into-one"));
> > +	}
>
> The "reuse similar messages for i18n" topic will encourage us to
> turn this part into:
>
> 	if (pack_everything & PACK_CRUFT && delete_redundant) {
> 		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
> 			die(_("%s and %s are mutually exclusive"),
> 			    "--cruft", "-A");
> 		if (keep_unreachable)
> 			die(_("%s and %s are mutually exclusive"),
> 			    "--cruft", "-k");
> 		if (!(pack_everything & ALL_INTO_ONE))
> 			die(_("--cruft must be combined with all-into-one"));
> 	}

Thanks, done.

> The conditionals are a bit unpleasant to read and maintain, but I
> guess we cannot help it?

I don't know that I find them unpleasant to read, but perhaps they are a
hassle to maintain (as we add new, mutually-exclusive options). But I
can't seem to think of a better alternative...

> Saying ALL_INTO_ONE is a bit unfriendly to the end user, who would
> probably not know that it is the name the code gave to the bit that
> is turned on when given an option externally known under a different
> name (is that "-a"?).
>
> If "--cruft" must be used with "all into one", I wonder if it makes
> sense to make it imply that?  Not in the sense that OPT_BIT()
> initially flips the ALL_INTO_ONE bit on upon seeing "--cruft", but
> after parse_options() returns, we check PACK_CRUFT and if it is on
> turn ALL_INTO_ONE also on (so even if '-a' gains '--all-into-one'
> option, the user won't break us by giving "--no-all-into-one" after
> they gave us "--cruft")?  I didn't think about this part thoroughly
> enough, though.

Yes, `--cruft` must be used with an option that sets ALL_INTO_ONE. Since
we don't have any automatic '--no-' versions of single character
options, I think that this conditional is currently redundant, but I
agree that this code would break if we (a) removed the conditional
you're talking about and (b) allowed passing something like
`--no-all-into-one` which unsets the ALL_INTO_ONE bit.

So setting ALL_INTO_ONE ourselves _after_ option parsing is done makes
sense to me, thanks.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 7183fb498f..4f8f4b5a1f 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -63,6 +63,17 @@  to the new separate pack will be written.
 	Also run  'git prune-packed' to remove redundant
 	loose object files.
 
+--cruft::
+	Same as `-a`, unless `-d` is used. Then any unreachable objects
+	are packed into a separate cruft pack. Unreachable objects can
+	be pruned using the normal expiry rules with the next `git gc`
+	invocation (see linkgit:git-gc[1]). Incompatible with `-k`.
+
+--cruft-expiration=<approxidate>::
+	Expire unreachable objects older than `<approxidate>`
+	immediately instead of waiting for the next `git gc` invocation.
+	Only useful with `--cruft -d`.
+
 -l::
 	Pass the `--local` option to 'git pack-objects'. See
 	linkgit:git-pack-objects[1].
diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
index bb54cce1b1..b7daad2e3e 100644
--- a/Documentation/technical/cruft-packs.txt
+++ b/Documentation/technical/cruft-packs.txt
@@ -16,7 +16,7 @@  pruned according to normal expiry rules with the next 'git gc' invocation.
 
 Unreachable objects aren't removed immediately, since doing so could race with
 an incoming push which may reference an object which is about to be deleted.
-Instead, those unreachable objects are stored as loose object and stay that way
+Instead, those unreachable objects are stored as loose objects and stay that way
 until they are older than the expiration window, at which point they are removed
 by linkgit:git-prune[1].
 
diff --git a/builtin/repack.c b/builtin/repack.c
index acbb7b8c3b..68b4bdf06f 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -18,11 +18,17 @@ 
 #include "pack-bitmap.h"
 #include "refs.h"
 
+#define ALL_INTO_ONE 1
+#define LOOSEN_UNREACHABLE 2
+#define PACK_CRUFT 4
+
+static int pack_everything;
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
 static int write_bitmaps = -1;
 static int use_delta_islands;
 static char *packdir, *packtmp_name, *packtmp;
+static char *cruft_expiration;
 
 static const char *const git_repack_usage[] = {
 	N_("git repack [<options>]"),
@@ -54,6 +60,7 @@  static int repack_config(const char *var, const char *value, void *cb)
 		use_delta_islands = git_config_bool(var, value);
 		return 0;
 	}
+
 	return git_default_config(var, value, cb);
 }
 
@@ -298,9 +305,6 @@  static void repack_promisor_objects(const struct pack_objects_args *args,
 		die(_("could not finish pack-objects to repack promisor objects"));
 }
 
-#define ALL_INTO_ONE 1
-#define LOOSEN_UNREACHABLE 2
-
 struct pack_geometry {
 	struct packed_git **pack;
 	uint32_t pack_nr, pack_alloc;
@@ -337,6 +341,8 @@  static void init_pack_geometry(struct pack_geometry **geometry_p)
 	for (p = get_all_packs(the_repository); p; p = p->next) {
 		if (!pack_kept_objects && p->pack_keep)
 			continue;
+		if (p->is_cruft)
+			continue;
 
 		ALLOC_GROW(geometry->pack,
 			   geometry->pack_nr + 1,
@@ -598,6 +604,67 @@  static int write_midx_included_packs(struct string_list *include,
 	return finish_command(&cmd);
 }
 
+static int write_cruft_pack(const struct pack_objects_args *args,
+			    const char *pack_prefix,
+			    struct string_list *names,
+			    struct string_list *existing_packs,
+			    struct string_list *existing_kept_packs)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf line = STRBUF_INIT;
+	struct string_list_item *item;
+	FILE *in, *out;
+	int ret;
+
+	prepare_pack_objects(&cmd, args);
+
+	strvec_push(&cmd.args, "--cruft");
+	if (cruft_expiration)
+		strvec_pushf(&cmd.args, "--cruft-expiration=%s",
+			     cruft_expiration);
+
+	strvec_push(&cmd.args, "--honor-pack-keep");
+	strvec_push(&cmd.args, "--non-empty");
+	strvec_push(&cmd.args, "--max-pack-size=0");
+
+	cmd.in = -1;
+
+	ret = start_command(&cmd);
+	if (ret)
+		return ret;
+
+	/*
+	 * names has a confusing double use: it both provides the list
+	 * of just-written new packs, and accepts the name of the cruft
+	 * pack we are writing.
+	 *
+	 * By the time it is read here, it contains only the pack(s)
+	 * that were just written, which is exactly the set of packs we
+	 * want to consider kept.
+	 */
+	in = xfdopen(cmd.in, "w");
+	for_each_string_list_item(item, names)
+		fprintf(in, "%s-%s.pack\n", pack_prefix, item->string);
+	for_each_string_list_item(item, existing_packs)
+		fprintf(in, "-%s.pack\n", item->string);
+	for_each_string_list_item(item, existing_kept_packs)
+		fprintf(in, "%s.pack\n", item->string);
+	fclose(in);
+
+	out = xfdopen(cmd.out, "r");
+	while (strbuf_getline_lf(&line, out) != EOF) {
+		if (line.len != the_hash_algo->hexsz)
+			die(_("repack: Expecting full hex object ID lines only "
+			      "from pack-objects."));
+		string_list_append(names, line.buf);
+	}
+	fclose(out);
+
+	strbuf_release(&line);
+
+	return finish_command(&cmd);
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -614,7 +681,6 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 	int show_progress = isatty(2);
 
 	/* variables to be filled by option parsing */
-	int pack_everything = 0;
 	int delete_redundant = 0;
 	const char *unpack_unreachable = NULL;
 	int keep_unreachable = 0;
@@ -630,6 +696,11 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 		OPT_BIT('A', NULL, &pack_everything,
 				N_("same as -a, and turn unreachable objects loose"),
 				   LOOSEN_UNREACHABLE | ALL_INTO_ONE),
+		OPT_BIT(0, "cruft", &pack_everything,
+				N_("same as -a, pack unreachable cruft objects separately"),
+				   PACK_CRUFT | ALL_INTO_ONE),
+		OPT_STRING(0, "cruft-expiration", &cruft_expiration, N_("approxidate"),
+				N_("with -C, expire objects older than this")),
 		OPT_BOOL('d', NULL, &delete_redundant,
 				N_("remove redundant packs, and run git-prune-packed")),
 		OPT_BOOL('f', NULL, &po_args.no_reuse_delta,
@@ -681,6 +752,14 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (keep_unreachable &&
 	    (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE)))
 		die(_("--keep-unreachable and -A are incompatible"));
+	if (pack_everything & PACK_CRUFT && delete_redundant) {
+		if (unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE))
+			die(_("--cruft and -A are incompatible"));
+		if (keep_unreachable)
+			die(_("--cruft and -k are incompatible"));
+		if (!(pack_everything & ALL_INTO_ONE))
+			die(_("--cruft must be combined with all-into-one"));
+	}
 
 	if (write_bitmaps < 0) {
 		if (!write_midx &&
@@ -763,7 +842,8 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (pack_everything & ALL_INTO_ONE) {
 		repack_promisor_objects(&po_args, &names);
 
-		if (existing_nonkept_packs.nr && delete_redundant) {
+		if (existing_nonkept_packs.nr && delete_redundant &&
+		    !(pack_everything & PACK_CRUFT)) {
 			for_each_string_list_item(item, &names) {
 				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
 					     packtmp_name, item->string);
@@ -798,6 +878,7 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 		return ret;
 
 	if (geometry) {
+		struct packed_git *p;
 		FILE *in = xfdopen(cmd.in, "w");
 		/*
 		 * The resulting pack should contain all objects in packs that
@@ -808,6 +889,12 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 			fprintf(in, "%s\n", pack_basename(geometry->pack[i]));
 		for (i = geometry->split; i < geometry->pack_nr; i++)
 			fprintf(in, "^%s\n", pack_basename(geometry->pack[i]));
+
+		for (p = get_all_packs(the_repository); p; p = p->next) {
+			if (!p->is_cruft)
+				continue;
+			fprintf(in, "^%s\n", pack_basename(p));
+		}
 		fclose(in);
 	}
 
@@ -825,6 +912,21 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (!names.nr && !po_args.quiet)
 		printf_ln(_("Nothing new to pack."));
 
+	if (pack_everything & PACK_CRUFT) {
+		const char *pack_prefix;
+		if (!skip_prefix(packtmp, packdir, &pack_prefix))
+			die(_("pack prefix %s does not begin with objdir %s"),
+			    packtmp, packdir);
+		if (*pack_prefix == '/')
+			pack_prefix++;
+
+		ret = write_cruft_pack(&po_args, pack_prefix, &names,
+				       &existing_nonkept_packs,
+				       &existing_kept_packs);
+		if (ret)
+			return ret;
+	}
+
 	for_each_string_list_item(item, &names) {
 		item->util = (void *)(uintptr_t)populate_pack_exts(item->string);
 	}
diff --git a/t/t5327-pack-objects-cruft.sh b/t/t5327-pack-objects-cruft.sh
index 31d4a561fe..ed1a113ab6 100755
--- a/t/t5327-pack-objects-cruft.sh
+++ b/t/t5327-pack-objects-cruft.sh
@@ -358,4 +358,157 @@  test_expect_success 'expired objects are pruned' '
 	)
 '
 
+test_expect_success 'repack --cruft generates a cruft pack' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git branch -M main &&
+		git checkout --orphan other &&
+		test_commit unreachable &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d unreachable &&
+		# objects are not cruft if they are contained in the reflogs
+		rm -fr .git/logs &&
+
+		git rev-list --objects --all --no-object-names >reachable.raw &&
+		git cat-file --batch-all-objects --batch-check="%(objectname)" >objects &&
+		sort <reachable.raw >reachable &&
+		comm -13 reachable objects >unreachable &&
+
+		git repack --cruft -d &&
+
+		cruft=$(basename $(ls $packdir/pack-*.mtimes) .mtimes) &&
+		pack=$(basename $(ls $packdir/pack-*.pack | grep -v $cruft) .pack) &&
+
+		git show-index <$packdir/$pack.idx >actual.raw &&
+		cut -f2 -d" " actual.raw | sort >actual &&
+		test_cmp reachable actual &&
+
+		git show-index <$packdir/$cruft.idx >actual.raw &&
+		cut -f2 -d" " actual.raw | sort >actual &&
+		test_cmp unreachable actual
+	)
+'
+
+test_expect_success 'loose objects mtimes upsert others' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		test_commit cruft &&
+		# incremental repack, leaving existing objects loose (so
+		# they can be "freshened")
+		git repack &&
+
+		tip="$(git rev-parse cruft)" &&
+		path="$objdir/$(test_oid_to_path "$(git rev-parse cruft)")" &&
+		test-tool chmtime --get +1000 "$path" >expect &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d cruft &&
+		rm -fr .git/logs &&
+
+		git repack --cruft -d &&
+
+		mtimes="$(basename $(ls $packdir/pack-*.mtimes))" &&
+		test-tool pack-mtimes "$mtimes" >actual.raw &&
+		grep "$tip" actual.raw | cut -d" " -f2 >actual &&
+		test_cmp expect actual
+	)
+'
+
+test_expect_success 'cruft packs are not included in geometric repack' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit reachable &&
+		git repack -Ad &&
+		git branch -M main &&
+
+		git checkout --orphan other &&
+		test_commit cruft &&
+		git repack -d &&
+
+		git checkout main &&
+		git branch -D other &&
+		git tag -d cruft &&
+		rm -fr .git/logs &&
+
+		git repack --cruft &&
+
+		find $packdir -type f | sort >before &&
+		git repack --geometric=2 -d &&
+		find $packdir -type f | sort >after &&
+
+		test_cmp before after
+	)
+'
+test_expect_success 'cruft repack with no reachable objects' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit base &&
+		git repack -ad &&
+
+		base="$(git rev-parse base)" &&
+
+		git for-each-ref --format="delete %(refname)" >in &&
+		git update-ref --stdin <in &&
+		rm -fr .git/logs &&
+		rm -fr .git/index &&
+
+		git repack --cruft -d &&
+
+		git cat-file -t $base
+	)
+'
+
+test_expect_success 'cruft repack ignores --max-pack-size' '
+	git init max-pack-size &&
+	(
+		cd max-pack-size &&
+		test_commit base &&
+		# two cruft objects which exceed the maximum pack size
+		test-tool genrandom foo 1048576 | git hash-object --stdin -w &&
+		test-tool genrandom bar 1048576 | git hash-object --stdin -w &&
+		git repack --cruft --max-pack-size=1M &&
+		find $packdir -name "*.mtimes" >cruft &&
+		test_line_count = 1 cruft &&
+		test-tool pack-mtimes "$(basename "$(cat cruft)")" >objects &&
+		test_line_count = 2 objects
+	)
+'
+
+test_expect_success 'cruft repack ignores pack.packSizeLimit' '
+	(
+		cd max-pack-size &&
+		# repack everything back together to remove the existing cruft
+		# pack (but to keep its objects)
+		git repack -adk &&
+		git -c pack.packSizeLimit=1M repack --cruft &&
+		# ensure the same post condition is met when --max-pack-size
+		# would otherwise be inferred from the configuration
+		find $packdir -name "*.mtimes" >cruft &&
+		test_line_count = 1 cruft &&
+		test-tool pack-mtimes "$(basename "$(cat cruft)")" >objects &&
+		test_line_count = 2 objects
+	)
+'
+
 test_done