diff mbox series

[8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps

Message ID 6a1c52181e8c8c9fe2f0e2d7fbeb1057f68c1f3d.1631331139.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series repack: introduce `--write-midx` | expand

Commit Message

Taylor Blau Sept. 11, 2021, 3:32 a.m. UTC
To prevent the race described in an earlier patch, generate and pass a
reference snapshot to the multi-pack bitmap code, if we are writing one
from `git repack`.

This patch is mostly limited to creating a temporary file, and then
calling for_each_ref(). Except we try to minimize duplicates, since
doing so can drastically reduce the size in network-of-forks style
repositories. In the kernel's fork network (the repository containing
all objects from the kernel and all its forks), deduplicating the
references drops the snapshot size from 934 MB to just 12 MB.

But since we're handling duplicates in this way, we have to make sure
that we preferred references (those listed in pack.preferBitmapTips)
before non-preferred ones (to avoid recording an object which is pointed
at by a preferred tip as non-preferred).

We accomplish this by doing separate passes over the references: first
visiting each prefix in pack.preferBitmapTips, and then over the rest of
the references.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

Comments

Ævar Arnfjörð Bjarmason Sept. 11, 2021, 10:27 a.m. UTC | #1
On Fri, Sep 10 2021, Taylor Blau wrote:

> +struct midx_snapshot_ref_data {
> +	struct tempfile *f;

Style/readability: Spare more than one byte for a variable name in a
struct, maybe file, or tmpfile?

> +	fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
> +		oid_to_hex(oid));

Just an idea: Maybe the file could be lines of "+\tOID\n" instead of
"+OID\n"? Lends itself more naturally to extension, use with the likes
of string_list_split() etc.

> +		for_each_string_list_item(item, preferred) {
> +			for_each_ref_in(item->string, midx_snapshot_ref_one, &data);
> +		}

Cheap style commenst I know, but throughout this series there's an odd
mixture of sometimes adding braces for one-lines, sometimes not, or in
the case of that "else ;" I commented on not for something *much* bigger
:)

> +		if (midx_snapshot_refs(refs_snapshot) < 0)
> +			die(_("could not take a snapshot of references"));

Seems like this die() should be moved into the midx_snapshot_refs()
function and that function should use die_errno(), not die(), also it
uses close_tempfile_gently(), shouldn't save the errno on failure there,
clean up the tempfile, and then die?
Ævar Arnfjörð Bjarmason Sept. 11, 2021, 11:19 a.m. UTC | #2
On Sat, Sep 11 2021, Ævar Arnfjörð Bjarmason wrote:

> On Fri, Sep 10 2021, Taylor Blau wrote:
>
>> +struct midx_snapshot_ref_data {
>> +	struct tempfile *f;
>
> Style/readability: Spare more than one byte for a variable name in a
> struct, maybe file, or tmpfile?
>
>> +	fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
>> +		oid_to_hex(oid));
>
> Just an idea: Maybe the file could be lines of "+\tOID\n" instead of
> "+OID\n"? Lends itself more naturally to extension, use with the likes
> of string_list_split() etc.

Actually, even better a format like:

    "OID[\t+]\n"

Or

    "OID[\tpreferred=1]\n"

Then you can just use the strbuf API to eat the lines, and only parse
any optional \t-delimited "flags" if the length of the line you get
doesn't match the hash size length.

I don't think it matters for speed or whatever, just ease of parsing &
working with this new format.

The "+" v.s. "preferred=1" suggestion is just to make it future-proof
for comma-delimited key-value options, which we already use in the
protocol etc., so it's easy to parse, but I don't know if we're ever
likely to add a new flag here (and even if we are, we can use the next
"\t" positional...
Taylor Blau Sept. 11, 2021, 4:49 p.m. UTC | #3
On Sat, Sep 11, 2021 at 12:27:39PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Fri, Sep 10 2021, Taylor Blau wrote:
>
> > +struct midx_snapshot_ref_data {
> > +	struct tempfile *f;
>
> Style/readability: Spare more than one byte for a variable name in a
> struct, maybe file, or tmpfile?

Sure.

> > +	fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
> > +		oid_to_hex(oid));
>
> Just an idea: Maybe the file could be lines of "+\tOID\n" instead of
> "+OID\n"? Lends itself more naturally to extension, use with the likes
> of string_list_split() etc.
>
> > +		for_each_string_list_item(item, preferred) {
> > +			for_each_ref_in(item->string, midx_snapshot_ref_one, &data);
> > +		}
>
> Cheap style commenst I know, but throughout this series there's an odd
> mixture of sometimes adding braces for one-lines, sometimes not, or in
> the case of that "else ;" I commented on not for something *much* bigger
> :)

I've always been uncomfortable with brace-less for loops when the loop
is hidden behind a macro. I haven't bothered to check, but my gut
feeling is that it is pretty rare to see a for_each_something() macro
with a one-line body that doesn't have braces.

But my gut is wrong, because

    git grep 'for_each_.*($' **/*.c | wc -l

turns up 99 results! So I'm happy to be more consistent with the other
code (despite my general discomfort with it). An observation is that
even this file isn't consistent about braceless macro'd for loops, even
before this series.

So even cleaning it up will still leave this file in an inconsistent
state, but I don't think it's worth spending much time at all thinking
about cleaning it up.

>
> > +		if (midx_snapshot_refs(refs_snapshot) < 0)
> > +			die(_("could not take a snapshot of references"));
>
> Seems like this die() should be moved into the midx_snapshot_refs()
> function and that function should use die_errno(), not die(), also it
> uses close_tempfile_gently(), shouldn't save the errno on failure there,
> clean up the tempfile, and then die?

Yeah, makes sense.

Thanks,
Taylor
Taylor Blau Sept. 11, 2021, 4:51 p.m. UTC | #4
On Sat, Sep 11, 2021 at 01:19:27PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Sat, Sep 11 2021, Ævar Arnfjörð Bjarmason wrote:
>
> > On Fri, Sep 10 2021, Taylor Blau wrote:
> >
> >> +struct midx_snapshot_ref_data {
> >> +	struct tempfile *f;
> >
> > Style/readability: Spare more than one byte for a variable name in a
> > struct, maybe file, or tmpfile?
> >
> >> +	fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
> >> +		oid_to_hex(oid));
> >
> > Just an idea: Maybe the file could be lines of "+\tOID\n" instead of
> > "+OID\n"? Lends itself more naturally to extension, use with the likes
> > of string_list_split() etc.
>
> Actually, even better a format like:
>
>     "OID[\t+]\n"
>
> Or
>
>     "OID[\tpreferred=1]\n"

Sure, but I admit that I'm a little torn on this suggestion. I don't
want to be naive and say that we're never going to change this format
and paint ourselves into a corner.

On the other hand, changing it does seem extremely unlikely to me, and
this tab-delimited thing feels like overkill compared to how simple the
'+' format is.

So, I don't know. It's certainly easy enough to change now before we
lock it in, so I guess we should.

Thanks,
Taylor
Jeff King Sept. 14, 2021, 6:55 p.m. UTC | #5
On Sat, Sep 11, 2021 at 12:51:24PM -0400, Taylor Blau wrote:

> > >> +	fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
> > >> +		oid_to_hex(oid));
> > >
> > > Just an idea: Maybe the file could be lines of "+\tOID\n" instead of
> > > "+OID\n"? Lends itself more naturally to extension, use with the likes
> > > of string_list_split() etc.
> >
> > Actually, even better a format like:
> >
> >     "OID[\t+]\n"
> >
> > Or
> >
> >     "OID[\tpreferred=1]\n"
> 
> Sure, but I admit that I'm a little torn on this suggestion. I don't
> want to be naive and say that we're never going to change this format
> and paint ourselves into a corner.
> 
> On the other hand, changing it does seem extremely unlikely to me, and
> this tab-delimited thing feels like overkill compared to how simple the
> '+' format is.
> 
> So, I don't know. It's certainly easy enough to change now before we
> lock it in, so I guess we should.

I'm not sure I really see the point of making this infinitely
extensible. This is mostly an internal interface between two Git
programs. Sure, it's exposed to the user in the sense that they can use
--refs-snapshot themselves. But if some writer wants to add a "foo"
flag, do they really want to be able to do it in a way that they're
_syntactically_ compatible with the older versions of Git, yet have no
clue if their option was actually recognized and accepted?

I.e., if we later add such a flag, then:

  +OID\tfoo
  +\tOID\tfoo
  OID\t+\tfoo

does not seem to make a big difference. We will likely add a new
command-line option to say "by the way, I am passing you some extra
information" at which point the format can be whatever we like.

And unless it is really complicated, I'd just as soon use a
single-character marker again, because it's dead-simple to generate and
parse (and again, this is an internal interface and not something we
expect users to have to understand).

Unless we think the "+" is too ugly or confusing, but I don't. The "-"
prefix for pack-objects read_object_list_from_stdin() is similar and
works just fine. For that matter, the --stdin-packs feature there works
similarly. Everything more feels a bit like over-engineering at this
point.

-Peff
Taylor Blau Sept. 14, 2021, 11:34 p.m. UTC | #6
On Tue, Sep 14, 2021 at 02:55:03PM -0400, Jeff King wrote:
> On Sat, Sep 11, 2021 at 12:51:24PM -0400, Taylor Blau wrote:
>
> > > >> +	fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
> > > >> +		oid_to_hex(oid));
> > > >
> > > > Just an idea: Maybe the file could be lines of "+\tOID\n" instead of
> > > > "+OID\n"? Lends itself more naturally to extension, use with the likes
> > > > of string_list_split() etc.
> > >
> > > Actually, even better a format like:
> > >
> > >     "OID[\t+]\n"
> > >
> > > Or
> > >
> > >     "OID[\tpreferred=1]\n"
> >
> > Sure, but I admit that I'm a little torn on this suggestion. I don't
> > want to be naive and say that we're never going to change this format
> > and paint ourselves into a corner.
> >
> > On the other hand, changing it does seem extremely unlikely to me, and
> > this tab-delimited thing feels like overkill compared to how simple the
> > '+' format is.
> >
> > So, I don't know. It's certainly easy enough to change now before we
> > lock it in, so I guess we should.
>
> I'm not sure I really see the point of making this infinitely
> extensible. This is mostly an internal interface between two Git
> programs. Sure, it's exposed to the user in the sense that they can use
> --refs-snapshot themselves. But if some writer wants to add a "foo"
> flag, do they really want to be able to do it in a way that they're
> _syntactically_ compatible with the older versions of Git, yet have no
> clue if their option was actually recognized and accepted?

I like this perspective, and tend to agree.

(I'm basically repeating what you're saying but) it seems to me that
trying to make the interface of `--refs-snapshot` compatible across
versions of Git is stretching what we consider to be our compatibility
guarantee given that:

  - the interface is basically expected to be private between `repack`
    and `multi-pack-index`, and

  - an infinitely-extensible version of `--refs-snapshot` would still
    have no way to tell the caller whether or not those new flags were
    accepted

So I tend to agree the existing format works fine and we shouldn't spend
a ton of time trying to over-engineer a solution.

(FWIW, the `+` vs `-` thing is intentional; `--stdin-packs` uses `-` to
_exclude_ packs, but `+` here means "this one is special" not "exclude
this ref").

Thanks,
Taylor
Ævar Arnfjörð Bjarmason Sept. 14, 2021, 11:56 p.m. UTC | #7
On Tue, Sep 14 2021, Taylor Blau wrote:

> On Tue, Sep 14, 2021 at 02:55:03PM -0400, Jeff King wrote:
>> On Sat, Sep 11, 2021 at 12:51:24PM -0400, Taylor Blau wrote:
>>
>> > > >> +	fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
>> > > >> +		oid_to_hex(oid));
>> > > >
>> > > > Just an idea: Maybe the file could be lines of "+\tOID\n" instead of
>> > > > "+OID\n"? Lends itself more naturally to extension, use with the likes
>> > > > of string_list_split() etc.
>> > >
>> > > Actually, even better a format like:
>> > >
>> > >     "OID[\t+]\n"
>> > >
>> > > Or
>> > >
>> > >     "OID[\tpreferred=1]\n"
>> >
>> > Sure, but I admit that I'm a little torn on this suggestion. I don't
>> > want to be naive and say that we're never going to change this format
>> > and paint ourselves into a corner.
>> >
>> > On the other hand, changing it does seem extremely unlikely to me, and
>> > this tab-delimited thing feels like overkill compared to how simple the
>> > '+' format is.
>> >
>> > So, I don't know. It's certainly easy enough to change now before we
>> > lock it in, so I guess we should.
>>
>> I'm not sure I really see the point of making this infinitely
>> extensible. This is mostly an internal interface between two Git
>> programs. Sure, it's exposed to the user in the sense that they can use
>> --refs-snapshot themselves. But if some writer wants to add a "foo"
>> flag, do they really want to be able to do it in a way that they're
>> _syntactically_ compatible with the older versions of Git, yet have no
>> clue if their option was actually recognized and accepted?
>
> I like this perspective, and tend to agree.
>
> (I'm basically repeating what you're saying but) it seems to me that
> trying to make the interface of `--refs-snapshot` compatible across
> versions of Git is stretching what we consider to be our compatibility
> guarantee given that:
>
>   - the interface is basically expected to be private between `repack`
>     and `multi-pack-index`, and
>
>   - an infinitely-extensible version of `--refs-snapshot` would still
>     have no way to tell the caller whether or not those new flags were
>     accepted
>
> So I tend to agree the existing format works fine and we shouldn't spend
> a ton of time trying to over-engineer a solution.
>
> (FWIW, the `+` vs `-` thing is intentional; `--stdin-packs` uses `-` to
> _exclude_ packs, but `+` here means "this one is special" not "exclude
> this ref").

The suggestion I had offhand was just meant as an offhand "interesting,
maybe easier like...", i.e. if you found it easier to split on \t, or
check the hash size in the loop or whatever.

I agree that this doesn't matter and should just be left to whatever
you've got a taste for, thanks. This thread became more of a digression
than I thought...
Taylor Blau Sept. 15, 2021, 4:31 a.m. UTC | #8
On Wed, Sep 15, 2021 at 01:56:00AM +0200, Ævar Arnfjörð Bjarmason wrote:
> The suggestion I had offhand was just meant as an offhand "interesting,
> maybe easier like...", i.e. if you found it easier to split on \t, or
> check the hash size in the loop or whatever.

:-). You did say "Just an idea", but I treated your suggestion as if you
were concerned about being able to extend the format in the future.

> I agree that this doesn't matter and should just be left to whatever
> you've got a taste for, thanks. This thread became more of a digression
> than I thought...

I think that all of the discussion was reasonable if I understood your
original opinion more clearly ;). It may help (me, at least) to say more
explicitly, "This may be easier like [...] but I don't have any concerns
about functionality" or similar.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/builtin/repack.c b/builtin/repack.c
index e958caff8b..3fda837cb5 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -15,6 +15,8 @@ 
 #include "promisor-remote.h"
 #include "shallow.h"
 #include "pack.h"
+#include "pack-bitmap.h"
+#include "refs.h"
 
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -440,6 +442,61 @@  static void clear_pack_geometry(struct pack_geometry *geometry)
 	geometry->split = 0;
 }
 
+struct midx_snapshot_ref_data {
+	struct tempfile *f;
+	struct oidset seen;
+	int preferred;
+};
+
+static int midx_snapshot_ref_one(const char *refname,
+				 const struct object_id *oid,
+				 int flag, void *_data)
+{
+	struct midx_snapshot_ref_data *data = _data;
+	struct object_id peeled;
+
+	if (!peel_iterated_oid(oid, &peeled))
+		oid = &peeled;
+
+	if (oidset_insert(&data->seen, oid))
+		return 0; /* already seen */
+
+	if (oid_object_info(the_repository, oid, NULL) != OBJ_COMMIT)
+		return 0;
+
+	fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
+		oid_to_hex(oid));
+
+	return 0;
+}
+
+static int midx_snapshot_refs(struct tempfile *f)
+{
+	struct midx_snapshot_ref_data data;
+	const struct string_list *preferred = bitmap_preferred_tips(the_repository);
+
+	data.f = f;
+	oidset_init(&data.seen, 0);
+
+	if (!fdopen_tempfile(f, "w"))
+		 die(_("could not open tempfile %s for writing"),
+		     get_tempfile_path(f));
+
+	if (preferred) {
+		struct string_list_item *item;
+
+		data.preferred = 1;
+		for_each_string_list_item(item, preferred) {
+			for_each_ref_in(item->string, midx_snapshot_ref_one, &data);
+		}
+		data.preferred = 0;
+	}
+
+	for_each_ref(midx_snapshot_ref_one, &data);
+
+	return close_tempfile_gently(f);
+}
+
 static void midx_included_packs(struct string_list *include,
 				struct string_list *existing_packs,
 				struct string_list *existing_kept_packs,
@@ -477,6 +534,7 @@  static void midx_included_packs(struct string_list *include,
 
 static int write_midx_included_packs(struct string_list *include,
 				     struct pack_geometry *geometry,
+				     const char *refs_snapshot,
 				     int show_progress, int write_bitmaps)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -511,6 +569,9 @@  static int write_midx_included_packs(struct string_list *include,
 			;
 	}
 
+	if (refs_snapshot)
+		strvec_pushf(&cmd.args, "--refs-snapshot=%s", refs_snapshot);
+
 	ret = start_command(&cmd);
 	if (ret)
 		return ret;
@@ -534,6 +595,7 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list existing_kept_packs = STRING_LIST_INIT_DUP;
 	struct pack_geometry *geometry = NULL;
 	struct strbuf line = STRBUF_INIT;
+	struct tempfile *refs_snapshot = NULL;
 	int i, ext, ret;
 	FILE *out;
 	int show_progress = isatty(2);
@@ -622,6 +684,19 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE) && !write_midx)
 		die(_(incremental_bitmap_conflict_error));
 
+	if (write_midx && write_bitmaps) {
+		struct strbuf path = STRBUF_INIT;
+
+		strbuf_addf(&path, "%s/%s_XXXXXX", get_object_directory(),
+			    "bitmap-ref-tips");
+
+		refs_snapshot = xmks_tempfile(path.buf);
+		if (midx_snapshot_refs(refs_snapshot) < 0)
+			die(_("could not take a snapshot of references"));
+
+		strbuf_release(&path);
+	}
+
 	if (geometric_factor) {
 		if (pack_everything)
 			die(_("--geometric is incompatible with -A, -a"));
@@ -798,6 +873,7 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 				    &existing_kept_packs, &names, geometry);
 
 		ret = write_midx_included_packs(&include, geometry,
+						refs_snapshot ? get_tempfile_path(refs_snapshot) : NULL,
 						show_progress, write_bitmaps > 0);
 
 		string_list_clear(&include, 0);