Message ID | 6a1c52181e8c8c9fe2f0e2d7fbeb1057f68c1f3d.1631331139.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | repack: introduce `--write-midx` | expand |
On Fri, Sep 10 2021, Taylor Blau wrote: > +struct midx_snapshot_ref_data { > + struct tempfile *f; Style/readability: Spare more than one byte for a variable name in a struct, maybe file, or tmpfile? > + fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "", > + oid_to_hex(oid)); Just an idea: Maybe the file could be lines of "+\tOID\n" instead of "+OID\n"? Lends itself more naturally to extension, use with the likes of string_list_split() etc. > + for_each_string_list_item(item, preferred) { > + for_each_ref_in(item->string, midx_snapshot_ref_one, &data); > + } Cheap style commenst I know, but throughout this series there's an odd mixture of sometimes adding braces for one-lines, sometimes not, or in the case of that "else ;" I commented on not for something *much* bigger :) > + if (midx_snapshot_refs(refs_snapshot) < 0) > + die(_("could not take a snapshot of references")); Seems like this die() should be moved into the midx_snapshot_refs() function and that function should use die_errno(), not die(), also it uses close_tempfile_gently(), shouldn't save the errno on failure there, clean up the tempfile, and then die?
On Sat, Sep 11 2021, Ævar Arnfjörð Bjarmason wrote: > On Fri, Sep 10 2021, Taylor Blau wrote: > >> +struct midx_snapshot_ref_data { >> + struct tempfile *f; > > Style/readability: Spare more than one byte for a variable name in a > struct, maybe file, or tmpfile? > >> + fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "", >> + oid_to_hex(oid)); > > Just an idea: Maybe the file could be lines of "+\tOID\n" instead of > "+OID\n"? Lends itself more naturally to extension, use with the likes > of string_list_split() etc. Actually, even better a format like: "OID[\t+]\n" Or "OID[\tpreferred=1]\n" Then you can just use the strbuf API to eat the lines, and only parse any optional \t-delimited "flags" if the length of the line you get doesn't match the hash size length. I don't think it matters for speed or whatever, just ease of parsing & working with this new format. The "+" v.s. "preferred=1" suggestion is just to make it future-proof for comma-delimited key-value options, which we already use in the protocol etc., so it's easy to parse, but I don't know if we're ever likely to add a new flag here (and even if we are, we can use the next "\t" positional...
On Sat, Sep 11, 2021 at 12:27:39PM +0200, Ævar Arnfjörð Bjarmason wrote: > > On Fri, Sep 10 2021, Taylor Blau wrote: > > > +struct midx_snapshot_ref_data { > > + struct tempfile *f; > > Style/readability: Spare more than one byte for a variable name in a > struct, maybe file, or tmpfile? Sure. > > + fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "", > > + oid_to_hex(oid)); > > Just an idea: Maybe the file could be lines of "+\tOID\n" instead of > "+OID\n"? Lends itself more naturally to extension, use with the likes > of string_list_split() etc. > > > + for_each_string_list_item(item, preferred) { > > + for_each_ref_in(item->string, midx_snapshot_ref_one, &data); > > + } > > Cheap style commenst I know, but throughout this series there's an odd > mixture of sometimes adding braces for one-lines, sometimes not, or in > the case of that "else ;" I commented on not for something *much* bigger > :) I've always been uncomfortable with brace-less for loops when the loop is hidden behind a macro. I haven't bothered to check, but my gut feeling is that it is pretty rare to see a for_each_something() macro with a one-line body that doesn't have braces. But my gut is wrong, because git grep 'for_each_.*($' **/*.c | wc -l turns up 99 results! So I'm happy to be more consistent with the other code (despite my general discomfort with it). An observation is that even this file isn't consistent about braceless macro'd for loops, even before this series. So even cleaning it up will still leave this file in an inconsistent state, but I don't think it's worth spending much time at all thinking about cleaning it up. > > > + if (midx_snapshot_refs(refs_snapshot) < 0) > > + die(_("could not take a snapshot of references")); > > Seems like this die() should be moved into the midx_snapshot_refs() > function and that function should use die_errno(), not die(), also it > uses close_tempfile_gently(), shouldn't save the errno on failure there, > clean up the tempfile, and then die? Yeah, makes sense. Thanks, Taylor
On Sat, Sep 11, 2021 at 01:19:27PM +0200, Ævar Arnfjörð Bjarmason wrote: > > On Sat, Sep 11 2021, Ævar Arnfjörð Bjarmason wrote: > > > On Fri, Sep 10 2021, Taylor Blau wrote: > > > >> +struct midx_snapshot_ref_data { > >> + struct tempfile *f; > > > > Style/readability: Spare more than one byte for a variable name in a > > struct, maybe file, or tmpfile? > > > >> + fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "", > >> + oid_to_hex(oid)); > > > > Just an idea: Maybe the file could be lines of "+\tOID\n" instead of > > "+OID\n"? Lends itself more naturally to extension, use with the likes > > of string_list_split() etc. > > Actually, even better a format like: > > "OID[\t+]\n" > > Or > > "OID[\tpreferred=1]\n" Sure, but I admit that I'm a little torn on this suggestion. I don't want to be naive and say that we're never going to change this format and paint ourselves into a corner. On the other hand, changing it does seem extremely unlikely to me, and this tab-delimited thing feels like overkill compared to how simple the '+' format is. So, I don't know. It's certainly easy enough to change now before we lock it in, so I guess we should. Thanks, Taylor
On Sat, Sep 11, 2021 at 12:51:24PM -0400, Taylor Blau wrote: > > >> + fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "", > > >> + oid_to_hex(oid)); > > > > > > Just an idea: Maybe the file could be lines of "+\tOID\n" instead of > > > "+OID\n"? Lends itself more naturally to extension, use with the likes > > > of string_list_split() etc. > > > > Actually, even better a format like: > > > > "OID[\t+]\n" > > > > Or > > > > "OID[\tpreferred=1]\n" > > Sure, but I admit that I'm a little torn on this suggestion. I don't > want to be naive and say that we're never going to change this format > and paint ourselves into a corner. > > On the other hand, changing it does seem extremely unlikely to me, and > this tab-delimited thing feels like overkill compared to how simple the > '+' format is. > > So, I don't know. It's certainly easy enough to change now before we > lock it in, so I guess we should. I'm not sure I really see the point of making this infinitely extensible. This is mostly an internal interface between two Git programs. Sure, it's exposed to the user in the sense that they can use --refs-snapshot themselves. But if some writer wants to add a "foo" flag, do they really want to be able to do it in a way that they're _syntactically_ compatible with the older versions of Git, yet have no clue if their option was actually recognized and accepted? I.e., if we later add such a flag, then: +OID\tfoo +\tOID\tfoo OID\t+\tfoo does not seem to make a big difference. We will likely add a new command-line option to say "by the way, I am passing you some extra information" at which point the format can be whatever we like. And unless it is really complicated, I'd just as soon use a single-character marker again, because it's dead-simple to generate and parse (and again, this is an internal interface and not something we expect users to have to understand). Unless we think the "+" is too ugly or confusing, but I don't. The "-" prefix for pack-objects read_object_list_from_stdin() is similar and works just fine. For that matter, the --stdin-packs feature there works similarly. Everything more feels a bit like over-engineering at this point. -Peff
On Tue, Sep 14, 2021 at 02:55:03PM -0400, Jeff King wrote: > On Sat, Sep 11, 2021 at 12:51:24PM -0400, Taylor Blau wrote: > > > > >> + fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "", > > > >> + oid_to_hex(oid)); > > > > > > > > Just an idea: Maybe the file could be lines of "+\tOID\n" instead of > > > > "+OID\n"? Lends itself more naturally to extension, use with the likes > > > > of string_list_split() etc. > > > > > > Actually, even better a format like: > > > > > > "OID[\t+]\n" > > > > > > Or > > > > > > "OID[\tpreferred=1]\n" > > > > Sure, but I admit that I'm a little torn on this suggestion. I don't > > want to be naive and say that we're never going to change this format > > and paint ourselves into a corner. > > > > On the other hand, changing it does seem extremely unlikely to me, and > > this tab-delimited thing feels like overkill compared to how simple the > > '+' format is. > > > > So, I don't know. It's certainly easy enough to change now before we > > lock it in, so I guess we should. > > I'm not sure I really see the point of making this infinitely > extensible. This is mostly an internal interface between two Git > programs. Sure, it's exposed to the user in the sense that they can use > --refs-snapshot themselves. But if some writer wants to add a "foo" > flag, do they really want to be able to do it in a way that they're > _syntactically_ compatible with the older versions of Git, yet have no > clue if their option was actually recognized and accepted? I like this perspective, and tend to agree. (I'm basically repeating what you're saying but) it seems to me that trying to make the interface of `--refs-snapshot` compatible across versions of Git is stretching what we consider to be our compatibility guarantee given that: - the interface is basically expected to be private between `repack` and `multi-pack-index`, and - an infinitely-extensible version of `--refs-snapshot` would still have no way to tell the caller whether or not those new flags were accepted So I tend to agree the existing format works fine and we shouldn't spend a ton of time trying to over-engineer a solution. (FWIW, the `+` vs `-` thing is intentional; `--stdin-packs` uses `-` to _exclude_ packs, but `+` here means "this one is special" not "exclude this ref"). Thanks, Taylor
On Tue, Sep 14 2021, Taylor Blau wrote: > On Tue, Sep 14, 2021 at 02:55:03PM -0400, Jeff King wrote: >> On Sat, Sep 11, 2021 at 12:51:24PM -0400, Taylor Blau wrote: >> >> > > >> + fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "", >> > > >> + oid_to_hex(oid)); >> > > > >> > > > Just an idea: Maybe the file could be lines of "+\tOID\n" instead of >> > > > "+OID\n"? Lends itself more naturally to extension, use with the likes >> > > > of string_list_split() etc. >> > > >> > > Actually, even better a format like: >> > > >> > > "OID[\t+]\n" >> > > >> > > Or >> > > >> > > "OID[\tpreferred=1]\n" >> > >> > Sure, but I admit that I'm a little torn on this suggestion. I don't >> > want to be naive and say that we're never going to change this format >> > and paint ourselves into a corner. >> > >> > On the other hand, changing it does seem extremely unlikely to me, and >> > this tab-delimited thing feels like overkill compared to how simple the >> > '+' format is. >> > >> > So, I don't know. It's certainly easy enough to change now before we >> > lock it in, so I guess we should. >> >> I'm not sure I really see the point of making this infinitely >> extensible. This is mostly an internal interface between two Git >> programs. Sure, it's exposed to the user in the sense that they can use >> --refs-snapshot themselves. But if some writer wants to add a "foo" >> flag, do they really want to be able to do it in a way that they're >> _syntactically_ compatible with the older versions of Git, yet have no >> clue if their option was actually recognized and accepted? > > I like this perspective, and tend to agree. > > (I'm basically repeating what you're saying but) it seems to me that > trying to make the interface of `--refs-snapshot` compatible across > versions of Git is stretching what we consider to be our compatibility > guarantee given that: > > - the interface is basically expected to be private between `repack` > and `multi-pack-index`, and > > - an infinitely-extensible version of `--refs-snapshot` would still > have no way to tell the caller whether or not those new flags were > accepted > > So I tend to agree the existing format works fine and we shouldn't spend > a ton of time trying to over-engineer a solution. > > (FWIW, the `+` vs `-` thing is intentional; `--stdin-packs` uses `-` to > _exclude_ packs, but `+` here means "this one is special" not "exclude > this ref"). The suggestion I had offhand was just meant as an offhand "interesting, maybe easier like...", i.e. if you found it easier to split on \t, or check the hash size in the loop or whatever. I agree that this doesn't matter and should just be left to whatever you've got a taste for, thanks. This thread became more of a digression than I thought...
On Wed, Sep 15, 2021 at 01:56:00AM +0200, Ævar Arnfjörð Bjarmason wrote: > The suggestion I had offhand was just meant as an offhand "interesting, > maybe easier like...", i.e. if you found it easier to split on \t, or > check the hash size in the loop or whatever. :-). You did say "Just an idea", but I treated your suggestion as if you were concerned about being able to extend the format in the future. > I agree that this doesn't matter and should just be left to whatever > you've got a taste for, thanks. This thread became more of a digression > than I thought... I think that all of the discussion was reasonable if I understood your original opinion more clearly ;). It may help (me, at least) to say more explicitly, "This may be easier like [...] but I don't have any concerns about functionality" or similar. Thanks, Taylor
diff --git a/builtin/repack.c b/builtin/repack.c index e958caff8b..3fda837cb5 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -15,6 +15,8 @@ #include "promisor-remote.h" #include "shallow.h" #include "pack.h" +#include "pack-bitmap.h" +#include "refs.h" static int delta_base_offset = 1; static int pack_kept_objects = -1; @@ -440,6 +442,61 @@ static void clear_pack_geometry(struct pack_geometry *geometry) geometry->split = 0; } +struct midx_snapshot_ref_data { + struct tempfile *f; + struct oidset seen; + int preferred; +}; + +static int midx_snapshot_ref_one(const char *refname, + const struct object_id *oid, + int flag, void *_data) +{ + struct midx_snapshot_ref_data *data = _data; + struct object_id peeled; + + if (!peel_iterated_oid(oid, &peeled)) + oid = &peeled; + + if (oidset_insert(&data->seen, oid)) + return 0; /* already seen */ + + if (oid_object_info(the_repository, oid, NULL) != OBJ_COMMIT) + return 0; + + fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "", + oid_to_hex(oid)); + + return 0; +} + +static int midx_snapshot_refs(struct tempfile *f) +{ + struct midx_snapshot_ref_data data; + const struct string_list *preferred = bitmap_preferred_tips(the_repository); + + data.f = f; + oidset_init(&data.seen, 0); + + if (!fdopen_tempfile(f, "w")) + die(_("could not open tempfile %s for writing"), + get_tempfile_path(f)); + + if (preferred) { + struct string_list_item *item; + + data.preferred = 1; + for_each_string_list_item(item, preferred) { + for_each_ref_in(item->string, midx_snapshot_ref_one, &data); + } + data.preferred = 0; + } + + for_each_ref(midx_snapshot_ref_one, &data); + + return close_tempfile_gently(f); +} + static void midx_included_packs(struct string_list *include, struct string_list *existing_packs, struct string_list *existing_kept_packs, @@ -477,6 +534,7 @@ static void midx_included_packs(struct string_list *include, static int write_midx_included_packs(struct string_list *include, struct pack_geometry *geometry, + const char *refs_snapshot, int show_progress, int write_bitmaps) { struct child_process cmd = CHILD_PROCESS_INIT; @@ -511,6 +569,9 @@ static int write_midx_included_packs(struct string_list *include, ; } + if (refs_snapshot) + strvec_pushf(&cmd.args, "--refs-snapshot=%s", refs_snapshot); + ret = start_command(&cmd); if (ret) return ret; @@ -534,6 +595,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) struct string_list existing_kept_packs = STRING_LIST_INIT_DUP; struct pack_geometry *geometry = NULL; struct strbuf line = STRBUF_INIT; + struct tempfile *refs_snapshot = NULL; int i, ext, ret; FILE *out; int show_progress = isatty(2); @@ -622,6 +684,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix) if (write_bitmaps && !(pack_everything & ALL_INTO_ONE) && !write_midx) die(_(incremental_bitmap_conflict_error)); + if (write_midx && write_bitmaps) { + struct strbuf path = STRBUF_INIT; + + strbuf_addf(&path, "%s/%s_XXXXXX", get_object_directory(), + "bitmap-ref-tips"); + + refs_snapshot = xmks_tempfile(path.buf); + if (midx_snapshot_refs(refs_snapshot) < 0) + die(_("could not take a snapshot of references")); + + strbuf_release(&path); + } + if (geometric_factor) { if (pack_everything) die(_("--geometric is incompatible with -A, -a")); @@ -798,6 +873,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) &existing_kept_packs, &names, geometry); ret = write_midx_included_packs(&include, geometry, + refs_snapshot ? get_tempfile_path(refs_snapshot) : NULL, show_progress, write_bitmaps > 0); string_list_clear(&include, 0);
To prevent the race described in an earlier patch, generate and pass a reference snapshot to the multi-pack bitmap code, if we are writing one from `git repack`. This patch is mostly limited to creating a temporary file, and then calling for_each_ref(). Except we try to minimize duplicates, since doing so can drastically reduce the size in network-of-forks style repositories. In the kernel's fork network (the repository containing all objects from the kernel and all its forks), deduplicating the references drops the snapshot size from 934 MB to just 12 MB. But since we're handling duplicates in this way, we have to make sure that we preferred references (those listed in pack.preferBitmapTips) before non-preferred ones (to avoid recording an object which is pointed at by a preferred tip as non-preferred). We accomplish this by doing separate passes over the references: first visiting each prefix in pack.preferBitmapTips, and then over the rest of the references. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- builtin/repack.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 76 insertions(+)