Message ID | 3e637d9ec83435540ad32b8325b0dce87f61bae0.1624314293.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | multi-pack reachability bitmaps | expand |
On Mon, Jun 21 2021, Taylor Blau wrote: > -static uint32_t find_object_pos(const struct object_id *oid) > +static uint32_t find_object_pos(const struct object_id *oid, int *found) > { > struct object_entry *entry = packlist_find(writer.to_pack, oid); > > if (!entry) { > - die("Failed to write bitmap index. Packfile doesn't have full closure " > + if (found) > + *found = 0; > + warning("Failed to write bitmap index. Packfile doesn't have full closure " > "(object %s is missing)", oid_to_hex(oid)); > + return 0; > } > > + if (found) > + *found = 1; > return oe_in_pack_pos(writer.to_pack, entry); > } > > @@ -331,9 +336,10 @@ static void bitmap_builder_clear(struct bitmap_builder *bb) > bb->commits_nr = bb->commits_alloc = 0; > } > > -static void fill_bitmap_tree(struct bitmap *bitmap, > - struct tree *tree) > +static int fill_bitmap_tree(struct bitmap *bitmap, > + struct tree *tree) > { > + int found; > uint32_t pos; > struct tree_desc desc; > struct name_entry entry; > @@ -342,9 +348,11 @@ static void fill_bitmap_tree(struct bitmap *bitmap, > * If our bit is already set, then there is nothing to do. Both this > * tree and all of its children will be set. > */ > - pos = find_object_pos(&tree->object.oid); > + pos = find_object_pos(&tree->object.oid, &found); > + if (!found) > + return -1; So, a function that returns an unsigned 32 bit int won't (presumably) have enough space for an "is bad", but before it died so it didn't matter. Now it warns, so it needs a "is bad", so we add another "int" to pass that information around. So if we're already paying for that extra space (which, on some platforms would already be a 64 bit int, and on some so would the uint32_t, it's just "at least 32 bits"). Wouldn't it be more idiomatic to just have find_object_pos() return int64_t now, if it's -1 it's an error, otherwise the "pos" is cast to uint32_t: diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c index 88d9e696a54..d71fa6f607a 100644 --- a/pack-bitmap-write.c +++ b/pack-bitmap-write.c @@ -125,14 +125,12 @@ static inline void push_bitmapped_commit(struct commit *commit) writer.selected_nr++; } -static uint32_t find_object_pos(const struct object_id *oid) +static int64_t find_object_pos(const struct object_id *oid) { struct object_entry *entry = packlist_find(writer.to_pack, oid); - if (!entry) { - die("Failed to write bitmap index. Packfile doesn't have full closure " - "(object %s is missing)", oid_to_hex(oid)); - } + if (!entry) + return -1; return oe_in_pack_pos(writer.to_pack, entry); } @@ -334,7 +332,7 @@ static void bitmap_builder_clear(struct bitmap_builder *bb) static void fill_bitmap_tree(struct bitmap *bitmap, struct tree *tree) { - uint32_t pos; + int64_t pos; struct tree_desc desc; struct name_entry entry; @@ -343,6 +341,9 @@ static void fill_bitmap_tree(struct bitmap *bitmap, * tree and all of its children will be set. */ pos = find_object_pos(&tree->object.oid); + if (pos < 0) + die("unhappy: %s", oid_to_hex(&tree->object.oid)); + if (bitmap_get(bitmap, pos)) return; bitmap_set(bitmap, pos); I mean, you don't want the die() part of that, but to me the rest looks better. > [...] > diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh > index 584a039b85..1667450917 100755 > --- a/t/t0410-partial-clone.sh > +++ b/t/t0410-partial-clone.sh > @@ -536,7 +536,13 @@ test_expect_success 'gc does not repack promisor objects if there are none' ' > repack_and_check () { > rm -rf repo2 && > cp -r repo repo2 && > - git -C repo2 repack $1 -d && > + if test x"$1" = "x--must-fail" > + then > + shift > + test_must_fail git -C repo2 repack $1 -d > + else > + git -C repo2 repack $1 -d > + fi && > git -C repo2 fsck && This sent me down the rabbit hole of https://lore.kernel.org/git/60c67bdf5b77e_f569220858@natae.notmuch/ again.
On Fri, Jun 25, 2021 at 01:23:40AM +0200, Ævar Arnfjörð Bjarmason wrote: > > On Mon, Jun 21 2021, Taylor Blau wrote: > > > -static uint32_t find_object_pos(const struct object_id *oid) > > +static uint32_t find_object_pos(const struct object_id *oid, int *found) > > { > > struct object_entry *entry = packlist_find(writer.to_pack, oid); > > > > if (!entry) { > > - die("Failed to write bitmap index. Packfile doesn't have full closure " > > + if (found) > > + *found = 0; > > + warning("Failed to write bitmap index. Packfile doesn't have full closure " > > "(object %s is missing)", oid_to_hex(oid)); > > + return 0; > > } > > > > + if (found) > > + *found = 1; > > return oe_in_pack_pos(writer.to_pack, entry); > > } > > So, a function that returns an unsigned 32 bit int won't (presumably) > have enough space for an "is bad", but before it died so it didn't > matter. > > Now it warns, so it needs a "is bad", so we add another "int" to pass > that information around. Right. You could imagine using the most-significant bit to indicate "bad" (which in this case is "I couldn't find this object that I'm supposed to be able to reach"), but of course it cuts our maximum number of objects in a bitmap in half. > So if we're already paying for that extra space (which, on some > platforms would already be a 64 bit int, and on some so would the > uint32_t, it's just "at least 32 bits"). > > Wouldn't it be more idiomatic to just have find_object_pos() return > int64_t now, if it's -1 it's an error, otherwise the "pos" is cast to > uint32_t: I'm not sure. It does save the extra argument, which is arguably more convenient for callers, but the cost for doing so is a cast from a signed integer type to an unsigned one (and a narrower destination type, at that). That seems easier to get wrong to me than passing a pointer to a pure "int" and keeping the return type a uint32_t. So, I'm probably more content to leave it as-is rather than change it. I don't feel too strongly about it, though, so if you do I'd be happy to hear more. Thanks, Taylor
On Wed, Jul 14 2021, Taylor Blau wrote: > On Fri, Jun 25, 2021 at 01:23:40AM +0200, Ævar Arnfjörð Bjarmason wrote: >> >> On Mon, Jun 21 2021, Taylor Blau wrote: >> >> > -static uint32_t find_object_pos(const struct object_id *oid) >> > +static uint32_t find_object_pos(const struct object_id *oid, int *found) >> > { >> > struct object_entry *entry = packlist_find(writer.to_pack, oid); >> > >> > if (!entry) { >> > - die("Failed to write bitmap index. Packfile doesn't have full closure " >> > + if (found) >> > + *found = 0; >> > + warning("Failed to write bitmap index. Packfile doesn't have full closure " >> > "(object %s is missing)", oid_to_hex(oid)); >> > + return 0; >> > } >> > >> > + if (found) >> > + *found = 1; >> > return oe_in_pack_pos(writer.to_pack, entry); >> > } >> >> So, a function that returns an unsigned 32 bit int won't (presumably) >> have enough space for an "is bad", but before it died so it didn't >> matter. >> >> Now it warns, so it needs a "is bad", so we add another "int" to pass >> that information around. > > Right. You could imagine using the most-significant bit to indicate > "bad" (which in this case is "I couldn't find this object that I'm > supposed to be able to reach"), but of course it cuts our maximum number > of objects in a bitmap in half. > >> So if we're already paying for that extra space (which, on some >> platforms would already be a 64 bit int, and on some so would the >> uint32_t, it's just "at least 32 bits"). >> >> Wouldn't it be more idiomatic to just have find_object_pos() return >> int64_t now, if it's -1 it's an error, otherwise the "pos" is cast to >> uint32_t: > > I'm not sure. It does save the extra argument, which is arguably more > convenient for callers, but the cost for doing so is a cast from a > signed integer type to an unsigned one (and a narrower destination type, > at that). > > That seems easier to get wrong to me than passing a pointer to a pure > "int" and keeping the return type a uint32_t. So, I'm probably more > content to leave it as-is rather than change it. > > I don't feel too strongly about it, though, so if you do I'd be happy to > hear more. I don't really care, it just looked a bit weird at first, and I wondered why it couldn't return -1. Aside from this case do you mean that such a cast would be too expensive in general, or fears abou going past the 32 bits? I assumed that there would be checks here for that already (and if not, we'd have wrap-around now...).
On Mon, Jun 21, 2021 at 06:25:01PM -0400, Taylor Blau wrote: > The set of objects covered by a bitmap must be closed under > reachability, since it must be the case that there is a valid bit > position assigned for every possible reachable object (otherwise the > bitmaps would be incomplete). > > Pack bitmaps are never written from 'git repack' unless repacking > all-into-one, and so we never write non-closed bitmaps (except in the > case of partial clones where we aren't guaranteed to have all objects). > > But multi-pack bitmaps change this, since it isn't known whether the > set of objects in the MIDX is closed under reachability until walking > them. Plumb through a bit that is set when a reachable object isn't > found. > > As soon as a reachable object isn't found in the set of objects to > include in the bitmap, bitmap_writer_build() knows that the set is not > closed, and so it now fails gracefully. Leaving aside your intended use here, I think it's nice to get rid of a deep-buried die() like this in general. The amount of error-plumbing you had to do is a little unpleasant, but I think is unavoidable. The only non-obvious part was this hunk: > @@ -463,8 +488,11 @@ void bitmap_writer_build(struct packing_data *to_pack) > struct commit *child; > int reused = 0; > > - fill_bitmap_commit(ent, commit, &queue, &tree_queue, > - old_bitmap, mapping); > + if (fill_bitmap_commit(ent, commit, &queue, &tree_queue, > + old_bitmap, mapping) < 0) { > + closed = 0; > + break; > + } > > if (ent->selected) { > store_selected(ent, commit); This is the right thing to do because we still want to free memory, stop progress, etc. I gave a look over what will run after breaking out of the loop, and compute_xor_offsets(), which you already handled, is the only thing we'd want to avoid running. Good. -Peff
On Wed, Jul 14, 2021 at 01:32:21PM -0400, Taylor Blau wrote: > > So if we're already paying for that extra space (which, on some > > platforms would already be a 64 bit int, and on some so would the > > uint32_t, it's just "at least 32 bits"). > > > > Wouldn't it be more idiomatic to just have find_object_pos() return > > int64_t now, if it's -1 it's an error, otherwise the "pos" is cast to > > uint32_t: > > I'm not sure. It does save the extra argument, which is arguably more > convenient for callers, but the cost for doing so is a cast from a > signed integer type to an unsigned one (and a narrower destination type, > at that). > > That seems easier to get wrong to me than passing a pointer to a pure > "int" and keeping the return type a uint32_t. So, I'm probably more > content to leave it as-is rather than change it. I agree that the separate "found" value makes things more obvious. Casting to a smaller size means explaining why that is OK, whereas it is quite clear that things are correct with two separate variables. And having an int64_t makes one wonder whether a value like 2^35 is possible (it isn't; this is a uint32_t because that is the limit of objects in the pack format). -Peff
On Wed, Jul 21, 2021 at 05:50:43AM -0400, Jeff King wrote: > The amount of error-plumbing you had to do is a little unpleasant, but I > think is unavoidable. The only non-obvious part was this hunk: Agreed, at least on the amount of plumbing required to get this to work ;). > > @@ -463,8 +488,11 @@ void bitmap_writer_build(struct packing_data *to_pack) > > struct commit *child; > > int reused = 0; > > > > - fill_bitmap_commit(ent, commit, &queue, &tree_queue, > > - old_bitmap, mapping); > > + if (fill_bitmap_commit(ent, commit, &queue, &tree_queue, > > + old_bitmap, mapping) < 0) { > > + closed = 0; > > + break; > > + } > > > > if (ent->selected) { > > store_selected(ent, commit); > > This is the right thing to do because we still want to free memory, stop > progress, etc. I gave a look over what will run after breaking out of > the loop, and compute_xor_offsets(), which you already handled, is the > only thing we'd want to avoid running. Good. Right. The key is that we return "closed ? 0 : -1" (of course, being careful to invert "closed" where "1" OK into a suitable return value for bitmap_writer_build, where "0" means OK, and a negative number means "error"). While I'm thinking about that inversion, we *could* call this variable "open" and set it to "0" until proven otherwise. Then the conditional becomes "if (!open)", but the return value is still "return open ? -1 : 0" (since I assume we'd want to use 0/1 values for "open" instead of -1, meaning we'd have to do some translation). Anyway, this is definitely an annoying detail that doesn't really matter (and just rambling on my part) ;). Thanks, Taylor
On Wed, Jul 21, 2021 at 01:20:47PM -0400, Taylor Blau wrote: > > > @@ -463,8 +488,11 @@ void bitmap_writer_build(struct packing_data *to_pack) > > > struct commit *child; > > > int reused = 0; > > > > > > - fill_bitmap_commit(ent, commit, &queue, &tree_queue, > > > - old_bitmap, mapping); > > > + if (fill_bitmap_commit(ent, commit, &queue, &tree_queue, > > > + old_bitmap, mapping) < 0) { > > > + closed = 0; > > > + break; > > > + } > > > > > > if (ent->selected) { > > > store_selected(ent, commit); > > > > This is the right thing to do because we still want to free memory, stop > > progress, etc. I gave a look over what will run after breaking out of > > the loop, and compute_xor_offsets(), which you already handled, is the > > only thing we'd want to avoid running. Good. > > Right. The key is that we return "closed ? 0 : -1" (of course, being > careful to invert "closed" where "1" OK into a suitable return value for > bitmap_writer_build, where "0" means OK, and a negative number means > "error"). > > While I'm thinking about that inversion, we *could* call this variable > "open" and set it to "0" until proven otherwise. Then the conditional > becomes "if (!open)", but the return value is still "return open ? -1 : > 0" (since I assume we'd want to use 0/1 values for "open" instead of -1, > meaning we'd have to do some translation). I thought about suggesting that it be called "err" or "ret" or something. And then we do not have to care that fill_bitmap_commit() only returns an error in the non-closed state. We are simply propagating its error-return back up the stack. And then you can just write: ret = fill_bitmap_commit(...); if (ret < 0) break; ... return ret; without an extra conversion. I don't care that much either way, though (but if you like it and are re-rolling anyway... :) ). -Peff
On Fri, Jul 23, 2021 at 03:37:52AM -0400, Jeff King wrote: > I thought about suggesting that it be called "err" or "ret" or > something. And then we do not have to care that fill_bitmap_commit() > only returns an error in the non-closed state. We are simply propagating > its error-return back up the stack. Hmm. For whatever the inconvience costs us, I do like that the variable can be named specifically like "open" or "closed" as opposed to the more generic "err" or "ret". So I'll probably keep it is unless you feel strongly (which I suspect you do not). Thanks, Taylor
On Mon, Jul 26, 2021 at 02:48:19PM -0400, Taylor Blau wrote: > On Fri, Jul 23, 2021 at 03:37:52AM -0400, Jeff King wrote: > > I thought about suggesting that it be called "err" or "ret" or > > something. And then we do not have to care that fill_bitmap_commit() > > only returns an error in the non-closed state. We are simply propagating > > its error-return back up the stack. > > Hmm. For whatever the inconvience costs us, I do like that the variable > can be named specifically like "open" or "closed" as opposed to the more > generic "err" or "ret". > > So I'll probably keep it is unless you feel strongly (which I suspect > you do not). I don't. -Peff
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index de00adbb9e..8a523624a1 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -1256,7 +1256,8 @@ static void write_pack_file(void) bitmap_writer_show_progress(progress); bitmap_writer_select_commits(indexed_commits, indexed_commits_nr, -1); - bitmap_writer_build(&to_pack); + if (bitmap_writer_build(&to_pack) < 0) + die(_("failed to write bitmap index")); bitmap_writer_finish(written_list, nr_written, tmpname.buf, write_bitmap_options); write_bitmap_index = 0; diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c index 88d9e696a5..d374f7884b 100644 --- a/pack-bitmap-write.c +++ b/pack-bitmap-write.c @@ -125,15 +125,20 @@ static inline void push_bitmapped_commit(struct commit *commit) writer.selected_nr++; } -static uint32_t find_object_pos(const struct object_id *oid) +static uint32_t find_object_pos(const struct object_id *oid, int *found) { struct object_entry *entry = packlist_find(writer.to_pack, oid); if (!entry) { - die("Failed to write bitmap index. Packfile doesn't have full closure " + if (found) + *found = 0; + warning("Failed to write bitmap index. Packfile doesn't have full closure " "(object %s is missing)", oid_to_hex(oid)); + return 0; } + if (found) + *found = 1; return oe_in_pack_pos(writer.to_pack, entry); } @@ -331,9 +336,10 @@ static void bitmap_builder_clear(struct bitmap_builder *bb) bb->commits_nr = bb->commits_alloc = 0; } -static void fill_bitmap_tree(struct bitmap *bitmap, - struct tree *tree) +static int fill_bitmap_tree(struct bitmap *bitmap, + struct tree *tree) { + int found; uint32_t pos; struct tree_desc desc; struct name_entry entry; @@ -342,9 +348,11 @@ static void fill_bitmap_tree(struct bitmap *bitmap, * If our bit is already set, then there is nothing to do. Both this * tree and all of its children will be set. */ - pos = find_object_pos(&tree->object.oid); + pos = find_object_pos(&tree->object.oid, &found); + if (!found) + return -1; if (bitmap_get(bitmap, pos)) - return; + return 0; bitmap_set(bitmap, pos); if (parse_tree(tree) < 0) @@ -355,11 +363,15 @@ static void fill_bitmap_tree(struct bitmap *bitmap, while (tree_entry(&desc, &entry)) { switch (object_type(entry.mode)) { case OBJ_TREE: - fill_bitmap_tree(bitmap, - lookup_tree(the_repository, &entry.oid)); + if (fill_bitmap_tree(bitmap, + lookup_tree(the_repository, &entry.oid)) < 0) + return -1; break; case OBJ_BLOB: - bitmap_set(bitmap, find_object_pos(&entry.oid)); + pos = find_object_pos(&entry.oid, &found); + if (!found) + return -1; + bitmap_set(bitmap, pos); break; default: /* Gitlink, etc; not reachable */ @@ -368,15 +380,18 @@ static void fill_bitmap_tree(struct bitmap *bitmap, } free_tree_buffer(tree); + return 0; } -static void fill_bitmap_commit(struct bb_commit *ent, - struct commit *commit, - struct prio_queue *queue, - struct prio_queue *tree_queue, - struct bitmap_index *old_bitmap, - const uint32_t *mapping) +static int fill_bitmap_commit(struct bb_commit *ent, + struct commit *commit, + struct prio_queue *queue, + struct prio_queue *tree_queue, + struct bitmap_index *old_bitmap, + const uint32_t *mapping) { + int found; + uint32_t pos; if (!ent->bitmap) ent->bitmap = bitmap_new(); @@ -401,11 +416,16 @@ static void fill_bitmap_commit(struct bb_commit *ent, * Mark ourselves and queue our tree. The commit * walk ensures we cover all parents. */ - bitmap_set(ent->bitmap, find_object_pos(&c->object.oid)); + pos = find_object_pos(&c->object.oid, &found); + if (!found) + return -1; + bitmap_set(ent->bitmap, pos); prio_queue_put(tree_queue, get_commit_tree(c)); for (p = c->parents; p; p = p->next) { - int pos = find_object_pos(&p->item->object.oid); + pos = find_object_pos(&p->item->object.oid, &found); + if (!found) + return -1; if (!bitmap_get(ent->bitmap, pos)) { bitmap_set(ent->bitmap, pos); prio_queue_put(queue, p->item); @@ -413,8 +433,12 @@ static void fill_bitmap_commit(struct bb_commit *ent, } } - while (tree_queue->nr) - fill_bitmap_tree(ent->bitmap, prio_queue_get(tree_queue)); + while (tree_queue->nr) { + if (fill_bitmap_tree(ent->bitmap, + prio_queue_get(tree_queue)) < 0) + return -1; + } + return 0; } static void store_selected(struct bb_commit *ent, struct commit *commit) @@ -432,7 +456,7 @@ static void store_selected(struct bb_commit *ent, struct commit *commit) kh_value(writer.bitmaps, hash_pos) = stored; } -void bitmap_writer_build(struct packing_data *to_pack) +int bitmap_writer_build(struct packing_data *to_pack) { struct bitmap_builder bb; size_t i; @@ -441,6 +465,7 @@ void bitmap_writer_build(struct packing_data *to_pack) struct prio_queue tree_queue = { NULL }; struct bitmap_index *old_bitmap; uint32_t *mapping; + int closed = 1; /* until proven otherwise */ writer.bitmaps = kh_init_oid_map(); writer.to_pack = to_pack; @@ -463,8 +488,11 @@ void bitmap_writer_build(struct packing_data *to_pack) struct commit *child; int reused = 0; - fill_bitmap_commit(ent, commit, &queue, &tree_queue, - old_bitmap, mapping); + if (fill_bitmap_commit(ent, commit, &queue, &tree_queue, + old_bitmap, mapping) < 0) { + closed = 0; + break; + } if (ent->selected) { store_selected(ent, commit); @@ -499,7 +527,9 @@ void bitmap_writer_build(struct packing_data *to_pack) stop_progress(&writer.progress); - compute_xor_offsets(); + if (closed) + compute_xor_offsets(); + return closed ? 0 : -1; } /** diff --git a/pack-bitmap.h b/pack-bitmap.h index 99d733eb26..020cd8d868 100644 --- a/pack-bitmap.h +++ b/pack-bitmap.h @@ -87,7 +87,7 @@ struct ewah_bitmap *bitmap_for_commit(struct bitmap_index *bitmap_git, struct commit *commit); void bitmap_writer_select_commits(struct commit **indexed_commits, unsigned int indexed_commits_nr, int max_bitmaps); -void bitmap_writer_build(struct packing_data *to_pack); +int bitmap_writer_build(struct packing_data *to_pack); void bitmap_writer_finish(struct pack_idx_entry **index, uint32_t index_nr, const char *filename, diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh index 584a039b85..1667450917 100755 --- a/t/t0410-partial-clone.sh +++ b/t/t0410-partial-clone.sh @@ -536,7 +536,13 @@ test_expect_success 'gc does not repack promisor objects if there are none' ' repack_and_check () { rm -rf repo2 && cp -r repo repo2 && - git -C repo2 repack $1 -d && + if test x"$1" = "x--must-fail" + then + shift + test_must_fail git -C repo2 repack $1 -d + else + git -C repo2 repack $1 -d + fi && git -C repo2 fsck && git -C repo2 cat-file -e $2 && @@ -561,6 +567,7 @@ test_expect_success 'repack -d does not irreversibly delete promisor objects' ' printf "$THREE\n" | pack_as_from_promisor && delete_object repo "$ONE" && + repack_and_check --must-fail -ab "$TWO" "$THREE" && repack_and_check -a "$TWO" "$THREE" && repack_and_check -A "$TWO" "$THREE" && repack_and_check -l "$TWO" "$THREE"
The set of objects covered by a bitmap must be closed under reachability, since it must be the case that there is a valid bit position assigned for every possible reachable object (otherwise the bitmaps would be incomplete). Pack bitmaps are never written from 'git repack' unless repacking all-into-one, and so we never write non-closed bitmaps (except in the case of partial clones where we aren't guaranteed to have all objects). But multi-pack bitmaps change this, since it isn't known whether the set of objects in the MIDX is closed under reachability until walking them. Plumb through a bit that is set when a reachable object isn't found. As soon as a reachable object isn't found in the set of objects to include in the bitmap, bitmap_writer_build() knows that the set is not closed, and so it now fails gracefully. A test is added in t0410 to trigger a bitmap write without full reachability closure by removing local copies of some reachable objects from a promisor remote. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- builtin/pack-objects.c | 3 +- pack-bitmap-write.c | 76 ++++++++++++++++++++++++++++------------ pack-bitmap.h | 2 +- t/t0410-partial-clone.sh | 9 ++++- 4 files changed, 64 insertions(+), 26 deletions(-)