Message ID | ef9186a8df0d712c2ecccbe62cb43a7abadb9c96.1598320716.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | builtin/repack.c: invalidate MIDX only when necessary | expand |
On Mon, Aug 24, 2020 at 10:01:04PM -0400, Taylor Blau wrote: > In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack' > learned to remove a multi-pack-index file if it added or removed a pack > from the object store. > > This mechanism is a little over-eager, since it is only necessary to > drop a MIDX if 'git repack' removes a pack that the MIDX references. > Adding a pack outside of the MIDX does not require invalidating the > MIDX, and likewise for removing a pack the MIDX does not know about. Does "git repack" ever remove just one pack? Obviously "git repack -ad" or "git repack -Ad" is going to pack everything and delete the old packs. So I think we'd want to remove a midx there. And "git repack -d" I think of as deleting only loose objects that we just packed. But I guess it could also remove a pack that has now been made redundant? That seems like a rare case in practice, but I suppose is possible. Not exactly related to your fix, but kind of the flip side of it: would we ever need to retain a midx that mentions some packs that still exist? E.g., imagine we have a midx that points to packs A and B, and git-repack deletes B. By your logic above, we need to remove the midx because now it points to objects in B which aren't accessible. But by deleting it, could we be deleting the only thing that mentions the objects in A? I _think_ the answer is "no", because we never went all-in on midx and allowed deleting the matching .idx files for contained packs. So we'd still have that A.idx, and we could just use the pack as normal. But it's an interesting corner case if we ever do go in that direction. If you'll let me muse a bit more on midx-lifetime issues (which I've never really thought about before just now): I'm also a little curious how bad it is to have a midx whose pack has gone away. I guess we'd answer queries for "yes, we have this object" even if we don't, which is bad. Though in practice we'd only delete those packs if we have their objects elsewhere. And the pack code is pretty good about retrying other copies of objects that can't be accessed. Alternatively, I wonder if the midx-loading code ought to check that all of the constituent packs are available. In that line of thinking, do we even need to delete midx files if one of their packs goes away? The reading side probably ought to be able to handle that gracefully. And the more interesting case is when you repack everything with "-ad" or similar, at which point you shouldn't even need to look up what's in the midx to see if you deleted its packs. The point of your operation is to put it all-into-one, so you know the old midx should be discarded. > Teach 'git repack' to check for this by loading the MIDX, and checking > whether the to-be-removed pack is known to the MIDX. This requires a > slightly odd alternation to a test in t5319, which is explained with a > comment. My above musings aside, this seems like an obvious improvement. > diff --git a/builtin/repack.c b/builtin/repack.c > index 04c5ceaf7e..98fac03946 100644 > --- a/builtin/repack.c > +++ b/builtin/repack.c > @@ -133,7 +133,11 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list, > static void remove_redundant_pack(const char *dir_name, const char *base_name) > { > struct strbuf buf = STRBUF_INIT; > - strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name); > + struct multi_pack_index *m = get_multi_pack_index(the_repository); > + strbuf_addf(&buf, "%s.pack", base_name); > + if (m && midx_contains_pack(m, buf.buf)) > + clear_midx_file(the_repository); > + strbuf_insertf(&buf, 0, "%s/", dir_name); Makes sense. midx_contains_pack() is a binary search, so we'll spend O(n log n) effort deleting the packs (I wondered if this might be accidentally quadratic over the number of packs). And after we clear, "m" will be NULL, so we'll do it at most once. Which is why you can get rid of the manual "midx_cleared" flag from the preimage. So the patch looks good to me. -Peff
On Mon, Aug 24, 2020 at 10:26:14PM -0400, Jeff King wrote: > On Mon, Aug 24, 2020 at 10:01:04PM -0400, Taylor Blau wrote: > > > In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack' > > learned to remove a multi-pack-index file if it added or removed a pack > > from the object store. > > > > This mechanism is a little over-eager, since it is only necessary to > > drop a MIDX if 'git repack' removes a pack that the MIDX references. > > Adding a pack outside of the MIDX does not require invalidating the > > MIDX, and likewise for removing a pack the MIDX does not know about. > > Does "git repack" ever remove just one pack? Obviously "git repack -ad" > or "git repack -Ad" is going to pack everything and delete the old > packs. So I think we'd want to remove a midx there. > > And "git repack -d" I think of as deleting only loose objects that we > just packed. But I guess it could also remove a pack that has now been > made redundant? That seems like a rare case in practice, but I suppose > is possible. Yeah, the patch message makes this sound more likely than it actually is, which I agree is very rare. I often write 'git repack' instead of 'git pack-objects' to slurp up everything loose into a new pack without having to list loose objects by name. That's the case that I really care about here: purely adding a new pack should not invalidate the existing MIDX. > Not exactly related to your fix, but kind of the flip side of it: would > we ever need to retain a midx that mentions some packs that still exist? > > E.g., imagine we have a midx that points to packs A and B, and > git-repack deletes B. By your logic above, we need to remove the midx > because now it points to objects in B which aren't accessible. But by > deleting it, could we be deleting the only thing that mentions the > objects in A? > > I _think_ the answer is "no", because we never went all-in on midx and > allowed deleting the matching .idx files for contained packs. So we'd > still have that A.idx, and we could just use the pack as normal. But > it's an interesting corner case if we ever do go in that direction. Agreed. Maybe a (admittedly somewhat large) #leftoverbits. > If you'll let me muse a bit more on midx-lifetime issues (which I've > never really thought about before just now): > > I'm also a little curious how bad it is to have a midx whose pack has > gone away. I guess we'd answer queries for "yes, we have this object" > even if we don't, which is bad. Though in practice we'd only delete > those packs if we have their objects elsewhere. And the pack code is > pretty good about retrying other copies of objects that can't be > accessed. Alternatively, I wonder if the midx-loading code ought to > check that all of the constituent packs are available. > > In that line of thinking, do we even need to delete midx files if one of > their packs goes away? The reading side probably ought to be able to > handle that gracefully. I think that this is probably the right direction, although I've only spend time in the MIDX code over the past couple of weeks, so I can't say with authority. It seems like it would be pretty annoying, though. For example, code that cares about listing all objects in a MIDX would have to check first whether the pack they're in still exists before emitting them. On top of that, there are more corner cases when object X exists in more than one pack, but some strict subset of those packs containing X have gone away. I don't think that it couldn't be done, though. > And the more interesting case is when you repack everything with "-ad" > or similar, at which point you shouldn't even need to look up what's in > the midx to see if you deleted its packs. The point of your operation is > to put it all-into-one, so you know the old midx should be discarded. > > > Teach 'git repack' to check for this by loading the MIDX, and checking > > whether the to-be-removed pack is known to the MIDX. This requires a > > slightly odd alternation to a test in t5319, which is explained with a > > comment. > > My above musings aside, this seems like an obvious improvement. > > > diff --git a/builtin/repack.c b/builtin/repack.c > > index 04c5ceaf7e..98fac03946 100644 > > --- a/builtin/repack.c > > +++ b/builtin/repack.c > > @@ -133,7 +133,11 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list, > > static void remove_redundant_pack(const char *dir_name, const char *base_name) > > { > > struct strbuf buf = STRBUF_INIT; > > - strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name); > > + struct multi_pack_index *m = get_multi_pack_index(the_repository); > > + strbuf_addf(&buf, "%s.pack", base_name); > > + if (m && midx_contains_pack(m, buf.buf)) > > + clear_midx_file(the_repository); > > + strbuf_insertf(&buf, 0, "%s/", dir_name); > > Makes sense. midx_contains_pack() is a binary search, so we'll spend > O(n log n) effort deleting the packs (I wondered if this might be > accidentally quadratic over the number of packs). Right. The MIDX stores packs in lexographic order, so checking them is O(log n), which we do at most 'n' times. > And after we clear, "m" will be NULL, so we'll do it at most once. Which > is why you can get rid of the manual "midx_cleared" flag from the > preimage. Yep. I thought briefly about passing 'm' as a parameter, but then you have to worry about a dangling reference to 'the_repository->objects->multi_pack_index' after calling 'clear_midx_file()', so it's easier to look it up each time. > So the patch looks good to me. Thanks. > -Peff Thanks, Taylor
Hi Taylor, Thanks for working on this. > On Aug 25, 2020, at 04:01, Taylor Blau <me@ttaylorr.com> wrote: > > In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack' > learned to remove a multi-pack-index file if it added or removed a pack > from the object store. > > This mechanism is a little over-eager, since it is only necessary to > drop a MIDX if 'git repack' removes a pack that the MIDX references. > Adding a pack outside of the MIDX does not require invalidating the > MIDX, and likewise for removing a pack the MIDX does not know about. I wonder if its worth to trigger write_midx_file() to update the midx instead of just removing MIDX? That is already the direction we are taking in the 'maintenance' patch series whenever the multi-pack-index file was deemed invalid. Or perhaps, we can check for 'core.multiPackIndex' value (which recently is 'true' by default) and determine whether we should remove the MIDX or rewrite it? > > Teach 'git repack' to check for this by loading the MIDX, and checking > whether the to-be-removed pack is known to the MIDX. This requires a > slightly odd alternation to a test in t5319, which is explained with a > comment. > > Signed-off-by: Taylor Blau <me@ttaylorr.com> Cheers, Son Luong.
On 8/25/2020 3:55 AM, Son Luong Ngoc wrote: > Hi Taylor, > > Thanks for working on this. > >> On Aug 25, 2020, at 04:01, Taylor Blau <me@ttaylorr.com> wrote: >> >> In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack' >> learned to remove a multi-pack-index file if it added or removed a pack >> from the object store. >> >> This mechanism is a little over-eager, since it is only necessary to >> drop a MIDX if 'git repack' removes a pack that the MIDX references. >> Adding a pack outside of the MIDX does not require invalidating the >> MIDX, and likewise for removing a pack the MIDX does not know about. > > I wonder if its worth to trigger write_midx_file() to update the midx instead of > just removing MIDX? It would be reasonable to have 'git repack' run write_midx_file() after updating the pack-files, but it still needs to delete the existing multi-pack-index after deleting a referenced pack, as the current multi-pack-index will be incorrect, leading to possibly invalid data during the write. 'git multi-pack-index expire' carefully handles deleting pack-files that have become redundant. It may be possible to change the behavior of write_midx_file() to check for the state of each referenced pack-file, but it would then need to re-scan the pack-indexes to rebuild the set of objects and which pack-files contain them. > That is already the direction we are taking in the 'maintenance' patch series > whenever the multi-pack-index file was deemed invalid. The 'maintenance' builtin definitely adds new MIDX-aware maintenance tasks. By performing a repack using the multi-pack-index as the starting point, we can get around these issues. One thing I've noticed by talking with Taylor about this change is that 'git repack' is a bit like 'git gc' in that it does a LOT of things and it can be hard to understand exactly when certain sub-tasks are run or not. There are likely some interesting operations that could be separated into maintenance tasks. For example, 'git repack -d' is very similar to 'git maintenance run --task=loose-objects'. > Or perhaps, we can check for 'core.multiPackIndex' value (which recently is > 'true' by default) and determine whether we should remove the MIDX or rewrite > it? We currently rewrite it during 'git repack' when GIT_TEST_MULTI_PACK_INDEX=1 to maximize how often we have a multi-pack-index in the test suite. It would be simple to update that to also check a config option. I don't think adding that behavior to 'core.multiPackIndex' is a good direction, though. Thanks, -Stolee
On 8/24/2020 10:37 PM, Taylor Blau wrote: > On Mon, Aug 24, 2020 at 10:26:14PM -0400, Jeff King wrote: >> On Mon, Aug 24, 2020 at 10:01:04PM -0400, Taylor Blau wrote: >> >>> In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack' >>> learned to remove a multi-pack-index file if it added or removed a pack >>> from the object store. >>> >>> This mechanism is a little over-eager, since it is only necessary to >>> drop a MIDX if 'git repack' removes a pack that the MIDX references. >>> Adding a pack outside of the MIDX does not require invalidating the >>> MIDX, and likewise for removing a pack the MIDX does not know about. >> >> Does "git repack" ever remove just one pack? Obviously "git repack -ad" >> or "git repack -Ad" is going to pack everything and delete the old >> packs. So I think we'd want to remove a midx there. >> >> And "git repack -d" I think of as deleting only loose objects that we >> just packed. But I guess it could also remove a pack that has now been >> made redundant? That seems like a rare case in practice, but I suppose >> is possible. > > Yeah, the patch message makes this sound more likely than it actually > is, which I agree is very rare. I often write 'git repack' instead of > 'git pack-objects' to slurp up everything loose into a new pack without > having to list loose objects by name. > > That's the case that I really care about here: purely adding a new pack > should not invalidate the existing MIDX. > >> Not exactly related to your fix, but kind of the flip side of it: would >> we ever need to retain a midx that mentions some packs that still exist? >> >> E.g., imagine we have a midx that points to packs A and B, and >> git-repack deletes B. By your logic above, we need to remove the midx >> because now it points to objects in B which aren't accessible. But by >> deleting it, could we be deleting the only thing that mentions the >> objects in A? >> >> I _think_ the answer is "no", because we never went all-in on midx and >> allowed deleting the matching .idx files for contained packs. So we'd >> still have that A.idx, and we could just use the pack as normal. But >> it's an interesting corner case if we ever do go in that direction. > > Agreed. Maybe a (admittedly somewhat large) #leftoverbits. > >> If you'll let me muse a bit more on midx-lifetime issues (which I've >> never really thought about before just now): >> >> I'm also a little curious how bad it is to have a midx whose pack has >> gone away. I guess we'd answer queries for "yes, we have this object" >> even if we don't, which is bad. Though in practice we'd only delete >> those packs if we have their objects elsewhere. And the pack code is >> pretty good about retrying other copies of objects that can't be >> accessed. Alternatively, I wonder if the midx-loading code ought to >> check that all of the constituent packs are available. >> >> In that line of thinking, do we even need to delete midx files if one of >> their packs goes away? The reading side probably ought to be able to >> handle that gracefully. > > I think that this is probably the right direction, although I've only > spend time in the MIDX code over the past couple of weeks, so I can't > say with authority. It seems like it would be pretty annoying, though. > For example, code that cares about listing all objects in a MIDX would > have to check first whether the pack they're in still exists before > emitting them. On top of that, there are more corner cases when object X > exists in more than one pack, but some strict subset of those packs > containing X have gone away. > > I don't think that it couldn't be done, though. > >> And the more interesting case is when you repack everything with "-ad" >> or similar, at which point you shouldn't even need to look up what's in >> the midx to see if you deleted its packs. The point of your operation is >> to put it all-into-one, so you know the old midx should be discarded. >> >>> Teach 'git repack' to check for this by loading the MIDX, and checking >>> whether the to-be-removed pack is known to the MIDX. This requires a >>> slightly odd alternation to a test in t5319, which is explained with a >>> comment. >> >> My above musings aside, this seems like an obvious improvement. >> >>> diff --git a/builtin/repack.c b/builtin/repack.c >>> index 04c5ceaf7e..98fac03946 100644 >>> --- a/builtin/repack.c >>> +++ b/builtin/repack.c >>> @@ -133,7 +133,11 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list, >>> static void remove_redundant_pack(const char *dir_name, const char *base_name) >>> { >>> struct strbuf buf = STRBUF_INIT; >>> - strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name); >>> + struct multi_pack_index *m = get_multi_pack_index(the_repository); >>> + strbuf_addf(&buf, "%s.pack", base_name); >>> + if (m && midx_contains_pack(m, buf.buf)) >>> + clear_midx_file(the_repository); >>> + strbuf_insertf(&buf, 0, "%s/", dir_name); >> >> Makes sense. midx_contains_pack() is a binary search, so we'll spend >> O(n log n) effort deleting the packs (I wondered if this might be >> accidentally quadratic over the number of packs). > > Right. The MIDX stores packs in lexographic order, so checking them is > O(log n), which we do at most 'n' times. > >> And after we clear, "m" will be NULL, so we'll do it at most once. Which >> is why you can get rid of the manual "midx_cleared" flag from the >> preimage. > > Yep. I thought briefly about passing 'm' as a parameter, but then you > have to worry about a dangling reference to > 'the_repository->objects->multi_pack_index' after calling > 'clear_midx_file()', so it's easier to look it up each time. The discussion in this thread matches my understanding of the situation. >> So the patch looks good to me. The code in builtin/repack.c looks good for sure. I have a quick question about this new test: +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' ' + git multi-pack-index write && + cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak && + test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" && + + # Write a new pack that is unknown to the multi-pack-index. + git hash-object -w </dev/null >blob && + git pack-objects $objdir/pack/pack <blob && + + GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d && + test_cmp_bin $objdir/pack/multi-pack-index \ + $objdir/pack/multi-pack-index.bak +' + You create an arbitrary blob, and then add it to a pack-file. Do we know that 'git repack' is definitely creating a new pack-file that makes our manually-created pack-file redundant? My suggestion is to have the test check itself: +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' ' + git multi-pack-index write && + cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak && + test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" && + + # Write a new pack that is unknown to the multi-pack-index. + git hash-object -w </dev/null >blob && + HASH=$(git pack-objects $objdir/pack/pack <blob) && + + GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d && + test_cmp_bin $objdir/pack/multi-pack-index \ + $objdir/pack/multi-pack-index.bak && + test_path_is_missing $objdir/pack/pack-$HASH.pack +' + This test fails for me, on the 'test_path_is_missing'. Likely, the blob is seen as already in a pack-file so is just pruned by 'git repack' instead. I thought that perhaps we need to add a new pack ourselves that overrides the small pack. Here is my attempt: test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' ' git multi-pack-index write && cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak && test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" && # Write a new pack that is unknown to the multi-pack-index. BLOB1=$(echo blob1 | git hash-object -w --stdin) && BLOB2=$(echo blob2 | git hash-object -w --stdin) && cat >blobs <<-EOF && $BLOB1 $BLOB2 EOF HASH1=$(echo $BLOB1 | git pack-objects $objdir/pack/pack) && HASH2=$(git pack-objects $objdir/pack/pack <blobs) && GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d && test_cmp_bin $objdir/pack/multi-pack-index \ $objdir/pack/multi-pack-index.bak && test_path_is_file $objdir/pack/pack-$HASH2.pack && test_path_is_missing $objdir/pack/pack-$HASH1.pack ' However, this _still_ fails on the "test_path_is_missing" line, so I'm not sure how to make sure your logic is tested. I saw that 'git repack' was writing "nothing new to pack" in the output, so I also tested adding a few commits and trying to force it to repack reachable data, but I cannot seem to trigger it to create a new pack that overrides only one pack that is not in the MIDX. Likely, I just don't know how 'git rebase' works well enough to trigger this behavior. But the test as-is is not testing what you want it to test. Thanks, -Stolee
On Tue, Aug 25, 2020 at 09:14:19AM -0400, Derrick Stolee wrote: > On 8/24/2020 10:37 PM, Taylor Blau wrote: > > On Mon, Aug 24, 2020 at 10:26:14PM -0400, Jeff King wrote: > >> On Mon, Aug 24, 2020 at 10:01:04PM -0400, Taylor Blau wrote: > >> > >>> In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack' > >>> learned to remove a multi-pack-index file if it added or removed a pack > >>> from the object store. > >>> > >>> This mechanism is a little over-eager, since it is only necessary to > >>> drop a MIDX if 'git repack' removes a pack that the MIDX references. > >>> Adding a pack outside of the MIDX does not require invalidating the > >>> MIDX, and likewise for removing a pack the MIDX does not know about. > >> > >> Does "git repack" ever remove just one pack? Obviously "git repack -ad" > >> or "git repack -Ad" is going to pack everything and delete the old > >> packs. So I think we'd want to remove a midx there. > >> > >> And "git repack -d" I think of as deleting only loose objects that we > >> just packed. But I guess it could also remove a pack that has now been > >> made redundant? That seems like a rare case in practice, but I suppose > >> is possible. > > > > Yeah, the patch message makes this sound more likely than it actually > > is, which I agree is very rare. I often write 'git repack' instead of > > 'git pack-objects' to slurp up everything loose into a new pack without > > having to list loose objects by name. > > > > That's the case that I really care about here: purely adding a new pack > > should not invalidate the existing MIDX. > > > >> Not exactly related to your fix, but kind of the flip side of it: would > >> we ever need to retain a midx that mentions some packs that still exist? > >> > >> E.g., imagine we have a midx that points to packs A and B, and > >> git-repack deletes B. By your logic above, we need to remove the midx > >> because now it points to objects in B which aren't accessible. But by > >> deleting it, could we be deleting the only thing that mentions the > >> objects in A? > >> > >> I _think_ the answer is "no", because we never went all-in on midx and > >> allowed deleting the matching .idx files for contained packs. So we'd > >> still have that A.idx, and we could just use the pack as normal. But > >> it's an interesting corner case if we ever do go in that direction. > > > > Agreed. Maybe a (admittedly somewhat large) #leftoverbits. > > > >> If you'll let me muse a bit more on midx-lifetime issues (which I've > >> never really thought about before just now): > >> > >> I'm also a little curious how bad it is to have a midx whose pack has > >> gone away. I guess we'd answer queries for "yes, we have this object" > >> even if we don't, which is bad. Though in practice we'd only delete > >> those packs if we have their objects elsewhere. And the pack code is > >> pretty good about retrying other copies of objects that can't be > >> accessed. Alternatively, I wonder if the midx-loading code ought to > >> check that all of the constituent packs are available. > >> > >> In that line of thinking, do we even need to delete midx files if one of > >> their packs goes away? The reading side probably ought to be able to > >> handle that gracefully. > > > > I think that this is probably the right direction, although I've only > > spend time in the MIDX code over the past couple of weeks, so I can't > > say with authority. It seems like it would be pretty annoying, though. > > For example, code that cares about listing all objects in a MIDX would > > have to check first whether the pack they're in still exists before > > emitting them. On top of that, there are more corner cases when object X > > exists in more than one pack, but some strict subset of those packs > > containing X have gone away. > > > > I don't think that it couldn't be done, though. > > > >> And the more interesting case is when you repack everything with "-ad" > >> or similar, at which point you shouldn't even need to look up what's in > >> the midx to see if you deleted its packs. The point of your operation is > >> to put it all-into-one, so you know the old midx should be discarded. > >> > >>> Teach 'git repack' to check for this by loading the MIDX, and checking > >>> whether the to-be-removed pack is known to the MIDX. This requires a > >>> slightly odd alternation to a test in t5319, which is explained with a > >>> comment. > >> > >> My above musings aside, this seems like an obvious improvement. > >> > >>> diff --git a/builtin/repack.c b/builtin/repack.c > >>> index 04c5ceaf7e..98fac03946 100644 > >>> --- a/builtin/repack.c > >>> +++ b/builtin/repack.c > >>> @@ -133,7 +133,11 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list, > >>> static void remove_redundant_pack(const char *dir_name, const char *base_name) > >>> { > >>> struct strbuf buf = STRBUF_INIT; > >>> - strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name); > >>> + struct multi_pack_index *m = get_multi_pack_index(the_repository); > >>> + strbuf_addf(&buf, "%s.pack", base_name); > >>> + if (m && midx_contains_pack(m, buf.buf)) > >>> + clear_midx_file(the_repository); > >>> + strbuf_insertf(&buf, 0, "%s/", dir_name); > >> > >> Makes sense. midx_contains_pack() is a binary search, so we'll spend > >> O(n log n) effort deleting the packs (I wondered if this might be > >> accidentally quadratic over the number of packs). > > > > Right. The MIDX stores packs in lexographic order, so checking them is > > O(log n), which we do at most 'n' times. > > > >> And after we clear, "m" will be NULL, so we'll do it at most once. Which > >> is why you can get rid of the manual "midx_cleared" flag from the > >> preimage. > > > > Yep. I thought briefly about passing 'm' as a parameter, but then you > > have to worry about a dangling reference to > > 'the_repository->objects->multi_pack_index' after calling > > 'clear_midx_file()', so it's easier to look it up each time. > > The discussion in this thread matches my understanding of the > situation. > > >> So the patch looks good to me. > > The code in builtin/repack.c looks good for sure. I have a quick question > about this new test: > > +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' ' > + git multi-pack-index write && > + cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak && > + test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" && > + > + # Write a new pack that is unknown to the multi-pack-index. > + git hash-object -w </dev/null >blob && > + git pack-objects $objdir/pack/pack <blob && > + > + GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d && > + test_cmp_bin $objdir/pack/multi-pack-index \ > + $objdir/pack/multi-pack-index.bak > +' > + > > You create an arbitrary blob, and then add it to a pack-file. Do we > know that 'git repack' is definitely creating a new pack-file that makes > our manually-created pack-file redundant? > > My suggestion is to have the test check itself: > > +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' ' > + git multi-pack-index write && > + cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak && > + test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" && > + > + # Write a new pack that is unknown to the multi-pack-index. > + git hash-object -w </dev/null >blob && > + HASH=$(git pack-objects $objdir/pack/pack <blob) && > + > + GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d && > + test_cmp_bin $objdir/pack/multi-pack-index \ > + $objdir/pack/multi-pack-index.bak && > + test_path_is_missing $objdir/pack/pack-$HASH.pack > +' > + > > This test fails for me, on the 'test_path_is_missing'. Likely, the > blob is seen as already in a pack-file so is just pruned by 'git repack' > instead. I thought that perhaps we need to add a new pack ourselves that > overrides the small pack. Here is my attempt: > > test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' ' > git multi-pack-index write && > cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak && > test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" && > > # Write a new pack that is unknown to the multi-pack-index. > BLOB1=$(echo blob1 | git hash-object -w --stdin) && > BLOB2=$(echo blob2 | git hash-object -w --stdin) && > cat >blobs <<-EOF && > $BLOB1 > $BLOB2 > EOF > HASH1=$(echo $BLOB1 | git pack-objects $objdir/pack/pack) && > HASH2=$(git pack-objects $objdir/pack/pack <blobs) && > GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d && > test_cmp_bin $objdir/pack/multi-pack-index \ > $objdir/pack/multi-pack-index.bak && > test_path_is_file $objdir/pack/pack-$HASH2.pack && > test_path_is_missing $objdir/pack/pack-$HASH1.pack > ' > > However, this _still_ fails on the "test_path_is_missing" line, so I'm not sure > how to make sure your logic is tested. I saw that 'git repack' was writing > "nothing new to pack" in the output, so I also tested adding a few commits and > trying to force it to repack reachable data, but I cannot seem to trigger it > to create a new pack that overrides only one pack that is not in the MIDX. > > Likely, I just don't know how 'git rebase' works well enough to trigger this > behavior. But the test as-is is not testing what you want it to test. I think this case might actually be impossible to tickle in a test. I thought that 'git repack -d' looked for existing packs whose objects are a subset of some new pack generated. But, it's much simpler than that: '-d' by itself just looks for packs that were already on disk with the same SHA-1 as a new pack, and it removes the old one. Note that 'git repack' uses 'git pack-objects' internally to find objects and generate a packfile. When calling 'git pack-objects', 'git repack -d' passes '--all' and '--unpacked', which means that there is no way we'd generate a new pack with the same SHA-1 as an existing pack ordinarily. So, I think this case is impossible, or at least astronomically unlikely. What is more interesting (and untested) is that adding a _new_ pack doesn't cause us to invalidate the MIDX. Here's a patch that does that: diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 16a1ad040e..620f2058d6 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -391,18 +391,27 @@ test_expect_success 'repack removes multi-pack-index when deleting packs' ' test_path_is_missing $objdir/pack/multi-pack-index ' -test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' ' - git multi-pack-index write && - cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak && - test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" && - - # Write a new pack that is unknown to the multi-pack-index. - git hash-object -w </dev/null >blob && - git pack-objects $objdir/pack/pack <blob && - - GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d && - test_cmp_bin $objdir/pack/multi-pack-index \ - $objdir/pack/multi-pack-index.bak +test_expect_success 'repack preserves multi-pack-index when creating packs' ' + git init preserve && + test_when_finished "rm -fr preserve" && + ( + cd preserve && + midx=.git/objects/pack/multi-pack-index && + + test_commit "initial" && + git repack -ad && + git multi-pack-index write && + ls .git/objects/pack | grep "\.pack$" >before && + + cp $midx $midx.bak && + + test_commit "another" && + GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d && + ls .git/objects/pack | grep "\.pack$" >after && + + test_cmp_bin $midx.bak $midx && + ! test_cmp before after + ) ' compare_results_with_midx "after repack" What do you think about applying this on top and then calling it a day? > Thanks, > -Stolee Thanks, Taylor
On Tue, Aug 25, 2020 at 09:55:22AM +0200, Son Luong Ngoc wrote: > Hi Taylor, > > Thanks for working on this. > > > On Aug 25, 2020, at 04:01, Taylor Blau <me@ttaylorr.com> wrote: > > > > In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack' > > learned to remove a multi-pack-index file if it added or removed a pack > > from the object store. > > > > This mechanism is a little over-eager, since it is only necessary to > > drop a MIDX if 'git repack' removes a pack that the MIDX references. > > Adding a pack outside of the MIDX does not require invalidating the > > MIDX, and likewise for removing a pack the MIDX does not know about. > > I wonder if its worth to trigger write_midx_file() to update the midx instead of > just removing MIDX? There's no reason that we couldn't do this, but I don't think that it's a very good idea, especially if the new 'git maintenance' command will be able to do something like this by itself. I'm hesitant to add yet another option to 'git repack', which I have always thought as a plumbing tool. That's important because callers (like 'git maintenance' or user scripts) can 'git multi-pack-index write ...' after their 'git repack' to generate a new MIDX if they want one. This becomes even trickier if 'git multi-pack-index write' were to gain new options, at which point 'git repack' would *also* have to learn those options to plumb them through. Maybe there's a legitimate case that I'm overlooking, but I don't think so. If there is, I'd be happy to come back to this, but this patch is really just focused on fixing a bug. > That is already the direction we are taking in the 'maintenance' patch series > whenever the multi-pack-index file was deemed invalid. > > Or perhaps, we can check for 'core.multiPackIndex' value (which recently is > 'true' by default) and determine whether we should remove the MIDX or rewrite > it? > > > > > Teach 'git repack' to check for this by loading the MIDX, and checking > > whether the to-be-removed pack is known to the MIDX. This requires a > > slightly odd alternation to a test in t5319, which is explained with a > > comment. > > > > Signed-off-by: Taylor Blau <me@ttaylorr.com> Thanks for your feedback. > Cheers, > Son Luong. Thanks, Taylor
On 8/25/2020 10:41 AM, Taylor Blau wrote: > On Tue, Aug 25, 2020 at 09:14:19AM -0400, Derrick Stolee wrote: >> The code in builtin/repack.c looks good for sure. I have a quick question >> about this new test: >> >> +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' ' >> + git multi-pack-index write && >> + cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak && >> + test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" && >> + >> + # Write a new pack that is unknown to the multi-pack-index. >> + git hash-object -w </dev/null >blob && >> + git pack-objects $objdir/pack/pack <blob && >> + >> + GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d && >> + test_cmp_bin $objdir/pack/multi-pack-index \ >> + $objdir/pack/multi-pack-index.bak >> +' >> + >> >> You create an arbitrary blob, and then add it to a pack-file. Do we >> know that 'git repack' is definitely creating a new pack-file that makes >> our manually-created pack-file redundant? >> >> My suggestion is to have the test check itself: >> >> +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' ' >> + git multi-pack-index write && >> + cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak && >> + test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" && >> + >> + # Write a new pack that is unknown to the multi-pack-index. >> + git hash-object -w </dev/null >blob && >> + HASH=$(git pack-objects $objdir/pack/pack <blob) && >> + >> + GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d && >> + test_cmp_bin $objdir/pack/multi-pack-index \ >> + $objdir/pack/multi-pack-index.bak && >> + test_path_is_missing $objdir/pack/pack-$HASH.pack >> +' >> + >> >> This test fails for me, on the 'test_path_is_missing'. Likely, the >> blob is seen as already in a pack-file so is just pruned by 'git repack' >> instead. I thought that perhaps we need to add a new pack ourselves that >> overrides the small pack. Here is my attempt: >> >> test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' ' >> git multi-pack-index write && >> cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak && >> test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" && >> >> # Write a new pack that is unknown to the multi-pack-index. >> BLOB1=$(echo blob1 | git hash-object -w --stdin) && >> BLOB2=$(echo blob2 | git hash-object -w --stdin) && >> cat >blobs <<-EOF && >> $BLOB1 >> $BLOB2 >> EOF >> HASH1=$(echo $BLOB1 | git pack-objects $objdir/pack/pack) && >> HASH2=$(git pack-objects $objdir/pack/pack <blobs) && >> GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d && >> test_cmp_bin $objdir/pack/multi-pack-index \ >> $objdir/pack/multi-pack-index.bak && >> test_path_is_file $objdir/pack/pack-$HASH2.pack && >> test_path_is_missing $objdir/pack/pack-$HASH1.pack >> ' >> >> However, this _still_ fails on the "test_path_is_missing" line, so I'm not sure >> how to make sure your logic is tested. I saw that 'git repack' was writing >> "nothing new to pack" in the output, so I also tested adding a few commits and >> trying to force it to repack reachable data, but I cannot seem to trigger it >> to create a new pack that overrides only one pack that is not in the MIDX. >> >> Likely, I just don't know how 'git rebase' works well enough to trigger this >> behavior. But the test as-is is not testing what you want it to test. > > I think this case might actually be impossible to tickle in a test. I > thought that 'git repack -d' looked for existing packs whose objects are > a subset of some new pack generated. But, it's much simpler than that: > '-d' by itself just looks for packs that were already on disk with the > same SHA-1 as a new pack, and it removes the old one. If 'git repack' never calls remove_redundant_pack() unless we are doing a "full repack", then we could simplify this logic: static void remove_redundant_pack(const char *dir_name, const char *base_name) { struct strbuf buf = STRBUF_INIT; - strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name); + struct multi_pack_index *m = get_multi_pack_index(the_repository); + strbuf_addf(&buf, "%s.pack", base_name); + if (m && midx_contains_pack(m, buf.buf)) + clear_midx_file(the_repository); + strbuf_insertf(&buf, 0, "%s/", dir_name); unlink_pack_path(buf.buf, 1); strbuf_release(&buf); } to static void remove_redundant_pack(const char *dir_name, const char *base_name) { struct strbuf buf = STRBUF_INIT; strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name); + clear_midx_file(the_repository); unlink_pack_path(buf.buf, 1); strbuf_release(&buf); } and get the same results as we are showing in these tests. This does move us incrementally to a better situation: don't delete the MIDX if we aren't deleting pack files. But, I think we can get around it. > Note that 'git repack' uses 'git pack-objects' internally to find > objects and generate a packfile. When calling 'git pack-objects', 'git > repack -d' passes '--all' and '--unpacked', which means that there is no > way we'd generate a new pack with the same SHA-1 as an existing pack > ordinarily. > > So, I think this case is impossible, or at least astronomically > unlikely. What is more interesting (and untested) is that adding a _new_ > pack doesn't cause us to invalidate the MIDX. Here's a patch that does > that: > > diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh > index 16a1ad040e..620f2058d6 100755 > --- a/t/t5319-multi-pack-index.sh > +++ b/t/t5319-multi-pack-index.sh > @@ -391,18 +391,27 @@ test_expect_success 'repack removes multi-pack-index when deleting packs' ' > test_path_is_missing $objdir/pack/multi-pack-index > ' > > -test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' ' > - git multi-pack-index write && > - cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak && > - test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" && > - > - # Write a new pack that is unknown to the multi-pack-index. > - git hash-object -w </dev/null >blob && > - git pack-objects $objdir/pack/pack <blob && > - > - GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d && > - test_cmp_bin $objdir/pack/multi-pack-index \ > - $objdir/pack/multi-pack-index.bak > +test_expect_success 'repack preserves multi-pack-index when creating packs' ' > + git init preserve && > + test_when_finished "rm -fr preserve" && > + ( > + cd preserve && > + midx=.git/objects/pack/multi-pack-index && > + > + test_commit "initial" && > + git repack -ad && > + git multi-pack-index write && > + ls .git/objects/pack | grep "\.pack$" >before && > + > + cp $midx $midx.bak && > + > + test_commit "another" && > + GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d && > + ls .git/objects/pack | grep "\.pack$" >after && > + > + test_cmp_bin $midx.bak $midx && > + ! test_cmp before after > + ) > ' After looking at the callers to remote_redundant_pack() I noticed that it is only called after inspecting the "names" struct, which contains the names of the packs to group into a new pack-file. We can use a .keep file to preserve the pack-file(s) in the MIDX but also to ensure multiple pack-files outside of the MIDX are repacked and deleted. While this is very unlikely in the wild, it is definitely possible. test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' ' git init preserve && test_when_finished "rm -fr preserve" && ( cd preserve && midx=.git/objects/pack/multi-pack-index && test_commit 1 && HASH1=$(git pack-objects --all .git/objects/pack/pack) && touch .git/objects/pack/pack-$HASH1.keep && cat >pack-input <<-\EOF && HEAD ^HEAD~1 EOF test_commit 2 && HASH2=$(git pack-objects --revs .git/objects/pack/pack <pack-input) && touch .git/objects/pack/pack-$HASH2.keep && git multi-pack-index write && cp $midx $midx.bak && test_commit 3 && HASH3=$(git pack-objects --revs .git/objects/pack/pack <pack-input) && test_commit 4 && HASH4=$(git pack-objects --revs .git/objects/pack/pack <pack-input) && GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -ad && test_path_is_file .git/objects/pack/pack-$HASH1.pack && test_path_is_file .git/objects/pack/pack-$HASH2.pack && test_path_is_missing .git/objects/pack/pack-$HASH3.pack && test_path_is_missing .git/objects/pack/pack-$HASH4.pack ) ' I believe this checks your condition properly enough. Thanks, -Stolee
On Tue, Aug 25, 2020 at 11:14:41AM -0400, Derrick Stolee wrote: > On 8/25/2020 10:41 AM, Taylor Blau wrote: > > On Tue, Aug 25, 2020 at 09:14:19AM -0400, Derrick Stolee wrote: > >> The code in builtin/repack.c looks good for sure. I have a quick question > >> about this new test: > >> > >> +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' ' > >> + git multi-pack-index write && > >> + cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak && > >> + test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" && > >> + > >> + # Write a new pack that is unknown to the multi-pack-index. > >> + git hash-object -w </dev/null >blob && > >> + git pack-objects $objdir/pack/pack <blob && > >> + > >> + GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d && > >> + test_cmp_bin $objdir/pack/multi-pack-index \ > >> + $objdir/pack/multi-pack-index.bak > >> +' > >> + > >> > >> You create an arbitrary blob, and then add it to a pack-file. Do we > >> know that 'git repack' is definitely creating a new pack-file that makes > >> our manually-created pack-file redundant? > >> > >> My suggestion is to have the test check itself: > >> > >> +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' ' > >> + git multi-pack-index write && > >> + cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak && > >> + test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" && > >> + > >> + # Write a new pack that is unknown to the multi-pack-index. > >> + git hash-object -w </dev/null >blob && > >> + HASH=$(git pack-objects $objdir/pack/pack <blob) && > >> + > >> + GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d && > >> + test_cmp_bin $objdir/pack/multi-pack-index \ > >> + $objdir/pack/multi-pack-index.bak && > >> + test_path_is_missing $objdir/pack/pack-$HASH.pack > >> +' > >> + > >> > >> This test fails for me, on the 'test_path_is_missing'. Likely, the > >> blob is seen as already in a pack-file so is just pruned by 'git repack' > >> instead. I thought that perhaps we need to add a new pack ourselves that > >> overrides the small pack. Here is my attempt: > >> > >> test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' ' > >> git multi-pack-index write && > >> cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak && > >> test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" && > >> > >> # Write a new pack that is unknown to the multi-pack-index. > >> BLOB1=$(echo blob1 | git hash-object -w --stdin) && > >> BLOB2=$(echo blob2 | git hash-object -w --stdin) && > >> cat >blobs <<-EOF && > >> $BLOB1 > >> $BLOB2 > >> EOF > >> HASH1=$(echo $BLOB1 | git pack-objects $objdir/pack/pack) && > >> HASH2=$(git pack-objects $objdir/pack/pack <blobs) && > >> GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d && > >> test_cmp_bin $objdir/pack/multi-pack-index \ > >> $objdir/pack/multi-pack-index.bak && > >> test_path_is_file $objdir/pack/pack-$HASH2.pack && > >> test_path_is_missing $objdir/pack/pack-$HASH1.pack > >> ' > >> > >> However, this _still_ fails on the "test_path_is_missing" line, so I'm not sure > >> how to make sure your logic is tested. I saw that 'git repack' was writing > >> "nothing new to pack" in the output, so I also tested adding a few commits and > >> trying to force it to repack reachable data, but I cannot seem to trigger it > >> to create a new pack that overrides only one pack that is not in the MIDX. > >> > >> Likely, I just don't know how 'git rebase' works well enough to trigger this > >> behavior. But the test as-is is not testing what you want it to test. > > > > I think this case might actually be impossible to tickle in a test. I > > thought that 'git repack -d' looked for existing packs whose objects are > > a subset of some new pack generated. But, it's much simpler than that: > > '-d' by itself just looks for packs that were already on disk with the > > same SHA-1 as a new pack, and it removes the old one. > > If 'git repack' never calls remove_redundant_pack() unless we are doing > a "full repack", then we could simplify this logic: > > static void remove_redundant_pack(const char *dir_name, const char *base_name) > { > struct strbuf buf = STRBUF_INIT; > - strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name); > + struct multi_pack_index *m = get_multi_pack_index(the_repository); > + strbuf_addf(&buf, "%s.pack", base_name); > + if (m && midx_contains_pack(m, buf.buf)) > + clear_midx_file(the_repository); > + strbuf_insertf(&buf, 0, "%s/", dir_name); > unlink_pack_path(buf.buf, 1); > strbuf_release(&buf); > } > > to > > static void remove_redundant_pack(const char *dir_name, const char *base_name) > { > struct strbuf buf = STRBUF_INIT; > strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name); > + clear_midx_file(the_repository); > unlink_pack_path(buf.buf, 1); > strbuf_release(&buf); > } > > and get the same results as we are showing in these tests. This does > move us incrementally to a better situation: don't delete the MIDX > if we aren't deleting pack files. But, I think we can get around it. Makes sense, but reading your whole email we are better off leaving this as-is and changing the test to exercise it more often. > > Note that 'git repack' uses 'git pack-objects' internally to find > > objects and generate a packfile. When calling 'git pack-objects', 'git > > repack -d' passes '--all' and '--unpacked', which means that there is no > > way we'd generate a new pack with the same SHA-1 as an existing pack > > ordinarily. > > > > So, I think this case is impossible, or at least astronomically > > unlikely. What is more interesting (and untested) is that adding a _new_ > > pack doesn't cause us to invalidate the MIDX. Here's a patch that does > > that: > > > > diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh > > index 16a1ad040e..620f2058d6 100755 > > --- a/t/t5319-multi-pack-index.sh > > +++ b/t/t5319-multi-pack-index.sh > > @@ -391,18 +391,27 @@ test_expect_success 'repack removes multi-pack-index when deleting packs' ' > > test_path_is_missing $objdir/pack/multi-pack-index > > ' > > > > -test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' ' > > - git multi-pack-index write && > > - cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak && > > - test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" && > > - > > - # Write a new pack that is unknown to the multi-pack-index. > > - git hash-object -w </dev/null >blob && > > - git pack-objects $objdir/pack/pack <blob && > > - > > - GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d && > > - test_cmp_bin $objdir/pack/multi-pack-index \ > > - $objdir/pack/multi-pack-index.bak > > +test_expect_success 'repack preserves multi-pack-index when creating packs' ' > > + git init preserve && > > + test_when_finished "rm -fr preserve" && > > + ( > > + cd preserve && > > + midx=.git/objects/pack/multi-pack-index && > > + > > + test_commit "initial" && > > + git repack -ad && > > + git multi-pack-index write && > > + ls .git/objects/pack | grep "\.pack$" >before && > > + > > + cp $midx $midx.bak && > > + > > + test_commit "another" && > > + GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d && > > + ls .git/objects/pack | grep "\.pack$" >after && > > + > > + test_cmp_bin $midx.bak $midx && > > + ! test_cmp before after > > + ) > > ' > > After looking at the callers to remote_redundant_pack() I noticed that it is only > called after inspecting the "names" struct, which contains the names of the packs > to group into a new pack-file. We can use a .keep file to preserve the pack-file(s) in > the MIDX but also to ensure multiple pack-files outside of the MIDX are repacked and > deleted. While this is very unlikely in the wild, it is definitely possible. Great idea. > test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' ' > git init preserve && > test_when_finished "rm -fr preserve" && > ( > cd preserve && > midx=.git/objects/pack/multi-pack-index && > > test_commit 1 && > HASH1=$(git pack-objects --all .git/objects/pack/pack) && > touch .git/objects/pack/pack-$HASH1.keep && > > cat >pack-input <<-\EOF && Escaping the heredoc shouldn't be necessary, so this can be written instead as '<<-EOF'. > HEAD > ^HEAD~1 > EOF > > test_commit 2 && > HASH2=$(git pack-objects --revs .git/objects/pack/pack <pack-input) && > touch .git/objects/pack/pack-$HASH2.keep && > > git multi-pack-index write && > cp $midx $midx.bak && > > test_commit 3 && > HASH3=$(git pack-objects --revs .git/objects/pack/pack <pack-input) && > > test_commit 4 && > HASH4=$(git pack-objects --revs .git/objects/pack/pack <pack-input) && > > GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -ad && > test_path_is_file .git/objects/pack/pack-$HASH1.pack && > test_path_is_file .git/objects/pack/pack-$HASH2.pack && > test_path_is_missing .git/objects/pack/pack-$HASH3.pack && > test_path_is_missing .git/objects/pack/pack-$HASH4.pack ...and we should check that 'test_cmp $midx.bak $midx' is clean, i.e., that we didn't touch the MIDX. > ) > ' > > I believe this checks your condition properly enough. Otherwise, I think that this test looks great. Thanks for suggesting it. I'll send a new patch now... > Thanks, > -Stolee Thanks, Taylor
Jeff King <peff@peff.net> writes: > Does "git repack" ever remove just one pack? Obviously "git repack -ad" > or "git repack -Ad" is going to pack everything and delete the old > packs. So I think we'd want to remove a midx there. AFAIK, the pack-redundant subcommand is used by nobody in-tree, so nobody is doing "there are three packfiles, but all the objects in one of them are contained in the other two, so let's remove that redundant one". > And "git repack -d" I think of as deleting only loose objects that we > just packed. But I guess it could also remove a pack that has now been > made redundant? That seems like a rare case in practice, but I suppose > is possible. Meaning it can become reality? Yes. Or it already can happen? I doubt it. > E.g., imagine we have a midx that points to packs A and B, and > git-repack deletes B. By your logic above, we need to remove the midx > because now it points to objects in B which aren't accessible. But by > deleting it, could we be deleting the only thing that mentions the > objects in A? > > I _think_ the answer is "no", because we never went all-in on midx and > allowed deleting the matching .idx files for contained packs. Yeah, it has been my assumption that that part of the design would never change. > I'm also a little curious how bad it is to have a midx whose pack has > gone away. I guess we'd answer queries for "yes, we have this object" > even if we don't, which is bad. Though in practice we'd only delete > those packs if we have their objects elsewhere. Hmph, object O used to be in pack A and pack B, midx points at the copy in pack B but we deleted it because the pack is redundant. Now, midx says O exists, but midx cannot be used to retrieve O. We need to ask A.idx about O to locate it. That sounds brittle. I am not sure our error fallback is all that patient. > And the pack code is > pretty good about retrying other copies of objects that can't be > accessed. Alternatively, I wonder if the midx-loading code ought to > check that all of the constituent packs are available. > > In that line of thinking, do we even need to delete midx files if one of > their packs goes away? The reading side probably ought to be able to > handle that gracefully. But at that point, is there even a point to have that midx file that knows about objects (so it can answer has_object()? correctly and quickly) but does not know the correct location of half of the objects? Instead of going directly to A.idx to locate O, we need to go to midx to learn the location of O in B (which no longer exists), and then fall back to it, that is a pure overhead. > And the more interesting case is when you repack everything with "-ad" > or similar, at which point you shouldn't even need to look up what's in > the midx to see if you deleted its packs. The point of your operation is > to put it all-into-one, so you know the old midx should be discarded. Old one, yes. Do we need to have the new one in that case? >> Teach 'git repack' to check for this by loading the MIDX, and checking >> whether the to-be-removed pack is known to the MIDX. This requires a >> slightly odd alternation to a test in t5319, which is explained with a >> comment. > > My above musings aside, this seems like an obvious improvement. Yes. >> diff --git a/builtin/repack.c b/builtin/repack.c >> index 04c5ceaf7e..98fac03946 100644 >> --- a/builtin/repack.c >> +++ b/builtin/repack.c >> @@ -133,7 +133,11 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list, >> static void remove_redundant_pack(const char *dir_name, const char *base_name) >> { >> struct strbuf buf = STRBUF_INIT; >> - strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name); >> + struct multi_pack_index *m = get_multi_pack_index(the_repository); >> + strbuf_addf(&buf, "%s.pack", base_name); >> + if (m && midx_contains_pack(m, buf.buf)) >> + clear_midx_file(the_repository); >> + strbuf_insertf(&buf, 0, "%s/", dir_name); > > Makes sense. midx_contains_pack() is a binary search, so we'll spend > O(n log n) effort deleting the packs (I wondered if this might be > accidentally quadratic over the number of packs). > > And after we clear, "m" will be NULL, so we'll do it at most once. Which > is why you can get rid of the manual "midx_cleared" flag from the > preimage. > > So the patch looks good to me. Thanks.
On Tue, Aug 25, 2020 at 08:58:46AM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > Does "git repack" ever remove just one pack? Obviously "git repack -ad" > > or "git repack -Ad" is going to pack everything and delete the old > > packs. So I think we'd want to remove a midx there. > > AFAIK, the pack-redundant subcommand is used by nobody in-tree, so > nobody is doing "there are three packfiles, but all the objects in > one of them are contained in the other two, so let's remove that > redundant one". Not AFAICT either. > > And "git repack -d" I think of as deleting only loose objects that we > > just packed. But I guess it could also remove a pack that has now been > > made redundant? That seems like a rare case in practice, but I suppose > > is possible. > > Meaning it can become reality? Yes. Or it already can happen? I > doubt it. > > > E.g., imagine we have a midx that points to packs A and B, and > > git-repack deletes B. By your logic above, we need to remove the midx > > because now it points to objects in B which aren't accessible. But by > > deleting it, could we be deleting the only thing that mentions the > > objects in A? > > > > I _think_ the answer is "no", because we never went all-in on midx and > > allowed deleting the matching .idx files for contained packs. > > Yeah, it has been my assumption that that part of the design would > never change. > > > I'm also a little curious how bad it is to have a midx whose pack has > > gone away. I guess we'd answer queries for "yes, we have this object" > > even if we don't, which is bad. Though in practice we'd only delete > > those packs if we have their objects elsewhere. > > Hmph, object O used to be in pack A and pack B, midx points at the > copy in pack B but we deleted it because the pack is redundant. > Now, midx says O exists, but midx cannot be used to retrieve O. We > need to ask A.idx about O to locate it. > > That sounds brittle. I am not sure our error fallback is all that > patient. Me either. Like I said, I think that all of this is possible at least in theory, but in practice it may be somewhat annoying in cases like these. > > > And the pack code is > > pretty good about retrying other copies of objects that can't be > > accessed. Alternatively, I wonder if the midx-loading code ought to > > check that all of the constituent packs are available. > > > > In that line of thinking, do we even need to delete midx files if one of > > their packs goes away? The reading side probably ought to be able to > > handle that gracefully. > > But at that point, is there even a point to have that midx file that > knows about objects (so it can answer has_object()? correctly and > quickly) but does not know the correct location of half of the objects? > Instead of going directly to A.idx to locate O, we need to go to midx > to learn the location of O in B (which no longer exists), and then > fall back to it, that is a pure overhead. Well put. > > And the more interesting case is when you repack everything with "-ad" > > or similar, at which point you shouldn't even need to look up what's in > > the midx to see if you deleted its packs. The point of your operation is > > to put it all-into-one, so you know the old midx should be discarded. > > Old one, yes. Do we need to have the new one in that case? > > >> Teach 'git repack' to check for this by loading the MIDX, and checking > >> whether the to-be-removed pack is known to the MIDX. This requires a > >> slightly odd alternation to a test in t5319, which is explained with a > >> comment. > > > > My above musings aside, this seems like an obvious improvement. > > Yes. > > >> diff --git a/builtin/repack.c b/builtin/repack.c > >> index 04c5ceaf7e..98fac03946 100644 > >> --- a/builtin/repack.c > >> +++ b/builtin/repack.c > >> @@ -133,7 +133,11 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list, > >> static void remove_redundant_pack(const char *dir_name, const char *base_name) > >> { > >> struct strbuf buf = STRBUF_INIT; > >> - strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name); > >> + struct multi_pack_index *m = get_multi_pack_index(the_repository); > >> + strbuf_addf(&buf, "%s.pack", base_name); > >> + if (m && midx_contains_pack(m, buf.buf)) > >> + clear_midx_file(the_repository); > >> + strbuf_insertf(&buf, 0, "%s/", dir_name); > > > > Makes sense. midx_contains_pack() is a binary search, so we'll spend > > O(n log n) effort deleting the packs (I wondered if this might be > > accidentally quadratic over the number of packs). > > > > And after we clear, "m" will be NULL, so we'll do it at most once. Which > > is why you can get rid of the manual "midx_cleared" flag from the > > preimage. > > > > So the patch looks good to me. > > Thanks. Thanks, both. If you're ready, let's use the version in [1] for queueing. Thanks, Taylor [1]: https://lore.kernel.org/git/87a3b7a5a2f091e2a23c163a7d86effbbbedfa3a.1598371475.git.me@ttaylorr.com/
On 8/25/2020 11:58 AM, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > >> Does "git repack" ever remove just one pack? Obviously "git repack -ad" >> or "git repack -Ad" is going to pack everything and delete the old >> packs. So I think we'd want to remove a midx there. > > AFAIK, the pack-redundant subcommand is used by nobody in-tree, so > nobody is doing "there are three packfiles, but all the objects in > one of them are contained in the other two, so let's remove that > redundant one". > >> And "git repack -d" I think of as deleting only loose objects that we >> just packed. But I guess it could also remove a pack that has now been >> made redundant? That seems like a rare case in practice, but I suppose >> is possible. > > Meaning it can become reality? Yes. Or it already can happen? I > doubt it. > >> E.g., imagine we have a midx that points to packs A and B, and >> git-repack deletes B. By your logic above, we need to remove the midx >> because now it points to objects in B which aren't accessible. But by >> deleting it, could we be deleting the only thing that mentions the >> objects in A? >> >> I _think_ the answer is "no", because we never went all-in on midx and >> allowed deleting the matching .idx files for contained packs. > > Yeah, it has been my assumption that that part of the design would > never change. Yes, we should intend to keep the .idx files around forever even when using the multi-pack-index. The duplicated data overhead is not typically a real problem. The one caveat I would consider is if we wanted to let a multi-pack-index cover thin packs, but that would be a substantial change to the object database that I do not plan to tackle any time soon. >> I'm also a little curious how bad it is to have a midx whose pack has >> gone away. I guess we'd answer queries for "yes, we have this object" >> even if we don't, which is bad. Though in practice we'd only delete >> those packs if we have their objects elsewhere. > > Hmph, object O used to be in pack A and pack B, midx points at the > copy in pack B but we deleted it because the pack is redundant. > Now, midx says O exists, but midx cannot be used to retrieve O. We > need to ask A.idx about O to locate it. > > That sounds brittle. I am not sure our error fallback is all that > patient. The best I can see is that prepare_midx_pack() will return 1 if the pack no longer exists, and this would cause a die("error preparing packfile from multi-pack-index") in nth_midxed_pack_entry(), killing the following stack trace: + find_pack_entry():packfile.c + fill_midx_entry():midx.c + nth_midxed_pack_entry():midx.c Perhaps that die() is a bit over-zealous since we could return 1 through this stack and find_pack_entry() could continue searching for the object in the packed_git list. However, it could start returning false _negatives_ if there were duplicates of the object in the multi-pack-index but only the latest copy was deleted (and the object does not appear in a pack-file outside of the multi- pack-index). Thanks, -Stolee
On Tue, Aug 25, 2020 at 10:45:15AM -0400, Taylor Blau wrote: > > I wonder if its worth to trigger write_midx_file() to update the midx instead of > > just removing MIDX? > > There's no reason that we couldn't do this, but I don't think that it's > a very good idea, especially if the new 'git maintenance' command will > be able to do something like this by itself. > > I'm hesitant to add yet another option to 'git repack', which I have > always thought as a plumbing tool. That's important because callers > (like 'git maintenance' or user scripts) can 'git multi-pack-index write > ...' after their 'git repack' to generate a new MIDX if they want one. It may be worth thinking a bit about atomicity here, though. Rather than separate delete and write steps, would somebody want a sequence like: - we have a midx with packs A, B, C - we find out that pack C is redundant and want to drop it - we create a new midx with A and B in a tempfile - we atomically rename the new midx over the old - we delete pack C A simultaneous reader always sees a consistent midx. Whereas in a delete-then-rewrite scenario there is a moment where there is no midx, and they'd fall back to reading the individual idx files. It may not matter all that much, though, for two reasons: - reading individual idx files should still give a correct answer. It just may be a bit slower. - even with an atomic replacement, I think caching on the reader side may cause interesting effects. For instance, if a reader process opens the old midx, it will generally cache that knowledge in memory (including mmap the content). So even after the replacement and deletion of C, it may still try to use the old midx that references C. If we do care, though, that implies some cooperation between the deletion process and the midx code. Perhaps it even argues that repack should refuse to delete such a single pack at all, since it _isn't_ redundant. It's part of a midx, and the caller should rewrite the midx first itself, and _then_ look for redundant packs. -Peff
On Tue, Aug 25, 2020 at 11:42:24AM -0400, Taylor Blau wrote: > > test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' ' > > git init preserve && > > test_when_finished "rm -fr preserve" && > > ( > > cd preserve && > > midx=.git/objects/pack/multi-pack-index && > > > > test_commit 1 && > > HASH1=$(git pack-objects --all .git/objects/pack/pack) && > > touch .git/objects/pack/pack-$HASH1.keep && > > > > cat >pack-input <<-\EOF && > > Escaping the heredoc shouldn't be necessary, so this can be written > instead as '<<-EOF'. We usually go the opposite way: if a here-doc doesn't need interpolation, then we prefer to mark it as such to avoid surprising somebody who adds lines later (and might need quoting). Certainly you can argue that least-surprise would be in the other direction (i.e., people expect to interpolate by default, and you surprise anybody adding a variable reference), but for Git's test suite, the convention usually runs the other way. -Peff
On 8/25/2020 12:47 PM, Jeff King wrote: > It may be worth thinking a bit about atomicity here, though. Rather than > separate delete and write steps, would somebody want a sequence like: ... > If we do care, though, that implies some cooperation between the > deletion process and the midx code. Perhaps it even argues that repack > should refuse to delete such a single pack at all, since it _isn't_ > redundant. It's part of a midx, and the caller should rewrite the midx > first itself, and _then_ look for redundant packs. It is worth noting that we _do_ have a way to integrate the delete and write code using 'git multi-pack-index expire'. One way to resolve this atomicity would be to do the following inside the repack command: 1. Create and index the new pack. 2. git multi-pack-index write 3. git multi-pack-index expire While this _mostly_ works, we still have issues around a concurrent process opening a multi-pack-index before step (2) and trying to read from the pack-file deleted in step (3). For this reason, the 'incremental-repack' task in the maintenance builtin series [1] runs 'expire' then 'repack' (and not the opposite order). To be completely safe, we'd want to make sure that no Git command that started before the 'repack' command is still running when we run 'expire'. In practice, it suffices to run the 'incremental-repack' step at most once a day. [1] https://lore.kernel.org/git/a8d956dad6b3a81d0f1b63cbd48f36a5e1195ae8.1597760730.git.gitgitgadget@gmail.com/ This discussion points out a big reason why the multi-pack-index wasn't fully integrated with 'git repack': it can be tricky. Using a repacking strategy that is focused on the multi-pack-index is less tricky. I still think that this patch is moving us in a better direction. I'm making note of these possible improvements: 1. Have 'git repack' integrate with 'git multi-pack-index expire' when deleting pack-files after creating a new one (if a multi-pack-index exists). 2. Handle "missing packs" referenced by the multi-pack-index more gracefully, likely by disabling the multi-pack-index and calling reprepare_packed_git(). Thanks, -Stolee
On Tue, Aug 25, 2020 at 08:58:46AM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > Does "git repack" ever remove just one pack? Obviously "git repack -ad" > > or "git repack -Ad" is going to pack everything and delete the old > > packs. So I think we'd want to remove a midx there. > > AFAIK, the pack-redundant subcommand is used by nobody in-tree, so > nobody is doing "there are three packfiles, but all the objects in > one of them are contained in the other two, so let's remove that > redundant one". OK, that's the part I was missing. The discussion here and the statement from git-repack(1): -d After packing, if the newly created packs make some existing packs redundant, remove the redundant packs. Also run git prune-packed to remove redundant loose object files. made me think that it was running pack-redundant. But it doesn't seem to. It looks like we stopped doing so in 6ed64058e1 (git-repack: do not do complex redundancy check., 2005-11-19). As an aside, we tried using pack-redundant at GitHub several years ago for dropping packs that were replicated in alternates storage. It performs very poorly (quadratically, perhaps?) to the point that we found it unusable, and I implemented a "local-redundant" command to do the same thing more quickly. We ended up dropping that as we now migrate packs wholesale (rather than via fetch), so I never upstreamed it. I mention that here to warn people away from pack-redundant (I wondered at the time why nobody had noticed its terrible performance, but now I think the answer is just that nobody ever runs it). I wonder if we should even consider deprecating and removing it. I'm also happy to share local-redundant if anybody is interested (though as I've mentioned elsewhere, I think the larger scheme it was part of it also performed poorly and isn't worth pursuing). > > And "git repack -d" I think of as deleting only loose objects that we > > just packed. But I guess it could also remove a pack that has now been > > made redundant? That seems like a rare case in practice, but I suppose > > is possible. > > Meaning it can become reality? Yes. Or it already can happen? I > doubt it. I thought the latter, but after this thread I agree that it can't. > > I'm also a little curious how bad it is to have a midx whose pack has > > gone away. I guess we'd answer queries for "yes, we have this object" > > even if we don't, which is bad. Though in practice we'd only delete > > those packs if we have their objects elsewhere. > > Hmph, object O used to be in pack A and pack B, midx points at the > copy in pack B but we deleted it because the pack is redundant. > Now, midx says O exists, but midx cannot be used to retrieve O. We > need to ask A.idx about O to locate it. > > That sounds brittle. I am not sure our error fallback is all that > patient. It has to be that patient even without a midx, I think. We might open A.idx and mmap its contents, without opening the matching A.pack (or we might open it and later close it due to memory or descriptor constraints). And we have two layers there: - when object_info_extended sees a bad return from packed_object_info(), it will mark the entry as bad and recurse, giving us an opportunity to find another one - when we can't find the object at all (e.g., because we marked that particular pack's version as inaccessible), we'll hit the reprepare_packed_git() path and look for a new idx Those same things should be kicking in for midx lookups, too (I didn't test them, but from a cursory look at the organization of the code, I think they will). > > In that line of thinking, do we even need to delete midx files if one of > > their packs goes away? The reading side probably ought to be able to > > handle that gracefully. > > But at that point, is there even a point to have that midx file that > knows about objects (so it can answer has_object()? correctly and > quickly) but does not know the correct location of half of the objects? > Instead of going directly to A.idx to locate O, we need to go to midx > to learn the location of O in B (which no longer exists), and then > fall back to it, that is a pure overhead. It is overhead for sure. But my thinking was that you're trading one efficiency for another: - the midx may incur an extra lookup for objects in the redundant pack, but that pack may be a small portion of what the midx indexes (and this is likely in the usual case that you have one big cumulative pack and many recently-created smaller ones; the big one will never be made redundant and dominates the object count of the midx) - by keeping the midx, you're improving lookup speed for all of the other objects, which don't have to hunt through separate pack idx files So my guess is that it would usually be a win. Though really the correct answer is: if you are mucking with packs you should just generate a new midx (or you should pack all-into-one, which generates a single idx accomplishing the same thing). > > And the more interesting case is when you repack everything with "-ad" > > or similar, at which point you shouldn't even need to look up what's in > > the midx to see if you deleted its packs. The point of your operation is > > to put it all-into-one, so you know the old midx should be discarded. > > Old one, yes. Do we need to have the new one in that case? You shouldn't need one since you'd only have a single pack now. -Peff
On Tue, Aug 25, 2020 at 01:10:13PM -0400, Derrick Stolee wrote: > > If we do care, though, that implies some cooperation between the > > deletion process and the midx code. Perhaps it even argues that repack > > should refuse to delete such a single pack at all, since it _isn't_ > > redundant. It's part of a midx, and the caller should rewrite the midx > > first itself, and _then_ look for redundant packs. > > It is worth noting that we _do_ have a way to integrate the delete and > write code using 'git multi-pack-index expire'. One way to resolve this > atomicity would be to do the following inside the repack command: > > 1. Create and index the new pack. > 2. git multi-pack-index write > 3. git multi-pack-index expire Given that discussion elsewhere points to git-repack only really deleting packs in all-in-one mode (and not ever a single pack), it seems like we can really be much simpler here. If we're not deleting packs via all-in-one, there's no need to touch the midx at all. If we are, then it's reasonable to delete the midx immediately (after having written our new pack but before deleting) since our new single pack idx is as good or better. I.e., drop step 2 above, and make step 3 just clear_midx_file(). Which is roughly what the code does now, isn't it? Or is there some reason that "expire" is more interesting than just clearing? And if anybody does want to drop single packs, etc, they can do so by generating a sensible midx separately from the repack operation (and probably doing so before dropping the packs for atomicity). -Peff
On Tue, Aug 25, 2020 at 01:29:01PM -0400, Jeff King wrote: > On Tue, Aug 25, 2020 at 01:10:13PM -0400, Derrick Stolee wrote: > > > > If we do care, though, that implies some cooperation between the > > > deletion process and the midx code. Perhaps it even argues that repack > > > should refuse to delete such a single pack at all, since it _isn't_ > > > redundant. It's part of a midx, and the caller should rewrite the midx > > > first itself, and _then_ look for redundant packs. > > > > It is worth noting that we _do_ have a way to integrate the delete and > > write code using 'git multi-pack-index expire'. One way to resolve this > > atomicity would be to do the following inside the repack command: > > > > 1. Create and index the new pack. > > 2. git multi-pack-index write > > 3. git multi-pack-index expire > > Given that discussion elsewhere points to git-repack only really > deleting packs in all-in-one mode (and not ever a single pack), it seems > like we can really be much simpler here. If we're not deleting packs via > all-in-one, there's no need to touch the midx at all. If we are, then > it's reasonable to delete the midx immediately (after having written our > new pack but before deleting) since our new single pack idx is as good > or better. > > I.e., drop step 2 above, and make step 3 just clear_midx_file(). Which > is roughly what the code does now, isn't it? Or is there some reason > that "expire" is more interesting than just clearing? It's not clear to me whether you're talking about my patch, or what a more full integration with 'git repack' looks like. If you are talking about my patch, I disagree: checking that the MIDX doesn't know about a pack we're dropping *is* useful even without all-in-one, because of '.keep' packs (as demonstrated by the new test in my patch). To me, this patch seems like an incremental step in the direction that we ultimately want to be going, but it's hard to untangle whether the ensuing discussion is targeted at my patch, or the ultimate goal. > And if anybody does want to drop single packs, etc, they can do so by > generating a sensible midx separately from the repack operation (and > probably doing so before dropping the packs for atomicity). > > -Peff Thanks, Taylor
On Tue, Aug 25, 2020 at 12:18:42PM -0400, Derrick Stolee wrote: > The best I can see is that prepare_midx_pack() will return 1 if the > pack no longer exists, and this would cause a die("error preparing > packfile from multi-pack-index") in nth_midxed_pack_entry(), killing > the following stack trace: > > + find_pack_entry():packfile.c > + fill_midx_entry():midx.c > + nth_midxed_pack_entry():midx.c > > Perhaps that die() is a bit over-zealous since we could return 1 > through this stack and find_pack_entry() could continue searching > for the object in the packed_git list. However, it could start > returning false _negatives_ if there were duplicates of the object > in the multi-pack-index but only the latest copy was deleted (and > the object does not appear in a pack-file outside of the multi- > pack-index). Hmm, yeah. I thought this code is already doing the right thing, because I'd expect the is_pack_valid() call later in nth_midxed_pack_entry() to be where we notice the problem. But add_packed_git() does stat the packfile and return an error. So that die() really ought to be just "return 0". The caller already has to (and does) handle similar errors (including that the pack went away after we added it to the packed_git list, or that it exists but has bogus data, etc). -Peff
On Tue, Aug 25, 2020 at 01:34:25PM -0400, Taylor Blau wrote: > > I.e., drop step 2 above, and make step 3 just clear_midx_file(). Which > > is roughly what the code does now, isn't it? Or is there some reason > > that "expire" is more interesting than just clearing? > > It's not clear to me whether you're talking about my patch, or what a > more full integration with 'git repack' looks like. > > If you are talking about my patch, I disagree: checking that the MIDX > doesn't know about a pack we're dropping *is* useful even without > all-in-one, because of '.keep' packs (as demonstrated by the new test in > my patch). I hadn't considered .keep. So all-in-one may still involve selectively deleting some packs. It makes sense, then, for repack to consider whether the midx is actually redundant or not rather than just always clearing it (i.e., what your patch does). In general I consider .keep packs kind of an awful feature that introduces a lot of confusion and works against other features like bitmaps. But I guess that they're the only thing that allows you to have a gigantic Windows-like repo where you never fully pack it, but just keep acquiring a string of big packs. Which is the exact case that the midx is most useful for. So they're definitely worth considering and supporting here. > To me, this patch seems like an incremental step in the direction that > we ultimately want to be going, but it's hard to untangle whether the > ensuing discussion is targeted at my patch, or the ultimate goal. I wasn't sure of the answer to that until we untangled more. I.e., it wasn't clear to me if your incremental step was even in the right direction if we weren't in fact ever selectively deleting packs in git-repack. (And now it sounds like we aren't via "git repack -d", which was my original question, but we are via .keep). -Peff
Jeff King <peff@peff.net> writes: > OK, that's the part I was missing. The discussion here and the statement > from git-repack(1): > > -d > After packing, if the newly created packs make some existing packs > redundant, remove the redundant packs. Also run git prune-packed > to remove redundant loose object files. > > made me think that it was running pack-redundant. But it doesn't seem > to. It looks like we stopped doing so in 6ed64058e1 (git-repack: do not > do complex redundancy check., 2005-11-19). Thanks for digging. A good opportunity for a #leftoverbits documentation update from new people is here. > As an aside, we tried using pack-redundant at GitHub several years ago > for dropping packs that were replicated in alternates storage. It > performs very poorly (quadratically, perhaps?) to the point that we > found it unusable,... Yes, I originally wrote "the pack-redundant subcommand" in the message you are responding to with a bit more colourful adjectives, but rewrote it ;-) My recollection from the last time I looked at it is that it is quadratic or even worse---that was long time ago, but on the other hand I think the subcommand had no significant improvement over the course of its life. Perhaps it is time to drop it. -- >8 -- Subject: [RFC] pack-redundant: gauge the usage before proposing its removal The subcommand is unusably slow and the reason why nobody reports it as a performance bug is suspected to be the absense of users. Let's disable the normal use of command by making it error out with a big message that asks the user to tell us that they still care about the command, with an escape hatch to override it with a command line option. Signed-off-by: Junio C Hamano <gitster@pobox.com> --- builtin/pack-redundant.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/builtin/pack-redundant.c b/builtin/pack-redundant.c index 178e3409b7..97cf3df79b 100644 --- a/builtin/pack-redundant.c +++ b/builtin/pack-redundant.c @@ -554,6 +554,7 @@ static void load_all(void) int cmd_pack_redundant(int argc, const char **argv, const char *prefix) { int i; + int i_still_use_this = 0; struct pack_list *min = NULL, *red, *pl; struct llist *ignore; struct object_id *oid; @@ -580,12 +581,25 @@ int cmd_pack_redundant(int argc, const char **argv, const char *prefix) alt_odb = 1; continue; } + if (!strcmp(arg, "--i-still-use-this")) { + i_still_use_this = 1; + continue; + } if (*arg == '-') usage(pack_redundant_usage); else break; } + if (!i_still_use_this) { + puts(_("'git pack-redundant' is nominated for removal.\n" + "If you still use this command, please add an extra\n" + "option, '--i-still-use-this', on the command line\n" + "and let us know you still use it by sending an e-mail\n" + "to <git@vger.kernel.org>. Thanks\n")); + exit(1); + } + if (load_all_packs) load_all(); else
On Tue, Aug 25, 2020 at 11:05:09AM -0700, Junio C Hamano wrote: > Subject: [RFC] pack-redundant: gauge the usage before proposing its removal > > The subcommand is unusably slow and the reason why nobody reports it > as a performance bug is suspected to be the absense of users. Let's > disable the normal use of command by making it error out with a big > message that asks the user to tell us that they still care about the > command, with an escape hatch to override it with a command line > option. I like this direction. And I'm tempted to say it's OK to go straight here given how generally useless the command is. But it feels like a big jump for the initial change. While we give the user an easy escape hatch, it might still break their scripts. A more gentle transition would I guess be: 1. Mention deprecation in release notes (hah, as if anybody reads them). 2. Issue a warning but continue to behave as normal. That might break scripts that care a lot about stderr, but otherwise is harmless. No clue if anybody would actually see the message or not. 3. Flip warning to error, as your patch does. 4. Removal. I'd guess we could do 1+2 in one release, then 3 a release or two later, and then finally 4. I dunno. That's more tedious and I'll be surprised if we get any bites, but maybe we ought to do it out of principle. -Peff
diff --git a/builtin/repack.c b/builtin/repack.c index 04c5ceaf7e..98fac03946 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -133,7 +133,11 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list, static void remove_redundant_pack(const char *dir_name, const char *base_name) { struct strbuf buf = STRBUF_INIT; - strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name); + struct multi_pack_index *m = get_multi_pack_index(the_repository); + strbuf_addf(&buf, "%s.pack", base_name); + if (m && midx_contains_pack(m, buf.buf)) + clear_midx_file(the_repository); + strbuf_insertf(&buf, 0, "%s/", dir_name); unlink_pack_path(buf.buf, 1); strbuf_release(&buf); } @@ -286,7 +290,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix) int keep_unreachable = 0; struct string_list keep_pack_list = STRING_LIST_INIT_NODUP; int no_update_server_info = 0; - int midx_cleared = 0; struct pack_objects_args po_args = {NULL}; struct option builtin_repack_options[] = { @@ -439,11 +442,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix) for (ext = 0; ext < ARRAY_SIZE(exts); ext++) { char *fname, *fname_old; - if (!midx_cleared) { - clear_midx_file(the_repository); - midx_cleared = 1; - } - fname = mkpathdup("%s/pack-%s%s", packdir, item->string, exts[ext].name); if (!file_exists(fname)) { diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 43b1b5b2af..16a1ad040e 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -382,12 +382,29 @@ test_expect_success 'repack with the --no-progress option' ' test_line_count = 0 err ' -test_expect_success 'repack removes multi-pack-index' ' +test_expect_success 'repack removes multi-pack-index when deleting packs' ' test_path_is_file $objdir/pack/multi-pack-index && - GIT_TEST_MULTI_PACK_INDEX=0 git repack -adf && + # Set GIT_TEST_MULTI_PACK_INDEX to 0 to avoid writing a new + # multi-pack-index after repacking, but set "core.multiPackIndex" to + # true so that "git repack" can read the existing MIDX. + GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -adf && test_path_is_missing $objdir/pack/multi-pack-index ' +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' ' + git multi-pack-index write && + cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak && + test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" && + + # Write a new pack that is unknown to the multi-pack-index. + git hash-object -w </dev/null >blob && + git pack-objects $objdir/pack/pack <blob && + + GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d && + test_cmp_bin $objdir/pack/multi-pack-index \ + $objdir/pack/multi-pack-index.bak +' + compare_results_with_midx "after repack" test_expect_success 'multi-pack-index and pack-bitmap' '
In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack' learned to remove a multi-pack-index file if it added or removed a pack from the object store. This mechanism is a little over-eager, since it is only necessary to drop a MIDX if 'git repack' removes a pack that the MIDX references. Adding a pack outside of the MIDX does not require invalidating the MIDX, and likewise for removing a pack the MIDX does not know about. Teach 'git repack' to check for this by loading the MIDX, and checking whether the to-be-removed pack is known to the MIDX. This requires a slightly odd alternation to a test in t5319, which is explained with a comment. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- Something I noticed while working on a MIDX-related topic. Not critical that we take this now since it's only dropping the MIDX too aggressively, but not otherwise doing anything destructive. But, this will make my somewhat large other series a little smaller. builtin/repack.c | 12 +++++------- t/t5319-multi-pack-index.sh | 21 +++++++++++++++++++-- 2 files changed, 24 insertions(+), 9 deletions(-) -- 2.28.0.rc1.13.ge78abce653