diff mbox series

builtin/repack.c: invalidate MIDX only when necessary

Message ID ef9186a8df0d712c2ecccbe62cb43a7abadb9c96.1598320716.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series builtin/repack.c: invalidate MIDX only when necessary | expand

Commit Message

Taylor Blau Aug. 25, 2020, 2:01 a.m. UTC
In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack'
learned to remove a multi-pack-index file if it added or removed a pack
from the object store.

This mechanism is a little over-eager, since it is only necessary to
drop a MIDX if 'git repack' removes a pack that the MIDX references.
Adding a pack outside of the MIDX does not require invalidating the
MIDX, and likewise for removing a pack the MIDX does not know about.

Teach 'git repack' to check for this by loading the MIDX, and checking
whether the to-be-removed pack is known to the MIDX. This requires a
slightly odd alternation to a test in t5319, which is explained with a
comment.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
Something I noticed while working on a MIDX-related topic. Not critical
that we take this now since it's only dropping the MIDX too
aggressively, but not otherwise doing anything destructive. But, this
will make my somewhat large other series a little smaller.

 builtin/repack.c            | 12 +++++-------
 t/t5319-multi-pack-index.sh | 21 +++++++++++++++++++--
 2 files changed, 24 insertions(+), 9 deletions(-)

--
2.28.0.rc1.13.ge78abce653

Comments

Jeff King Aug. 25, 2020, 2:26 a.m. UTC | #1
On Mon, Aug 24, 2020 at 10:01:04PM -0400, Taylor Blau wrote:

> In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack'
> learned to remove a multi-pack-index file if it added or removed a pack
> from the object store.
> 
> This mechanism is a little over-eager, since it is only necessary to
> drop a MIDX if 'git repack' removes a pack that the MIDX references.
> Adding a pack outside of the MIDX does not require invalidating the
> MIDX, and likewise for removing a pack the MIDX does not know about.

Does "git repack" ever remove just one pack? Obviously "git repack -ad"
or "git repack -Ad" is going to pack everything and delete the old
packs. So I think we'd want to remove a midx there.

And "git repack -d" I think of as deleting only loose objects that we
just packed. But I guess it could also remove a pack that has now been
made redundant? That seems like a rare case in practice, but I suppose
is possible.

Not exactly related to your fix, but kind of the flip side of it: would
we ever need to retain a midx that mentions some packs that still exist?

E.g., imagine we have a midx that points to packs A and B, and
git-repack deletes B. By your logic above, we need to remove the midx
because now it points to objects in B which aren't accessible. But by
deleting it, could we be deleting the only thing that mentions the
objects in A?

I _think_ the answer is "no", because we never went all-in on midx and
allowed deleting the matching .idx files for contained packs. So we'd
still have that A.idx, and we could just use the pack as normal. But
it's an interesting corner case if we ever do go in that direction.

If you'll let me muse a bit more on midx-lifetime issues (which I've
never really thought about before just now):

I'm also a little curious how bad it is to have a midx whose pack has
gone away. I guess we'd answer queries for "yes, we have this object"
even if we don't, which is bad. Though in practice we'd only delete
those packs if we have their objects elsewhere. And the pack code is
pretty good about retrying other copies of objects that can't be
accessed. Alternatively, I wonder if the midx-loading code ought to
check that all of the constituent packs are available.

In that line of thinking, do we even need to delete midx files if one of
their packs goes away? The reading side probably ought to be able to
handle that gracefully.

And the more interesting case is when you repack everything with "-ad"
or similar, at which point you shouldn't even need to look up what's in
the midx to see if you deleted its packs. The point of your operation is
to put it all-into-one, so you know the old midx should be discarded.

> Teach 'git repack' to check for this by loading the MIDX, and checking
> whether the to-be-removed pack is known to the MIDX. This requires a
> slightly odd alternation to a test in t5319, which is explained with a
> comment.

My above musings aside, this seems like an obvious improvement.

> diff --git a/builtin/repack.c b/builtin/repack.c
> index 04c5ceaf7e..98fac03946 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -133,7 +133,11 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list,
>  static void remove_redundant_pack(const char *dir_name, const char *base_name)
>  {
>  	struct strbuf buf = STRBUF_INIT;
> -	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
> +	struct multi_pack_index *m = get_multi_pack_index(the_repository);
> +	strbuf_addf(&buf, "%s.pack", base_name);
> +	if (m && midx_contains_pack(m, buf.buf))
> +		clear_midx_file(the_repository);
> +	strbuf_insertf(&buf, 0, "%s/", dir_name);

Makes sense. midx_contains_pack() is a binary search, so we'll spend
O(n log n) effort deleting the packs (I wondered if this might be
accidentally quadratic over the number of packs).

And after we clear, "m" will be NULL, so we'll do it at most once. Which
is why you can get rid of the manual "midx_cleared" flag from the
preimage.

So the patch looks good to me.

-Peff
Taylor Blau Aug. 25, 2020, 2:37 a.m. UTC | #2
On Mon, Aug 24, 2020 at 10:26:14PM -0400, Jeff King wrote:
> On Mon, Aug 24, 2020 at 10:01:04PM -0400, Taylor Blau wrote:
>
> > In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack'
> > learned to remove a multi-pack-index file if it added or removed a pack
> > from the object store.
> >
> > This mechanism is a little over-eager, since it is only necessary to
> > drop a MIDX if 'git repack' removes a pack that the MIDX references.
> > Adding a pack outside of the MIDX does not require invalidating the
> > MIDX, and likewise for removing a pack the MIDX does not know about.
>
> Does "git repack" ever remove just one pack? Obviously "git repack -ad"
> or "git repack -Ad" is going to pack everything and delete the old
> packs. So I think we'd want to remove a midx there.
>
> And "git repack -d" I think of as deleting only loose objects that we
> just packed. But I guess it could also remove a pack that has now been
> made redundant? That seems like a rare case in practice, but I suppose
> is possible.

Yeah, the patch message makes this sound more likely than it actually
is, which I agree is very rare. I often write 'git repack' instead of
'git pack-objects' to slurp up everything loose into a new pack without
having to list loose objects by name.

That's the case that I really care about here: purely adding a new pack
should not invalidate the existing MIDX.

> Not exactly related to your fix, but kind of the flip side of it: would
> we ever need to retain a midx that mentions some packs that still exist?
>
> E.g., imagine we have a midx that points to packs A and B, and
> git-repack deletes B. By your logic above, we need to remove the midx
> because now it points to objects in B which aren't accessible. But by
> deleting it, could we be deleting the only thing that mentions the
> objects in A?
>
> I _think_ the answer is "no", because we never went all-in on midx and
> allowed deleting the matching .idx files for contained packs. So we'd
> still have that A.idx, and we could just use the pack as normal. But
> it's an interesting corner case if we ever do go in that direction.

Agreed. Maybe a (admittedly somewhat large) #leftoverbits.

> If you'll let me muse a bit more on midx-lifetime issues (which I've
> never really thought about before just now):
>
> I'm also a little curious how bad it is to have a midx whose pack has
> gone away. I guess we'd answer queries for "yes, we have this object"
> even if we don't, which is bad. Though in practice we'd only delete
> those packs if we have their objects elsewhere. And the pack code is
> pretty good about retrying other copies of objects that can't be
> accessed. Alternatively, I wonder if the midx-loading code ought to
> check that all of the constituent packs are available.
>
> In that line of thinking, do we even need to delete midx files if one of
> their packs goes away? The reading side probably ought to be able to
> handle that gracefully.

I think that this is probably the right direction, although I've only
spend time in the MIDX code over the past couple of weeks, so I can't
say with authority. It seems like it would be pretty annoying, though.
For example, code that cares about listing all objects in a MIDX would
have to check first whether the pack they're in still exists before
emitting them. On top of that, there are more corner cases when object X
exists in more than one pack, but some strict subset of those packs
containing X have gone away.

I don't think that it couldn't be done, though.

> And the more interesting case is when you repack everything with "-ad"
> or similar, at which point you shouldn't even need to look up what's in
> the midx to see if you deleted its packs. The point of your operation is
> to put it all-into-one, so you know the old midx should be discarded.
>
> > Teach 'git repack' to check for this by loading the MIDX, and checking
> > whether the to-be-removed pack is known to the MIDX. This requires a
> > slightly odd alternation to a test in t5319, which is explained with a
> > comment.
>
> My above musings aside, this seems like an obvious improvement.
>
> > diff --git a/builtin/repack.c b/builtin/repack.c
> > index 04c5ceaf7e..98fac03946 100644
> > --- a/builtin/repack.c
> > +++ b/builtin/repack.c
> > @@ -133,7 +133,11 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list,
> >  static void remove_redundant_pack(const char *dir_name, const char *base_name)
> >  {
> >  	struct strbuf buf = STRBUF_INIT;
> > -	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
> > +	struct multi_pack_index *m = get_multi_pack_index(the_repository);
> > +	strbuf_addf(&buf, "%s.pack", base_name);
> > +	if (m && midx_contains_pack(m, buf.buf))
> > +		clear_midx_file(the_repository);
> > +	strbuf_insertf(&buf, 0, "%s/", dir_name);
>
> Makes sense. midx_contains_pack() is a binary search, so we'll spend
> O(n log n) effort deleting the packs (I wondered if this might be
> accidentally quadratic over the number of packs).

Right. The MIDX stores packs in lexographic order, so checking them is
O(log n), which we do at most 'n' times.

> And after we clear, "m" will be NULL, so we'll do it at most once. Which
> is why you can get rid of the manual "midx_cleared" flag from the
> preimage.

Yep. I thought briefly about passing 'm' as a parameter, but then you
have to worry about a dangling reference to
'the_repository->objects->multi_pack_index' after calling
'clear_midx_file()', so it's easier to look it up each time.

> So the patch looks good to me.

Thanks.

> -Peff

Thanks,
Taylor
Son Luong Ngoc Aug. 25, 2020, 7:55 a.m. UTC | #3
Hi Taylor,

Thanks for working on this.

> On Aug 25, 2020, at 04:01, Taylor Blau <me@ttaylorr.com> wrote:
> 
> In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack'
> learned to remove a multi-pack-index file if it added or removed a pack
> from the object store.
> 
> This mechanism is a little over-eager, since it is only necessary to
> drop a MIDX if 'git repack' removes a pack that the MIDX references.
> Adding a pack outside of the MIDX does not require invalidating the
> MIDX, and likewise for removing a pack the MIDX does not know about.

I wonder if its worth to trigger write_midx_file() to update the midx instead of
just removing MIDX?

That is already the direction we are taking in the 'maintenance' patch series
whenever the multi-pack-index file was deemed invalid.

Or perhaps, we can check for 'core.multiPackIndex' value (which recently is 
'true' by default) and determine whether we should remove the MIDX or rewrite
it?

> 
> Teach 'git repack' to check for this by loading the MIDX, and checking
> whether the to-be-removed pack is known to the MIDX. This requires a
> slightly odd alternation to a test in t5319, which is explained with a
> comment.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>

Cheers,
Son Luong.
Derrick Stolee Aug. 25, 2020, 12:45 p.m. UTC | #4
On 8/25/2020 3:55 AM, Son Luong Ngoc wrote:
> Hi Taylor,
> 
> Thanks for working on this.
> 
>> On Aug 25, 2020, at 04:01, Taylor Blau <me@ttaylorr.com> wrote:
>>
>> In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack'
>> learned to remove a multi-pack-index file if it added or removed a pack
>> from the object store.
>>
>> This mechanism is a little over-eager, since it is only necessary to
>> drop a MIDX if 'git repack' removes a pack that the MIDX references.
>> Adding a pack outside of the MIDX does not require invalidating the
>> MIDX, and likewise for removing a pack the MIDX does not know about.
> 
> I wonder if its worth to trigger write_midx_file() to update the midx instead of
> just removing MIDX?

It would be reasonable to have 'git repack' run write_midx_file() after updating
the pack-files, but it still needs to delete the existing multi-pack-index after
deleting a referenced pack, as the current multi-pack-index will be incorrect,
leading to possibly invalid data during the write. 'git multi-pack-index expire'
carefully handles deleting pack-files that have become redundant.

It may be possible to change the behavior of write_midx_file() to check for the
state of each referenced pack-file, but it would then need to re-scan the
pack-indexes to rebuild the set of objects and which pack-files contain them.

> That is already the direction we are taking in the 'maintenance' patch series
> whenever the multi-pack-index file was deemed invalid.

The 'maintenance' builtin definitely adds new MIDX-aware maintenance tasks.
By performing a repack using the multi-pack-index as the starting point, we
can get around these issues.

One thing I've noticed by talking with Taylor about this change is that
'git repack' is a bit like 'git gc' in that it does a LOT of things and it
can be hard to understand exactly when certain sub-tasks are run or not.
There are likely some interesting operations that could be separated into
maintenance tasks. For example, 'git repack -d' is very similar to
'git maintenance run --task=loose-objects'.

> Or perhaps, we can check for 'core.multiPackIndex' value (which recently is 
> 'true' by default) and determine whether we should remove the MIDX or rewrite
> it?

We currently rewrite it during 'git repack' when GIT_TEST_MULTI_PACK_INDEX=1
to maximize how often we have a multi-pack-index in the test suite. It would
be simple to update that to also check a config option. I don't think adding
that behavior to 'core.multiPackIndex' is a good direction, though.

Thanks,
-Stolee
Derrick Stolee Aug. 25, 2020, 1:14 p.m. UTC | #5
On 8/24/2020 10:37 PM, Taylor Blau wrote:
> On Mon, Aug 24, 2020 at 10:26:14PM -0400, Jeff King wrote:
>> On Mon, Aug 24, 2020 at 10:01:04PM -0400, Taylor Blau wrote:
>>
>>> In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack'
>>> learned to remove a multi-pack-index file if it added or removed a pack
>>> from the object store.
>>>
>>> This mechanism is a little over-eager, since it is only necessary to
>>> drop a MIDX if 'git repack' removes a pack that the MIDX references.
>>> Adding a pack outside of the MIDX does not require invalidating the
>>> MIDX, and likewise for removing a pack the MIDX does not know about.
>>
>> Does "git repack" ever remove just one pack? Obviously "git repack -ad"
>> or "git repack -Ad" is going to pack everything and delete the old
>> packs. So I think we'd want to remove a midx there.
>>
>> And "git repack -d" I think of as deleting only loose objects that we
>> just packed. But I guess it could also remove a pack that has now been
>> made redundant? That seems like a rare case in practice, but I suppose
>> is possible.
> 
> Yeah, the patch message makes this sound more likely than it actually
> is, which I agree is very rare. I often write 'git repack' instead of
> 'git pack-objects' to slurp up everything loose into a new pack without
> having to list loose objects by name.
> 
> That's the case that I really care about here: purely adding a new pack
> should not invalidate the existing MIDX.
> 
>> Not exactly related to your fix, but kind of the flip side of it: would
>> we ever need to retain a midx that mentions some packs that still exist?
>>
>> E.g., imagine we have a midx that points to packs A and B, and
>> git-repack deletes B. By your logic above, we need to remove the midx
>> because now it points to objects in B which aren't accessible. But by
>> deleting it, could we be deleting the only thing that mentions the
>> objects in A?
>>
>> I _think_ the answer is "no", because we never went all-in on midx and
>> allowed deleting the matching .idx files for contained packs. So we'd
>> still have that A.idx, and we could just use the pack as normal. But
>> it's an interesting corner case if we ever do go in that direction.
> 
> Agreed. Maybe a (admittedly somewhat large) #leftoverbits.
> 
>> If you'll let me muse a bit more on midx-lifetime issues (which I've
>> never really thought about before just now):
>>
>> I'm also a little curious how bad it is to have a midx whose pack has
>> gone away. I guess we'd answer queries for "yes, we have this object"
>> even if we don't, which is bad. Though in practice we'd only delete
>> those packs if we have their objects elsewhere. And the pack code is
>> pretty good about retrying other copies of objects that can't be
>> accessed. Alternatively, I wonder if the midx-loading code ought to
>> check that all of the constituent packs are available.
>>
>> In that line of thinking, do we even need to delete midx files if one of
>> their packs goes away? The reading side probably ought to be able to
>> handle that gracefully.
> 
> I think that this is probably the right direction, although I've only
> spend time in the MIDX code over the past couple of weeks, so I can't
> say with authority. It seems like it would be pretty annoying, though.
> For example, code that cares about listing all objects in a MIDX would
> have to check first whether the pack they're in still exists before
> emitting them. On top of that, there are more corner cases when object X
> exists in more than one pack, but some strict subset of those packs
> containing X have gone away.
> 
> I don't think that it couldn't be done, though.
> 
>> And the more interesting case is when you repack everything with "-ad"
>> or similar, at which point you shouldn't even need to look up what's in
>> the midx to see if you deleted its packs. The point of your operation is
>> to put it all-into-one, so you know the old midx should be discarded.
>>
>>> Teach 'git repack' to check for this by loading the MIDX, and checking
>>> whether the to-be-removed pack is known to the MIDX. This requires a
>>> slightly odd alternation to a test in t5319, which is explained with a
>>> comment.
>>
>> My above musings aside, this seems like an obvious improvement.
>>
>>> diff --git a/builtin/repack.c b/builtin/repack.c
>>> index 04c5ceaf7e..98fac03946 100644
>>> --- a/builtin/repack.c
>>> +++ b/builtin/repack.c
>>> @@ -133,7 +133,11 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list,
>>>  static void remove_redundant_pack(const char *dir_name, const char *base_name)
>>>  {
>>>  	struct strbuf buf = STRBUF_INIT;
>>> -	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
>>> +	struct multi_pack_index *m = get_multi_pack_index(the_repository);
>>> +	strbuf_addf(&buf, "%s.pack", base_name);
>>> +	if (m && midx_contains_pack(m, buf.buf))
>>> +		clear_midx_file(the_repository);
>>> +	strbuf_insertf(&buf, 0, "%s/", dir_name);
>>
>> Makes sense. midx_contains_pack() is a binary search, so we'll spend
>> O(n log n) effort deleting the packs (I wondered if this might be
>> accidentally quadratic over the number of packs).
> 
> Right. The MIDX stores packs in lexographic order, so checking them is
> O(log n), which we do at most 'n' times.
> 
>> And after we clear, "m" will be NULL, so we'll do it at most once. Which
>> is why you can get rid of the manual "midx_cleared" flag from the
>> preimage.
> 
> Yep. I thought briefly about passing 'm' as a parameter, but then you
> have to worry about a dangling reference to
> 'the_repository->objects->multi_pack_index' after calling
> 'clear_midx_file()', so it's easier to look it up each time.

The discussion in this thread matches my understanding of the
situation.

>> So the patch looks good to me.

The code in builtin/repack.c looks good for sure. I have a quick question
about this new test:

+test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
+	git multi-pack-index write &&
+	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
+	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
+
+	# Write a new pack that is unknown to the multi-pack-index.
+	git hash-object -w </dev/null >blob &&
+	git pack-objects $objdir/pack/pack <blob &&
+
+	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
+	test_cmp_bin $objdir/pack/multi-pack-index \
+		$objdir/pack/multi-pack-index.bak
+'
+

You create an arbitrary blob, and then add it to a pack-file. Do we
know that 'git repack' is definitely creating a new pack-file that makes
our manually-created pack-file redundant?

My suggestion is to have the test check itself:

+test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
+	git multi-pack-index write &&
+	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
+	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
+
+	# Write a new pack that is unknown to the multi-pack-index.
+	git hash-object -w </dev/null >blob &&
+	HASH=$(git pack-objects $objdir/pack/pack <blob) &&
+
+	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
+	test_cmp_bin $objdir/pack/multi-pack-index \
+		$objdir/pack/multi-pack-index.bak &&
+	test_path_is_missing $objdir/pack/pack-$HASH.pack
+'
+

This test fails for me, on the 'test_path_is_missing'. Likely, the
blob is seen as already in a pack-file so is just pruned by 'git repack'
instead. I thought that perhaps we need to add a new pack ourselves that
overrides the small pack. Here is my attempt:

test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
	git multi-pack-index write &&
	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&

	# Write a new pack that is unknown to the multi-pack-index.
	BLOB1=$(echo blob1 | git hash-object -w --stdin) &&
	BLOB2=$(echo blob2 | git hash-object -w --stdin) &&
	cat >blobs <<-EOF &&
	$BLOB1
	$BLOB2
	EOF
	HASH1=$(echo $BLOB1 | git pack-objects $objdir/pack/pack) &&
	HASH2=$(git pack-objects $objdir/pack/pack <blobs) &&
	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
	test_cmp_bin $objdir/pack/multi-pack-index \
		$objdir/pack/multi-pack-index.bak &&
	test_path_is_file $objdir/pack/pack-$HASH2.pack &&
	test_path_is_missing $objdir/pack/pack-$HASH1.pack
'

However, this _still_ fails on the "test_path_is_missing" line, so I'm not sure
how to make sure your logic is tested. I saw that 'git repack' was writing
"nothing new to pack" in the output, so I also tested adding a few commits and
trying to force it to repack reachable data, but I cannot seem to trigger it
to create a new pack that overrides only one pack that is not in the MIDX.

Likely, I just don't know how 'git rebase' works well enough to trigger this
behavior. But the test as-is is not testing what you want it to test.

Thanks,
-Stolee
Taylor Blau Aug. 25, 2020, 2:41 p.m. UTC | #6
On Tue, Aug 25, 2020 at 09:14:19AM -0400, Derrick Stolee wrote:
> On 8/24/2020 10:37 PM, Taylor Blau wrote:
> > On Mon, Aug 24, 2020 at 10:26:14PM -0400, Jeff King wrote:
> >> On Mon, Aug 24, 2020 at 10:01:04PM -0400, Taylor Blau wrote:
> >>
> >>> In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack'
> >>> learned to remove a multi-pack-index file if it added or removed a pack
> >>> from the object store.
> >>>
> >>> This mechanism is a little over-eager, since it is only necessary to
> >>> drop a MIDX if 'git repack' removes a pack that the MIDX references.
> >>> Adding a pack outside of the MIDX does not require invalidating the
> >>> MIDX, and likewise for removing a pack the MIDX does not know about.
> >>
> >> Does "git repack" ever remove just one pack? Obviously "git repack -ad"
> >> or "git repack -Ad" is going to pack everything and delete the old
> >> packs. So I think we'd want to remove a midx there.
> >>
> >> And "git repack -d" I think of as deleting only loose objects that we
> >> just packed. But I guess it could also remove a pack that has now been
> >> made redundant? That seems like a rare case in practice, but I suppose
> >> is possible.
> >
> > Yeah, the patch message makes this sound more likely than it actually
> > is, which I agree is very rare. I often write 'git repack' instead of
> > 'git pack-objects' to slurp up everything loose into a new pack without
> > having to list loose objects by name.
> >
> > That's the case that I really care about here: purely adding a new pack
> > should not invalidate the existing MIDX.
> >
> >> Not exactly related to your fix, but kind of the flip side of it: would
> >> we ever need to retain a midx that mentions some packs that still exist?
> >>
> >> E.g., imagine we have a midx that points to packs A and B, and
> >> git-repack deletes B. By your logic above, we need to remove the midx
> >> because now it points to objects in B which aren't accessible. But by
> >> deleting it, could we be deleting the only thing that mentions the
> >> objects in A?
> >>
> >> I _think_ the answer is "no", because we never went all-in on midx and
> >> allowed deleting the matching .idx files for contained packs. So we'd
> >> still have that A.idx, and we could just use the pack as normal. But
> >> it's an interesting corner case if we ever do go in that direction.
> >
> > Agreed. Maybe a (admittedly somewhat large) #leftoverbits.
> >
> >> If you'll let me muse a bit more on midx-lifetime issues (which I've
> >> never really thought about before just now):
> >>
> >> I'm also a little curious how bad it is to have a midx whose pack has
> >> gone away. I guess we'd answer queries for "yes, we have this object"
> >> even if we don't, which is bad. Though in practice we'd only delete
> >> those packs if we have their objects elsewhere. And the pack code is
> >> pretty good about retrying other copies of objects that can't be
> >> accessed. Alternatively, I wonder if the midx-loading code ought to
> >> check that all of the constituent packs are available.
> >>
> >> In that line of thinking, do we even need to delete midx files if one of
> >> their packs goes away? The reading side probably ought to be able to
> >> handle that gracefully.
> >
> > I think that this is probably the right direction, although I've only
> > spend time in the MIDX code over the past couple of weeks, so I can't
> > say with authority. It seems like it would be pretty annoying, though.
> > For example, code that cares about listing all objects in a MIDX would
> > have to check first whether the pack they're in still exists before
> > emitting them. On top of that, there are more corner cases when object X
> > exists in more than one pack, but some strict subset of those packs
> > containing X have gone away.
> >
> > I don't think that it couldn't be done, though.
> >
> >> And the more interesting case is when you repack everything with "-ad"
> >> or similar, at which point you shouldn't even need to look up what's in
> >> the midx to see if you deleted its packs. The point of your operation is
> >> to put it all-into-one, so you know the old midx should be discarded.
> >>
> >>> Teach 'git repack' to check for this by loading the MIDX, and checking
> >>> whether the to-be-removed pack is known to the MIDX. This requires a
> >>> slightly odd alternation to a test in t5319, which is explained with a
> >>> comment.
> >>
> >> My above musings aside, this seems like an obvious improvement.
> >>
> >>> diff --git a/builtin/repack.c b/builtin/repack.c
> >>> index 04c5ceaf7e..98fac03946 100644
> >>> --- a/builtin/repack.c
> >>> +++ b/builtin/repack.c
> >>> @@ -133,7 +133,11 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list,
> >>>  static void remove_redundant_pack(const char *dir_name, const char *base_name)
> >>>  {
> >>>  	struct strbuf buf = STRBUF_INIT;
> >>> -	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
> >>> +	struct multi_pack_index *m = get_multi_pack_index(the_repository);
> >>> +	strbuf_addf(&buf, "%s.pack", base_name);
> >>> +	if (m && midx_contains_pack(m, buf.buf))
> >>> +		clear_midx_file(the_repository);
> >>> +	strbuf_insertf(&buf, 0, "%s/", dir_name);
> >>
> >> Makes sense. midx_contains_pack() is a binary search, so we'll spend
> >> O(n log n) effort deleting the packs (I wondered if this might be
> >> accidentally quadratic over the number of packs).
> >
> > Right. The MIDX stores packs in lexographic order, so checking them is
> > O(log n), which we do at most 'n' times.
> >
> >> And after we clear, "m" will be NULL, so we'll do it at most once. Which
> >> is why you can get rid of the manual "midx_cleared" flag from the
> >> preimage.
> >
> > Yep. I thought briefly about passing 'm' as a parameter, but then you
> > have to worry about a dangling reference to
> > 'the_repository->objects->multi_pack_index' after calling
> > 'clear_midx_file()', so it's easier to look it up each time.
>
> The discussion in this thread matches my understanding of the
> situation.
>
> >> So the patch looks good to me.
>
> The code in builtin/repack.c looks good for sure. I have a quick question
> about this new test:
>
> +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
> +	git multi-pack-index write &&
> +	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
> +	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
> +
> +	# Write a new pack that is unknown to the multi-pack-index.
> +	git hash-object -w </dev/null >blob &&
> +	git pack-objects $objdir/pack/pack <blob &&
> +
> +	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
> +	test_cmp_bin $objdir/pack/multi-pack-index \
> +		$objdir/pack/multi-pack-index.bak
> +'
> +
>
> You create an arbitrary blob, and then add it to a pack-file. Do we
> know that 'git repack' is definitely creating a new pack-file that makes
> our manually-created pack-file redundant?
>
> My suggestion is to have the test check itself:
>
> +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
> +	git multi-pack-index write &&
> +	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
> +	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
> +
> +	# Write a new pack that is unknown to the multi-pack-index.
> +	git hash-object -w </dev/null >blob &&
> +	HASH=$(git pack-objects $objdir/pack/pack <blob) &&
> +
> +	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
> +	test_cmp_bin $objdir/pack/multi-pack-index \
> +		$objdir/pack/multi-pack-index.bak &&
> +	test_path_is_missing $objdir/pack/pack-$HASH.pack
> +'
> +
>
> This test fails for me, on the 'test_path_is_missing'. Likely, the
> blob is seen as already in a pack-file so is just pruned by 'git repack'
> instead. I thought that perhaps we need to add a new pack ourselves that
> overrides the small pack. Here is my attempt:
>
> test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
> 	git multi-pack-index write &&
> 	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
> 	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
>
> 	# Write a new pack that is unknown to the multi-pack-index.
> 	BLOB1=$(echo blob1 | git hash-object -w --stdin) &&
> 	BLOB2=$(echo blob2 | git hash-object -w --stdin) &&
> 	cat >blobs <<-EOF &&
> 	$BLOB1
> 	$BLOB2
> 	EOF
> 	HASH1=$(echo $BLOB1 | git pack-objects $objdir/pack/pack) &&
> 	HASH2=$(git pack-objects $objdir/pack/pack <blobs) &&
> 	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
> 	test_cmp_bin $objdir/pack/multi-pack-index \
> 		$objdir/pack/multi-pack-index.bak &&
> 	test_path_is_file $objdir/pack/pack-$HASH2.pack &&
> 	test_path_is_missing $objdir/pack/pack-$HASH1.pack
> '
>
> However, this _still_ fails on the "test_path_is_missing" line, so I'm not sure
> how to make sure your logic is tested. I saw that 'git repack' was writing
> "nothing new to pack" in the output, so I also tested adding a few commits and
> trying to force it to repack reachable data, but I cannot seem to trigger it
> to create a new pack that overrides only one pack that is not in the MIDX.
>
> Likely, I just don't know how 'git rebase' works well enough to trigger this
> behavior. But the test as-is is not testing what you want it to test.

I think this case might actually be impossible to tickle in a test. I
thought that 'git repack -d' looked for existing packs whose objects are
a subset of some new pack generated. But, it's much simpler than that:
'-d' by itself just looks for packs that were already on disk with the
same SHA-1 as a new pack, and it removes the old one.

Note that 'git repack' uses 'git pack-objects' internally to find
objects and generate a packfile. When calling 'git pack-objects', 'git
repack -d' passes '--all' and '--unpacked', which means that there is no
way we'd generate a new pack with the same SHA-1 as an existing pack
ordinarily.

So, I think this case is impossible, or at least astronomically
unlikely. What is more interesting (and untested) is that adding a _new_
pack doesn't cause us to invalidate the MIDX. Here's a patch that does
that:

  diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
  index 16a1ad040e..620f2058d6 100755
  --- a/t/t5319-multi-pack-index.sh
  +++ b/t/t5319-multi-pack-index.sh
  @@ -391,18 +391,27 @@ test_expect_success 'repack removes multi-pack-index when deleting packs' '
          test_path_is_missing $objdir/pack/multi-pack-index
   '

  -test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
  -       git multi-pack-index write &&
  -       cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
  -       test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
  -
  -       # Write a new pack that is unknown to the multi-pack-index.
  -       git hash-object -w </dev/null >blob &&
  -       git pack-objects $objdir/pack/pack <blob &&
  -
  -       GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
  -       test_cmp_bin $objdir/pack/multi-pack-index \
  -               $objdir/pack/multi-pack-index.bak
  +test_expect_success 'repack preserves multi-pack-index when creating packs' '
  +       git init preserve &&
  +       test_when_finished "rm -fr preserve" &&
  +       (
  +               cd preserve &&
  +               midx=.git/objects/pack/multi-pack-index &&
  +
  +               test_commit "initial" &&
  +               git repack -ad &&
  +               git multi-pack-index write &&
  +               ls .git/objects/pack | grep "\.pack$" >before &&
  +
  +               cp $midx $midx.bak &&
  +
  +               test_commit "another" &&
  +               GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
  +               ls .git/objects/pack | grep "\.pack$" >after &&
  +
  +               test_cmp_bin $midx.bak $midx &&
  +               ! test_cmp before after
  +       )
   '

   compare_results_with_midx "after repack"

What do you think about applying this on top and then calling it a day?

> Thanks,
> -Stolee

Thanks,
Taylor
Taylor Blau Aug. 25, 2020, 2:45 p.m. UTC | #7
On Tue, Aug 25, 2020 at 09:55:22AM +0200, Son Luong Ngoc wrote:
> Hi Taylor,
>
> Thanks for working on this.
>
> > On Aug 25, 2020, at 04:01, Taylor Blau <me@ttaylorr.com> wrote:
> >
> > In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack'
> > learned to remove a multi-pack-index file if it added or removed a pack
> > from the object store.
> >
> > This mechanism is a little over-eager, since it is only necessary to
> > drop a MIDX if 'git repack' removes a pack that the MIDX references.
> > Adding a pack outside of the MIDX does not require invalidating the
> > MIDX, and likewise for removing a pack the MIDX does not know about.
>
> I wonder if its worth to trigger write_midx_file() to update the midx instead of
> just removing MIDX?

There's no reason that we couldn't do this, but I don't think that it's
a very good idea, especially if the new 'git maintenance' command will
be able to do something like this by itself.

I'm hesitant to add yet another option to 'git repack', which I have
always thought as a plumbing tool. That's important because callers
(like 'git maintenance' or user scripts) can 'git multi-pack-index write
...' after their 'git repack' to generate a new MIDX if they want one.

This becomes even trickier if 'git multi-pack-index write' were to gain
new options, at which point 'git repack' would *also* have to learn
those options to plumb them through.

Maybe there's a legitimate case that I'm overlooking, but I don't think
so. If there is, I'd be happy to come back to this, but this patch is
really just focused on fixing a bug.

> That is already the direction we are taking in the 'maintenance' patch series
> whenever the multi-pack-index file was deemed invalid.
>
> Or perhaps, we can check for 'core.multiPackIndex' value (which recently is
> 'true' by default) and determine whether we should remove the MIDX or rewrite
> it?
>
> >
> > Teach 'git repack' to check for this by loading the MIDX, and checking
> > whether the to-be-removed pack is known to the MIDX. This requires a
> > slightly odd alternation to a test in t5319, which is explained with a
> > comment.
> >
> > Signed-off-by: Taylor Blau <me@ttaylorr.com>

Thanks for your feedback.

> Cheers,
> Son Luong.

Thanks,
Taylor
Derrick Stolee Aug. 25, 2020, 3:14 p.m. UTC | #8
On 8/25/2020 10:41 AM, Taylor Blau wrote:
> On Tue, Aug 25, 2020 at 09:14:19AM -0400, Derrick Stolee wrote:
>> The code in builtin/repack.c looks good for sure. I have a quick question
>> about this new test:
>>
>> +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
>> +	git multi-pack-index write &&
>> +	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
>> +	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
>> +
>> +	# Write a new pack that is unknown to the multi-pack-index.
>> +	git hash-object -w </dev/null >blob &&
>> +	git pack-objects $objdir/pack/pack <blob &&
>> +
>> +	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
>> +	test_cmp_bin $objdir/pack/multi-pack-index \
>> +		$objdir/pack/multi-pack-index.bak
>> +'
>> +
>>
>> You create an arbitrary blob, and then add it to a pack-file. Do we
>> know that 'git repack' is definitely creating a new pack-file that makes
>> our manually-created pack-file redundant?
>>
>> My suggestion is to have the test check itself:
>>
>> +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
>> +	git multi-pack-index write &&
>> +	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
>> +	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
>> +
>> +	# Write a new pack that is unknown to the multi-pack-index.
>> +	git hash-object -w </dev/null >blob &&
>> +	HASH=$(git pack-objects $objdir/pack/pack <blob) &&
>> +
>> +	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
>> +	test_cmp_bin $objdir/pack/multi-pack-index \
>> +		$objdir/pack/multi-pack-index.bak &&
>> +	test_path_is_missing $objdir/pack/pack-$HASH.pack
>> +'
>> +
>>
>> This test fails for me, on the 'test_path_is_missing'. Likely, the
>> blob is seen as already in a pack-file so is just pruned by 'git repack'
>> instead. I thought that perhaps we need to add a new pack ourselves that
>> overrides the small pack. Here is my attempt:
>>
>> test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
>> 	git multi-pack-index write &&
>> 	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
>> 	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
>>
>> 	# Write a new pack that is unknown to the multi-pack-index.
>> 	BLOB1=$(echo blob1 | git hash-object -w --stdin) &&
>> 	BLOB2=$(echo blob2 | git hash-object -w --stdin) &&
>> 	cat >blobs <<-EOF &&
>> 	$BLOB1
>> 	$BLOB2
>> 	EOF
>> 	HASH1=$(echo $BLOB1 | git pack-objects $objdir/pack/pack) &&
>> 	HASH2=$(git pack-objects $objdir/pack/pack <blobs) &&
>> 	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
>> 	test_cmp_bin $objdir/pack/multi-pack-index \
>> 		$objdir/pack/multi-pack-index.bak &&
>> 	test_path_is_file $objdir/pack/pack-$HASH2.pack &&
>> 	test_path_is_missing $objdir/pack/pack-$HASH1.pack
>> '
>>
>> However, this _still_ fails on the "test_path_is_missing" line, so I'm not sure
>> how to make sure your logic is tested. I saw that 'git repack' was writing
>> "nothing new to pack" in the output, so I also tested adding a few commits and
>> trying to force it to repack reachable data, but I cannot seem to trigger it
>> to create a new pack that overrides only one pack that is not in the MIDX.
>>
>> Likely, I just don't know how 'git rebase' works well enough to trigger this
>> behavior. But the test as-is is not testing what you want it to test.
> 
> I think this case might actually be impossible to tickle in a test. I
> thought that 'git repack -d' looked for existing packs whose objects are
> a subset of some new pack generated. But, it's much simpler than that:
> '-d' by itself just looks for packs that were already on disk with the
> same SHA-1 as a new pack, and it removes the old one.

If 'git repack' never calls remove_redundant_pack() unless we are doing
a "full repack", then we could simplify this logic:

 static void remove_redundant_pack(const char *dir_name, const char *base_name)
 {
 	struct strbuf buf = STRBUF_INIT;
-	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
+	struct multi_pack_index *m = get_multi_pack_index(the_repository);
+	strbuf_addf(&buf, "%s.pack", base_name);
+	if (m && midx_contains_pack(m, buf.buf))
+		clear_midx_file(the_repository);
+	strbuf_insertf(&buf, 0, "%s/", dir_name);
 	unlink_pack_path(buf.buf, 1);
 	strbuf_release(&buf);
 }

to

 static void remove_redundant_pack(const char *dir_name, const char *base_name)
 {
 	struct strbuf buf = STRBUF_INIT;
	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
+	clear_midx_file(the_repository);
 	unlink_pack_path(buf.buf, 1);
 	strbuf_release(&buf);
 }

and get the same results as we are showing in these tests. This does
move us incrementally to a better situation: don't delete the MIDX
if we aren't deleting pack files. But, I think we can get around it.

> Note that 'git repack' uses 'git pack-objects' internally to find
> objects and generate a packfile. When calling 'git pack-objects', 'git
> repack -d' passes '--all' and '--unpacked', which means that there is no
> way we'd generate a new pack with the same SHA-1 as an existing pack
> ordinarily.
> 
> So, I think this case is impossible, or at least astronomically
> unlikely. What is more interesting (and untested) is that adding a _new_
> pack doesn't cause us to invalidate the MIDX. Here's a patch that does
> that:
> 
>   diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
>   index 16a1ad040e..620f2058d6 100755
>   --- a/t/t5319-multi-pack-index.sh
>   +++ b/t/t5319-multi-pack-index.sh
>   @@ -391,18 +391,27 @@ test_expect_success 'repack removes multi-pack-index when deleting packs' '
>           test_path_is_missing $objdir/pack/multi-pack-index
>    '
> 
>   -test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
>   -       git multi-pack-index write &&
>   -       cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
>   -       test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
>   -
>   -       # Write a new pack that is unknown to the multi-pack-index.
>   -       git hash-object -w </dev/null >blob &&
>   -       git pack-objects $objdir/pack/pack <blob &&
>   -
>   -       GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
>   -       test_cmp_bin $objdir/pack/multi-pack-index \
>   -               $objdir/pack/multi-pack-index.bak
>   +test_expect_success 'repack preserves multi-pack-index when creating packs' '
>   +       git init preserve &&
>   +       test_when_finished "rm -fr preserve" &&
>   +       (
>   +               cd preserve &&
>   +               midx=.git/objects/pack/multi-pack-index &&
>   +
>   +               test_commit "initial" &&
>   +               git repack -ad &&
>   +               git multi-pack-index write &&
>   +               ls .git/objects/pack | grep "\.pack$" >before &&
>   +
>   +               cp $midx $midx.bak &&
>   +
>   +               test_commit "another" &&
>   +               GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
>   +               ls .git/objects/pack | grep "\.pack$" >after &&
>   +
>   +               test_cmp_bin $midx.bak $midx &&
>   +               ! test_cmp before after
>   +       )
>    '

After looking at the callers to remote_redundant_pack() I noticed that it is only
called after inspecting the "names" struct, which contains the names of the packs
to group into a new pack-file. We can use a .keep file to preserve the pack-file(s) in
the MIDX but also to ensure multiple pack-files outside of the MIDX are repacked and
deleted. While this is very unlikely in the wild, it is definitely possible.

test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
	git init preserve &&
	test_when_finished "rm -fr preserve" &&
	(
		cd preserve &&
		midx=.git/objects/pack/multi-pack-index &&

		test_commit 1 &&
		HASH1=$(git pack-objects --all .git/objects/pack/pack) &&
		touch .git/objects/pack/pack-$HASH1.keep &&

		cat >pack-input <<-\EOF &&
		HEAD
		^HEAD~1
		EOF

		test_commit 2 &&
		HASH2=$(git pack-objects --revs .git/objects/pack/pack <pack-input) &&
		touch .git/objects/pack/pack-$HASH2.keep &&

		git multi-pack-index write &&
		cp $midx $midx.bak &&

		test_commit 3 &&
		HASH3=$(git pack-objects --revs .git/objects/pack/pack <pack-input) &&

		test_commit 4 &&
		HASH4=$(git pack-objects --revs .git/objects/pack/pack <pack-input) &&

		GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -ad &&
		test_path_is_file .git/objects/pack/pack-$HASH1.pack &&
		test_path_is_file .git/objects/pack/pack-$HASH2.pack &&
		test_path_is_missing .git/objects/pack/pack-$HASH3.pack &&
		test_path_is_missing .git/objects/pack/pack-$HASH4.pack
       )
'

I believe this checks your condition properly enough.

Thanks,
-Stolee
Taylor Blau Aug. 25, 2020, 3:42 p.m. UTC | #9
On Tue, Aug 25, 2020 at 11:14:41AM -0400, Derrick Stolee wrote:
> On 8/25/2020 10:41 AM, Taylor Blau wrote:
> > On Tue, Aug 25, 2020 at 09:14:19AM -0400, Derrick Stolee wrote:
> >> The code in builtin/repack.c looks good for sure. I have a quick question
> >> about this new test:
> >>
> >> +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
> >> +	git multi-pack-index write &&
> >> +	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
> >> +	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
> >> +
> >> +	# Write a new pack that is unknown to the multi-pack-index.
> >> +	git hash-object -w </dev/null >blob &&
> >> +	git pack-objects $objdir/pack/pack <blob &&
> >> +
> >> +	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
> >> +	test_cmp_bin $objdir/pack/multi-pack-index \
> >> +		$objdir/pack/multi-pack-index.bak
> >> +'
> >> +
> >>
> >> You create an arbitrary blob, and then add it to a pack-file. Do we
> >> know that 'git repack' is definitely creating a new pack-file that makes
> >> our manually-created pack-file redundant?
> >>
> >> My suggestion is to have the test check itself:
> >>
> >> +test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
> >> +	git multi-pack-index write &&
> >> +	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
> >> +	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
> >> +
> >> +	# Write a new pack that is unknown to the multi-pack-index.
> >> +	git hash-object -w </dev/null >blob &&
> >> +	HASH=$(git pack-objects $objdir/pack/pack <blob) &&
> >> +
> >> +	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
> >> +	test_cmp_bin $objdir/pack/multi-pack-index \
> >> +		$objdir/pack/multi-pack-index.bak &&
> >> +	test_path_is_missing $objdir/pack/pack-$HASH.pack
> >> +'
> >> +
> >>
> >> This test fails for me, on the 'test_path_is_missing'. Likely, the
> >> blob is seen as already in a pack-file so is just pruned by 'git repack'
> >> instead. I thought that perhaps we need to add a new pack ourselves that
> >> overrides the small pack. Here is my attempt:
> >>
> >> test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
> >> 	git multi-pack-index write &&
> >> 	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
> >> 	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
> >>
> >> 	# Write a new pack that is unknown to the multi-pack-index.
> >> 	BLOB1=$(echo blob1 | git hash-object -w --stdin) &&
> >> 	BLOB2=$(echo blob2 | git hash-object -w --stdin) &&
> >> 	cat >blobs <<-EOF &&
> >> 	$BLOB1
> >> 	$BLOB2
> >> 	EOF
> >> 	HASH1=$(echo $BLOB1 | git pack-objects $objdir/pack/pack) &&
> >> 	HASH2=$(git pack-objects $objdir/pack/pack <blobs) &&
> >> 	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
> >> 	test_cmp_bin $objdir/pack/multi-pack-index \
> >> 		$objdir/pack/multi-pack-index.bak &&
> >> 	test_path_is_file $objdir/pack/pack-$HASH2.pack &&
> >> 	test_path_is_missing $objdir/pack/pack-$HASH1.pack
> >> '
> >>
> >> However, this _still_ fails on the "test_path_is_missing" line, so I'm not sure
> >> how to make sure your logic is tested. I saw that 'git repack' was writing
> >> "nothing new to pack" in the output, so I also tested adding a few commits and
> >> trying to force it to repack reachable data, but I cannot seem to trigger it
> >> to create a new pack that overrides only one pack that is not in the MIDX.
> >>
> >> Likely, I just don't know how 'git rebase' works well enough to trigger this
> >> behavior. But the test as-is is not testing what you want it to test.
> >
> > I think this case might actually be impossible to tickle in a test. I
> > thought that 'git repack -d' looked for existing packs whose objects are
> > a subset of some new pack generated. But, it's much simpler than that:
> > '-d' by itself just looks for packs that were already on disk with the
> > same SHA-1 as a new pack, and it removes the old one.
>
> If 'git repack' never calls remove_redundant_pack() unless we are doing
> a "full repack", then we could simplify this logic:
>
>  static void remove_redundant_pack(const char *dir_name, const char *base_name)
>  {
>  	struct strbuf buf = STRBUF_INIT;
> -	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
> +	struct multi_pack_index *m = get_multi_pack_index(the_repository);
> +	strbuf_addf(&buf, "%s.pack", base_name);
> +	if (m && midx_contains_pack(m, buf.buf))
> +		clear_midx_file(the_repository);
> +	strbuf_insertf(&buf, 0, "%s/", dir_name);
>  	unlink_pack_path(buf.buf, 1);
>  	strbuf_release(&buf);
>  }
>
> to
>
>  static void remove_redundant_pack(const char *dir_name, const char *base_name)
>  {
>  	struct strbuf buf = STRBUF_INIT;
> 	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
> +	clear_midx_file(the_repository);
>  	unlink_pack_path(buf.buf, 1);
>  	strbuf_release(&buf);
>  }
>
> and get the same results as we are showing in these tests. This does
> move us incrementally to a better situation: don't delete the MIDX
> if we aren't deleting pack files. But, I think we can get around it.

Makes sense, but reading your whole email we are better off leaving this
as-is and changing the test to exercise it more often.

> > Note that 'git repack' uses 'git pack-objects' internally to find
> > objects and generate a packfile. When calling 'git pack-objects', 'git
> > repack -d' passes '--all' and '--unpacked', which means that there is no
> > way we'd generate a new pack with the same SHA-1 as an existing pack
> > ordinarily.
> >
> > So, I think this case is impossible, or at least astronomically
> > unlikely. What is more interesting (and untested) is that adding a _new_
> > pack doesn't cause us to invalidate the MIDX. Here's a patch that does
> > that:
> >
> >   diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> >   index 16a1ad040e..620f2058d6 100755
> >   --- a/t/t5319-multi-pack-index.sh
> >   +++ b/t/t5319-multi-pack-index.sh
> >   @@ -391,18 +391,27 @@ test_expect_success 'repack removes multi-pack-index when deleting packs' '
> >           test_path_is_missing $objdir/pack/multi-pack-index
> >    '
> >
> >   -test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
> >   -       git multi-pack-index write &&
> >   -       cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
> >   -       test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
> >   -
> >   -       # Write a new pack that is unknown to the multi-pack-index.
> >   -       git hash-object -w </dev/null >blob &&
> >   -       git pack-objects $objdir/pack/pack <blob &&
> >   -
> >   -       GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
> >   -       test_cmp_bin $objdir/pack/multi-pack-index \
> >   -               $objdir/pack/multi-pack-index.bak
> >   +test_expect_success 'repack preserves multi-pack-index when creating packs' '
> >   +       git init preserve &&
> >   +       test_when_finished "rm -fr preserve" &&
> >   +       (
> >   +               cd preserve &&
> >   +               midx=.git/objects/pack/multi-pack-index &&
> >   +
> >   +               test_commit "initial" &&
> >   +               git repack -ad &&
> >   +               git multi-pack-index write &&
> >   +               ls .git/objects/pack | grep "\.pack$" >before &&
> >   +
> >   +               cp $midx $midx.bak &&
> >   +
> >   +               test_commit "another" &&
> >   +               GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
> >   +               ls .git/objects/pack | grep "\.pack$" >after &&
> >   +
> >   +               test_cmp_bin $midx.bak $midx &&
> >   +               ! test_cmp before after
> >   +       )
> >    '
>
> After looking at the callers to remote_redundant_pack() I noticed that it is only
> called after inspecting the "names" struct, which contains the names of the packs
> to group into a new pack-file. We can use a .keep file to preserve the pack-file(s) in
> the MIDX but also to ensure multiple pack-files outside of the MIDX are repacked and
> deleted. While this is very unlikely in the wild, it is definitely possible.

Great idea.

> test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
> 	git init preserve &&
> 	test_when_finished "rm -fr preserve" &&
> 	(
> 		cd preserve &&
> 		midx=.git/objects/pack/multi-pack-index &&
>
> 		test_commit 1 &&
> 		HASH1=$(git pack-objects --all .git/objects/pack/pack) &&
> 		touch .git/objects/pack/pack-$HASH1.keep &&
>
> 		cat >pack-input <<-\EOF &&

Escaping the heredoc shouldn't be necessary, so this can be written
instead as '<<-EOF'.

> 		HEAD
> 		^HEAD~1
> 		EOF
>
> 		test_commit 2 &&
> 		HASH2=$(git pack-objects --revs .git/objects/pack/pack <pack-input) &&
> 		touch .git/objects/pack/pack-$HASH2.keep &&
>
> 		git multi-pack-index write &&
> 		cp $midx $midx.bak &&
>
> 		test_commit 3 &&
> 		HASH3=$(git pack-objects --revs .git/objects/pack/pack <pack-input) &&
>
> 		test_commit 4 &&
> 		HASH4=$(git pack-objects --revs .git/objects/pack/pack <pack-input) &&
>
> 		GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -ad &&
> 		test_path_is_file .git/objects/pack/pack-$HASH1.pack &&
> 		test_path_is_file .git/objects/pack/pack-$HASH2.pack &&
> 		test_path_is_missing .git/objects/pack/pack-$HASH3.pack &&
> 		test_path_is_missing .git/objects/pack/pack-$HASH4.pack

...and we should check that 'test_cmp $midx.bak $midx' is clean, i.e.,
that we didn't touch the MIDX.

>        )
> '
>
> I believe this checks your condition properly enough.

Otherwise, I think that this test looks great. Thanks for suggesting
it. I'll send a new patch now...

> Thanks,
> -Stolee

Thanks,
Taylor
Junio C Hamano Aug. 25, 2020, 3:58 p.m. UTC | #10
Jeff King <peff@peff.net> writes:

> Does "git repack" ever remove just one pack? Obviously "git repack -ad"
> or "git repack -Ad" is going to pack everything and delete the old
> packs. So I think we'd want to remove a midx there.

AFAIK, the pack-redundant subcommand is used by nobody in-tree, so
nobody is doing "there are three packfiles, but all the objects in
one of them are contained in the other two, so let's remove that
redundant one".

> And "git repack -d" I think of as deleting only loose objects that we
> just packed. But I guess it could also remove a pack that has now been
> made redundant? That seems like a rare case in practice, but I suppose
> is possible.

Meaning it can become reality?  Yes.  Or it already can happen?  I
doubt it.

> E.g., imagine we have a midx that points to packs A and B, and
> git-repack deletes B. By your logic above, we need to remove the midx
> because now it points to objects in B which aren't accessible. But by
> deleting it, could we be deleting the only thing that mentions the
> objects in A?
>
> I _think_ the answer is "no", because we never went all-in on midx and
> allowed deleting the matching .idx files for contained packs.

Yeah, it has been my assumption that that part of the design would
never change.

> I'm also a little curious how bad it is to have a midx whose pack has
> gone away. I guess we'd answer queries for "yes, we have this object"
> even if we don't, which is bad. Though in practice we'd only delete
> those packs if we have their objects elsewhere.

Hmph, object O used to be in pack A and pack B, midx points at the
copy in pack B but we deleted it because the pack is redundant.
Now, midx says O exists, but midx cannot be used to retrieve O.  We
need to ask A.idx about O to locate it.

That sounds brittle.  I am not sure our error fallback is all that
patient.


> And the pack code is
> pretty good about retrying other copies of objects that can't be
> accessed. Alternatively, I wonder if the midx-loading code ought to
> check that all of the constituent packs are available.
>
> In that line of thinking, do we even need to delete midx files if one of
> their packs goes away? The reading side probably ought to be able to
> handle that gracefully.

But at that point, is there even a point to have that midx file that
knows about objects (so it can answer has_object()? correctly and
quickly) but does not know the correct location of half of the objects?
Instead of going directly to A.idx to locate O, we need to go to midx
to learn the location of O in B (which no longer exists), and then
fall back to it, that is a pure overhead.

> And the more interesting case is when you repack everything with "-ad"
> or similar, at which point you shouldn't even need to look up what's in
> the midx to see if you deleted its packs. The point of your operation is
> to put it all-into-one, so you know the old midx should be discarded.

Old one, yes.  Do we need to have the new one in that case?

>> Teach 'git repack' to check for this by loading the MIDX, and checking
>> whether the to-be-removed pack is known to the MIDX. This requires a
>> slightly odd alternation to a test in t5319, which is explained with a
>> comment.
>
> My above musings aside, this seems like an obvious improvement.

Yes.

>> diff --git a/builtin/repack.c b/builtin/repack.c
>> index 04c5ceaf7e..98fac03946 100644
>> --- a/builtin/repack.c
>> +++ b/builtin/repack.c
>> @@ -133,7 +133,11 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list,
>>  static void remove_redundant_pack(const char *dir_name, const char *base_name)
>>  {
>>  	struct strbuf buf = STRBUF_INIT;
>> -	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
>> +	struct multi_pack_index *m = get_multi_pack_index(the_repository);
>> +	strbuf_addf(&buf, "%s.pack", base_name);
>> +	if (m && midx_contains_pack(m, buf.buf))
>> +		clear_midx_file(the_repository);
>> +	strbuf_insertf(&buf, 0, "%s/", dir_name);
>
> Makes sense. midx_contains_pack() is a binary search, so we'll spend
> O(n log n) effort deleting the packs (I wondered if this might be
> accidentally quadratic over the number of packs).
>
> And after we clear, "m" will be NULL, so we'll do it at most once. Which
> is why you can get rid of the manual "midx_cleared" flag from the
> preimage.
>
> So the patch looks good to me.

Thanks.
Taylor Blau Aug. 25, 2020, 4:08 p.m. UTC | #11
On Tue, Aug 25, 2020 at 08:58:46AM -0700, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
>
> > Does "git repack" ever remove just one pack? Obviously "git repack -ad"
> > or "git repack -Ad" is going to pack everything and delete the old
> > packs. So I think we'd want to remove a midx there.
>
> AFAIK, the pack-redundant subcommand is used by nobody in-tree, so
> nobody is doing "there are three packfiles, but all the objects in
> one of them are contained in the other two, so let's remove that
> redundant one".

Not AFAICT either.

> > And "git repack -d" I think of as deleting only loose objects that we
> > just packed. But I guess it could also remove a pack that has now been
> > made redundant? That seems like a rare case in practice, but I suppose
> > is possible.
>
> Meaning it can become reality?  Yes.  Or it already can happen?  I
> doubt it.
>
> > E.g., imagine we have a midx that points to packs A and B, and
> > git-repack deletes B. By your logic above, we need to remove the midx
> > because now it points to objects in B which aren't accessible. But by
> > deleting it, could we be deleting the only thing that mentions the
> > objects in A?
> >
> > I _think_ the answer is "no", because we never went all-in on midx and
> > allowed deleting the matching .idx files for contained packs.
>
> Yeah, it has been my assumption that that part of the design would
> never change.
>
> > I'm also a little curious how bad it is to have a midx whose pack has
> > gone away. I guess we'd answer queries for "yes, we have this object"
> > even if we don't, which is bad. Though in practice we'd only delete
> > those packs if we have their objects elsewhere.
>
> Hmph, object O used to be in pack A and pack B, midx points at the
> copy in pack B but we deleted it because the pack is redundant.
> Now, midx says O exists, but midx cannot be used to retrieve O.  We
> need to ask A.idx about O to locate it.
>
> That sounds brittle.  I am not sure our error fallback is all that
> patient.

Me either. Like I said, I think that all of this is possible at least in
theory, but in practice it may be somewhat annoying in cases like these.

>
> > And the pack code is
> > pretty good about retrying other copies of objects that can't be
> > accessed. Alternatively, I wonder if the midx-loading code ought to
> > check that all of the constituent packs are available.
> >
> > In that line of thinking, do we even need to delete midx files if one of
> > their packs goes away? The reading side probably ought to be able to
> > handle that gracefully.
>
> But at that point, is there even a point to have that midx file that
> knows about objects (so it can answer has_object()? correctly and
> quickly) but does not know the correct location of half of the objects?
> Instead of going directly to A.idx to locate O, we need to go to midx
> to learn the location of O in B (which no longer exists), and then
> fall back to it, that is a pure overhead.

Well put.

> > And the more interesting case is when you repack everything with "-ad"
> > or similar, at which point you shouldn't even need to look up what's in
> > the midx to see if you deleted its packs. The point of your operation is
> > to put it all-into-one, so you know the old midx should be discarded.
>
> Old one, yes.  Do we need to have the new one in that case?
>
> >> Teach 'git repack' to check for this by loading the MIDX, and checking
> >> whether the to-be-removed pack is known to the MIDX. This requires a
> >> slightly odd alternation to a test in t5319, which is explained with a
> >> comment.
> >
> > My above musings aside, this seems like an obvious improvement.
>
> Yes.
>
> >> diff --git a/builtin/repack.c b/builtin/repack.c
> >> index 04c5ceaf7e..98fac03946 100644
> >> --- a/builtin/repack.c
> >> +++ b/builtin/repack.c
> >> @@ -133,7 +133,11 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list,
> >>  static void remove_redundant_pack(const char *dir_name, const char *base_name)
> >>  {
> >>  	struct strbuf buf = STRBUF_INIT;
> >> -	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
> >> +	struct multi_pack_index *m = get_multi_pack_index(the_repository);
> >> +	strbuf_addf(&buf, "%s.pack", base_name);
> >> +	if (m && midx_contains_pack(m, buf.buf))
> >> +		clear_midx_file(the_repository);
> >> +	strbuf_insertf(&buf, 0, "%s/", dir_name);
> >
> > Makes sense. midx_contains_pack() is a binary search, so we'll spend
> > O(n log n) effort deleting the packs (I wondered if this might be
> > accidentally quadratic over the number of packs).
> >
> > And after we clear, "m" will be NULL, so we'll do it at most once. Which
> > is why you can get rid of the manual "midx_cleared" flag from the
> > preimage.
> >
> > So the patch looks good to me.
>
> Thanks.

Thanks, both. If you're ready, let's use the version in [1] for
queueing.

Thanks,
Taylor

[1]: https://lore.kernel.org/git/87a3b7a5a2f091e2a23c163a7d86effbbbedfa3a.1598371475.git.me@ttaylorr.com/
Derrick Stolee Aug. 25, 2020, 4:18 p.m. UTC | #12
On 8/25/2020 11:58 AM, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> 
>> Does "git repack" ever remove just one pack? Obviously "git repack -ad"
>> or "git repack -Ad" is going to pack everything and delete the old
>> packs. So I think we'd want to remove a midx there.
> 
> AFAIK, the pack-redundant subcommand is used by nobody in-tree, so
> nobody is doing "there are three packfiles, but all the objects in
> one of them are contained in the other two, so let's remove that
> redundant one".
> 
>> And "git repack -d" I think of as deleting only loose objects that we
>> just packed. But I guess it could also remove a pack that has now been
>> made redundant? That seems like a rare case in practice, but I suppose
>> is possible.
> 
> Meaning it can become reality?  Yes.  Or it already can happen?  I
> doubt it.
> 
>> E.g., imagine we have a midx that points to packs A and B, and
>> git-repack deletes B. By your logic above, we need to remove the midx
>> because now it points to objects in B which aren't accessible. But by
>> deleting it, could we be deleting the only thing that mentions the
>> objects in A?
>>
>> I _think_ the answer is "no", because we never went all-in on midx and
>> allowed deleting the matching .idx files for contained packs.
> 
> Yeah, it has been my assumption that that part of the design would
> never change.

Yes, we should intend to keep the .idx files around forever even when
using the multi-pack-index. The duplicated data overhead is not typically
a real problem.

The one caveat I would consider is if we wanted to let a multi-pack-index
cover thin packs, but that would be a substantial change to the object
database that I do not plan to tackle any time soon.

>> I'm also a little curious how bad it is to have a midx whose pack has
>> gone away. I guess we'd answer queries for "yes, we have this object"
>> even if we don't, which is bad. Though in practice we'd only delete
>> those packs if we have their objects elsewhere.
> 
> Hmph, object O used to be in pack A and pack B, midx points at the
> copy in pack B but we deleted it because the pack is redundant.
> Now, midx says O exists, but midx cannot be used to retrieve O.  We
> need to ask A.idx about O to locate it.
> 
> That sounds brittle.  I am not sure our error fallback is all that
> patient.

The best I can see is that prepare_midx_pack() will return 1 if the
pack no longer exists, and this would cause a die("error preparing
packfile from multi-pack-index") in nth_midxed_pack_entry(), killing
the following stack trace:

+ find_pack_entry():packfile.c
 + fill_midx_entry():midx.c
  + nth_midxed_pack_entry():midx.c

Perhaps that die() is a bit over-zealous since we could return 1
through this stack and find_pack_entry() could continue searching
for the object in the packed_git list. However, it could start
returning false _negatives_ if there were duplicates of the object
in the multi-pack-index but only the latest copy was deleted (and
the object does not appear in a pack-file outside of the multi-
pack-index).

Thanks,
-Stolee
Jeff King Aug. 25, 2020, 4:47 p.m. UTC | #13
On Tue, Aug 25, 2020 at 10:45:15AM -0400, Taylor Blau wrote:

> > I wonder if its worth to trigger write_midx_file() to update the midx instead of
> > just removing MIDX?
> 
> There's no reason that we couldn't do this, but I don't think that it's
> a very good idea, especially if the new 'git maintenance' command will
> be able to do something like this by itself.
> 
> I'm hesitant to add yet another option to 'git repack', which I have
> always thought as a plumbing tool. That's important because callers
> (like 'git maintenance' or user scripts) can 'git multi-pack-index write
> ...' after their 'git repack' to generate a new MIDX if they want one.

It may be worth thinking a bit about atomicity here, though. Rather than
separate delete and write steps, would somebody want a sequence like:

  - we have a midx with packs A, B, C

  - we find out that pack C is redundant and want to drop it

  - we create a new midx with A and B in a tempfile

  - we atomically rename the new midx over the old

  - we delete pack C

A simultaneous reader always sees a consistent midx. Whereas in a
delete-then-rewrite scenario there is a moment where there is no midx,
and they'd fall back to reading the individual idx files.

It may not matter all that much, though, for two reasons:

  - reading individual idx files should still give a correct answer.
    It just may be a bit slower.

  - even with an atomic replacement, I think caching on the reader side
    may cause interesting effects. For instance, if a reader process
    opens the old midx, it will generally cache that knowledge in memory
    (including mmap the content). So even after the replacement and
    deletion of C, it may still try to use the old midx that references
    C.

If we do care, though, that implies some cooperation between the
deletion process and the midx code. Perhaps it even argues that repack
should refuse to delete such a single pack at all, since it _isn't_
redundant. It's part of a midx, and the caller should rewrite the midx
first itself, and _then_ look for redundant packs.

-Peff
Jeff King Aug. 25, 2020, 4:56 p.m. UTC | #14
On Tue, Aug 25, 2020 at 11:42:24AM -0400, Taylor Blau wrote:

> > test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
> > 	git init preserve &&
> > 	test_when_finished "rm -fr preserve" &&
> > 	(
> > 		cd preserve &&
> > 		midx=.git/objects/pack/multi-pack-index &&
> >
> > 		test_commit 1 &&
> > 		HASH1=$(git pack-objects --all .git/objects/pack/pack) &&
> > 		touch .git/objects/pack/pack-$HASH1.keep &&
> >
> > 		cat >pack-input <<-\EOF &&
> 
> Escaping the heredoc shouldn't be necessary, so this can be written
> instead as '<<-EOF'.

We usually go the opposite way: if a here-doc doesn't need
interpolation, then we prefer to mark it as such to avoid surprising
somebody who adds lines later (and might need quoting).

Certainly you can argue that least-surprise would be in the other
direction (i.e., people expect to interpolate by default, and you
surprise anybody adding a variable reference), but for Git's test suite,
the convention usually runs the other way.

-Peff
Derrick Stolee Aug. 25, 2020, 5:10 p.m. UTC | #15
On 8/25/2020 12:47 PM, Jeff King wrote:
> It may be worth thinking a bit about atomicity here, though. Rather than
> separate delete and write steps, would somebody want a sequence like:

...
> If we do care, though, that implies some cooperation between the
> deletion process and the midx code. Perhaps it even argues that repack
> should refuse to delete such a single pack at all, since it _isn't_
> redundant. It's part of a midx, and the caller should rewrite the midx
> first itself, and _then_ look for redundant packs.

It is worth noting that we _do_ have a way to integrate the delete and
write code using 'git multi-pack-index expire'. One way to resolve this
atomicity would be to do the following inside the repack command:

 1. Create and index the new pack.
 2. git multi-pack-index write
 3. git multi-pack-index expire

While this _mostly_ works, we still have issues around a concurrent
process opening a multi-pack-index before step (2) and trying to
read from the pack-file deleted in step (3). For this reason, the
'incremental-repack' task in the maintenance builtin series [1] runs
'expire' then 'repack' (and not the opposite order). To be completely
safe, we'd want to make sure that no Git command that started before
the 'repack' command is still running when we run 'expire'. In practice,
it suffices to run the 'incremental-repack' step at most once a day.

[1] https://lore.kernel.org/git/a8d956dad6b3a81d0f1b63cbd48f36a5e1195ae8.1597760730.git.gitgitgadget@gmail.com/

This discussion points out a big reason why the multi-pack-index
wasn't fully integrated with 'git repack': it can be tricky. Using
a repacking strategy that is focused on the multi-pack-index is
less tricky.

I still think that this patch is moving us in a better direction.

I'm making note of these possible improvements:

1. Have 'git repack' integrate with 'git multi-pack-index expire'
   when deleting pack-files after creating a new one (if a
   multi-pack-index exists).

2. Handle "missing packs" referenced by the multi-pack-index more
   gracefully, likely by disabling the multi-pack-index and
   calling reprepare_packed_git().

Thanks,
-Stolee
Jeff King Aug. 25, 2020, 5:22 p.m. UTC | #16
On Tue, Aug 25, 2020 at 08:58:46AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Does "git repack" ever remove just one pack? Obviously "git repack -ad"
> > or "git repack -Ad" is going to pack everything and delete the old
> > packs. So I think we'd want to remove a midx there.
> 
> AFAIK, the pack-redundant subcommand is used by nobody in-tree, so
> nobody is doing "there are three packfiles, but all the objects in
> one of them are contained in the other two, so let's remove that
> redundant one".

OK, that's the part I was missing. The discussion here and the statement
from git-repack(1):

  -d
      After packing, if the newly created packs make some existing packs
      redundant, remove the redundant packs. Also run git prune-packed
      to remove redundant loose object files.

made me think that it was running pack-redundant. But it doesn't seem
to. It looks like we stopped doing so in 6ed64058e1 (git-repack: do not
do complex redundancy check., 2005-11-19).

As an aside, we tried using pack-redundant at GitHub several years ago
for dropping packs that were replicated in alternates storage. It
performs very poorly (quadratically, perhaps?) to the point that we
found it unusable, and I implemented a "local-redundant" command to do
the same thing more quickly. We ended up dropping that as we now migrate
packs wholesale (rather than via fetch), so I never upstreamed it.

I mention that here to warn people away from pack-redundant (I wondered
at the time why nobody had noticed its terrible performance, but now I
think the answer is just that nobody ever runs it). I wonder if we
should even consider deprecating and removing it.

I'm also happy to share local-redundant if anybody is interested (though
as I've mentioned elsewhere, I think the larger scheme it was part of it
also performed poorly and isn't worth pursuing).

> > And "git repack -d" I think of as deleting only loose objects that we
> > just packed. But I guess it could also remove a pack that has now been
> > made redundant? That seems like a rare case in practice, but I suppose
> > is possible.
> 
> Meaning it can become reality?  Yes.  Or it already can happen?  I
> doubt it.

I thought the latter, but after this thread I agree that it can't.

> > I'm also a little curious how bad it is to have a midx whose pack has
> > gone away. I guess we'd answer queries for "yes, we have this object"
> > even if we don't, which is bad. Though in practice we'd only delete
> > those packs if we have their objects elsewhere.
> 
> Hmph, object O used to be in pack A and pack B, midx points at the
> copy in pack B but we deleted it because the pack is redundant.
> Now, midx says O exists, but midx cannot be used to retrieve O.  We
> need to ask A.idx about O to locate it.
> 
> That sounds brittle.  I am not sure our error fallback is all that
> patient.

It has to be that patient even without a midx, I think. We might open
A.idx and mmap its contents, without opening the matching A.pack (or we
might open it and later close it due to memory or descriptor
constraints). And we have two layers there:

  - when object_info_extended sees a bad return from
    packed_object_info(), it will mark the entry as bad and recurse,
    giving us an opportunity to find another one

  - when we can't find the object at all (e.g., because we marked that
    particular pack's version as inaccessible), we'll hit the
    reprepare_packed_git() path and look for a new idx

Those same things should be kicking in for midx lookups, too (I didn't
test them, but from a cursory look at the organization of the code, I
think they will).

> > In that line of thinking, do we even need to delete midx files if one of
> > their packs goes away? The reading side probably ought to be able to
> > handle that gracefully.
> 
> But at that point, is there even a point to have that midx file that
> knows about objects (so it can answer has_object()? correctly and
> quickly) but does not know the correct location of half of the objects?
> Instead of going directly to A.idx to locate O, we need to go to midx
> to learn the location of O in B (which no longer exists), and then
> fall back to it, that is a pure overhead.

It is overhead for sure. But my thinking was that you're trading one
efficiency for another:

  - the midx may incur an extra lookup for objects in the redundant
    pack, but that pack may be a small portion of what the midx indexes
    (and this is likely in the usual case that you have one big
    cumulative pack and many recently-created smaller ones; the big one
    will never be made redundant and dominates the object count of the
    midx)

  - by keeping the midx, you're improving lookup speed for all of the
    other objects, which don't have to hunt through separate pack idx
    files

So my guess is that it would usually be a win. Though really the correct
answer is: if you are mucking with packs you should just generate a new
midx (or you should pack all-into-one, which generates a single idx
accomplishing the same thing).

> > And the more interesting case is when you repack everything with "-ad"
> > or similar, at which point you shouldn't even need to look up what's in
> > the midx to see if you deleted its packs. The point of your operation is
> > to put it all-into-one, so you know the old midx should be discarded.
> 
> Old one, yes.  Do we need to have the new one in that case?

You shouldn't need one since you'd only have a single pack now.

-Peff
Jeff King Aug. 25, 2020, 5:29 p.m. UTC | #17
On Tue, Aug 25, 2020 at 01:10:13PM -0400, Derrick Stolee wrote:

> > If we do care, though, that implies some cooperation between the
> > deletion process and the midx code. Perhaps it even argues that repack
> > should refuse to delete such a single pack at all, since it _isn't_
> > redundant. It's part of a midx, and the caller should rewrite the midx
> > first itself, and _then_ look for redundant packs.
> 
> It is worth noting that we _do_ have a way to integrate the delete and
> write code using 'git multi-pack-index expire'. One way to resolve this
> atomicity would be to do the following inside the repack command:
> 
>  1. Create and index the new pack.
>  2. git multi-pack-index write
>  3. git multi-pack-index expire

Given that discussion elsewhere points to git-repack only really
deleting packs in all-in-one mode (and not ever a single pack), it seems
like we can really be much simpler here. If we're not deleting packs via
all-in-one, there's no need to touch the midx at all. If we are, then
it's reasonable to delete the midx immediately (after having written our
new pack but before deleting) since our new single pack idx is as good
or better.

I.e., drop step 2 above, and make step 3 just clear_midx_file(). Which
is roughly what the code does now, isn't it? Or is there some reason
that "expire" is more interesting than just clearing?

And if anybody does want to drop single packs, etc, they can do so by
generating a sensible midx separately from the repack operation (and
probably doing so before dropping the packs for atomicity).

-Peff
Taylor Blau Aug. 25, 2020, 5:34 p.m. UTC | #18
On Tue, Aug 25, 2020 at 01:29:01PM -0400, Jeff King wrote:
> On Tue, Aug 25, 2020 at 01:10:13PM -0400, Derrick Stolee wrote:
>
> > > If we do care, though, that implies some cooperation between the
> > > deletion process and the midx code. Perhaps it even argues that repack
> > > should refuse to delete such a single pack at all, since it _isn't_
> > > redundant. It's part of a midx, and the caller should rewrite the midx
> > > first itself, and _then_ look for redundant packs.
> >
> > It is worth noting that we _do_ have a way to integrate the delete and
> > write code using 'git multi-pack-index expire'. One way to resolve this
> > atomicity would be to do the following inside the repack command:
> >
> >  1. Create and index the new pack.
> >  2. git multi-pack-index write
> >  3. git multi-pack-index expire
>
> Given that discussion elsewhere points to git-repack only really
> deleting packs in all-in-one mode (and not ever a single pack), it seems
> like we can really be much simpler here. If we're not deleting packs via
> all-in-one, there's no need to touch the midx at all. If we are, then
> it's reasonable to delete the midx immediately (after having written our
> new pack but before deleting) since our new single pack idx is as good
> or better.
>
> I.e., drop step 2 above, and make step 3 just clear_midx_file(). Which
> is roughly what the code does now, isn't it? Or is there some reason
> that "expire" is more interesting than just clearing?

It's not clear to me whether you're talking about my patch, or what a
more full integration with 'git repack' looks like.

If you are talking about my patch, I disagree: checking that the MIDX
doesn't know about a pack we're dropping *is* useful even without
all-in-one, because of '.keep' packs (as demonstrated by the new test in
my patch).

To me, this patch seems like an incremental step in the direction that
we ultimately want to be going, but it's hard to untangle whether the
ensuing discussion is targeted at my patch, or the ultimate goal.

> And if anybody does want to drop single packs, etc, they can do so by
> generating a sensible midx separately from the repack operation (and
> probably doing so before dropping the packs for atomicity).
>
> -Peff

Thanks,
Taylor
Jeff King Aug. 25, 2020, 5:34 p.m. UTC | #19
On Tue, Aug 25, 2020 at 12:18:42PM -0400, Derrick Stolee wrote:

> The best I can see is that prepare_midx_pack() will return 1 if the
> pack no longer exists, and this would cause a die("error preparing
> packfile from multi-pack-index") in nth_midxed_pack_entry(), killing
> the following stack trace:
> 
> + find_pack_entry():packfile.c
>  + fill_midx_entry():midx.c
>   + nth_midxed_pack_entry():midx.c
> 
> Perhaps that die() is a bit over-zealous since we could return 1
> through this stack and find_pack_entry() could continue searching
> for the object in the packed_git list. However, it could start
> returning false _negatives_ if there were duplicates of the object
> in the multi-pack-index but only the latest copy was deleted (and
> the object does not appear in a pack-file outside of the multi-
> pack-index).

Hmm, yeah.

I thought this code is already doing the right thing, because I'd expect
the is_pack_valid() call later in nth_midxed_pack_entry() to be where we
notice the problem. But add_packed_git() does stat the packfile and
return an error.

So that die() really ought to be just "return 0". The caller already has
to (and does) handle similar errors (including that the pack went away
after we added it to the packed_git list, or that it exists but has
bogus data, etc).

-Peff
Jeff King Aug. 25, 2020, 5:42 p.m. UTC | #20
On Tue, Aug 25, 2020 at 01:34:25PM -0400, Taylor Blau wrote:

> > I.e., drop step 2 above, and make step 3 just clear_midx_file(). Which
> > is roughly what the code does now, isn't it? Or is there some reason
> > that "expire" is more interesting than just clearing?
> 
> It's not clear to me whether you're talking about my patch, or what a
> more full integration with 'git repack' looks like.
> 
> If you are talking about my patch, I disagree: checking that the MIDX
> doesn't know about a pack we're dropping *is* useful even without
> all-in-one, because of '.keep' packs (as demonstrated by the new test in
> my patch).

I hadn't considered .keep. So all-in-one may still involve selectively
deleting some packs. It makes sense, then, for repack to consider
whether the midx is actually redundant or not rather than just always
clearing it (i.e., what your patch does).

In general I consider .keep packs kind of an awful feature that
introduces a lot of confusion and works against other features like
bitmaps. But I guess that they're the only thing that allows you to have
a gigantic Windows-like repo where you never fully pack it, but just
keep acquiring a string of big packs. Which is the exact case that the
midx is most useful for. So they're definitely worth considering and
supporting here.

> To me, this patch seems like an incremental step in the direction that
> we ultimately want to be going, but it's hard to untangle whether the
> ensuing discussion is targeted at my patch, or the ultimate goal.

I wasn't sure of the answer to that until we untangled more. I.e., it
wasn't clear to me if your incremental step was even in the right
direction if we weren't in fact ever selectively deleting packs in
git-repack. (And now it sounds like we aren't via "git repack -d", which
was my original question, but we are via .keep).

-Peff
Junio C Hamano Aug. 25, 2020, 6:05 p.m. UTC | #21
Jeff King <peff@peff.net> writes:

> OK, that's the part I was missing. The discussion here and the statement
> from git-repack(1):
>
>   -d
>       After packing, if the newly created packs make some existing packs
>       redundant, remove the redundant packs. Also run git prune-packed
>       to remove redundant loose object files.
>
> made me think that it was running pack-redundant. But it doesn't seem
> to. It looks like we stopped doing so in 6ed64058e1 (git-repack: do not
> do complex redundancy check., 2005-11-19).

Thanks for digging.  A good opportunity for a #leftoverbits
documentation update from new people is here.

> As an aside, we tried using pack-redundant at GitHub several years ago
> for dropping packs that were replicated in alternates storage. It
> performs very poorly (quadratically, perhaps?) to the point that we
> found it unusable,...

Yes, I originally wrote "the pack-redundant subcommand" in the
message you are responding to with a bit more colourful adjectives,
but rewrote it ;-)  My recollection from the last time I looked at
it is that it is quadratic or even worse---that was long time ago,
but on the other hand I think the subcommand had no significant
improvement over the course of its life.

Perhaps it is time to drop it.

-- >8 --
Subject: [RFC] pack-redundant: gauge the usage before proposing its removal

The subcommand is unusably slow and the reason why nobody reports it
as a performance bug is suspected to be the absense of users.  Let's
disable the normal use of command by making it error out with a big
message that asks the user to tell us that they still care about the
command, with an escape hatch to override it with a command line
option.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/pack-redundant.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/builtin/pack-redundant.c b/builtin/pack-redundant.c
index 178e3409b7..97cf3df79b 100644
--- a/builtin/pack-redundant.c
+++ b/builtin/pack-redundant.c
@@ -554,6 +554,7 @@ static void load_all(void)
 int cmd_pack_redundant(int argc, const char **argv, const char *prefix)
 {
 	int i;
+	int i_still_use_this = 0;
 	struct pack_list *min = NULL, *red, *pl;
 	struct llist *ignore;
 	struct object_id *oid;
@@ -580,12 +581,25 @@ int cmd_pack_redundant(int argc, const char **argv, const char *prefix)
 			alt_odb = 1;
 			continue;
 		}
+		if (!strcmp(arg, "--i-still-use-this")) {
+			i_still_use_this = 1;
+			continue;
+		}
 		if (*arg == '-')
 			usage(pack_redundant_usage);
 		else
 			break;
 	}
 
+	if (!i_still_use_this) {
+		puts(_("'git pack-redundant' is nominated for removal.\n"
+		       "If you still use this command, please add an extra\n"
+		       "option, '--i-still-use-this', on the command line\n"
+		       "and let us know you still use it by sending an e-mail\n"
+		       "to <git@vger.kernel.org>.  Thanks\n"));
+		exit(1);
+	}
+
 	if (load_all_packs)
 		load_all();
 	else
Jeff King Aug. 25, 2020, 6:27 p.m. UTC | #22
On Tue, Aug 25, 2020 at 11:05:09AM -0700, Junio C Hamano wrote:

> Subject: [RFC] pack-redundant: gauge the usage before proposing its removal
> 
> The subcommand is unusably slow and the reason why nobody reports it
> as a performance bug is suspected to be the absense of users.  Let's
> disable the normal use of command by making it error out with a big
> message that asks the user to tell us that they still care about the
> command, with an escape hatch to override it with a command line
> option.

I like this direction. And I'm tempted to say it's OK to go straight
here given how generally useless the command is. But it feels like a big
jump for the initial change. While we give the user an easy escape
hatch, it might still break their scripts.

A more gentle transition would I guess be:

  1. Mention deprecation in release notes (hah, as if anybody reads
     them).

  2. Issue a warning but continue to behave as normal. That might break
     scripts that care a lot about stderr, but otherwise is harmless. No
     clue if anybody would actually see the message or not.

  3. Flip warning to error, as your patch does.

  4. Removal.

I'd guess we could do 1+2 in one release, then 3 a release or two later,
and then finally 4. I dunno. That's more tedious and I'll be surprised
if we get any bites, but maybe we ought to do it out of principle.

-Peff
diff mbox series

Patch

diff --git a/builtin/repack.c b/builtin/repack.c
index 04c5ceaf7e..98fac03946 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -133,7 +133,11 @@  static void get_non_kept_pack_filenames(struct string_list *fname_list,
 static void remove_redundant_pack(const char *dir_name, const char *base_name)
 {
 	struct strbuf buf = STRBUF_INIT;
-	strbuf_addf(&buf, "%s/%s.pack", dir_name, base_name);
+	struct multi_pack_index *m = get_multi_pack_index(the_repository);
+	strbuf_addf(&buf, "%s.pack", base_name);
+	if (m && midx_contains_pack(m, buf.buf))
+		clear_midx_file(the_repository);
+	strbuf_insertf(&buf, 0, "%s/", dir_name);
 	unlink_pack_path(buf.buf, 1);
 	strbuf_release(&buf);
 }
@@ -286,7 +290,6 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 	int keep_unreachable = 0;
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	int no_update_server_info = 0;
-	int midx_cleared = 0;
 	struct pack_objects_args po_args = {NULL};

 	struct option builtin_repack_options[] = {
@@ -439,11 +442,6 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 		for (ext = 0; ext < ARRAY_SIZE(exts); ext++) {
 			char *fname, *fname_old;

-			if (!midx_cleared) {
-				clear_midx_file(the_repository);
-				midx_cleared = 1;
-			}
-
 			fname = mkpathdup("%s/pack-%s%s", packdir,
 						item->string, exts[ext].name);
 			if (!file_exists(fname)) {
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 43b1b5b2af..16a1ad040e 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -382,12 +382,29 @@  test_expect_success 'repack with the --no-progress option' '
 	test_line_count = 0 err
 '

-test_expect_success 'repack removes multi-pack-index' '
+test_expect_success 'repack removes multi-pack-index when deleting packs' '
 	test_path_is_file $objdir/pack/multi-pack-index &&
-	GIT_TEST_MULTI_PACK_INDEX=0 git repack -adf &&
+	# Set GIT_TEST_MULTI_PACK_INDEX to 0 to avoid writing a new
+	# multi-pack-index after repacking, but set "core.multiPackIndex" to
+	# true so that "git repack" can read the existing MIDX.
+	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -adf &&
 	test_path_is_missing $objdir/pack/multi-pack-index
 '

+test_expect_success 'repack preserves multi-pack-index when deleting unknown packs' '
+	git multi-pack-index write &&
+	cp $objdir/pack/multi-pack-index $objdir/pack/multi-pack-index.bak &&
+	test_when_finished "rm -f $objdir/pack/multi-pack-index.bak" &&
+
+	# Write a new pack that is unknown to the multi-pack-index.
+	git hash-object -w </dev/null >blob &&
+	git pack-objects $objdir/pack/pack <blob &&
+
+	GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -d &&
+	test_cmp_bin $objdir/pack/multi-pack-index \
+		$objdir/pack/multi-pack-index.bak
+'
+
 compare_results_with_midx "after repack"

 test_expect_success 'multi-pack-index and pack-bitmap' '