diff mbox series

t1006: add tests for %(objectsize:disk)

Message ID 20231221094722.GA570888@coredump.intra.peff.net (mailing list archive)
State Superseded
Headers show
Series t1006: add tests for %(objectsize:disk) | expand

Commit Message

Jeff King Dec. 21, 2023, 9:47 a.m. UTC
On Tue, Dec 19, 2023 at 05:42:39PM +0100, René Scharfe wrote:

> > This adds a packed-object function, but I doubt anybody actually calls
> > it. If we're going to do that, it's probably worth adding some tests for
> > "cat-file --batch-check" or similar.
> 
> Yes, and I was assuming that someone else would be eager to add such
> tests. *ahem*

:P OK, here it is. This can be its own topic, or go on top of the
rs/t6300-compressed-size-fix branch.

> > At which point I wonder if rather than having a function for a single
> > object, we are better off just testing the result of:
> >
> >   git cat-file --batch-all-objects --unordered --batch-check='%(objectsize:disk)'
> >
> > against a single post-processed "show-index" invocation.
> 
> Sure, we might want to optimize for bulk-processing and possibly end up
> only checking the size of single objects in t6300, making new library
> functions unnecessary.

So yeah, I think the approach here makes library functions unnecessary
(and I see you already asked Junio to drop your patch 2).

> When dumping size information of multiple objects, it's probably a good
> idea to include "%(objectname)" as well in the format.

Yep, definitely.

-- >8 --
Subject: [PATCH] t1006: add tests for %(objectsize:disk)

Back when we added this placeholder in a4ac106178 (cat-file: add
%(objectsize:disk) format atom, 2013-07-10), there were no tests,
claiming "[...]the exact numbers returned are volatile and subject to
zlib and packing decisions".

But we can use a little shell hackery to get the expected numbers
ourselves. To a certain degree this is just re-implementing what Git is
doing under the hood, but it is still worth doing. It makes sure we
exercise the %(objectsize:disk) code at all, and having the two
implementations agree gives us more confidence.

Note that our shell code assumes that no object appears twice (either in
two packs, or as both loose and packed), as then the results really are
undefined. That's OK for our purposes, and the test will notice if that
assumption is violated (the shell version would produce duplicate lines
that Git's output does not have).

Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Jeff King <peff@peff.net>
---
I stole a bit of your awk. You can tell because I'd have written it in
perl. ;)

 t/t1006-cat-file.sh | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

Comments

René Scharfe Dec. 21, 2023, 12:19 p.m. UTC | #1
Am 21.12.23 um 10:47 schrieb Jeff King:
> On Tue, Dec 19, 2023 at 05:42:39PM +0100, René Scharfe wrote:
>
>>> This adds a packed-object function, but I doubt anybody actually calls
>>> it. If we're going to do that, it's probably worth adding some tests for
>>> "cat-file --batch-check" or similar.
>>
>> Yes, and I was assuming that someone else would be eager to add such
>> tests. *ahem*
>
> :P OK, here it is. This can be its own topic, or go on top of the
> rs/t6300-compressed-size-fix branch.

Great, thank you!

> -- >8 --
> Subject: [PATCH] t1006: add tests for %(objectsize:disk)
>
> Back when we added this placeholder in a4ac106178 (cat-file: add
> %(objectsize:disk) format atom, 2013-07-10), there were no tests,
> claiming "[...]the exact numbers returned are volatile and subject to
> zlib and packing decisions".
>
> But we can use a little shell hackery to get the expected numbers
> ourselves. To a certain degree this is just re-implementing what Git is
> doing under the hood, but it is still worth doing. It makes sure we
> exercise the %(objectsize:disk) code at all, and having the two
> implementations agree gives us more confidence.
>
> Note that our shell code assumes that no object appears twice (either in
> two packs, or as both loose and packed), as then the results really are
> undefined. That's OK for our purposes, and the test will notice if that
> assumption is violated (the shell version would produce duplicate lines
> that Git's output does not have).
>
> Helped-by: René Scharfe <l.s.r@web.de>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> I stole a bit of your awk. You can tell because I'd have written it in
> perl. ;)

I think we can do it even in shell, especially if...

>
>  t/t1006-cat-file.sh | 34 ++++++++++++++++++++++++++++++++++
>  1 file changed, 34 insertions(+)
>
> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
> index 271c5e4fd3..21915be308 100755
> --- a/t/t1006-cat-file.sh
> +++ b/t/t1006-cat-file.sh
> @@ -1100,6 +1100,40 @@ test_expect_success 'cat-file --batch="batman" with --batch-all-objects will wor
>  	cmp expect actual
>  '
>
> +test_expect_success 'cat-file %(objectsize:disk) with --batch-all-objects' '
> +	# our state has both loose and packed objects,
> +	# so find both for our expected output
> +	{
> +		find .git/objects/?? -type f |
> +		awk -F/ "{ print \$0, \$3\$4 }" |
> +		while read path oid
> +		do
> +			size=$(test_file_size "$path") &&
> +			echo "$oid $size" ||
> +			return 1
> +		done &&
> +		rawsz=$(test_oid rawsz) &&
> +		find .git/objects/pack -name "*.idx" |
> +		while read idx
> +		do
> +			git show-index <"$idx" >idx.raw &&
> +			sort -n <idx.raw >idx.sorted &&
> +			packsz=$(test_file_size "${idx%.idx}.pack") &&
> +			end=$((packsz - rawsz)) &&
> +			awk -v end="$end" "
> +			  NR > 1 { print oid, \$1 - start }
> +			  { start = \$1; oid = \$2 }
> +			  END { print oid, end - start }
> +			" idx.sorted ||

... we stop slicing the data against the grain.  Let's reverse the order
(sort -r), then we don't need to carry the oid forward:

			sort -nr <idx.raw >idx.sorted &&
			packsz=$(test_file_size "${idx%.idx}.pack") &&
			end=$((packsz - rawsz)) &&
			awk -v end="$end" "
			  { print \$2, end - \$1; end = \$1 }
			" idx.sorted ||

And at that point it should be easy to use a shell loop instead of awk:

			while read start oid rest
			do
				size=$((end - start)) &&
				end=$start &&
				echo "$oid $size" ||
				return 1
			done <idx.sorted

> +			return 1
> +		done
> +	} >expect.raw &&
> +	sort <expect.raw >expect &&

The reversal above becomes irrelevant with that line, so the result in
expect stays the same.

Should we deduplicate here, like cat-file does (i.e. use "sort -u")?
Having the same object in multiple places for whatever reason would not
be a cause for reporting an error in this test, I would think.

> +	git cat-file --batch-all-objects \
> +		--batch-check="%(objectname) %(objectsize:disk)" >actual &&
> +	test_cmp expect actual
> +'
> +
>  test_expect_success 'set up replacement object' '
>  	orig=$(git rev-parse HEAD) &&
>  	git cat-file commit $orig >orig &&

One more thing: We can do the work of the first awk invocation in the
already existing loop as well:

> +test_expect_success 'cat-file %(objectsize:disk) with --batch-all-objects' '
> +	# our state has both loose and packed objects,
> +	# so find both for our expected output
> +	{
> +		find .git/objects/?? -type f |
> +		awk -F/ "{ print \$0, \$3\$4 }" |
> +		while read path oid
> +		do
> +			size=$(test_file_size "$path") &&
> +			echo "$oid $size" ||
> +			return 1
> +		done &&

... but the substitutions are a bit awkward:

		find .git/objects/?? -type f |
		while read path
		do
			basename=${path##*/} &&
			dirname=${path%/$basename} &&
			oid="${dirname#.git/objects/}${basename}" &&
			size=$(test_file_size "$path") &&
			echo "$oid $size" ||
			return 1
		done &&

The avoided awk invocation might be worth the trouble, though.

René
Jeff King Dec. 21, 2023, 9:30 p.m. UTC | #2
On Thu, Dec 21, 2023 at 01:19:53PM +0100, René Scharfe wrote:

> I think we can do it even in shell, especially if...
> [...]

Yeah, your conversion looks accurate. I do wonder if it is worth golfing
further, though. If it were a process invocation per object, I'd
definitely say the efficiency gain is worth it. But dropping one process
from the whole test isn't that exciting either way.

> (sort -r), then we don't need to carry the oid forward:
> 
> 			sort -nr <idx.raw >idx.sorted &&
> 			packsz=$(test_file_size "${idx%.idx}.pack") &&
> 			end=$((packsz - rawsz)) &&
> 			awk -v end="$end" "
> 			  { print \$2, end - \$1; end = \$1 }
> 			" idx.sorted ||
> 
> And at that point it should be easy to use a shell loop instead of awk:
> 
> 			while read start oid rest
> 			do
> 				size=$((end - start)) &&
> 				end=$start &&
> 				echo "$oid $size" ||
> 				return 1
> 			done <idx.sorted

The one thing I do like is that we don't have to escape anything inside
an awk program that is forced to use double-quotes. ;)

> Should we deduplicate here, like cat-file does (i.e. use "sort -u")?
> Having the same object in multiple places for whatever reason would not
> be a cause for reporting an error in this test, I would think.

No, for the reasons I said in the commit message: if an object exists in
multiple places the test is already potentially invalid, as Git does not
promise which version it will use. So it might work racily, or it might
work for now but be fragile. By not de-duplicating, we make sure the
test's assumption holds.

> One more thing: We can do the work of the first awk invocation in the
> already existing loop as well:
> [...]
> ... but the substitutions are a bit awkward:
> 
> 		find .git/objects/?? -type f |
> 		while read path
> 		do
> 			basename=${path##*/} &&
> 			dirname=${path%/$basename} &&
> 			oid="${dirname#.git/objects/}${basename}" &&
> 			size=$(test_file_size "$path") &&
> 			echo "$oid $size" ||
> 			return 1
> 		done &&
> 
> The avoided awk invocation might be worth the trouble, though.

Yeah, I briefly considered whether it would be possible in pure shell,
but didn't get very far before assuming it was going to be ugly. Thank
you for confirming. ;)

Again, if we were doing one awk per object, I'd try to avoid it. But
since we can cover all objects in a single pass, I think it's OK.

-Peff
René Scharfe Dec. 21, 2023, 11:13 p.m. UTC | #3
Am 21.12.23 um 22:30 schrieb Jeff King:
> On Thu, Dec 21, 2023 at 01:19:53PM +0100, René Scharfe wrote:
>
>> I think we can do it even in shell, especially if...
>> [...]
>
> Yeah, your conversion looks accurate. I do wonder if it is worth golfing
> further, though. If it were a process invocation per object, I'd
> definitely say the efficiency gain is worth it. But dropping one process
> from the whole test isn't that exciting either way.

Fair enough.

>
>> (sort -r), then we don't need to carry the oid forward:
>>
>> 			sort -nr <idx.raw >idx.sorted &&
>> 			packsz=$(test_file_size "${idx%.idx}.pack") &&
>> 			end=$((packsz - rawsz)) &&
>> 			awk -v end="$end" "
>> 			  { print \$2, end - \$1; end = \$1 }
>> 			" idx.sorted ||
>>
>> And at that point it should be easy to use a shell loop instead of awk:
>>
>> 			while read start oid rest
>> 			do
>> 				size=$((end - start)) &&
>> 				end=$start &&
>> 				echo "$oid $size" ||
>> 				return 1
>> 			done <idx.sorted
>
> The one thing I do like is that we don't have to escape anything inside
> an awk program that is forced to use double-quotes. ;)

For me it's processing the data in the "correct" order (descending, i.e.
starting at the end, which we have to calculate first anyway based on the
size).

>> Should we deduplicate here, like cat-file does (i.e. use "sort -u")?
>> Having the same object in multiple places for whatever reason would not
>> be a cause for reporting an error in this test, I would think.
>
> No, for the reasons I said in the commit message: if an object exists in
> multiple places the test is already potentially invalid, as Git does not
> promise which version it will use. So it might work racily, or it might
> work for now but be fragile. By not de-duplicating, we make sure the
> test's assumption holds.

Oh, skipped that paragraph.  Still I don't see how a duplicate object
would necessarily invalidate t1006.  The comment for the test "cat-file
--batch-all-objects shows all objects" a few lines above indicates that
it's picky about the provenance of objects, but it uses a separate
repository.  I can't infer the same requirement for the root repo, but
we already established that I can't read.

Anyway, if someone finds a use for git repack without -d or
git unpack-objects or whatever else causes duplicates in the root
repository of t1006 then they can try to reverse your ban with concrete
arguments.

René
Jeff King Dec. 23, 2023, 10:18 a.m. UTC | #4
On Fri, Dec 22, 2023 at 12:13:10AM +0100, René Scharfe wrote:

> >> Should we deduplicate here, like cat-file does (i.e. use "sort -u")?
> >> Having the same object in multiple places for whatever reason would not
> >> be a cause for reporting an error in this test, I would think.
> >
> > No, for the reasons I said in the commit message: if an object exists in
> > multiple places the test is already potentially invalid, as Git does not
> > promise which version it will use. So it might work racily, or it might
> > work for now but be fragile. By not de-duplicating, we make sure the
> > test's assumption holds.
> 
> Oh, skipped that paragraph.  Still I don't see how a duplicate object
> would necessarily invalidate t1006.  The comment for the test "cat-file
> --batch-all-objects shows all objects" a few lines above indicates that
> it's picky about the provenance of objects, but it uses a separate
> repository.  I can't infer the same requirement for the root repo, but
> we already established that I can't read.

The cat-file documentation explicitly calls this situation out:

  Note also that multiple copies of an object may be present in the
  object database; in this case, it is undefined which copy’s size or
  delta base will be reported.

So if t1006 were to grow such a duplicate object, what will happen? If
we de-dup in the new test, then we might end up mentioning the same copy
(and the test passes), or we might not (and the test fails). But much
worse, the results might be racy (depending on how cat-file happens to
decide which one to use). By no de-duping, then the test will reliably
fail and the author can decide how to handle it then.

IOW it is about failing immediately and predictably rather than letting
a future change to sneak a race or other accident-waiting-to-happen into
t1006.

> Anyway, if someone finds a use for git repack without -d or
> git unpack-objects or whatever else causes duplicates in the root
> repository of t1006 then they can try to reverse your ban with concrete
> arguments.

In the real world, the most common way to get a duplicate is to fetch or
push into a repository, such that:

  1. There are enough objects to retain the pack (100 by default)

  2. There's a thin delta in the on-the-wire pack (i.e., a delta against
     a base that the sender knows the receiver has, but which isn't
     itself sent).

Then "index-pack --fix-thin" will complete the on-disk pack by storing a
copy of the base object in it. And now we have it in two packs (and if
it's a delta or loose in the original, the size will be different).

-Peff
René Scharfe Dec. 24, 2023, 9:30 a.m. UTC | #5
Am 23.12.23 um 11:18 schrieb Jeff King:
> On Fri, Dec 22, 2023 at 12:13:10AM +0100, René Scharfe wrote:
>
>>>> Should we deduplicate here, like cat-file does (i.e. use "sort -u")?
>>>> Having the same object in multiple places for whatever reason would not
>>>> be a cause for reporting an error in this test, I would think.
>>>
>>> No, for the reasons I said in the commit message: if an object exists in
>>> multiple places the test is already potentially invalid, as Git does not
>>> promise which version it will use. So it might work racily, or it might
>>> work for now but be fragile. By not de-duplicating, we make sure the
>>> test's assumption holds.
>>
>> Oh, skipped that paragraph.  Still I don't see how a duplicate object
>> would necessarily invalidate t1006.  The comment for the test "cat-file
>> --batch-all-objects shows all objects" a few lines above indicates that
>> it's picky about the provenance of objects, but it uses a separate
>> repository.  I can't infer the same requirement for the root repo, but
>> we already established that I can't read.
>
> The cat-file documentation explicitly calls this situation out:
>
>   Note also that multiple copies of an object may be present in the
>   object database; in this case, it is undefined which copy’s size or
>   delta base will be reported.
>
> So if t1006 were to grow such a duplicate object, what will happen? If
> we de-dup in the new test, then we might end up mentioning the same copy
> (and the test passes), or we might not (and the test fails). But much
> worse, the results might be racy (depending on how cat-file happens to
> decide which one to use). By no de-duping, then the test will reliably
> fail and the author can decide how to handle it then.
>
> IOW it is about failing immediately and predictably rather than letting
> a future change to sneak a race or other accident-waiting-to-happen into
> t1006.
>
>> Anyway, if someone finds a use for git repack without -d or
>> git unpack-objects or whatever else causes duplicates in the root
>> repository of t1006 then they can try to reverse your ban with concrete
>> arguments.
>
> In the real world, the most common way to get a duplicate is to fetch or
> push into a repository, such that:
>
>   1. There are enough objects to retain the pack (100 by default)
>
>   2. There's a thin delta in the on-the-wire pack (i.e., a delta against
>      a base that the sender knows the receiver has, but which isn't
>      itself sent).
>
> Then "index-pack --fix-thin" will complete the on-disk pack by storing a
> copy of the base object in it. And now we have it in two packs (and if
> it's a delta or loose in the original, the size will be different).

I think I get it now.  The size possibly being different is crucial.
cat-file deduplicates based on object ID alone.  sort -u in t1006 would
deduplicate based on object ID and size, meaning that it would only
remove duplicates of the same size.  Emulating the deduplication of
cat-file is also possible, but would introduce the race you mentioned.

However, even removing only same-size duplicates is unreliable because
there is no guarantee that the same object has the same size in
different packs.  Adding a new object that is a better delta base would
change the size.

So, deduplicating based on object ID and size is sound for any
particular run, but sizes are not stable and thus we need to know if
the tests do something that adds duplicates of any size.

René
diff mbox series

Patch

diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 271c5e4fd3..21915be308 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -1100,6 +1100,40 @@  test_expect_success 'cat-file --batch="batman" with --batch-all-objects will wor
 	cmp expect actual
 '
 
+test_expect_success 'cat-file %(objectsize:disk) with --batch-all-objects' '
+	# our state has both loose and packed objects,
+	# so find both for our expected output
+	{
+		find .git/objects/?? -type f |
+		awk -F/ "{ print \$0, \$3\$4 }" |
+		while read path oid
+		do
+			size=$(test_file_size "$path") &&
+			echo "$oid $size" ||
+			return 1
+		done &&
+		rawsz=$(test_oid rawsz) &&
+		find .git/objects/pack -name "*.idx" |
+		while read idx
+		do
+			git show-index <"$idx" >idx.raw &&
+			sort -n <idx.raw >idx.sorted &&
+			packsz=$(test_file_size "${idx%.idx}.pack") &&
+			end=$((packsz - rawsz)) &&
+			awk -v end="$end" "
+			  NR > 1 { print oid, \$1 - start }
+			  { start = \$1; oid = \$2 }
+			  END { print oid, end - start }
+			" idx.sorted ||
+			return 1
+		done
+	} >expect.raw &&
+	sort <expect.raw >expect &&
+	git cat-file --batch-all-objects \
+		--batch-check="%(objectname) %(objectsize:disk)" >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success 'set up replacement object' '
 	orig=$(git rev-parse HEAD) &&
 	git cat-file commit $orig >orig &&