Message ID | 20231221094722.GA570888@coredump.intra.peff.net (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | t1006: add tests for %(objectsize:disk) | expand |
Am 21.12.23 um 10:47 schrieb Jeff King: > On Tue, Dec 19, 2023 at 05:42:39PM +0100, René Scharfe wrote: > >>> This adds a packed-object function, but I doubt anybody actually calls >>> it. If we're going to do that, it's probably worth adding some tests for >>> "cat-file --batch-check" or similar. >> >> Yes, and I was assuming that someone else would be eager to add such >> tests. *ahem* > > :P OK, here it is. This can be its own topic, or go on top of the > rs/t6300-compressed-size-fix branch. Great, thank you! > -- >8 -- > Subject: [PATCH] t1006: add tests for %(objectsize:disk) > > Back when we added this placeholder in a4ac106178 (cat-file: add > %(objectsize:disk) format atom, 2013-07-10), there were no tests, > claiming "[...]the exact numbers returned are volatile and subject to > zlib and packing decisions". > > But we can use a little shell hackery to get the expected numbers > ourselves. To a certain degree this is just re-implementing what Git is > doing under the hood, but it is still worth doing. It makes sure we > exercise the %(objectsize:disk) code at all, and having the two > implementations agree gives us more confidence. > > Note that our shell code assumes that no object appears twice (either in > two packs, or as both loose and packed), as then the results really are > undefined. That's OK for our purposes, and the test will notice if that > assumption is violated (the shell version would produce duplicate lines > that Git's output does not have). > > Helped-by: René Scharfe <l.s.r@web.de> > Signed-off-by: Jeff King <peff@peff.net> > --- > I stole a bit of your awk. You can tell because I'd have written it in > perl. ;) I think we can do it even in shell, especially if... > > t/t1006-cat-file.sh | 34 ++++++++++++++++++++++++++++++++++ > 1 file changed, 34 insertions(+) > > diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh > index 271c5e4fd3..21915be308 100755 > --- a/t/t1006-cat-file.sh > +++ b/t/t1006-cat-file.sh > @@ -1100,6 +1100,40 @@ test_expect_success 'cat-file --batch="batman" with --batch-all-objects will wor > cmp expect actual > ' > > +test_expect_success 'cat-file %(objectsize:disk) with --batch-all-objects' ' > + # our state has both loose and packed objects, > + # so find both for our expected output > + { > + find .git/objects/?? -type f | > + awk -F/ "{ print \$0, \$3\$4 }" | > + while read path oid > + do > + size=$(test_file_size "$path") && > + echo "$oid $size" || > + return 1 > + done && > + rawsz=$(test_oid rawsz) && > + find .git/objects/pack -name "*.idx" | > + while read idx > + do > + git show-index <"$idx" >idx.raw && > + sort -n <idx.raw >idx.sorted && > + packsz=$(test_file_size "${idx%.idx}.pack") && > + end=$((packsz - rawsz)) && > + awk -v end="$end" " > + NR > 1 { print oid, \$1 - start } > + { start = \$1; oid = \$2 } > + END { print oid, end - start } > + " idx.sorted || ... we stop slicing the data against the grain. Let's reverse the order (sort -r), then we don't need to carry the oid forward: sort -nr <idx.raw >idx.sorted && packsz=$(test_file_size "${idx%.idx}.pack") && end=$((packsz - rawsz)) && awk -v end="$end" " { print \$2, end - \$1; end = \$1 } " idx.sorted || And at that point it should be easy to use a shell loop instead of awk: while read start oid rest do size=$((end - start)) && end=$start && echo "$oid $size" || return 1 done <idx.sorted > + return 1 > + done > + } >expect.raw && > + sort <expect.raw >expect && The reversal above becomes irrelevant with that line, so the result in expect stays the same. Should we deduplicate here, like cat-file does (i.e. use "sort -u")? Having the same object in multiple places for whatever reason would not be a cause for reporting an error in this test, I would think. > + git cat-file --batch-all-objects \ > + --batch-check="%(objectname) %(objectsize:disk)" >actual && > + test_cmp expect actual > +' > + > test_expect_success 'set up replacement object' ' > orig=$(git rev-parse HEAD) && > git cat-file commit $orig >orig && One more thing: We can do the work of the first awk invocation in the already existing loop as well: > +test_expect_success 'cat-file %(objectsize:disk) with --batch-all-objects' ' > + # our state has both loose and packed objects, > + # so find both for our expected output > + { > + find .git/objects/?? -type f | > + awk -F/ "{ print \$0, \$3\$4 }" | > + while read path oid > + do > + size=$(test_file_size "$path") && > + echo "$oid $size" || > + return 1 > + done && ... but the substitutions are a bit awkward: find .git/objects/?? -type f | while read path do basename=${path##*/} && dirname=${path%/$basename} && oid="${dirname#.git/objects/}${basename}" && size=$(test_file_size "$path") && echo "$oid $size" || return 1 done && The avoided awk invocation might be worth the trouble, though. René
On Thu, Dec 21, 2023 at 01:19:53PM +0100, René Scharfe wrote: > I think we can do it even in shell, especially if... > [...] Yeah, your conversion looks accurate. I do wonder if it is worth golfing further, though. If it were a process invocation per object, I'd definitely say the efficiency gain is worth it. But dropping one process from the whole test isn't that exciting either way. > (sort -r), then we don't need to carry the oid forward: > > sort -nr <idx.raw >idx.sorted && > packsz=$(test_file_size "${idx%.idx}.pack") && > end=$((packsz - rawsz)) && > awk -v end="$end" " > { print \$2, end - \$1; end = \$1 } > " idx.sorted || > > And at that point it should be easy to use a shell loop instead of awk: > > while read start oid rest > do > size=$((end - start)) && > end=$start && > echo "$oid $size" || > return 1 > done <idx.sorted The one thing I do like is that we don't have to escape anything inside an awk program that is forced to use double-quotes. ;) > Should we deduplicate here, like cat-file does (i.e. use "sort -u")? > Having the same object in multiple places for whatever reason would not > be a cause for reporting an error in this test, I would think. No, for the reasons I said in the commit message: if an object exists in multiple places the test is already potentially invalid, as Git does not promise which version it will use. So it might work racily, or it might work for now but be fragile. By not de-duplicating, we make sure the test's assumption holds. > One more thing: We can do the work of the first awk invocation in the > already existing loop as well: > [...] > ... but the substitutions are a bit awkward: > > find .git/objects/?? -type f | > while read path > do > basename=${path##*/} && > dirname=${path%/$basename} && > oid="${dirname#.git/objects/}${basename}" && > size=$(test_file_size "$path") && > echo "$oid $size" || > return 1 > done && > > The avoided awk invocation might be worth the trouble, though. Yeah, I briefly considered whether it would be possible in pure shell, but didn't get very far before assuming it was going to be ugly. Thank you for confirming. ;) Again, if we were doing one awk per object, I'd try to avoid it. But since we can cover all objects in a single pass, I think it's OK. -Peff
Am 21.12.23 um 22:30 schrieb Jeff King: > On Thu, Dec 21, 2023 at 01:19:53PM +0100, René Scharfe wrote: > >> I think we can do it even in shell, especially if... >> [...] > > Yeah, your conversion looks accurate. I do wonder if it is worth golfing > further, though. If it were a process invocation per object, I'd > definitely say the efficiency gain is worth it. But dropping one process > from the whole test isn't that exciting either way. Fair enough. > >> (sort -r), then we don't need to carry the oid forward: >> >> sort -nr <idx.raw >idx.sorted && >> packsz=$(test_file_size "${idx%.idx}.pack") && >> end=$((packsz - rawsz)) && >> awk -v end="$end" " >> { print \$2, end - \$1; end = \$1 } >> " idx.sorted || >> >> And at that point it should be easy to use a shell loop instead of awk: >> >> while read start oid rest >> do >> size=$((end - start)) && >> end=$start && >> echo "$oid $size" || >> return 1 >> done <idx.sorted > > The one thing I do like is that we don't have to escape anything inside > an awk program that is forced to use double-quotes. ;) For me it's processing the data in the "correct" order (descending, i.e. starting at the end, which we have to calculate first anyway based on the size). >> Should we deduplicate here, like cat-file does (i.e. use "sort -u")? >> Having the same object in multiple places for whatever reason would not >> be a cause for reporting an error in this test, I would think. > > No, for the reasons I said in the commit message: if an object exists in > multiple places the test is already potentially invalid, as Git does not > promise which version it will use. So it might work racily, or it might > work for now but be fragile. By not de-duplicating, we make sure the > test's assumption holds. Oh, skipped that paragraph. Still I don't see how a duplicate object would necessarily invalidate t1006. The comment for the test "cat-file --batch-all-objects shows all objects" a few lines above indicates that it's picky about the provenance of objects, but it uses a separate repository. I can't infer the same requirement for the root repo, but we already established that I can't read. Anyway, if someone finds a use for git repack without -d or git unpack-objects or whatever else causes duplicates in the root repository of t1006 then they can try to reverse your ban with concrete arguments. René
On Fri, Dec 22, 2023 at 12:13:10AM +0100, René Scharfe wrote: > >> Should we deduplicate here, like cat-file does (i.e. use "sort -u")? > >> Having the same object in multiple places for whatever reason would not > >> be a cause for reporting an error in this test, I would think. > > > > No, for the reasons I said in the commit message: if an object exists in > > multiple places the test is already potentially invalid, as Git does not > > promise which version it will use. So it might work racily, or it might > > work for now but be fragile. By not de-duplicating, we make sure the > > test's assumption holds. > > Oh, skipped that paragraph. Still I don't see how a duplicate object > would necessarily invalidate t1006. The comment for the test "cat-file > --batch-all-objects shows all objects" a few lines above indicates that > it's picky about the provenance of objects, but it uses a separate > repository. I can't infer the same requirement for the root repo, but > we already established that I can't read. The cat-file documentation explicitly calls this situation out: Note also that multiple copies of an object may be present in the object database; in this case, it is undefined which copy’s size or delta base will be reported. So if t1006 were to grow such a duplicate object, what will happen? If we de-dup in the new test, then we might end up mentioning the same copy (and the test passes), or we might not (and the test fails). But much worse, the results might be racy (depending on how cat-file happens to decide which one to use). By no de-duping, then the test will reliably fail and the author can decide how to handle it then. IOW it is about failing immediately and predictably rather than letting a future change to sneak a race or other accident-waiting-to-happen into t1006. > Anyway, if someone finds a use for git repack without -d or > git unpack-objects or whatever else causes duplicates in the root > repository of t1006 then they can try to reverse your ban with concrete > arguments. In the real world, the most common way to get a duplicate is to fetch or push into a repository, such that: 1. There are enough objects to retain the pack (100 by default) 2. There's a thin delta in the on-the-wire pack (i.e., a delta against a base that the sender knows the receiver has, but which isn't itself sent). Then "index-pack --fix-thin" will complete the on-disk pack by storing a copy of the base object in it. And now we have it in two packs (and if it's a delta or loose in the original, the size will be different). -Peff
Am 23.12.23 um 11:18 schrieb Jeff King: > On Fri, Dec 22, 2023 at 12:13:10AM +0100, René Scharfe wrote: > >>>> Should we deduplicate here, like cat-file does (i.e. use "sort -u")? >>>> Having the same object in multiple places for whatever reason would not >>>> be a cause for reporting an error in this test, I would think. >>> >>> No, for the reasons I said in the commit message: if an object exists in >>> multiple places the test is already potentially invalid, as Git does not >>> promise which version it will use. So it might work racily, or it might >>> work for now but be fragile. By not de-duplicating, we make sure the >>> test's assumption holds. >> >> Oh, skipped that paragraph. Still I don't see how a duplicate object >> would necessarily invalidate t1006. The comment for the test "cat-file >> --batch-all-objects shows all objects" a few lines above indicates that >> it's picky about the provenance of objects, but it uses a separate >> repository. I can't infer the same requirement for the root repo, but >> we already established that I can't read. > > The cat-file documentation explicitly calls this situation out: > > Note also that multiple copies of an object may be present in the > object database; in this case, it is undefined which copy’s size or > delta base will be reported. > > So if t1006 were to grow such a duplicate object, what will happen? If > we de-dup in the new test, then we might end up mentioning the same copy > (and the test passes), or we might not (and the test fails). But much > worse, the results might be racy (depending on how cat-file happens to > decide which one to use). By no de-duping, then the test will reliably > fail and the author can decide how to handle it then. > > IOW it is about failing immediately and predictably rather than letting > a future change to sneak a race or other accident-waiting-to-happen into > t1006. > >> Anyway, if someone finds a use for git repack without -d or >> git unpack-objects or whatever else causes duplicates in the root >> repository of t1006 then they can try to reverse your ban with concrete >> arguments. > > In the real world, the most common way to get a duplicate is to fetch or > push into a repository, such that: > > 1. There are enough objects to retain the pack (100 by default) > > 2. There's a thin delta in the on-the-wire pack (i.e., a delta against > a base that the sender knows the receiver has, but which isn't > itself sent). > > Then "index-pack --fix-thin" will complete the on-disk pack by storing a > copy of the base object in it. And now we have it in two packs (and if > it's a delta or loose in the original, the size will be different). I think I get it now. The size possibly being different is crucial. cat-file deduplicates based on object ID alone. sort -u in t1006 would deduplicate based on object ID and size, meaning that it would only remove duplicates of the same size. Emulating the deduplication of cat-file is also possible, but would introduce the race you mentioned. However, even removing only same-size duplicates is unreliable because there is no guarantee that the same object has the same size in different packs. Adding a new object that is a better delta base would change the size. So, deduplicating based on object ID and size is sound for any particular run, but sizes are not stable and thus we need to know if the tests do something that adds duplicates of any size. René
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh index 271c5e4fd3..21915be308 100755 --- a/t/t1006-cat-file.sh +++ b/t/t1006-cat-file.sh @@ -1100,6 +1100,40 @@ test_expect_success 'cat-file --batch="batman" with --batch-all-objects will wor cmp expect actual ' +test_expect_success 'cat-file %(objectsize:disk) with --batch-all-objects' ' + # our state has both loose and packed objects, + # so find both for our expected output + { + find .git/objects/?? -type f | + awk -F/ "{ print \$0, \$3\$4 }" | + while read path oid + do + size=$(test_file_size "$path") && + echo "$oid $size" || + return 1 + done && + rawsz=$(test_oid rawsz) && + find .git/objects/pack -name "*.idx" | + while read idx + do + git show-index <"$idx" >idx.raw && + sort -n <idx.raw >idx.sorted && + packsz=$(test_file_size "${idx%.idx}.pack") && + end=$((packsz - rawsz)) && + awk -v end="$end" " + NR > 1 { print oid, \$1 - start } + { start = \$1; oid = \$2 } + END { print oid, end - start } + " idx.sorted || + return 1 + done + } >expect.raw && + sort <expect.raw >expect && + git cat-file --batch-all-objects \ + --batch-check="%(objectname) %(objectsize:disk)" >actual && + test_cmp expect actual +' + test_expect_success 'set up replacement object' ' orig=$(git rev-parse HEAD) && git cat-file commit $orig >orig &&