mbox series

[v3,0/4] Changed path filter hash fix and version bump

Message ID cover.1686251688.git.jonathantanmy@google.com (mailing list archive)
Headers show
Series Changed path filter hash fix and version bump | expand

Message

Jonathan Tan June 8, 2023, 7:21 p.m. UTC
Here's an updated version with changes only in the tests:
 - resilience against unsigned-by-default (a "skip" will be shown)
 - for systems that have different quoting behavior, the tests
   will be skipped (see patch 2 for more information)

Some of the test changes may seem a bit of a hack, so if you have
a better way of doing things, please let me know.

I've also included a change to the file format specification. To
reviewers, if you generally agree that we need a version 2 but are still
unsure about how we should migrate, consider saying so and then perhaps
we can merge the first patch while the rest remain under review. This
will give other projects, like JGit, more certainty as to the direction
that the Git project wants to take.

Jonathan Tan (4):
  gitformat-commit-graph: describe version 2 of BDAT
  t4216: test changed path filters with high bit paths
  repo-settings: introduce commitgraph.changedPathsVersion
  commit-graph: new filter ver. that fixes murmur3

 Documentation/config/commitgraph.txt     | 16 ++++-
 Documentation/gitformat-commit-graph.txt |  9 ++-
 bloom.c                                  | 65 +++++++++++++++++-
 bloom.h                                  |  8 ++-
 commit-graph.c                           | 29 ++++++--
 oss-fuzz/fuzz-commit-graph.c             |  2 +-
 repo-settings.c                          |  6 +-
 repository.h                             |  2 +-
 t/helper/test-bloom.c                    |  9 ++-
 t/t0095-bloom.sh                         |  8 +++
 t/t4216-log-bloom.sh                     | 86 ++++++++++++++++++++++++
 11 files changed, 219 insertions(+), 21 deletions(-)

Range-diff against v2:
-:  ---------- > 1:  d4b63945f6 gitformat-commit-graph: describe version 2 of BDAT
1:  c587eb3470 ! 2:  aa4535776e t4216: test changed path filters with high bit paths
    @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills empty comm
     +# chosen to be the same under all Unicode normalization forms
     +CENT=$(printf "\xc2\xa2")
     +
    -+test_expect_success 'set up repo with high bit path, version 1 changed-path' '
    ++# Some systems (in particular, Linux on the CI running on GitHub at the time of
    ++# writing) store into CENT a literal backslash, then "x", and so on (instead of
    ++# the high-bit characters needed). In these systems, do not run the following
    ++# tests.
    ++if test "$(printf $CENT | perl -0777 -ne 'no utf8; print ord($_)')" = "194"
    ++then
    ++	test_set_prereq HIGH_BIT
    ++fi
    ++
    ++test_expect_success HIGH_BIT 'set up repo with high bit path, version 1 changed-path' '
     +	git init highbit1 &&
     +	test_commit -C highbit1 c1 "$CENT" &&
     +	git -C highbit1 commit-graph write --reachable --changed-paths
     +'
     +
    -+test_expect_success 'check value of version 1 changed-path' '
    ++test_expect_success HIGH_BIT 'setup check value of version 1 changed-path' '
     +	(cd highbit1 &&
     +		printf "52a9" >expect &&
    -+		get_first_changed_path_filter >actual &&
    -+		test_cmp expect actual)
    ++		get_first_changed_path_filter >actual)
    ++'
    ++
    ++# expect will not match actual if int is unsigned by default. Write the test
    ++# in this way, so that a user running this test script can still see if the two
    ++# files match. (It will appear as an ordinary success if they match, and a skip
    ++# if not.)
    ++if test_cmp highbit1/expect highbit1/actual
    ++then
    ++	test_set_prereq SIGNED_INT_BY_DEFAULT
    ++fi
    ++test_expect_success SIGNED_INT_BY_DEFAULT 'check value of version 1 changed-path' '
    ++	# Only the prereq matters for this test.
    ++	true
     +'
     +
    -+test_expect_success 'version 1 changed-path used when version 1 requested' '
    ++test_expect_success HIGH_BIT 'version 1 changed-path used when version 1 requested' '
     +	(cd highbit1 &&
     +		test_bloom_filters_used "-- $CENT")
     +'
2:  d0e5dd20dc = 3:  d6982268a4 repo-settings: introduce commitgraph.changedPathsVersion
3:  eb19f8a35b ! 4:  e879483c42 commit-graph: new filter ver. that fixes murmur3
    @@ t/t0095-bloom.sh: test_expect_success 'compute unseeded murmur3 hash for test st
      	Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
     
      ## t/t4216-log-bloom.sh ##
    -@@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when version 1 requested' '
    +@@ t/t4216-log-bloom.sh: test_expect_success HIGH_BIT 'version 1 changed-path used when version 1 request
      		test_bloom_filters_used "-- $CENT")
      '
      
    -+test_expect_success 'version 1 changed-path not used when version 2 requested' '
    ++test_expect_success HIGH_BIT 'version 1 changed-path not used when version 2 requested' '
     +	(cd highbit1 &&
     +		git config --add commitgraph.changedPathsVersion 2 &&
     +		test_bloom_filters_not_used "-- $CENT")
     +'
     +
    -+test_expect_success 'set up repo with high bit path, version 2 changed-path' '
    ++test_expect_success HIGH_BIT 'set up repo with high bit path, version 2 changed-path' '
     +	git init highbit2 &&
     +	git -C highbit2 config --add commitgraph.changedPathsVersion 2 &&
     +	test_commit -C highbit2 c2 "$CENT" &&
     +	git -C highbit2 commit-graph write --reachable --changed-paths
     +'
     +
    -+test_expect_success 'check value of version 2 changed-path' '
    ++test_expect_success HIGH_BIT 'check value of version 2 changed-path' '
     +	(cd highbit2 &&
     +		printf "c01f" >expect &&
     +		get_first_changed_path_filter >actual &&
     +		test_cmp expect actual)
     +'
     +
    -+test_expect_success 'version 2 changed-path used when version 2 requested' '
    ++test_expect_success HIGH_BIT 'version 2 changed-path used when version 2 requested' '
     +	(cd highbit2 &&
     +		test_bloom_filters_used "-- $CENT")
     +'
     +
    -+test_expect_success 'version 2 changed-path not used when version 1 requested' '
    ++test_expect_success HIGH_BIT 'version 2 changed-path not used when version 1 requested' '
     +	(cd highbit2 &&
     +		git config --add commitgraph.changedPathsVersion 1 &&
     +		test_bloom_filters_not_used "-- $CENT")

Comments

Ramsay Jones June 8, 2023, 7:50 p.m. UTC | #1
On 08/06/2023 20:21, Jonathan Tan wrote:
> Here's an updated version with changes only in the tests:
>  - resilience against unsigned-by-default (a "skip" will be shown)
>  - for systems that have different quoting behavior, the tests
>    will be skipped (see patch 2 for more information)
> 
> Some of the test changes may seem a bit of a hack, so if you have
> a better way of doing things, please let me know.
> 
> I've also included a change to the file format specification. To
> reviewers, if you generally agree that we need a version 2 but are still
> unsure about how we should migrate, consider saying so and then perhaps
> we can merge the first patch while the rest remain under review. This
> will give other projects, like JGit, more certainty as to the direction
> that the Git project wants to take.
> 
> Jonathan Tan (4):
>   gitformat-commit-graph: describe version 2 of BDAT
>   t4216: test changed path filters with high bit paths
>   repo-settings: introduce commitgraph.changedPathsVersion
>   commit-graph: new filter ver. that fixes murmur3
> 
>  Documentation/config/commitgraph.txt     | 16 ++++-
>  Documentation/gitformat-commit-graph.txt |  9 ++-
>  bloom.c                                  | 65 +++++++++++++++++-
>  bloom.h                                  |  8 ++-
>  commit-graph.c                           | 29 ++++++--
>  oss-fuzz/fuzz-commit-graph.c             |  2 +-
>  repo-settings.c                          |  6 +-
>  repository.h                             |  2 +-
>  t/helper/test-bloom.c                    |  9 ++-
>  t/t0095-bloom.sh                         |  8 +++
>  t/t4216-log-bloom.sh                     | 86 ++++++++++++++++++++++++
>  11 files changed, 219 insertions(+), 21 deletions(-)
> 
> Range-diff against v2:
> -:  ---------- > 1:  d4b63945f6 gitformat-commit-graph: describe version 2 of BDAT
> 1:  c587eb3470 ! 2:  aa4535776e t4216: test changed path filters with high bit paths
>     @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills empty comm
>      +# chosen to be the same under all Unicode normalization forms
>      +CENT=$(printf "\xc2\xa2")

I think the only change you need to make here (because /usr/bin/sh
on Ubuntu is usually 'dash' not 'bash') is to use octal rather than
hexadecimal. ie: CENT=$(printf "\302\242")

HTH

ATB,
Ramsay Jones
Jonathan Tan June 9, 2023, 12:08 a.m. UTC | #2
Ramsay Jones <ramsay@ramsayjones.plus.com> writes:
> >      +CENT=$(printf "\xc2\xa2")
> 
> I think the only change you need to make here (because /usr/bin/sh
> on Ubuntu is usually 'dash' not 'bash') is to use octal rather than
> hexadecimal. ie: CENT=$(printf "\302\242")
> 
> HTH
> 
> ATB,
> Ramsay Jones

Thanks! Yes, it works. I've checked that it passes CI [1] and will wait
for other reviews before sending out a new version to the list.

[1] https://github.com/jonathantanmy/git/actions/runs/5216513514
Junio C Hamano June 12, 2023, 9:31 p.m. UTC | #3
Ramsay Jones <ramsay@ramsayjones.plus.com> writes:

> I think the only change you need to make here (because /usr/bin/sh
> on Ubuntu is usually 'dash' not 'bash') is to use octal rather than
> hexadecimal. ie: CENT=$(printf "\302\242")

Perhaps an addition to Documentation/CodingGuidelines is in order?

 Documentation/CodingGuidelines | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git c/Documentation/CodingGuidelines w/Documentation/CodingGuidelines
index 9d5c27807a..78bc60665d 100644
--- c/Documentation/CodingGuidelines
+++ w/Documentation/CodingGuidelines
@@ -188,6 +188,15 @@ For shell scripts specifically (not exhaustive):
    hopefully nobody starts using "local" before they are reimplemented
    in C ;-)
 
+ - In 'printf' format string, do not use hexadecimals, as they are not
+   portable.  Write 
+
+     CENT=$(printf "\302\242")
+
+   not
+
+     CENT=$(printf "\xc2\xa2")
+
 
 For C programs:
Jonathan Tan June 13, 2023, 5:16 p.m. UTC | #4
Junio C Hamano <gitster@pobox.com> writes:
> Perhaps an addition to Documentation/CodingGuidelines is in order?
> 
>  Documentation/CodingGuidelines | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git c/Documentation/CodingGuidelines w/Documentation/CodingGuidelines
> index 9d5c27807a..78bc60665d 100644
> --- c/Documentation/CodingGuidelines
> +++ w/Documentation/CodingGuidelines
> @@ -188,6 +188,15 @@ For shell scripts specifically (not exhaustive):
>     hopefully nobody starts using "local" before they are reimplemented
>     in C ;-)
>  
> + - In 'printf' format string, do not use hexadecimals, as they are not
> +   portable.  Write 
> +
> +     CENT=$(printf "\302\242")
> +
> +   not
> +
> +     CENT=$(printf "\xc2\xa2")

I've checked with "dash" and this applies to any quoted string, not just
when passed to printf. I'll prepare a patch describing this.
Junio C Hamano June 13, 2023, 7:16 p.m. UTC | #5
Jonathan Tan <jonathantanmy@google.com> writes:

>> + - In 'printf' format string, do not use hexadecimals, as they are not
>> +   portable.  Write 
>> +
>> +     CENT=$(printf "\302\242")
>> +
>> +   not
>> +
>> +     CENT=$(printf "\xc2\xa2")
>
> I've checked with "dash" and this applies to any quoted string, not just
> when passed to printf. I'll prepare a patch describing this.

What do you mean by "any quoted string"?

I think built-in 'echo' of dash takes "\302\242" and emits the
cent-sign (but /bin/echo may not), but I do not think it is "any
quoted string".  To wit:

    dash$ echo '\302\242'
    ยข

The echo built into dash shows the cent-sign.  The example does not
let us tell if is the shell (i.e. the 'echo' command sees the
cent-sign in its argv[1]) or if it is the command (i.e. the 'echo'
sees 8-byte string "bs three zero two bs two four two" in its
argv[1], and shows that as the cent-sign), though.

    dash$ /bin/echo '\302\242'
    \302\242

This shows that shell is not doing anything fancy.  /bin/echo gets
the 8-byte string in its argv[1] and emits that as-is.

    dash$ /bin/echo "\302\242"
    \302\242

Again this shows the same; I added this example to demonstrate that
it is _not_ like the shell interprets the backslashed octal strings
depending on how they are quoted.

    dash$ printf "%s\n" '\302\242'
    \302\242
    dash$ printf "%s\n" "\302\242"
    \302\242

And these demonstrates that argv[2] given to printf in these cases
are 8-byte string "bs three zero two bs two four two" and the shell
is not doing anything fancy.

So, I would suggest not saying "any quoted string".  In addition,
even though dash's built-in echo that recognizes "\0num" seems to be
conformant to what POSIX specifies (cf. [*1*]), GNU requires "-e" in
order to do so in both standalone 'echo' binary and one built into
'bash', so we cannot rely on this POSIX behaviour when writing a
portable script.  Hence, I would recommend us to focus on giving a
piece of advice on use of printf in this part of the documentation.


[References]

*1* https://pubs.opengroup.org/onlinepubs/9699919799/utilities/echo.html
*2* https://www.gnu.org/software/coreutils/manual/html_node/echo-invocation.html