diff mbox series

t2080: fix cp invocation to copy symlinks instead of following them

Message ID 5386ee606567248464d39c313ee6177862a1f337.1622071475.git.matheus.bernardino@usp.br (mailing list archive)
State Accepted
Commit ea08db7473318798527361d4b85a67d9a1325eb4
Headers show
Series t2080: fix cp invocation to copy symlinks instead of following them | expand

Commit Message

Matheus Tavares May 26, 2021, 11:58 p.m. UTC
t2080 makes a few copies of a test repository and later performs a
branch switch on each one of the copies to verify that parallel checkout
and sequential checkout produce the same results. However, the
repository is copied with `cp -R` which, on some systems, defaults to
following symlinks on the directory hierarchy and copying their target
files instead of copying the symlinks themselves. AIX is one example of
system where this happens. Because the symlinks are not preserved, the
copied repositories have paths that do not match what is in the index,
causing git to abort the checkout operation that we want to test. This
makes the test fail on these systems.

Fix this by copying the repository with the POSIX flag '-P', which
forces cp to copy the symlinks instead of following them. Note that we
already use this flag for other cp invocations in our test suite (see
t7001). With this change, t2080 now passes on AIX.

Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 t/t2080-parallel-checkout-basics.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Christian Couder May 27, 2021, 7:25 a.m. UTC | #1
Just a few nits that you can take into account if you reroll the patch
for another reason, but I am not sure they are worth rerolling by
themselves.

On Thu, May 27, 2021 at 3:27 AM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> t2080 makes a few copies of a test repository and later performs a
> branch switch on each one of the copies to verify that parallel checkout
> and sequential checkout produce the same results. However, the
> repository is copied with `cp -R` which, on some systems, defaults to
> following symlinks on the directory hierarchy and copying their target
> files instead of copying the symlinks themselves. AIX is one example of
> system where this happens. Because the symlinks are not preserved, the
> copied repositories have paths that do not match what is in the index,
> causing git to abort the checkout operation that we want to test. This

s/git/Git/

> makes the test fail on these systems.
>
> Fix this by copying the repository with the POSIX flag '-P', which
> forces cp to copy the symlinks instead of following them. Note that we
> already use this flag for other cp invocations in our test suite (see
> t7001).

Maybe you could mention 00764ca10e (test: fix t7001 cp to use POSIX
options, 2014-04-11) that also fixed t7001 in a similar way.

> With this change, t2080 now passes on AIX.

Thanks!
Ævar Arnfjörð Bjarmason May 27, 2021, 12:51 p.m. UTC | #2
On Wed, May 26 2021, Matheus Tavares wrote:

> t2080 makes a few copies of a test repository and later performs a
> branch switch on each one of the copies to verify that parallel checkout
> and sequential checkout produce the same results. However, the
> repository is copied with `cp -R` which, on some systems, defaults to
> following symlinks on the directory hierarchy and copying their target
> files instead of copying the symlinks themselves. AIX is one example of
> system where this happens. Because the symlinks are not preserved, the
> copied repositories have paths that do not match what is in the index,
> causing git to abort the checkout operation that we want to test. This
> makes the test fail on these systems.
>
> Fix this by copying the repository with the POSIX flag '-P', which
> forces cp to copy the symlinks instead of following them. Note that we
> already use this flag for other cp invocations in our test suite (see
> t7001). With this change, t2080 now passes on AIX.
>
> Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>  t/t2080-parallel-checkout-basics.sh | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/t/t2080-parallel-checkout-basics.sh b/t/t2080-parallel-checkout-basics.sh
> index 7087818550..3e0f8c675f 100755
> --- a/t/t2080-parallel-checkout-basics.sh
> +++ b/t/t2080-parallel-checkout-basics.sh
> @@ -114,7 +114,7 @@ do
>  
>  	test_expect_success "$mode checkout" '
>  		repo=various_$mode &&
> -		cp -R various $repo &&
> +		cp -R -P various $repo &&
>  
>  		# The just copied files have more recent timestamps than their
>  		# associated index entries. So refresh the cached timestamps

Thanks for the quick fix, I can confirm that this makes the test pass on
AIX 7.2.
Ævar Arnfjörð Bjarmason May 31, 2021, 2:01 p.m. UTC | #3
On Thu, May 27 2021, Ævar Arnfjörð Bjarmason wrote:

> On Wed, May 26 2021, Matheus Tavares wrote:
>
>> t2080 makes a few copies of a test repository and later performs a
>> branch switch on each one of the copies to verify that parallel checkout
>> and sequential checkout produce the same results. However, the
>> repository is copied with `cp -R` which, on some systems, defaults to
>> following symlinks on the directory hierarchy and copying their target
>> files instead of copying the symlinks themselves. AIX is one example of
>> system where this happens. Because the symlinks are not preserved, the
>> copied repositories have paths that do not match what is in the index,
>> causing git to abort the checkout operation that we want to test. This
>> makes the test fail on these systems.
>>
>> Fix this by copying the repository with the POSIX flag '-P', which
>> forces cp to copy the symlinks instead of following them. Note that we
>> already use this flag for other cp invocations in our test suite (see
>> t7001). With this change, t2080 now passes on AIX.
>>
>> Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
>> ---
>>  t/t2080-parallel-checkout-basics.sh | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/t/t2080-parallel-checkout-basics.sh b/t/t2080-parallel-checkout-basics.sh
>> index 7087818550..3e0f8c675f 100755
>> --- a/t/t2080-parallel-checkout-basics.sh
>> +++ b/t/t2080-parallel-checkout-basics.sh
>> @@ -114,7 +114,7 @@ do
>>  
>>  	test_expect_success "$mode checkout" '
>>  		repo=various_$mode &&
>> -		cp -R various $repo &&
>> +		cp -R -P various $repo &&
>>  
>>  		# The just copied files have more recent timestamps than their
>>  		# associated index entries. So refresh the cached timestamps
>
> Thanks for the quick fix, I can confirm that this makes the test pass on
> AIX 7.2.

There's still a failure[1] in t2082-parallel-checkout-attributes.sh
though, which is new in 2.32.0-rc*. The difference is in an unexpected
BOM:
    
    avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/A.internal 
    efbbbf74657874
    avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/utf8-text  
    74657874

I.e. the A.internal starts with 0xefbbbf. The 2nd test of t0028*.sh also
fails similarly[2], so perhaps it's some old/iconv/whatever issue not
per-se related to any change of yours.

I tried compiling with both NO_ICONV=Y and ICONV_OMITS_BOM=Y, both have
the same failure.

In any case, I think AIX test failures (or other obscure platforms I
test) are rather uninteresting per-se, I mainly test on these to smoke
out bad assumptions elsewhere. E.g. non-POSIX code, perhaps in this case
we're relying on some unportable iconv assumption, or maybe not...
    

1.
Initialized empty Git repository in /scratch/avar/git/t/trash directory.t2082-parallel-checkout-attributes/encoding/.git/
+ cd encoding
+ echo text
+ 1> utf8-text
+ write_utf16
+ 0< utf8-text 1> utf16-text
checking prerequisite: NO_UTF16_BOM

mkdir -p "$TRASH_DIRECTORY/prereq-test-dir-NO_UTF16_BOM" &&
(
        cd "$TRASH_DIRECTORY/prereq-test-dir-NO_UTF16_BOM" &&
        test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6

)
+ mkdir -p /scratch/avar/git/t/trash directory.t2082-parallel-checkout-attributes/prereq-test-dir-NO_UTF16_BOM
+ cd /scratch/avar/git/t/trash directory.t2082-parallel-checkout-attributes/prereq-test-dir-NO_UTF16_BOM
+ printf abc
+ iconv -f UTF-8 -t UTF-16
+ wc -c
+ test 6 = 6
prerequisite NO_UTF16_BOM ok
+ echo A working-tree-encoding=UTF-16
+ 1> .gitattributes
+ cp utf16-text A
+ cp utf8-text B
+ git add A B .gitattributes
+ git commit -m encoding
[master (root-commit) 65fa7e8] encoding
 Author: A U Thor <author@example.com>
 3 files changed, 3 insertions(+)
 create mode 100644 .gitattributes
 create mode 100644 A
 create mode 100644 B
+ git cat-file -p :A
+ 1> A.internal
+ test_cmp_bin utf8-text A.internal
utf8-text A.internal differ: char 1, line 1
error: last command exited with $?=1
not ok 2 - parallel-checkout with re-encoding
#
#               set_checkout_config 2 0 &&
#               git init encoding &&
#               (
#                       cd encoding &&
#                       echo text >utf8-text &&
#                       write_utf16 <utf8-text >utf16-text &&
#
#                       echo "A working-tree-encoding=UTF-16" >.gitattributes &&
#                       cp utf16-text A &&
#                       cp utf8-text B &&
#                       git add A B .gitattributes &&
#                       git commit -m encoding &&
#
#                       # Check that A is stored in UTF-8
#                       git cat-file -p :A >A.internal &&
#                       test_cmp_bin utf8-text A.internal &&
#
#                       rm A B &&
#                       test_checkout_workers 2 git checkout A B &&
#
#                       # Check that A (and only A) is re-encoded during checkout
#                       test_cmp_bin utf16-text A &&
#                       test_cmp_bin utf8-text B
#               )
#

2. 

avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t0028-working-tree-encoding/test.utf8.raw  
68616c6c6f20746865726521
63616e20796f752072656164206d653f
avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t0028-working-tree-encoding/test.utf16.git    
efbbbf68616c6c6f20746865726521
63616e20796f752072656164206d653f
Matheus Tavares May 31, 2021, 4:09 p.m. UTC | #4
On Mon, May 31, 2021 at 11:16 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
> There's still a failure[1] in t2082-parallel-checkout-attributes.sh
> though, which is new in 2.32.0-rc*. The difference is in an unexpected
> BOM:
>
>    avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/A.internal
>    efbbbf74657874
>    avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/utf8-text 
>    74657874
>
> I.e. the A.internal starts with 0xefbbbf. The 2nd test of t0028*.sh also
> fails similarly[2], so perhaps it's some old/iconv/whatever issue not
> per-se related to any change of yours.

I ran t2080 on the same machine (gcc119) to try debugging it but I could
not reproduce the test failure [1]. t0028 also passed here. Could it be
that we are building git differently? I did `gmake CC=gcc NO_CURL=YesPlease`.

The encoding inspection also seems OK here:

matheustavares@gcc119:[/home/matheustavares/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/A.internal
74657874
matheustavares@gcc119:[/home/matheustavares/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/utf8-text
74657874

[1]:
Initialized empty Git repository in /home/matheustavares/git/t/trash directory.t2082-parallel-checkout-attributes/encoding/.git/
+ cd encoding
+ echo text
+ 1> utf8-text
+ write_utf16
+ 0< utf8-text 1> utf16-text
checking prerequisite: NO_UTF16_BOM

mkdir -p "$TRASH_DIRECTORY/prereq-test-dir-NO_UTF16_BOM" &&
(
        cd "$TRASH_DIRECTORY/prereq-test-dir-NO_UTF16_BOM" &&
        test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6

)
+ mkdir -p /home/matheustavares/git/t/trash directory.t2082-parallel-checkout-attributes/prereq-test-dir-NO_UTF16_BOM
+ cd /home/matheustavares/git/t/trash directory.t2082-parallel-checkout-attributes/prereq-test-dir-NO_UTF16_BOM
+ wc -c
+ iconv -f UTF-8 -t UTF-16
+ printf abc
+ test 6 = 6
prerequisite NO_UTF16_BOM ok
+ echo A working-tree-encoding=UTF-16
+ 1> .gitattributes
+ cp utf16-text A
+ cp utf8-text B
+ git add A B .gitattributes
+ git commit -m encoding
[master (root-commit) eb6a843] encoding
 Author: A U Thor <author@example.com>
 3 files changed, 3 insertions(+)
 create mode 100644 .gitattributes
 create mode 100644 A
 create mode 100644 B
+ git cat-file -p :A
+ 1> A.internal
+ test_cmp_bin utf8-text A.internal
+ rm A B
+ test_checkout_workers 2 git checkout A B
Updated 2 paths from the index
+ test_cmp_bin utf16-text A
+ test_cmp_bin utf8-text B
+ test_unconfig --global checkout.thresholdForParallelism
+ exit 0
+ eval_ret=0
+ test_unconfig --global checkout.workers
+ exit 0
+ eval_ret=0
+ :
ok 2 - parallel-checkout with re-encoding
Ævar Arnfjörð Bjarmason May 31, 2021, 8:41 p.m. UTC | #5
On Mon, May 31 2021, Matheus Tavares wrote:

> On Mon, May 31, 2021 at 11:16 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>>
>> There's still a failure[1] in t2082-parallel-checkout-attributes.sh
>> though, which is new in 2.32.0-rc*. The difference is in an unexpected
>> BOM:
>>
>>    avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/A.internal
>>    efbbbf74657874
>>    avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/utf8-text 
>>    74657874
>>
>> I.e. the A.internal starts with 0xefbbbf. The 2nd test of t0028*.sh also
>> fails similarly[2], so perhaps it's some old/iconv/whatever issue not
>> per-se related to any change of yours.
>
> I ran t2080 on the same machine (gcc119) to try debugging it but I could
> not reproduce the test failure [1]. t0028 also passed here. Could it be
> that we are building git differently? I did `gmake CC=gcc NO_CURL=YesPlease`.

t2080 is fine, it's t2082 that's broken. But yes, we are using different
parameters. This works for me:

    gmake CC=gcc -j3 CFLAGS="-g -O2" NO_CURL=UnfortunatelyNot

This doesn't:

    gmake CC=xlc -j3 CFLAGS="-g -O2 -qmaxmem=524288" NO_CURL=UnfortunatelyNot

The reason I test on AIX / Solaris is to get from under the GNU-isms of
various libraries & away from the gcc/clang mono(bio?)culture.

I.e. you're testing with GNU iconv, but IBM also has its own
implementation:
https://www.ibm.com/docs/en/aix/7.1?topic=programming-understanding-libiconv
of that POSIX interface:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv.html

Perhaps we've had some GNU-ism slip in somewhere...
Đoàn Trần Công Danh June 2, 2021, 1:36 a.m. UTC | #6
On 2021-05-31 16:01:01+0200, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> 
> On Thu, May 27 2021, Ævar Arnfjörð Bjarmason wrote:
> 
> > On Wed, May 26 2021, Matheus Tavares wrote:
> >
> >> t2080 makes a few copies of a test repository and later performs a
> >> branch switch on each one of the copies to verify that parallel checkout
> >> and sequential checkout produce the same results. However, the
> >> repository is copied with `cp -R` which, on some systems, defaults to
> >> following symlinks on the directory hierarchy and copying their target
> >> files instead of copying the symlinks themselves. AIX is one example of
> >> system where this happens. Because the symlinks are not preserved, the
> >> copied repositories have paths that do not match what is in the index,
> >> causing git to abort the checkout operation that we want to test. This
> >> makes the test fail on these systems.
> >>
> >> Fix this by copying the repository with the POSIX flag '-P', which
> >> forces cp to copy the symlinks instead of following them. Note that we
> >> already use this flag for other cp invocations in our test suite (see
> >> t7001). With this change, t2080 now passes on AIX.
> >>
> >> Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> >> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> >> ---
> >>  t/t2080-parallel-checkout-basics.sh | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/t/t2080-parallel-checkout-basics.sh b/t/t2080-parallel-checkout-basics.sh
> >> index 7087818550..3e0f8c675f 100755
> >> --- a/t/t2080-parallel-checkout-basics.sh
> >> +++ b/t/t2080-parallel-checkout-basics.sh
> >> @@ -114,7 +114,7 @@ do
> >>  
> >>  	test_expect_success "$mode checkout" '
> >>  		repo=various_$mode &&
> >> -		cp -R various $repo &&
> >> +		cp -R -P various $repo &&
> >>  
> >>  		# The just copied files have more recent timestamps than their
> >>  		# associated index entries. So refresh the cached timestamps
> >
> > Thanks for the quick fix, I can confirm that this makes the test pass on
> > AIX 7.2.
> 
> There's still a failure[1] in t2082-parallel-checkout-attributes.sh
> though, which is new in 2.32.0-rc*. The difference is in an unexpected
> BOM:
>     
>     avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/A.internal 
>     efbbbf74657874
>     avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/utf8-text  
>     74657874
> 
> I.e. the A.internal starts with 0xefbbbf. The 2nd test of t0028*.sh also
> fails similarly[2], so perhaps it's some old/iconv/whatever issue not
> per-se related to any change of yours.

The 0xefbbbf looks interesting, it's BOM for utf-8.

> I tried compiling with both NO_ICONV=Y and ICONV_OMITS_BOM=Y, both have
> the same failure.

I didn't check the code-path for NO_ICONV=Y but ICONV_OMITS_BOM=Y only
affects output of converting *to* utf-16 and utf-32.

So, I think AIX iconv implementation automatically add BOM to utf-8?

Perhap we need to call skip_utf8_bom somewhere?

-- Danh

> In any case, I think AIX test failures (or other obscure platforms I
> test) are rather uninteresting per-se, I mainly test on these to smoke
> out bad assumptions elsewhere. E.g. non-POSIX code, perhaps in this case
> we're relying on some unportable iconv assumption, or maybe not...
>     
> 
> 1.
> Initialized empty Git repository in /scratch/avar/git/t/trash directory.t2082-parallel-checkout-attributes/encoding/.git/
> + cd encoding
> + echo text
> + 1> utf8-text
> + write_utf16
> + 0< utf8-text 1> utf16-text
> checking prerequisite: NO_UTF16_BOM
> 
> mkdir -p "$TRASH_DIRECTORY/prereq-test-dir-NO_UTF16_BOM" &&
> (
>         cd "$TRASH_DIRECTORY/prereq-test-dir-NO_UTF16_BOM" &&
>         test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6
> 
> )
> + mkdir -p /scratch/avar/git/t/trash directory.t2082-parallel-checkout-attributes/prereq-test-dir-NO_UTF16_BOM
> + cd /scratch/avar/git/t/trash directory.t2082-parallel-checkout-attributes/prereq-test-dir-NO_UTF16_BOM
> + printf abc
> + iconv -f UTF-8 -t UTF-16
> + wc -c
> + test 6 = 6
> prerequisite NO_UTF16_BOM ok
> + echo A working-tree-encoding=UTF-16
> + 1> .gitattributes
> + cp utf16-text A
> + cp utf8-text B
> + git add A B .gitattributes
> + git commit -m encoding
> [master (root-commit) 65fa7e8] encoding
>  Author: A U Thor <author@example.com>
>  3 files changed, 3 insertions(+)
>  create mode 100644 .gitattributes
>  create mode 100644 A
>  create mode 100644 B
> + git cat-file -p :A
> + 1> A.internal
> + test_cmp_bin utf8-text A.internal
> utf8-text A.internal differ: char 1, line 1
> error: last command exited with $?=1
> not ok 2 - parallel-checkout with re-encoding
> #
> #               set_checkout_config 2 0 &&
> #               git init encoding &&
> #               (
> #                       cd encoding &&
> #                       echo text >utf8-text &&
> #                       write_utf16 <utf8-text >utf16-text &&
> #
> #                       echo "A working-tree-encoding=UTF-16" >.gitattributes &&
> #                       cp utf16-text A &&
> #                       cp utf8-text B &&
> #                       git add A B .gitattributes &&
> #                       git commit -m encoding &&
> #
> #                       # Check that A is stored in UTF-8
> #                       git cat-file -p :A >A.internal &&
> #                       test_cmp_bin utf8-text A.internal &&
> #
> #                       rm A B &&
> #                       test_checkout_workers 2 git checkout A B &&
> #
> #                       # Check that A (and only A) is re-encoded during checkout
> #                       test_cmp_bin utf16-text A &&
> #                       test_cmp_bin utf8-text B
> #               )
> #
> 
> 2. 
> 
> avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t0028-working-tree-encoding/test.utf8.raw  
> 68616c6c6f20746865726521
> 63616e20796f752072656164206d653f
> avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t0028-working-tree-encoding/test.utf16.git    
> efbbbf68616c6c6f20746865726521
> 63616e20796f752072656164206d653f
Ævar Arnfjörð Bjarmason June 2, 2021, 10:50 a.m. UTC | #7
On Wed, Jun 02 2021, Đoàn Trần Công Danh wrote:

> On 2021-05-31 16:01:01+0200, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>> 
>> On Thu, May 27 2021, Ævar Arnfjörð Bjarmason wrote:
>> 
>> > On Wed, May 26 2021, Matheus Tavares wrote:
>> >
>> >> t2080 makes a few copies of a test repository and later performs a
>> >> branch switch on each one of the copies to verify that parallel checkout
>> >> and sequential checkout produce the same results. However, the
>> >> repository is copied with `cp -R` which, on some systems, defaults to
>> >> following symlinks on the directory hierarchy and copying their target
>> >> files instead of copying the symlinks themselves. AIX is one example of
>> >> system where this happens. Because the symlinks are not preserved, the
>> >> copied repositories have paths that do not match what is in the index,
>> >> causing git to abort the checkout operation that we want to test. This
>> >> makes the test fail on these systems.
>> >>
>> >> Fix this by copying the repository with the POSIX flag '-P', which
>> >> forces cp to copy the symlinks instead of following them. Note that we
>> >> already use this flag for other cp invocations in our test suite (see
>> >> t7001). With this change, t2080 now passes on AIX.
>> >>
>> >> Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>> >> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
>> >> ---
>> >>  t/t2080-parallel-checkout-basics.sh | 2 +-
>> >>  1 file changed, 1 insertion(+), 1 deletion(-)
>> >>
>> >> diff --git a/t/t2080-parallel-checkout-basics.sh b/t/t2080-parallel-checkout-basics.sh
>> >> index 7087818550..3e0f8c675f 100755
>> >> --- a/t/t2080-parallel-checkout-basics.sh
>> >> +++ b/t/t2080-parallel-checkout-basics.sh
>> >> @@ -114,7 +114,7 @@ do
>> >>  
>> >>  	test_expect_success "$mode checkout" '
>> >>  		repo=various_$mode &&
>> >> -		cp -R various $repo &&
>> >> +		cp -R -P various $repo &&
>> >>  
>> >>  		# The just copied files have more recent timestamps than their
>> >>  		# associated index entries. So refresh the cached timestamps
>> >
>> > Thanks for the quick fix, I can confirm that this makes the test pass on
>> > AIX 7.2.
>> 
>> There's still a failure[1] in t2082-parallel-checkout-attributes.sh
>> though, which is new in 2.32.0-rc*. The difference is in an unexpected
>> BOM:
>>     
>>     avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/A.internal 
>>     efbbbf74657874
>>     avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/utf8-text  
>>     74657874
>> 
>> I.e. the A.internal starts with 0xefbbbf. The 2nd test of t0028*.sh also
>> fails similarly[2], so perhaps it's some old/iconv/whatever issue not
>> per-se related to any change of yours.
>
> The 0xefbbbf looks interesting, it's BOM for utf-8.
>
>> I tried compiling with both NO_ICONV=Y and ICONV_OMITS_BOM=Y, both have
>> the same failure.
>
> I didn't check the code-path for NO_ICONV=Y but ICONV_OMITS_BOM=Y only
> affects output of converting *to* utf-16 and utf-32.
>
> So, I think AIX iconv implementation automatically add BOM to utf-8?
>
> Perhap we need to call skip_utf8_bom somewhere?

I debugged this a bit more, it's probably *also* an issue in our use of
libiconv, but it goes wrong just with our test setup with
iconv(1). I.e. on my boring linux box:
    
    echo x | iconv -f UTF-8 -t UTF-16 | perl -0777 -MData::Dumper -ne 'my @a = map { sprintf "0x%x", $_ } unpack "C*"; print Dumper \@a'
    $VAR1 = [
              '0xff',
              '0xfe',
              '0x78',
              '0x0',
              '0xa',
              '0x0'
            ];


On the AIX box to get the same I need to do that as:

    (printf '\376\377'; echo x | iconv -f UTF-8 -t UTF-16LE) | [...]

I.e. we omit the BOM *and* AIX's idea of our UTF-16 is little-endian
UTF-16, a plain UTF-16 gives you the big-endian version. To make things
worse the same is true of UTF-32, except "iconv -l" lists no UTF-32LE
version. So it seems we can't get the same result at all for that one.

So from the outset the code added around 79444c92943 (utf8: handle
systems that don't write BOM for UTF-16, 2019-02-12) needs to be more
careful (although this looked broken before), i.e. we should test exact
known-good bytes and see if UTF-16 is really what we think it is,
etc. This is likely broken on any big-endian non-GNUish iconv
implementation.
Bagas Sanjaya June 2, 2021, 11:14 a.m. UTC | #8
On 02/06/21 17.50, Ævar Arnfjörð Bjarmason wrote:
> So from the outset the code added around 79444c92943 (utf8: handle
> systems that don't write BOM for UTF-16, 2019-02-12) needs to be more
> careful (although this looked broken before), i.e. we should test exact
> known-good bytes and see if UTF-16 is really what we think it is,
> etc. This is likely broken on any big-endian non-GNUish iconv
> implementation.
> 
So we're still fixing for utf-16le case, right?

And what about known-good bytes on utf-16be?
Đoàn Trần Công Danh June 2, 2021, 11:22 a.m. UTC | #9
On 2021-06-02 12:50:53+0200, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> 
> On Wed, Jun 02 2021, Đoàn Trần Công Danh wrote:
> 
> > On 2021-05-31 16:01:01+0200, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> >> 
> >> On Thu, May 27 2021, Ævar Arnfjörð Bjarmason wrote:
> >> 
> >> > On Wed, May 26 2021, Matheus Tavares wrote:
> >> >
> >> >> t2080 makes a few copies of a test repository and later performs a
> >> >> branch switch on each one of the copies to verify that parallel checkout
> >> >> and sequential checkout produce the same results. However, the
> >> >> repository is copied with `cp -R` which, on some systems, defaults to
> >> >> following symlinks on the directory hierarchy and copying their target
> >> >> files instead of copying the symlinks themselves. AIX is one example of
> >> >> system where this happens. Because the symlinks are not preserved, the
> >> >> copied repositories have paths that do not match what is in the index,
> >> >> causing git to abort the checkout operation that we want to test. This
> >> >> makes the test fail on these systems.
> >> >>
> >> >> Fix this by copying the repository with the POSIX flag '-P', which
> >> >> forces cp to copy the symlinks instead of following them. Note that we
> >> >> already use this flag for other cp invocations in our test suite (see
> >> >> t7001). With this change, t2080 now passes on AIX.
> >> >>
> >> >> Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> >> >> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> >> >> ---
> >> >>  t/t2080-parallel-checkout-basics.sh | 2 +-
> >> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >> >>
> >> >> diff --git a/t/t2080-parallel-checkout-basics.sh b/t/t2080-parallel-checkout-basics.sh
> >> >> index 7087818550..3e0f8c675f 100755
> >> >> --- a/t/t2080-parallel-checkout-basics.sh
> >> >> +++ b/t/t2080-parallel-checkout-basics.sh
> >> >> @@ -114,7 +114,7 @@ do
> >> >>  
> >> >>  	test_expect_success "$mode checkout" '
> >> >>  		repo=various_$mode &&
> >> >> -		cp -R various $repo &&
> >> >> +		cp -R -P various $repo &&
> >> >>  
> >> >>  		# The just copied files have more recent timestamps than their
> >> >>  		# associated index entries. So refresh the cached timestamps
> >> >
> >> > Thanks for the quick fix, I can confirm that this makes the test pass on
> >> > AIX 7.2.
> >> 
> >> There's still a failure[1] in t2082-parallel-checkout-attributes.sh
> >> though, which is new in 2.32.0-rc*. The difference is in an unexpected
> >> BOM:
> >>     
> >>     avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/A.internal 
> >>     efbbbf74657874
> >>     avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/utf8-text  
> >>     74657874
> >> 
> >> I.e. the A.internal starts with 0xefbbbf. The 2nd test of t0028*.sh also
> >> fails similarly[2], so perhaps it's some old/iconv/whatever issue not
> >> per-se related to any change of yours.
> >
> > The 0xefbbbf looks interesting, it's BOM for utf-8.
> >
> >> I tried compiling with both NO_ICONV=Y and ICONV_OMITS_BOM=Y, both have
> >> the same failure.
> >
> > I didn't check the code-path for NO_ICONV=Y but ICONV_OMITS_BOM=Y only
> > affects output of converting *to* utf-16 and utf-32.
> >
> > So, I think AIX iconv implementation automatically add BOM to utf-8?
> >
> > Perhap we need to call skip_utf8_bom somewhere?
> 
> I debugged this a bit more, it's probably *also* an issue in our use of
> libiconv, but it goes wrong just with our test setup with
> iconv(1). I.e. on my boring linux box:
>     
>     echo x | iconv -f UTF-8 -t UTF-16 | perl -0777 -MData::Dumper -ne 'my @a = map { sprintf "0x%x", $_ } unpack "C*"; print Dumper \@a'
>     $VAR1 = [
>               '0xff',
>               '0xfe',
>               '0x78',
>               '0x0',
>               '0xa',
>               '0x0'
>             ];
> 
> 
> On the AIX box to get the same I need to do that as:
> 
>     (printf '\376\377'; echo x | iconv -f UTF-8 -t UTF-16LE) | [...]

FWIW, my Linux with musl-libc also need to be done like this.

> I.e. we omit the BOM *and* AIX's idea of our UTF-16 is little-endian
> UTF-16, a plain UTF-16 gives you the big-endian version.

Per spec, plain UTF-16 *is* big-endian. [1]

	In the table <BOM> indicates that the byte order is determined
	by a byte order mark, if present at the beginning of the data
	stream, otherwise it is big-endian.

> To make things
> worse the same is true of UTF-32, except "iconv -l" lists no UTF-32LE
> version. So it seems we can't get the same result at all for that one.

Ditto for UTF-32

> So from the outset the code added around 79444c92943 (utf8: handle
> systems that don't write BOM for UTF-16, 2019-02-12) needs to be more
> careful (although this looked broken before), i.e. we should test exact
> known-good bytes and see if UTF-16 is really what we think it is,
> etc. This is likely broken on any big-endian non-GNUish iconv
> implementation.

Linux with musl-libc on little endian also thinks UTF-16 without BOM is UTF-16-BE

I still think we should strip UTF-8 BOM after reencode_string_len
I.e. something like this, I can't test this, though, since I don't have any AIX box.
And my Linux with musl-libc doesn't output BOM for utf-8
It doesn't write BOM for utf-16be and utf-32be, anyway.

-----8<----
diff --git a/utf8.c b/utf8.c
index de4ce5c0e6..73631632bd 100644
--- a/utf8.c
+++ b/utf8.c
@@ -8,6 +8,7 @@ static const char utf16_be_bom[] = {'\xFE', '\xFF'};
 static const char utf16_le_bom[] = {'\xFF', '\xFE'};
 static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
 static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
+const char utf8_bom[] = "\357\273\277";
 
 struct interval {
 	ucs_char_t first;
@@ -28,6 +29,12 @@ size_t display_mode_esc_sequence_len(const char *s)
 	return p - s;
 }
 
+static int has_utf8_bom(const char *text, size_t len)
+{
+	return len >= strlen(utf8_bom) &&
+		memcmp(text, utf8_bom, strlen(utf8_bom)) == 0;
+}
+
 /* auxiliary function for binary search in interval table */
 static int bisearch(ucs_char_t ucs, const struct interval *table, int max)
 {
@@ -539,12 +546,13 @@ static const char *fallback_encoding(const char *name)
 
 char *reencode_string_len(const char *in, size_t insz,
 			  const char *out_encoding, const char *in_encoding,
-			  size_t *outsz)
+			  size_t *outsz_p)
 {
 	iconv_t conv;
 	char *out;
 	const char *bom_str = NULL;
 	size_t bom_len = 0;
+	size_t outsz = 0;
 
 	if (!in_encoding)
 		return NULL;
@@ -590,10 +598,16 @@ char *reencode_string_len(const char *in, size_t insz,
 		if (conv == (iconv_t) -1)
 			return NULL;
 	}
-	out = reencode_string_iconv(in, insz, conv, bom_len, outsz);
+	out = reencode_string_iconv(in, insz, conv, bom_len, &outsz);
 	iconv_close(conv);
 	if (out && bom_str && bom_len)
 		memcpy(out, bom_str, bom_len);
+	if (is_encoding_utf8(out_encoding) && has_utf8_bom(out, outsz)) {
+		outsz -= strlen(utf8_bom);
+		memmove(out, out + strlen(utf8_bom), outsz + 1);
+	}
+	if (outsz_p)
+		*outsz_p = outsz;
 	return out;
 }
 #endif
@@ -782,12 +796,9 @@ int is_hfs_dotmailmap(const char *path)
 	return is_hfs_dot_str(path, "mailmap");
 }
 
-const char utf8_bom[] = "\357\273\277";
-
 int skip_utf8_bom(char **text, size_t len)
 {
-	if (len < strlen(utf8_bom) ||
-	    memcmp(*text, utf8_bom, strlen(utf8_bom)))
+	if (!has_utf8_bom(*text, len))
 		return 0;
 	*text += strlen(utf8_bom);
 	return 1;
---->8------

1: https://unicode.org/faq/utf_bom.html
Ævar Arnfjörð Bjarmason June 2, 2021, 1:36 p.m. UTC | #10
On Wed, Jun 02 2021, Đoàn Trần Công Danh wrote:

> On 2021-06-02 12:50:53+0200, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>> 
>> On Wed, Jun 02 2021, Đoàn Trần Công Danh wrote:
>> 
>> > On 2021-05-31 16:01:01+0200, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>> >> 
>> >> On Thu, May 27 2021, Ævar Arnfjörð Bjarmason wrote:
>> >> 
>> >> > On Wed, May 26 2021, Matheus Tavares wrote:
>> >> >
>> >> >> t2080 makes a few copies of a test repository and later performs a
>> >> >> branch switch on each one of the copies to verify that parallel checkout
>> >> >> and sequential checkout produce the same results. However, the
>> >> >> repository is copied with `cp -R` which, on some systems, defaults to
>> >> >> following symlinks on the directory hierarchy and copying their target
>> >> >> files instead of copying the symlinks themselves. AIX is one example of
>> >> >> system where this happens. Because the symlinks are not preserved, the
>> >> >> copied repositories have paths that do not match what is in the index,
>> >> >> causing git to abort the checkout operation that we want to test. This
>> >> >> makes the test fail on these systems.
>> >> >>
>> >> >> Fix this by copying the repository with the POSIX flag '-P', which
>> >> >> forces cp to copy the symlinks instead of following them. Note that we
>> >> >> already use this flag for other cp invocations in our test suite (see
>> >> >> t7001). With this change, t2080 now passes on AIX.
>> >> >>
>> >> >> Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>> >> >> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
>> >> >> ---
>> >> >>  t/t2080-parallel-checkout-basics.sh | 2 +-
>> >> >>  1 file changed, 1 insertion(+), 1 deletion(-)
>> >> >>
>> >> >> diff --git a/t/t2080-parallel-checkout-basics.sh b/t/t2080-parallel-checkout-basics.sh
>> >> >> index 7087818550..3e0f8c675f 100755
>> >> >> --- a/t/t2080-parallel-checkout-basics.sh
>> >> >> +++ b/t/t2080-parallel-checkout-basics.sh
>> >> >> @@ -114,7 +114,7 @@ do
>> >> >>  
>> >> >>  	test_expect_success "$mode checkout" '
>> >> >>  		repo=various_$mode &&
>> >> >> -		cp -R various $repo &&
>> >> >> +		cp -R -P various $repo &&
>> >> >>  
>> >> >>  		# The just copied files have more recent timestamps than their
>> >> >>  		# associated index entries. So refresh the cached timestamps
>> >> >
>> >> > Thanks for the quick fix, I can confirm that this makes the test pass on
>> >> > AIX 7.2.
>> >> 
>> >> There's still a failure[1] in t2082-parallel-checkout-attributes.sh
>> >> though, which is new in 2.32.0-rc*. The difference is in an unexpected
>> >> BOM:
>> >>     
>> >>     avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/A.internal 
>> >>     efbbbf74657874
>> >>     avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/utf8-text  
>> >>     74657874
>> >> 
>> >> I.e. the A.internal starts with 0xefbbbf. The 2nd test of t0028*.sh also
>> >> fails similarly[2], so perhaps it's some old/iconv/whatever issue not
>> >> per-se related to any change of yours.
>> >
>> > The 0xefbbbf looks interesting, it's BOM for utf-8.
>> >
>> >> I tried compiling with both NO_ICONV=Y and ICONV_OMITS_BOM=Y, both have
>> >> the same failure.
>> >
>> > I didn't check the code-path for NO_ICONV=Y but ICONV_OMITS_BOM=Y only
>> > affects output of converting *to* utf-16 and utf-32.
>> >
>> > So, I think AIX iconv implementation automatically add BOM to utf-8?
>> >
>> > Perhap we need to call skip_utf8_bom somewhere?
>> 
>> I debugged this a bit more, it's probably *also* an issue in our use of
>> libiconv, but it goes wrong just with our test setup with
>> iconv(1). I.e. on my boring linux box:
>>     
>>     echo x | iconv -f UTF-8 -t UTF-16 | perl -0777 -MData::Dumper -ne 'my @a = map { sprintf "0x%x", $_ } unpack "C*"; print Dumper \@a'
>>     $VAR1 = [
>>               '0xff',
>>               '0xfe',
>>               '0x78',
>>               '0x0',
>>               '0xa',
>>               '0x0'
>>             ];
>> 
>> 
>> On the AIX box to get the same I need to do that as:
>> 
>>     (printf '\376\377'; echo x | iconv -f UTF-8 -t UTF-16LE) | [...]
>
> FWIW, my Linux with musl-libc also need to be done like this.
>
>> I.e. we omit the BOM *and* AIX's idea of our UTF-16 is little-endian
>> UTF-16, a plain UTF-16 gives you the big-endian version.
>
> Per spec, plain UTF-16 *is* big-endian. [1]
>
> 	In the table <BOM> indicates that the byte order is determined
> 	by a byte order mark, if present at the beginning of the data
> 	stream, otherwise it is big-endian.
>
>> To make things
>> worse the same is true of UTF-32, except "iconv -l" lists no UTF-32LE
>> version. So it seems we can't get the same result at all for that one.
>
> Ditto for UTF-32
>
>> So from the outset the code added around 79444c92943 (utf8: handle
>> systems that don't write BOM for UTF-16, 2019-02-12) needs to be more
>> careful (although this looked broken before), i.e. we should test exact
>> known-good bytes and see if UTF-16 is really what we think it is,
>> etc. This is likely broken on any big-endian non-GNUish iconv
>> implementation.
>
> Linux with musl-libc on little endian also thinks UTF-16 without BOM is UTF-16-BE
>
> I still think we should strip UTF-8 BOM after reencode_string_len
> I.e. something like this, I can't test this, though, since I don't have any AIX box.
> And my Linux with musl-libc doesn't output BOM for utf-8
> It doesn't write BOM for utf-16be and utf-32be, anyway.
>
> -----8<----
> diff --git a/utf8.c b/utf8.c
> index de4ce5c0e6..73631632bd 100644
> --- a/utf8.c
> +++ b/utf8.c
> @@ -8,6 +8,7 @@ static const char utf16_be_bom[] = {'\xFE', '\xFF'};
>  static const char utf16_le_bom[] = {'\xFF', '\xFE'};
>  static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
>  static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
> +const char utf8_bom[] = "\357\273\277";
>  
>  struct interval {
>  	ucs_char_t first;
> @@ -28,6 +29,12 @@ size_t display_mode_esc_sequence_len(const char *s)
>  	return p - s;
>  }
>  
> +static int has_utf8_bom(const char *text, size_t len)
> +{
> +	return len >= strlen(utf8_bom) &&
> +		memcmp(text, utf8_bom, strlen(utf8_bom)) == 0;
> +}
> +
>  /* auxiliary function for binary search in interval table */
>  static int bisearch(ucs_char_t ucs, const struct interval *table, int max)
>  {
> @@ -539,12 +546,13 @@ static const char *fallback_encoding(const char *name)
>  
>  char *reencode_string_len(const char *in, size_t insz,
>  			  const char *out_encoding, const char *in_encoding,
> -			  size_t *outsz)
> +			  size_t *outsz_p)
>  {
>  	iconv_t conv;
>  	char *out;
>  	const char *bom_str = NULL;
>  	size_t bom_len = 0;
> +	size_t outsz = 0;
>  
>  	if (!in_encoding)
>  		return NULL;
> @@ -590,10 +598,16 @@ char *reencode_string_len(const char *in, size_t insz,
>  		if (conv == (iconv_t) -1)
>  			return NULL;
>  	}
> -	out = reencode_string_iconv(in, insz, conv, bom_len, outsz);
> +	out = reencode_string_iconv(in, insz, conv, bom_len, &outsz);
>  	iconv_close(conv);
>  	if (out && bom_str && bom_len)
>  		memcpy(out, bom_str, bom_len);
> +	if (is_encoding_utf8(out_encoding) && has_utf8_bom(out, outsz)) {
> +		outsz -= strlen(utf8_bom);
> +		memmove(out, out + strlen(utf8_bom), outsz + 1);
> +	}
> +	if (outsz_p)
> +		*outsz_p = outsz;
>  	return out;
>  }
>  #endif
> @@ -782,12 +796,9 @@ int is_hfs_dotmailmap(const char *path)
>  	return is_hfs_dot_str(path, "mailmap");
>  }
>  
> -const char utf8_bom[] = "\357\273\277";
> -
>  int skip_utf8_bom(char **text, size_t len)
>  {
> -	if (len < strlen(utf8_bom) ||
> -	    memcmp(*text, utf8_bom, strlen(utf8_bom)))
> +	if (!has_utf8_bom(*text, len))
>  		return 0;
>  	*text += strlen(utf8_bom);
>  	return 1;
> ---->8------
>
> 1: https://unicode.org/faq/utf_bom.html

That's getting us there, now we don't fail on the 2nd test, but do start
failing on the third "re-encode to UTF-16 on checkout" and other
"checkout" tests.

The "test_cmp" at the end of that 3rd tests shows that the difference in
test.utf16.raw and test.utf16 is now that the "raw" one has the BOM, but
not the "test.utf16" file.
Đoàn Trần Công Danh June 2, 2021, 1:50 p.m. UTC | #11
On 2021-06-02 15:36:57+0200, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> That's getting us there, now we don't fail on the 2nd test, but do start
> failing on the third "re-encode to UTF-16 on checkout" and other
> "checkout" tests.
> 
> The "test_cmp" at the end of that 3rd tests shows that the difference in
> test.utf16.raw and test.utf16 is now that the "raw" one has the BOM, but
> not the "test.utf16" file.

That meant we need: ICONV_OMITS_BOM=UnfortunatelyYes for AIX?
I can replicate that test failure when building for musl libc without
ICONV_OMITS_BOM undefined.
Torsten Bögershausen June 2, 2021, 7:13 p.m. UTC | #12
On Wed, Jun 02, 2021 at 03:36:57PM +0200, Ævar Arnfjörð Bjarmason wrote:

> >> >> There's still a failure[1] in t2082-parallel-checkout-attributes.sh
> >> >> though, which is new in 2.32.0-rc*. The difference is in an unexpected
> >> >> BOM:
> >> >>
> >> >>     avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/A.internal
> >> >>     efbbbf74657874
> >> >>     avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/utf8-text
> >> >>     74657874
> >> >>
> >> >> I.e. the A.internal starts with 0xefbbbf. The 2nd test of t0028*.sh also
> >> >> fails similarly[2], so perhaps it's some old/iconv/whatever issue not
> >> >> per-se related to any change of yours.
> >> >
> >> > The 0xefbbbf looks interesting, it's BOM for utf-8.
> >> >
> >> >> I tried compiling with both NO_ICONV=Y and ICONV_OMITS_BOM=Y, both have
> >> >> the same failure.
> >> >
> >> > I didn't check the code-path for NO_ICONV=Y but ICONV_OMITS_BOM=Y only
> >> > affects output of converting *to* utf-16 and utf-32.
> >> >
> >> > So, I think AIX iconv implementation automatically add BOM to utf-8?
> >> >
> >> > Perhap we need to call skip_utf8_bom somewhere?
> >>
> >> I debugged this a bit more, it's probably *also* an issue in our use of
> >> libiconv, but it goes wrong just with our test setup with
> >> iconv(1). I.e. on my boring linux box:
> >>
> >>     echo x | iconv -f UTF-8 -t UTF-16 | perl -0777 -MData::Dumper -ne 'my @a = map { sprintf "0x%x", $_ } unpack "C*"; print Dumper \@a'
> >>     $VAR1 = [
> >>               '0xff',
> >>               '0xfe',
> >>               '0x78',
> >>               '0x0',
> >>               '0xa',
> >>               '0x0'
> >>             ];
> >>
> >>
> >> On the AIX box to get the same I need to do that as:
> >>
> >>     (printf '\376\377'; echo x | iconv -f UTF-8 -t UTF-16LE) | [...]
> >
> > FWIW, my Linux with musl-libc also need to be done like this.
> >
> >> I.e. we omit the BOM *and* AIX's idea of our UTF-16 is little-endian
> >> UTF-16, a plain UTF-16 gives you the big-endian version.
> >
> > Per spec, plain UTF-16 *is* big-endian. [1]
> >
> > 	In the table <BOM> indicates that the byte order is determined
> > 	by a byte order mark, if present at the beginning of the data
> > 	stream, otherwise it is big-endian.
> >
> >> To make things
> >> worse the same is true of UTF-32, except "iconv -l" lists no UTF-32LE
> >> version. So it seems we can't get the same result at all for that one.
> >
> > Ditto for UTF-32
> >
> >> So from the outset the code added around 79444c92943 (utf8: handle
> >> systems that don't write BOM for UTF-16, 2019-02-12) needs to be more
> >> careful (although this looked broken before), i.e. we should test exact
> >> known-good bytes and see if UTF-16 is really what we think it is,
> >> etc. This is likely broken on any big-endian non-GNUish iconv
> >> implementation.
> >
> > Linux with musl-libc on little endian also thinks UTF-16 without BOM is UTF-16-BE
> >
> > I still think we should strip UTF-8 BOM after reencode_string_len
> > I.e. something like this, I can't test this, though, since I don't have any AIX box.
> > And my Linux with musl-libc doesn't output BOM for utf-8
> > It doesn't write BOM for utf-16be and utf-32be, anyway.
> >
> > -----8<----
> > diff --git a/utf8.c b/utf8.c
> > index de4ce5c0e6..73631632bd 100644
> > --- a/utf8.c
> > +++ b/utf8.c
> > @@ -8,6 +8,7 @@ static const char utf16_be_bom[] = {'\xFE', '\xFF'};
> >  static const char utf16_le_bom[] = {'\xFF', '\xFE'};
> >  static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
> >  static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
> > +const char utf8_bom[] = "\357\273\277";
> >
> >  struct interval {
> >  	ucs_char_t first;
> > @@ -28,6 +29,12 @@ size_t display_mode_esc_sequence_len(const char *s)
> >  	return p - s;
> >  }
> >
> > +static int has_utf8_bom(const char *text, size_t len)
> > +{
> > +	return len >= strlen(utf8_bom) &&
> > +		memcmp(text, utf8_bom, strlen(utf8_bom)) == 0;
> > +}
> > +
> >  /* auxiliary function for binary search in interval table */
> >  static int bisearch(ucs_char_t ucs, const struct interval *table, int max)
> >  {
> > @@ -539,12 +546,13 @@ static const char *fallback_encoding(const char *name)
> >
> >  char *reencode_string_len(const char *in, size_t insz,
> >  			  const char *out_encoding, const char *in_encoding,
> > -			  size_t *outsz)
> > +			  size_t *outsz_p)
> >  {
> >  	iconv_t conv;
> >  	char *out;
> >  	const char *bom_str = NULL;
> >  	size_t bom_len = 0;
> > +	size_t outsz = 0;
> >
> >  	if (!in_encoding)
> >  		return NULL;
> > @@ -590,10 +598,16 @@ char *reencode_string_len(const char *in, size_t insz,
> >  		if (conv == (iconv_t) -1)
> >  			return NULL;
> >  	}
> > -	out = reencode_string_iconv(in, insz, conv, bom_len, outsz);
> > +	out = reencode_string_iconv(in, insz, conv, bom_len, &outsz);
> >  	iconv_close(conv);
> >  	if (out && bom_str && bom_len)
> >  		memcpy(out, bom_str, bom_len);
> > +	if (is_encoding_utf8(out_encoding) && has_utf8_bom(out, outsz)) {
> > +		outsz -= strlen(utf8_bom);
> > +		memmove(out, out + strlen(utf8_bom), outsz + 1);
> > +	}
> > +	if (outsz_p)
> > +		*outsz_p = outsz;
> >  	return out;
> >  }
> >  #endif
> > @@ -782,12 +796,9 @@ int is_hfs_dotmailmap(const char *path)
> >  	return is_hfs_dot_str(path, "mailmap");
> >  }
> >
> > -const char utf8_bom[] = "\357\273\277";
> > -
> >  int skip_utf8_bom(char **text, size_t len)
> >  {
> > -	if (len < strlen(utf8_bom) ||
> > -	    memcmp(*text, utf8_bom, strlen(utf8_bom)))
> > +	if (!has_utf8_bom(*text, len))
> >  		return 0;
> >  	*text += strlen(utf8_bom);
> >  	return 1;
> > ---->8------
> >
> > 1: https://unicode.org/faq/utf_bom.html
>
> That's getting us there, now we don't fail on the 2nd test, but do start
> failing on the third "re-encode to UTF-16 on checkout" and other
> "checkout" tests.
>
> The "test_cmp" at the end of that 3rd tests shows that the difference in
> test.utf16.raw and test.utf16 is now that the "raw" one has the BOM, but
> not the "test.utf16" file.

What I can read from all of this, is that "the iconv" does not handle BOMS
correcttly. When going from UTF-16 or UTF-32 to UTF-8 the BOM should be removed.
But that is not the case here, as it seams.
The patch from above for utf8.c in Git will fix this - OK so far.

For t2082-parallel-checkout-attributes.sh and t0028 we may be able to prepare
the "right" versions of the expected data on e.g. Linux box and add that material
to the test case.
Aad remove the invocation of a potential broken iconv binary completely from
the test scripts.
brian m. carlson June 3, 2021, 12:07 a.m. UTC | #13
On 2021-06-02 at 10:50:53, Ævar Arnfjörð Bjarmason wrote:
> I debugged this a bit more, it's probably *also* an issue in our use of
> libiconv, but it goes wrong just with our test setup with
> iconv(1). I.e. on my boring linux box:
>     
>     echo x | iconv -f UTF-8 -t UTF-16 | perl -0777 -MData::Dumper -ne 'my @a = map { sprintf "0x%x", $_ } unpack "C*"; print Dumper \@a'
>     $VAR1 = [
>               '0xff',
>               '0xfe',
>               '0x78',
>               '0x0',
>               '0xa',
>               '0x0'
>             ];
> 

This is a little-endian encoding of UTF-16 with a BOM.  The BOM is
required here since the default, if no BOM is provided, is big endian.
However, as I alluded to in 79444c92943, while the standard permits the
BOM to be omitted, doing so is generally improvident because that leads
to breakage when interoperating with Windows machines, many programs for
which assume little endian.

I mean, I don't use Windows and I think those programs are broken and
their authors rightfully should have known better, but practically,
using a BOM solves the problem easily, and if we can be slightly nicer
to the poor, hapless users of those programs, why not?

> On the AIX box to get the same I need to do that as:
> 
>     (printf '\376\377'; echo x | iconv -f UTF-8 -t UTF-16LE) | [...]
> 
> I.e. we omit the BOM *and* AIX's idea of our UTF-16 is little-endian
> UTF-16, a plain UTF-16 gives you the big-endian version. To make things
> worse the same is true of UTF-32, except "iconv -l" lists no UTF-32LE
> version. So it seems we can't get the same result at all for that one.

But what do you get if you just use UTF-16?  Is it little endian with
BOM, big endian with BOM, or big endian without BOM?  If it's big endian
without BOM, did you set ICONV_OMITS_BOM when building?

> So from the outset the code added around 79444c92943 (utf8: handle
> systems that don't write BOM for UTF-16, 2019-02-12) needs to be more
> careful (although this looked broken before), i.e. we should test exact
> known-good bytes and see if UTF-16 is really what we think it is,
> etc. This is likely broken on any big-endian non-GNUish iconv
> implementation.

We probably could have been more careful here.  Part of the problem is
that I don't have access to any affected systems here, so it's not in
general easy for me to write a test (or even a patch) for this case.

We also did use iconv(1) before that, but I _think_ it's possible to
remove it.  The thing that's tricky is the use of SHIFT-JIS, which has
known round-tripping problems, but I don't think we rely on using the
system iconv(3) there and encoding any valid SHIFT-JIS sequence is
probably fine.
Đoàn Trần Công Danh June 3, 2021, 12:34 p.m. UTC | #14
On 2021-06-02 20:50:39+0700, Đoàn Trần Công Danh <congdanhqx@gmail.com> wrote:
> On 2021-06-02 15:36:57+0200, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> > That's getting us there, now we don't fail on the 2nd test, but do start
> > failing on the third "re-encode to UTF-16 on checkout" and other
> > "checkout" tests.
> > 
> > The "test_cmp" at the end of that 3rd tests shows that the difference in
> > test.utf16.raw and test.utf16 is now that the "raw" one has the BOM, but
> > not the "test.utf16" file.
> 
> That meant we need: ICONV_OMITS_BOM=UnfortunatelyYes for AIX?
> I can replicate that test failure when building for musl libc without
> ICONV_OMITS_BOM undefined.

Applying my patch and build with ICONV_OMITS_BOM=Yes, t0028.3 passed
but t0028.4 and t0028.21 run into failure. Here is the dump of first
10 characters of test.utf16lebom:

          '0xff',
          '0xfe',
          '0xfe',
          '0xff',
          '0x0',
          '0x68',
          '0x0',
          '0x61',
          '0x0',
          '0x6c',

Digging a bit more, it seems like iconv(3) from utf-16-le-bom to utf-8
there is broken, iconv(3) thinks it's converting from utf-16-be to
utf-8:

source (test.utf16lebom, considered UTF-16LE-BOM):
|  0: ff   |  1: fe   |  2: 68 h |  3:  0   |  4: 61 a |  5:  0   |  6: 6c l |  7:  0
|  8: 6c l |  9:  0   | 10: 6f o | 11:  0   | 12: 20   | 13:  0   | 14: 74 t | 15:  0
| 16: 68 h | 17:  0   | 18: 65 e | 19:  0   | 20: 72 r | 21:  0   | 22: 65 e | 23:  0
| 24: 21 ! | 25:  0   | 26:  a   | 27:  0   | 28: 63 c | 29:  0   | 30: 61 a | 31:  0
| 32: 6e n | 33:  0   | 34: 20   | 35:  0   | 36: 79 y | 37:  0   | 38: 6f o | 39:  0
| 40: 75 u | 41:  0   | 42: 20   | 43:  0   | 44: 72 r | 45:  0   | 46: 65 e | 47:  0
| 48: 61 a | 49:  0   | 50: 64 d | 51:  0   | 52: 20   | 53:  0   | 54: 6d m | 55:  0
| 56: 65 e | 57:  0   | 58: 3f ? | 59:  0

destination (test.utf16lebom, considered UTF-8):
|  0: ef   |  1: bf   |  2: be   |  3: e6   |  4: a0   |  5: 80   |  6: e6   |  7: 84
|  8: 80   |  9: e6   | 10: b0   | 11: 80   | 12: e6   | 13: b0   | 14: 80   | 15: e6
| 16: bc   | 17: 80   | 18: e2   | 19: 80   | 20: 80   | 21: e7   | 22: 90   | 23: 80
| 24: e6   | 25: a0   | 26: 80   | 27: e6   | 28: 94   | 29: 80   | 30: e7   | 31: 88
| 32: 80   | 33: e6   | 34: 94   | 35: 80   | 36: e2   | 37: 84   | 38: 80   | 39: e0
| 40: a8   | 41: 80   | 42: e6   | 43: 8c   | 44: 80   | 45: e6   | 46: 84   | 47: 80
| 48: e6   | 49: b8   | 50: 80   | 51: e2   | 52: 80   | 53: 80   | 54: e7   | 55: a4
| 56: 80   | 57: e6   | 58: bc   | 59: 80   | 60: e7   | 61: 94   | 62: 80   | 63: e2
| 64: 80   | 65: 80   | 66: e7   | 67: 88   | 68: 80   | 69: e6   | 70: 94   | 71: 80
| 72: e6   | 73: 84   | 74: 80   | 75: e6   | 76: 90   | 77: 80   | 78: e2   | 79: 80
| 80: 80   | 81: e6   | 82: b4   | 83: 80   | 84: e6   | 85: 94   | 86: 80   | 87: e3
| 88: bc   | 89: 80
diff mbox series

Patch

diff --git a/t/t2080-parallel-checkout-basics.sh b/t/t2080-parallel-checkout-basics.sh
index 7087818550..3e0f8c675f 100755
--- a/t/t2080-parallel-checkout-basics.sh
+++ b/t/t2080-parallel-checkout-basics.sh
@@ -114,7 +114,7 @@  do
 
 	test_expect_success "$mode checkout" '
 		repo=various_$mode &&
-		cp -R various $repo &&
+		cp -R -P various $repo &&
 
 		# The just copied files have more recent timestamps than their
 		# associated index entries. So refresh the cached timestamps