diff mbox series

[v2,25/27] t/lib-unicode-nfc-nfd: helper prereqs for testing unicode nfc/nfd

Message ID 5a0c1b7a2873accc6db4b34493962378819eacd4.1646777728.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Builtin FSMonitor Part 3 | expand

Commit Message

Jeff Hostetler March 8, 2022, 10:15 p.m. UTC
From: Jeff Hostetler <jeffhost@microsoft.com>

Create a set of prereqs to help understand how file names
are handled by the filesystem when they contain NFC and NFD
Unicode characters.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 t/lib-unicode-nfc-nfd.sh | 159 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 159 insertions(+)
 create mode 100755 t/lib-unicode-nfc-nfd.sh

Comments

Derrick Stolee March 9, 2022, 6:40 p.m. UTC | #1
On 3/8/2022 5:15 PM, Jeff Hostetler via GitGitGadget wrote:
> From: Jeff Hostetler <jeffhost@microsoft.com>
> 
> Create a set of prereqs to help understand how file names
> are handled by the filesystem when they contain NFC and NFD
> Unicode characters.

Prereqs look good and are well documented.

> +if test $unicode_debug = 1

Is this $unicode_debug something I should know from a previous
patch? or is it a leftover from local debugging?

Thanks,
-Stolee
Derrick Stolee March 9, 2022, 6:42 p.m. UTC | #2
On 3/9/2022 1:40 PM, Derrick Stolee wrote:
> On 3/8/2022 5:15 PM, Jeff Hostetler via GitGitGadget wrote:
>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>
>> Create a set of prereqs to help understand how file names
>> are handled by the filesystem when they contain NFC and NFD
>> Unicode characters.
> 
> Prereqs look good and are well documented.
> 
>> +if test $unicode_debug = 1
> 
> Is this $unicode_debug something I should know from a previous
> patch? or is it a leftover from local debugging?

I see that you set unicode_debug = 0 in a later patch, but I
suppose that we might want this output no matter what. Or, do
we think it will interrupt the output parsing of 'prove' and
other tools?

Thanks,
-Stolee
Jeff Hostetler March 10, 2022, 2:23 p.m. UTC | #3
On 3/9/22 1:40 PM, Derrick Stolee wrote:
> On 3/8/2022 5:15 PM, Jeff Hostetler via GitGitGadget wrote:
>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>
>> Create a set of prereqs to help understand how file names
>> are handled by the filesystem when they contain NFC and NFD
>> Unicode characters.
> 
> Prereqs look good and are well documented.
> 
>> +if test $unicode_debug = 1
> 
> Is this $unicode_debug something I should know from a previous
> patch? or is it a leftover from local debugging?


I added that and all of the print statements to help
describe the characteristics of the (OS, FS) pair,
for example what happens on (MacOS, FAT32) and is that
any different from (MacOS, APFS).  I found this very
useful in trying to decipher the docs.

However, it is kinda noisy and appears directly on the
console.  Since most people don't need to see it (unless
they are working on Unicode/UTF8 issues), I decided to
turn it off for now.

I'm not sure if we have a way to handle such output or
not.  I thought about maybe hooking it into the -d or -x
options, but I'm not sure if that helps or not.  So I
just turned it off.

Also, by not always testing the prereqs just to print
the result here, we avoid actually doing the lazy evals
until a real test wants to use one of them.


I'll add a comment in the script documenting it.

Thanks
Jeff
Jeff Hostetler March 10, 2022, 2:28 p.m. UTC | #4
On 3/9/22 1:42 PM, Derrick Stolee wrote:
> On 3/9/2022 1:40 PM, Derrick Stolee wrote:
>> On 3/8/2022 5:15 PM, Jeff Hostetler via GitGitGadget wrote:
>>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>>
>>> Create a set of prereqs to help understand how file names
>>> are handled by the filesystem when they contain NFC and NFD
>>> Unicode characters.
>>
>> Prereqs look good and are well documented.
>>
>>> +if test $unicode_debug = 1
>>
>> Is this $unicode_debug something I should know from a previous
>> patch? or is it a leftover from local debugging?
> 
> I see that you set unicode_debug = 0 in a later patch, but I
> suppose that we might want this output no matter what. Or, do
> we think it will interrupt the output parsing of 'prove' and
> other tools?

I was afraid that it might interrupt tools like prove, but
I just tried it and it didn't.  But yeah it would be safer
to turn it off until someone actually wants to do some debugging
in this area.

Jeff
diff mbox series

Patch

diff --git a/t/lib-unicode-nfc-nfd.sh b/t/lib-unicode-nfc-nfd.sh
new file mode 100755
index 00000000000..a09e910c302
--- /dev/null
+++ b/t/lib-unicode-nfc-nfd.sh
@@ -0,0 +1,159 @@ 
+# Help detect how Unicode NFC and NFD are handled on the filesystem.
+
+# A simple character that has a NFD form.
+#
+# NFC:       U+00e9 LATIN SMALL LETTER E WITH ACUTE
+# UTF8(NFC): \xc3 \xa9
+#
+# NFD:       U+0065 LATIN SMALL LETTER E
+#            U+0301 COMBINING ACUTE ACCENT
+# UTF8(NFD): \x65  +  \xcc \x81
+#
+utf8_nfc=$(printf "\xc3\xa9")
+utf8_nfd=$(printf "\x65\xcc\x81")
+
+# Is the OS or the filesystem "Unicode composition sensitive"?
+#
+# That is, does the OS or the filesystem allow files to exist with
+# both the NFC and NFD spellings?  Or, does the OS/FS lie to us and
+# tell us that the NFC and NFD forms are equivalent.
+#
+# This is or may be independent of what type of filesystem we have,
+# since it might be handled by the OS at a layer above the FS.
+# Testing shows on MacOS using APFS, HFS+, and FAT32 reports a
+# collision, for example.
+#
+# This does not tell us how the Unicode pathname will be spelled
+# on disk, but rather only that the two spelling "collide".  We
+# will examine the actual on disk spelling in a later prereq.
+#
+test_lazy_prereq UNICODE_COMPOSITION_SENSITIVE '
+	mkdir trial_${utf8_nfc} &&
+	mkdir trial_${utf8_nfd}
+'
+
+# Is the spelling of an NFC pathname preserved on disk?
+#
+# On MacOS with HFS+ and FAT32, NFC paths are converted into NFD
+# and on APFS, NFC paths are preserved.  As we have established
+# above, this is independent of "composition sensitivity".
+#
+# 0000000 63 5f c3 a9
+#
+# (/usr/bin/od output contains different amount of whitespace
+# on different platforms, so we need the wildcards here.)
+#
+test_lazy_prereq UNICODE_NFC_PRESERVED '
+	mkdir c_${utf8_nfc} &&
+	ls | od -t x1 | grep "63 *5f *c3 *a9"
+'
+
+# Is the spelling of an NFD pathname preserved on disk?
+#
+# 0000000 64 5f 65 cc 81
+#
+test_lazy_prereq UNICODE_NFD_PRESERVED '
+	mkdir d_${utf8_nfd} &&
+	ls | od -t x1 | grep "64 *5f *65 *cc *81"
+'
+	mkdir c_${utf8_nfc} &&
+	mkdir d_${utf8_nfd} &&
+
+# The following _DOUBLE_ forms are more for my curiosity,
+# but there may be quirks lurking when there are multiple
+# combining characters in non-canonical order.
+
+# Unicode also allows multiple combining characters
+# that can be decomposed in pieces.
+#
+# NFC:        U+1f67 GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI
+# UTF8(NFC):  \xe1 \xbd \xa7
+#
+# NFD1:       U+1f61 GREEK SMALL LETTER OMEGA WITH DASIA
+#             U+0342 COMBINING GREEK PERISPOMENI
+# UTF8(NFD1): \xe1 \xbd \xa1  +  \xcd \x82
+#
+# But U+1f61 decomposes into
+# NFD2:       U+03c9 GREEK SMALL LETTER OMEGA
+#             U+0314 COMBINING REVERSED COMMA ABOVE
+# UTF8(NFD2): \xcf \x89  +  \xcc \x94
+#
+# Yielding:   \xcf \x89  +  \xcc \x94  +  \xcd \x82
+#
+# Note that I've used the canonical ordering of the
+# combinining characters.  It is also possible to
+# swap them.  My testing shows that that non-standard
+# ordering also causes a collision in mkdir.  However,
+# the resulting names don't draw correctly on the
+# terminal (implying that the on-disk format also has
+# them out of order).
+#
+greek_nfc=$(printf "\xe1\xbd\xa7")
+greek_nfd1=$(printf "\xe1\xbd\xa1\xcd\x82")
+greek_nfd2=$(printf "\xcf\x89\xcc\x94\xcd\x82")
+
+# See if a double decomposition also collides.
+#
+test_lazy_prereq UNICODE_DOUBLE_COMPOSITION_SENSITIVE '
+	mkdir trial_${greek_nfc} &&
+	mkdir trial_${greek_nfd2}
+'
+
+# See if the NFC spelling appears on the disk.
+#
+test_lazy_prereq UNICODE_DOUBLE_NFC_PRESERVED '
+	mkdir c_${greek_nfc} &&
+	ls | od -t x1 | grep "63 *5f *e1 *bd *a7"
+'
+
+# See if the NFD spelling appears on the disk.
+#
+test_lazy_prereq UNICODE_DOUBLE_NFD_PRESERVED '
+	mkdir d_${greek_nfd2} &&
+	ls | od -t x1 | grep "64 *5f *cf *89 *cc *94 *cd *82"
+'
+
+if test $unicode_debug = 1
+then
+	if test_have_prereq UNICODE_COMPOSITION_SENSITIVE
+	then
+		echo NFC and NFD are distinct on this OS/filesystem.
+	else
+		echo NFC and NFD are aliases on this OS/filesystem.
+	fi
+
+	if test_have_prereq UNICODE_NFC_PRESERVED
+	then
+		echo NFC maintains original spelling.
+	else
+		echo NFC is modified.
+	fi
+
+	if test_have_prereq UNICODE_NFD_PRESERVED
+	then
+		echo NFD maintains original spelling.
+	else
+		echo NFD is modified.
+	fi
+
+	if test_have_prereq UNICODE_DOUBLE_COMPOSITION_SENSITIVE
+	then
+		echo DOUBLE NFC and NFD are distinct on this OS/filesystem.
+	else
+		echo DOUBLE NFC and NFD are aliases on this OS/filesystem.
+	fi
+
+	if test_have_prereq UNICODE_DOUBLE_NFC_PRESERVED
+	then
+		echo Double NFC maintains original spelling.
+	else
+		echo Double NFC is modified.
+	fi
+
+	if test_have_prereq UNICODE_DOUBLE_NFD_PRESERVED
+	then
+		echo Double NFD maintains original spelling.
+	else
+		echo Double NFD is modified.
+	fi
+fi