[0/2] Improve documentation on UTF-16

Message ID	20181227021734.528629-1-sandals@crustytoothpaste.net (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> From: "brian m. carlson" <sandals@crustytoothpaste.net> To: git@vger.kernel.org Cc: Lars Schneider <larsxschneider@gmail.com>, =?utf-8?q?Torsten_B=C3=B6gers?= =?utf-8?q?hausen?= <tboegi@web.de> Subject: [PATCH 0/2] Improve documentation on UTF-16 Date: Thu, 27 Dec 2018 02:17:32 +0000 Message-Id: <20181227021734.528629-1-sandals@crustytoothpaste.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Improve documentation on UTF-16 \| expand [0/2] Improve documentation on UTF-16 [1/2] Documentation: document UTF-16-related behavior [2/2] utf8: add comment explaining why BOMs are rejected

brian m. carlson Dec. 27, 2018, 2:17 a.m. UTC

We've recently fielded several reports from unhappy Windows users about
our handling of UTF-16, UTF-16LE, and UTF-16BE, none of which seem to be
suitable for certain Windows programs.

In an effort to communicate the reasons for our behavior more
effectively, explain in the documentation that the UTF-16 variant that
people have been asking for hasn't been standardized, and therefore
hasn't been implemented in iconv(3). Mention what each of the variants
do, so that people can make a decision which one meets their needs the
best.

In addition, add a comment in the code about why we must, for
correctness reasons, reject a UTF-16LE or UTF-16BE sequence that begins
with U+FEFF, namely that such a codepoint semantically represents a
ZWNBSP, not a BOM, but that that codepoint at the beginning of a UTF-8
sequence (as encoded in the object store) would be misinterpreted as a
BOM instead.

This comment is in the code because I think it needs to be somewhere,
but I'm not sure the documentation is the right place for it. If
desired, I can add it to the documentation, although I feel the lurid
details are not interesting to most users. If the wording is confusing,
I'm very open to hearing suggestions for how to improve it.

I don't use Windows, so I don't know what MSVCRT does. If it requires a
BOM but doesn't accept big-endian encoding, then perhaps we should
report that as a bug to Microsoft so it can be fixed in a future
version. That would probably make a lot more programs work right out of
the box and dramatically improve the user experience.

As a note, I'm currently on vacation through the 2nd, so my responses
may be slightly delayed.

brian m. carlson (2):
  Documentation: document UTF-16-related behavior
  utf8: add comment explaining why BOMs are rejected

 Documentation/gitattributes.txt | 5 +++++
 utf8.c                          | 7 +++++++
 2 files changed, 12 insertions(+)

Johannes Sixt Dec. 27, 2018, 10:06 a.m. UTC | #1

Am 27.12.18 um 03:17 schrieb brian m. carlson:
> We've recently fielded several reports from unhappy Windows users about
> our handling of UTF-16, UTF-16LE, and UTF-16BE, none of which seem to be
> suitable for certain Windows programs.
> 
> In an effort to communicate the reasons for our behavior more
> effectively, explain in the documentation that the UTF-16 variant that
> people have been asking for hasn't been standardized, and therefore
> hasn't been implemented in iconv(3). Mention what each of the variants
> do, so that people can make a decision which one meets their needs the
> best.
> 
> In addition, add a comment in the code about why we must, for
> correctness reasons, reject a UTF-16LE or UTF-16BE sequence that begins
> with U+FEFF, namely that such a codepoint semantically represents a
> ZWNBSP, not a BOM, but that that codepoint at the beginning of a UTF-8
> sequence (as encoded in the object store) would be misinterpreted as a
> BOM instead.
> 
> This comment is in the code because I think it needs to be somewhere,
> but I'm not sure the documentation is the right place for it. If
> desired, I can add it to the documentation, although I feel the lurid
> details are not interesting to most users. If the wording is confusing,
> I'm very open to hearing suggestions for how to improve it.
> 
> I don't use Windows, so I don't know what MSVCRT does. If it requires a
> BOM but doesn't accept big-endian encoding, then perhaps we should
> report that as a bug to Microsoft so it can be fixed in a future
> version. That would probably make a lot more programs work right out of
> the box and dramatically improve the user experience.

It worries me that theoretical correctness is regarded higher than 
existing practice. I do not care a lot what some RFC tells what programs 
should do if the majority of the software does something different and 
that behavior has been proven useful in practice.

My understanding is that there is no such thing as a "byte order 
marker". It just so happens that when the first character in some UTF-16 
text file begins with a ZWNBSP, then it is possible to derive the 
endianness of the file automatically. Other then that, that very first 
code point U+FEFF *is part of the data* and must not be removed when the 
data is reencoded. If Git does something different, it is bogus, IMO.

-- Hannes

brian m. carlson Dec. 27, 2018, 4:43 p.m. UTC | #2

On Thu, Dec 27, 2018 at 11:06:17AM +0100, Johannes Sixt wrote:
> It worries me that theoretical correctness is regarded higher than existing
> practice. I do not care a lot what some RFC tells what programs should do if
> the majority of the software does something different and that behavior has
> been proven useful in practice.

The majority of OSes produce the behavior I document here, and they are
the majority of systems on the Internet. Windows is the outlier here,
although a significant one. It is a common user of UTF-16 and its
variants, but so are Java and JavaScript, and they're present on a lot
of devices. Swallowing the U+FEFF would break compatibility with those
systems.

The issue that Windows users are seeing is that libiconv always produces
big-endian data for UTF-16, and they always want little-endian. glibc
produces native-endian data, which is what Windows users want. Git for
Windows could patch libiconv to do that (and that is the simple,
five-minute solution to this problem), but we'd still want to warn
people that they're relying on unspecified behavior, hence this series.

I would even be willing to patch Git for Windows's libiconv if somebody
could point me to the repo (although I obviously cannot test it, not
being a Windows user). I feel strongly, though, that fixing this is
outside of the scope of Git proper, and it's not a thing we should be
handling here.

> My understanding is that there is no such thing as a "byte order marker". It
> just so happens that when the first character in some UTF-16 text file
> begins with a ZWNBSP, then it is possible to derive the endianness of the
> file automatically. Other then that, that very first code point U+FEFF *is
> part of the data* and must not be removed when the data is reencoded. If Git
> does something different, it is bogus, IMO.

You've got part of this. For UTF-16LE and UTF-16BE, a U+FEFF is part of
the text, as would a second one be if we had two at the beginning of a
UTF-16 or UTF-8 sequence. If someone produces UTF-16LE and places a
U+FEFF at the beginning of it, when we encode to UTF-8, we emit only one
U+FEFF, which has the wrong semantics.

To be correct here and accept a U+FEFF, we'd need to check for a U+FEFF
at the beginning of a UTF-16LE or UTF-16BE sequence and ensure we encode
an extra U+FEFF at the beginning of the UTF-8 data (one for BOM and one
for the text) and then strip it off when we decode. That's kind of ugly,
and since iconv doesn't do that itself, we'd have to.

Johannes Sixt Dec. 27, 2018, 7:55 p.m. UTC | #3

Am 27.12.18 um 17:43 schrieb brian m. carlson:
> On Thu, Dec 27, 2018 at 11:06:17AM +0100, Johannes Sixt wrote:
>> It worries me that theoretical correctness is regarded higher than existing
>> practice. I do not care a lot what some RFC tells what programs should do if
>> the majority of the software does something different and that behavior has
>> been proven useful in practice.
> 
> The majority of OSes produce the behavior I document here, and they are
> the majority of systems on the Internet. Windows is the outlier here,
> although a significant one. It is a common user of UTF-16 and its
> variants, but so are Java and JavaScript, and they're present on a lot
> of devices. Swallowing the U+FEFF would break compatibility with those
> systems.
> 
> The issue that Windows users are seeing is that libiconv always produces
> big-endian data for UTF-16, and they always want little-endian. glibc
> produces native-endian data, which is what Windows users want. Git for
> Windows could patch libiconv to do that (and that is the simple,
> five-minute solution to this problem), but we'd still want to warn
> people that they're relying on unspecified behavior, hence this series.
> 
> I would even be willing to patch Git for Windows's libiconv if somebody
> could point me to the repo (although I obviously cannot test it, not
> being a Windows user). I feel strongly, though, that fixing this is
> outside of the scope of Git proper, and it's not a thing we should be
> handling here.

Please appologize that I leave the majority of what you said uncommented 
as I am not deep in the matter and don't have a firm understanding of 
all the issues. I'll just trust what you said is sound.

Just one thing: Please do the count by *users* (or existing files or 
number of charactes exchanged or something similar); do not just count 
OSs; I mean, Windows is *not* the outlier if it handles 90% of the 
UTF-16 data in the world. (I'm just making up numbers here, but I think 
you get the point.)

>> My understanding is that there is no such thing as a "byte order marker". It
>> just so happens that when the first character in some UTF-16 text file
>> begins with a ZWNBSP, then it is possible to derive the endianness of the
>> file automatically. Other then that, that very first code point U+FEFF *is
>> part of the data* and must not be removed when the data is reencoded. If Git
>> does something different, it is bogus, IMO.
> 
> You've got part of this. For UTF-16LE and UTF-16BE, a U+FEFF is part of
> the text, as would a second one be if we had two at the beginning of a
> UTF-16 or UTF-8 sequence. If someone produces UTF-16LE and places a
> U+FEFF at the beginning of it, when we encode to UTF-8, we emit only one
> U+FEFF, which has the wrong semantics.
> 
> To be correct here and accept a U+FEFF, we'd need to check for a U+FEFF
> at the beginning of a UTF-16LE or UTF-16BE sequence and ensure we encode
> an extra U+FEFF at the beginning of the UTF-8 data (one for BOM and one
> for the text) and then strip it off when we decode. That's kind of ugly,
> and since iconv doesn't do that itself, we'd have to.

But why do you add another U+FEFF on the way to UTF-8? There is one in 
the incoming UTF-16 data, and only *that* one must be converted. If 
there is no U+FEFF in the UTF-16 data, the should not be one in UTF-8, 
either. Puzzled...

-- Hannes

brian m. carlson Dec. 27, 2018, 11:45 p.m. UTC | #4

On Thu, Dec 27, 2018 at 08:55:27PM +0100, Johannes Sixt wrote:
> Am 27.12.18 um 17:43 schrieb brian m. carlson:
> > You've got part of this. For UTF-16LE and UTF-16BE, a U+FEFF is part of
> > the text, as would a second one be if we had two at the beginning of a
> > UTF-16 or UTF-8 sequence. If someone produces UTF-16LE and places a
> > U+FEFF at the beginning of it, when we encode to UTF-8, we emit only one
> > U+FEFF, which has the wrong semantics.
> > 
> > To be correct here and accept a U+FEFF, we'd need to check for a U+FEFF
> > at the beginning of a UTF-16LE or UTF-16BE sequence and ensure we encode
> > an extra U+FEFF at the beginning of the UTF-8 data (one for BOM and one
> > for the text) and then strip it off when we decode. That's kind of ugly,
> > and since iconv doesn't do that itself, we'd have to.
> 
> But why do you add another U+FEFF on the way to UTF-8? There is one in the
> incoming UTF-16 data, and only *that* one must be converted. If there is no
> U+FEFF in the UTF-16 data, the should not be one in UTF-8, either.
> Puzzled...

So for UTF-16, there must be a BOM. For UTF-16LE and UTF-16BE, there
must not be a BOM. So if we do this:

  $ printf '\xfe\xff\x00\x0a' | iconv -f UTF-16BE -t UTF-16 | xxd -g1
  00000000: ff fe ff fe 0a 00                                ......

That U+FEFF we have in the input is part of the text as a ZWNBSP; it is
not a BOM. We end up with two U+FEFF values. The first is the BOM that's
required as part of UTF-16. The second is semantically part of the text
and has the semantics of a zero-width non-breaking space.

In UTF-8, if the sequence starts with U+FEFF, it has the semantics of a
BOM just like in UTF-16 (except that it's optional): it's not part of
the text, and should be stripped off. So when we receive a UTF-16LE or
UTF-16BE sequence and it contains a U+FEFF (which is part of the text),
we need to insert a BOM in front of the sequence that's part of the text
to keep the semantics.

Essentially, we have this situation:

Text (in memory):  U+FEFF U+000A
Semantics of text: ZWNBSP NL
UTF-16BE:          FE FF  00 0A
Semantics:         ZWNBSP NL
UTF-16:            FE FF FE FF  00 0A
Semantics:         BOM   ZWNBSP NL
UTF-8:             EF BB BF EF BB BF 0A
Semantics:         BOM      ZWNBSP   NL

If you don't have a U+FEFF, then things can be simpler:

Text (in memory):  U+0041 U+0042 U+0043
Semantics of text: A      B      C
UTF-16BE:          00 41 00 42 00 43
Semantics:         A     B     C
UTF-16:            FE FF 00 41 00 42 00 43
Semantics:         BOM   A     B     C
UTF-8:             41 42 43
Semantics:         A  B  C
UTF-8 (optional):  EF BB BF 41 42 43
Semantics:         BOM      A  B  C

(I have picked big-endian UTF-16 here, but little-endian is fine, too;
this is just easier for me to type.)

This is all a huge edge case involving correctly serializing code
points. By rejecting U=FEFF in UTF-16BE and UTF-16LE, we don't have to
deal with any of it.

As mentioned, I think patching Git for Windows's iconv is the smallest,
most achievable solution to this, because it means we don't have to
handle any of this edge case ourselves. Windows and WSL users can both
write "UTF-16" and get a BOM and little-endian behavior, while we can
delegate all the rest of the encoding stuff to libiconv.

Ævar Arnfjörð Bjarmason Dec. 28, 2018, 8:46 a.m. UTC | #5

On Thu, Dec 27 2018, brian m. carlson wrote:

> We've recently fielded several reports from unhappy Windows users about
> our handling of UTF-16, UTF-16LE, and UTF-16BE, none of which seem to be
> suitable for certain Windows programs.

Just for context, is "we" here $DAYJOB or a reference to some previous
ML thread(s) on this list, or something else?

Johannes Sixt Dec. 28, 2018, 8:59 a.m. UTC | #6

Am 28.12.18 um 00:45 schrieb brian m. carlson:
> On Thu, Dec 27, 2018 at 08:55:27PM +0100, Johannes Sixt wrote:
>> But why do you add another U+FEFF on the way to UTF-8? There is one in the
>> incoming UTF-16 data, and only *that* one must be converted. If there is no
>> U+FEFF in the UTF-16 data, the should not be one in UTF-8, either.
>> Puzzled...
> 
> So for UTF-16, there must be a BOM. For UTF-16LE and UTF-16BE, there
> must not be a BOM. So if we do this:
> 
>    $ printf '\xfe\xff\x00\x0a' | iconv -f UTF-16BE -t UTF-16 | xxd -g1
>    00000000: ff fe ff fe 0a 00                                ......

What sort of braindamage is this? Fix iconv.

But as I said, I'm not an expert. I just vented my worries that 
widespread existing practice would be ignored under the excuse "you are 
the outlier".

-- Hannes

Philip Oakley Dec. 28, 2018, 8:31 p.m. UTC | #7

On 28/12/2018 08:59, Johannes Sixt wrote:
> Am 28.12.18 um 00:45 schrieb brian m. carlson:
>> On Thu, Dec 27, 2018 at 08:55:27PM +0100, Johannes Sixt wrote:
>>> But why do you add another U+FEFF on the way to UTF-8? There is one 
>>> in the
>>> incoming UTF-16 data, and only *that* one must be converted. If 
>>> there is no
>>> U+FEFF in the UTF-16 data, the should not be one in UTF-8, either.
>>> Puzzled...
>>
>> So for UTF-16, there must be a BOM. For UTF-16LE and UTF-16BE, there
>> must not be a BOM. So if we do this:
>>
>>    $ printf '\xfe\xff\x00\x0a' | iconv -f UTF-16BE -t UTF-16 | xxd -g1
>>    00000000: ff fe ff fe 0a 00 ......
>
> What sort of braindamage is this? Fix iconv.
>
> But as I said, I'm not an expert. I just vented my worries that 
> widespread existing practice would be ignored under the excuse "you 
> are the outlier".
>
> -- Hannes

For ref, I dug out a Microsoft document [1] on its view of BOMs which 
can be compared to the ref [0] Brian gave

[1] 
https://docs.microsoft.com/en-us/windows/desktop/intl/using-byte-order-marks

[0] https://unicode.org/faq/utf_bom.html#bom9

Maybe the documentation patch ([PATCH 1/2] Documentation: document 
UTF-16-related behavior) should include the line ", because we encode 
into UTF-8 internally,", and a link to ref [0], and maybe [1]


Whether the various Windows programs actually follow the Microsoft 
convention is another matter altogether .

Philip Oakley Dec. 28, 2018, 8:35 p.m. UTC | #8

On 28/12/2018 08:46, Ævar Arnfjörð Bjarmason wrote:
> On Thu, Dec 27 2018, brian m. carlson wrote:
>
>> We've recently fielded several reports from unhappy Windows users about
>> our handling of UTF-16, UTF-16LE, and UTF-16BE, none of which seem to be
>> suitable for certain Windows programs.
> Just for context, is "we" here $DAYJOB or a reference to some previous
> ML thread(s) on this list, or something else?


I think 
https://public-inbox.org/git/CADN+U_PUfnYWb-wW6drRANv-ZaYBEk3gWHc7oJtxohA5Vc3NEg@mail.gmail.com/ 
was the most recent on the Git list.

brian m. carlson Dec. 29, 2018, 11:17 p.m. UTC | #9

On Fri, Dec 28, 2018 at 09:46:18AM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, Dec 27 2018, brian m. carlson wrote:
> 
> > We've recently fielded several reports from unhappy Windows users about
> > our handling of UTF-16, UTF-16LE, and UTF-16BE, none of which seem to be
> > suitable for certain Windows programs.
> 
> Just for context, is "we" here $DAYJOB or a reference to some previous
> ML thread(s) on this list, or something else?

"We" in this case is the Git list. I think the list has seen at least
three threads in recent months.

[0/2] Improve documentation on UTF-16

Message

Comments