Message ID | 20181227021734.528629-1-sandals@crustytoothpaste.net (mailing list archive) |
---|---|
Headers | show |
Series | Improve documentation on UTF-16 | expand |
Am 27.12.18 um 03:17 schrieb brian m. carlson: > We've recently fielded several reports from unhappy Windows users about > our handling of UTF-16, UTF-16LE, and UTF-16BE, none of which seem to be > suitable for certain Windows programs. > > In an effort to communicate the reasons for our behavior more > effectively, explain in the documentation that the UTF-16 variant that > people have been asking for hasn't been standardized, and therefore > hasn't been implemented in iconv(3). Mention what each of the variants > do, so that people can make a decision which one meets their needs the > best. > > In addition, add a comment in the code about why we must, for > correctness reasons, reject a UTF-16LE or UTF-16BE sequence that begins > with U+FEFF, namely that such a codepoint semantically represents a > ZWNBSP, not a BOM, but that that codepoint at the beginning of a UTF-8 > sequence (as encoded in the object store) would be misinterpreted as a > BOM instead. > > This comment is in the code because I think it needs to be somewhere, > but I'm not sure the documentation is the right place for it. If > desired, I can add it to the documentation, although I feel the lurid > details are not interesting to most users. If the wording is confusing, > I'm very open to hearing suggestions for how to improve it. > > I don't use Windows, so I don't know what MSVCRT does. If it requires a > BOM but doesn't accept big-endian encoding, then perhaps we should > report that as a bug to Microsoft so it can be fixed in a future > version. That would probably make a lot more programs work right out of > the box and dramatically improve the user experience. It worries me that theoretical correctness is regarded higher than existing practice. I do not care a lot what some RFC tells what programs should do if the majority of the software does something different and that behavior has been proven useful in practice. My understanding is that there is no such thing as a "byte order marker". It just so happens that when the first character in some UTF-16 text file begins with a ZWNBSP, then it is possible to derive the endianness of the file automatically. Other then that, that very first code point U+FEFF *is part of the data* and must not be removed when the data is reencoded. If Git does something different, it is bogus, IMO. -- Hannes
On Thu, Dec 27, 2018 at 11:06:17AM +0100, Johannes Sixt wrote: > It worries me that theoretical correctness is regarded higher than existing > practice. I do not care a lot what some RFC tells what programs should do if > the majority of the software does something different and that behavior has > been proven useful in practice. The majority of OSes produce the behavior I document here, and they are the majority of systems on the Internet. Windows is the outlier here, although a significant one. It is a common user of UTF-16 and its variants, but so are Java and JavaScript, and they're present on a lot of devices. Swallowing the U+FEFF would break compatibility with those systems. The issue that Windows users are seeing is that libiconv always produces big-endian data for UTF-16, and they always want little-endian. glibc produces native-endian data, which is what Windows users want. Git for Windows could patch libiconv to do that (and that is the simple, five-minute solution to this problem), but we'd still want to warn people that they're relying on unspecified behavior, hence this series. I would even be willing to patch Git for Windows's libiconv if somebody could point me to the repo (although I obviously cannot test it, not being a Windows user). I feel strongly, though, that fixing this is outside of the scope of Git proper, and it's not a thing we should be handling here. > My understanding is that there is no such thing as a "byte order marker". It > just so happens that when the first character in some UTF-16 text file > begins with a ZWNBSP, then it is possible to derive the endianness of the > file automatically. Other then that, that very first code point U+FEFF *is > part of the data* and must not be removed when the data is reencoded. If Git > does something different, it is bogus, IMO. You've got part of this. For UTF-16LE and UTF-16BE, a U+FEFF is part of the text, as would a second one be if we had two at the beginning of a UTF-16 or UTF-8 sequence. If someone produces UTF-16LE and places a U+FEFF at the beginning of it, when we encode to UTF-8, we emit only one U+FEFF, which has the wrong semantics. To be correct here and accept a U+FEFF, we'd need to check for a U+FEFF at the beginning of a UTF-16LE or UTF-16BE sequence and ensure we encode an extra U+FEFF at the beginning of the UTF-8 data (one for BOM and one for the text) and then strip it off when we decode. That's kind of ugly, and since iconv doesn't do that itself, we'd have to.
Am 27.12.18 um 17:43 schrieb brian m. carlson: > On Thu, Dec 27, 2018 at 11:06:17AM +0100, Johannes Sixt wrote: >> It worries me that theoretical correctness is regarded higher than existing >> practice. I do not care a lot what some RFC tells what programs should do if >> the majority of the software does something different and that behavior has >> been proven useful in practice. > > The majority of OSes produce the behavior I document here, and they are > the majority of systems on the Internet. Windows is the outlier here, > although a significant one. It is a common user of UTF-16 and its > variants, but so are Java and JavaScript, and they're present on a lot > of devices. Swallowing the U+FEFF would break compatibility with those > systems. > > The issue that Windows users are seeing is that libiconv always produces > big-endian data for UTF-16, and they always want little-endian. glibc > produces native-endian data, which is what Windows users want. Git for > Windows could patch libiconv to do that (and that is the simple, > five-minute solution to this problem), but we'd still want to warn > people that they're relying on unspecified behavior, hence this series. > > I would even be willing to patch Git for Windows's libiconv if somebody > could point me to the repo (although I obviously cannot test it, not > being a Windows user). I feel strongly, though, that fixing this is > outside of the scope of Git proper, and it's not a thing we should be > handling here. Please appologize that I leave the majority of what you said uncommented as I am not deep in the matter and don't have a firm understanding of all the issues. I'll just trust what you said is sound. Just one thing: Please do the count by *users* (or existing files or number of charactes exchanged or something similar); do not just count OSs; I mean, Windows is *not* the outlier if it handles 90% of the UTF-16 data in the world. (I'm just making up numbers here, but I think you get the point.) >> My understanding is that there is no such thing as a "byte order marker". It >> just so happens that when the first character in some UTF-16 text file >> begins with a ZWNBSP, then it is possible to derive the endianness of the >> file automatically. Other then that, that very first code point U+FEFF *is >> part of the data* and must not be removed when the data is reencoded. If Git >> does something different, it is bogus, IMO. > > You've got part of this. For UTF-16LE and UTF-16BE, a U+FEFF is part of > the text, as would a second one be if we had two at the beginning of a > UTF-16 or UTF-8 sequence. If someone produces UTF-16LE and places a > U+FEFF at the beginning of it, when we encode to UTF-8, we emit only one > U+FEFF, which has the wrong semantics. > > To be correct here and accept a U+FEFF, we'd need to check for a U+FEFF > at the beginning of a UTF-16LE or UTF-16BE sequence and ensure we encode > an extra U+FEFF at the beginning of the UTF-8 data (one for BOM and one > for the text) and then strip it off when we decode. That's kind of ugly, > and since iconv doesn't do that itself, we'd have to. But why do you add another U+FEFF on the way to UTF-8? There is one in the incoming UTF-16 data, and only *that* one must be converted. If there is no U+FEFF in the UTF-16 data, the should not be one in UTF-8, either. Puzzled... -- Hannes
On Thu, Dec 27, 2018 at 08:55:27PM +0100, Johannes Sixt wrote: > Am 27.12.18 um 17:43 schrieb brian m. carlson: > > You've got part of this. For UTF-16LE and UTF-16BE, a U+FEFF is part of > > the text, as would a second one be if we had two at the beginning of a > > UTF-16 or UTF-8 sequence. If someone produces UTF-16LE and places a > > U+FEFF at the beginning of it, when we encode to UTF-8, we emit only one > > U+FEFF, which has the wrong semantics. > > > > To be correct here and accept a U+FEFF, we'd need to check for a U+FEFF > > at the beginning of a UTF-16LE or UTF-16BE sequence and ensure we encode > > an extra U+FEFF at the beginning of the UTF-8 data (one for BOM and one > > for the text) and then strip it off when we decode. That's kind of ugly, > > and since iconv doesn't do that itself, we'd have to. > > But why do you add another U+FEFF on the way to UTF-8? There is one in the > incoming UTF-16 data, and only *that* one must be converted. If there is no > U+FEFF in the UTF-16 data, the should not be one in UTF-8, either. > Puzzled... So for UTF-16, there must be a BOM. For UTF-16LE and UTF-16BE, there must not be a BOM. So if we do this: $ printf '\xfe\xff\x00\x0a' | iconv -f UTF-16BE -t UTF-16 | xxd -g1 00000000: ff fe ff fe 0a 00 ...... That U+FEFF we have in the input is part of the text as a ZWNBSP; it is not a BOM. We end up with two U+FEFF values. The first is the BOM that's required as part of UTF-16. The second is semantically part of the text and has the semantics of a zero-width non-breaking space. In UTF-8, if the sequence starts with U+FEFF, it has the semantics of a BOM just like in UTF-16 (except that it's optional): it's not part of the text, and should be stripped off. So when we receive a UTF-16LE or UTF-16BE sequence and it contains a U+FEFF (which is part of the text), we need to insert a BOM in front of the sequence that's part of the text to keep the semantics. Essentially, we have this situation: Text (in memory): U+FEFF U+000A Semantics of text: ZWNBSP NL UTF-16BE: FE FF 00 0A Semantics: ZWNBSP NL UTF-16: FE FF FE FF 00 0A Semantics: BOM ZWNBSP NL UTF-8: EF BB BF EF BB BF 0A Semantics: BOM ZWNBSP NL If you don't have a U+FEFF, then things can be simpler: Text (in memory): U+0041 U+0042 U+0043 Semantics of text: A B C UTF-16BE: 00 41 00 42 00 43 Semantics: A B C UTF-16: FE FF 00 41 00 42 00 43 Semantics: BOM A B C UTF-8: 41 42 43 Semantics: A B C UTF-8 (optional): EF BB BF 41 42 43 Semantics: BOM A B C (I have picked big-endian UTF-16 here, but little-endian is fine, too; this is just easier for me to type.) This is all a huge edge case involving correctly serializing code points. By rejecting U=FEFF in UTF-16BE and UTF-16LE, we don't have to deal with any of it. As mentioned, I think patching Git for Windows's iconv is the smallest, most achievable solution to this, because it means we don't have to handle any of this edge case ourselves. Windows and WSL users can both write "UTF-16" and get a BOM and little-endian behavior, while we can delegate all the rest of the encoding stuff to libiconv.
On Thu, Dec 27 2018, brian m. carlson wrote: > We've recently fielded several reports from unhappy Windows users about > our handling of UTF-16, UTF-16LE, and UTF-16BE, none of which seem to be > suitable for certain Windows programs. Just for context, is "we" here $DAYJOB or a reference to some previous ML thread(s) on this list, or something else?
Am 28.12.18 um 00:45 schrieb brian m. carlson: > On Thu, Dec 27, 2018 at 08:55:27PM +0100, Johannes Sixt wrote: >> But why do you add another U+FEFF on the way to UTF-8? There is one in the >> incoming UTF-16 data, and only *that* one must be converted. If there is no >> U+FEFF in the UTF-16 data, the should not be one in UTF-8, either. >> Puzzled... > > So for UTF-16, there must be a BOM. For UTF-16LE and UTF-16BE, there > must not be a BOM. So if we do this: > > $ printf '\xfe\xff\x00\x0a' | iconv -f UTF-16BE -t UTF-16 | xxd -g1 > 00000000: ff fe ff fe 0a 00 ...... What sort of braindamage is this? Fix iconv. But as I said, I'm not an expert. I just vented my worries that widespread existing practice would be ignored under the excuse "you are the outlier". -- Hannes
On 28/12/2018 08:59, Johannes Sixt wrote: > Am 28.12.18 um 00:45 schrieb brian m. carlson: >> On Thu, Dec 27, 2018 at 08:55:27PM +0100, Johannes Sixt wrote: >>> But why do you add another U+FEFF on the way to UTF-8? There is one >>> in the >>> incoming UTF-16 data, and only *that* one must be converted. If >>> there is no >>> U+FEFF in the UTF-16 data, the should not be one in UTF-8, either. >>> Puzzled... >> >> So for UTF-16, there must be a BOM. For UTF-16LE and UTF-16BE, there >> must not be a BOM. So if we do this: >> >> $ printf '\xfe\xff\x00\x0a' | iconv -f UTF-16BE -t UTF-16 | xxd -g1 >> 00000000: ff fe ff fe 0a 00 ...... > > What sort of braindamage is this? Fix iconv. > > But as I said, I'm not an expert. I just vented my worries that > widespread existing practice would be ignored under the excuse "you > are the outlier". > > -- Hannes For ref, I dug out a Microsoft document [1] on its view of BOMs which can be compared to the ref [0] Brian gave [1] https://docs.microsoft.com/en-us/windows/desktop/intl/using-byte-order-marks [0] https://unicode.org/faq/utf_bom.html#bom9 Maybe the documentation patch ([PATCH 1/2] Documentation: document UTF-16-related behavior) should include the line ", because we encode into UTF-8 internally,", and a link to ref [0], and maybe [1] Whether the various Windows programs actually follow the Microsoft convention is another matter altogether .
On 28/12/2018 08:46, Ævar Arnfjörð Bjarmason wrote: > On Thu, Dec 27 2018, brian m. carlson wrote: > >> We've recently fielded several reports from unhappy Windows users about >> our handling of UTF-16, UTF-16LE, and UTF-16BE, none of which seem to be >> suitable for certain Windows programs. > Just for context, is "we" here $DAYJOB or a reference to some previous > ML thread(s) on this list, or something else? I think https://public-inbox.org/git/CADN+U_PUfnYWb-wW6drRANv-ZaYBEk3gWHc7oJtxohA5Vc3NEg@mail.gmail.com/ was the most recent on the Git list.
On Fri, Dec 28, 2018 at 09:46:18AM +0100, Ævar Arnfjörð Bjarmason wrote: > > On Thu, Dec 27 2018, brian m. carlson wrote: > > > We've recently fielded several reports from unhappy Windows users about > > our handling of UTF-16, UTF-16LE, and UTF-16BE, none of which seem to be > > suitable for certain Windows programs. > > Just for context, is "we" here $DAYJOB or a reference to some previous > ML thread(s) on this list, or something else? "We" in this case is the Git list. I think the list has seen at least three threads in recent months.