Message ID | cover.1620823573.git.mchehab+huawei@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Use ASCII subset instead of UTF-8 alternate symbols | expand |
On Wed, May 12, 2021 at 02:50:04PM +0200, Mauro Carvalho Chehab wrote: > v2: > - removed EM/EN DASH conversion from this patchset; Are you still thinking about doing the EN DASH --> "--" EM DASH --> "---" conversion? That's not going to change what the documentation will look like in the HTML and PDF output forms, and I think it would make life easier for people are reading and editing the Documentation/* files in text form. - Ted
Em Wed, 12 May 2021 10:14:44 -0400 "Theodore Ts'o" <tytso@mit.edu> escreveu: > On Wed, May 12, 2021 at 02:50:04PM +0200, Mauro Carvalho Chehab wrote: > > v2: > > - removed EM/EN DASH conversion from this patchset; > > Are you still thinking about doing the > > EN DASH --> "--" > EM DASH --> "---" > > conversion? Yes, but I intend to submit it on a separate patch series, probably after having this one merged. Let's first cleanup the large part of the conversion-generated UTF-8 char noise ;-) > That's not going to change what the documentation will > look like in the HTML and PDF output forms, and I think it would make > life easier for people are reading and editing the Documentation/* > files in text form. Agreed. I'm also considering to add a couple of cases of this char: - U+2026 ('…'): HORIZONTAL ELLIPSIS As Sphinx also replaces "..." into HORIZONTAL ELLIPSIS. - Anyway, I'm opting to submitting those in separate because it seems that at least some maintainers added EM/EN DASH intentionally. So, it may generate case-per-case discussions. Also, IMO, at least a couple of EN/EM DASH cases would be better served with a single hyphen. Thanks, Mauro
Your title 'Use ASCII subset' is now at least a bit *closer* to describing what the patches are actually doing, but it's still a bit misleading because you're only doing it for *some* characters. And the wording is still indicative of a fundamentally *misguided* motivation for doing any of this. Your commit comments should be about fixing a specific thing, nothing to do with "use ASCII subset", which is pointless in itself. On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote: > Such conversion tools - plus some text editor like LibreOffice or similar - have > a set of rules that turns some typed ASCII characters into UTF-8 alternatives, > for instance converting commas into curly commas and adding non-breakable > spaces. All of those are meant to produce better results when the text is > displayed in HTML or PDF formats. And don't we render our documentation into HTML or PDF formats? Are some of those non-breaking spaces not actually *useful* for their intended purpose? > While it is perfectly fine to use UTF-8 characters in Linux, and specially at > the documentation, it is better to stick to the ASCII subset on such > particular case, due to a couple of reasons: > > 1. it makes life easier for tools like grep; Barely, as noted, because of things like line feeds. > 2. they easier to edit with the some commonly used text/source > code editors. That is nonsense. Any but the most broken and/or anachronistic environments and editors will be just fine.
On Wed, 2021-05-12 at 17:17 +0200, Mauro Carvalho Chehab wrote: > Em Wed, 12 May 2021 10:14:44 -0400 > "Theodore Ts'o" <tytso@mit.edu> escreveu: > > > On Wed, May 12, 2021 at 02:50:04PM +0200, Mauro Carvalho Chehab wrote: > > > v2: > > > - removed EM/EN DASH conversion from this patchset; > > > > Are you still thinking about doing the > > > > EN DASH --> "--" > > EM DASH --> "---" > > > > conversion? > > Yes, but I intend to submit it on a separate patch series, probably after > having this one merged. Let's first cleanup the large part of the > conversion-generated UTF-8 char noise ;-) > > > That's not going to change what the documentation will > > look like in the HTML and PDF output forms, and I think it would make > > life easier for people are reading and editing the Documentation/* > > files in text form. > > Agreed. I'm also considering to add a couple of cases of this char: > > - U+2026 ('…'): HORIZONTAL ELLIPSIS > > As Sphinx also replaces "..." into HORIZONTAL ELLIPSIS. Er, what? The *only* part of this whole enterprise that actually seemed to make even a tiny bit of sense — rather than seeming like a thinly veiled retrospective excuse for dragging us back in time by 30 years — was the bit about making it easier to grep. But if I understand you correctly, you're talking about using something like C trigraphs to represent the perfectly reasonable text emdash character ("—") as two hyphen-minuses ("--") in the source code of the documentation? Isn't that going to achieve precisely the *opposite*? If I select some text in the HTML output of the docs and then search for it in the source code, that's going to *stop* it matching my search?
Em Wed, 12 May 2021 18:07:04 +0100 David Woodhouse <dwmw2@infradead.org> escreveu: > On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote: > > Such conversion tools - plus some text editor like LibreOffice or similar - have > > a set of rules that turns some typed ASCII characters into UTF-8 alternatives, > > for instance converting commas into curly commas and adding non-breakable > > spaces. All of those are meant to produce better results when the text is > > displayed in HTML or PDF formats. > > And don't we render our documentation into HTML or PDF formats? Yes. > Are > some of those non-breaking spaces not actually *useful* for their > intended purpose? No. The thing is: non-breaking space can cause a lot of problems. We even had to disable Sphinx usage of non-breaking space for PDF outputs, as this was causing bad LaTeX/PDF outputs. See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output") The afore mentioned patch disables Sphinx default behavior of using NON-BREAKABLE SPACE on literal blocks and strings, using this special setting: "parsedliteralwraps=true". When NON-BREAKABLE SPACE were used on PDF outputs, several parts of the media uAPI docs were violating the document margins by far, causing texts to be truncated. So, please **don't add NON-BREAKABLE SPACE**, unless you test (and keep testing it from time to time) if outputs on all formats are properly supporting it on different Sphinx versions. - Also, most of those came from conversion tools, together with other eccentricities, like the usage of U+FEFF (BOM) character at the start of some documents. The remaining ones seem to came from cut-and-paste. For instance, bibliographic references (there are a couple of those on media) sometimes have NON-BREAKABLE SPACE. I'm pretty sure that those came from cut-and-pasting the document titles from their names at the original PDF documents or web pages that are referenced. > > While it is perfectly fine to use UTF-8 characters in Linux, and specially at > > the documentation, it is better to stick to the ASCII subset on such > > particular case, due to a couple of reasons: > > > > 1. it makes life easier for tools like grep; > > Barely, as noted, because of things like line feeds. You can use grep with "-z" to seek for multi-line strings(*), Like: $ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f) Documentation/RCU/Design/Data-Structures/Data-Structures.rst (*) Unfortunately, while "git grep" also has a "-z" flag, it seems that this is (currently?) broken with regards of handling multilines: $ git grep -Pzl 'grace period started,\s*then' $ > > 2. they easier to edit with the some commonly used text/source > > code editors. > > That is nonsense. Any but the most broken and/or anachronistic > environments and editors will be just fine. Not really. I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely on the US-intl keyboard settings, that allow me to type as "'a" for á. However, there's no shortcut for non-Latin UTF-codes, as far as I know. So, if would need to type a curly comma on the text editors I normally use for development (vim, nano, kate), I would need to cut-and-paste it from somewhere[1]. [1] If I have a table with UTF-8 codes handy, I could type the UTF-8 number manually... However, it seems that this is currently broken at least on Fedora 33 (with Mate Desktop and US intl keyboard with dead keys). Here, <CTRL><SHIFT>U is not working. No idea why. I haven't test it for *years*, as I din't see any reason why I would need to type UTF-8 characters by numbers until we started this thread. In practice, on the very rare cases where I needed to write non-Latin utf-8 chars (maybe once in a year or so, Like when I would need to use a Greek letter or some weird symbol), there changes are high that I wouldn't remember its UTF-8 code. So, If I need to spend time to seek for an specific symbol, after finding it, I just cut-and-paste it. But even in the best case scenario where I know the UTF-8 and <CTRL><SHIFT>U works, if I wanted to use, for instance, a curly comma, the keystroke sequence would be: <CTRL><SHIFT>U201csome string<CTRL><SHIFT>U201d That's a lot harder than typing and has a higher chances of mistakenly add a wrong symbol than just typing: "some string" Knowing that both will produce *exactly* the same output, why should I bother doing it the hard way? - Now, I'm not arguing that you can't use whatever UTF-8 symbol you want on your docs. I'm just saying that, now that the conversion is over and a lot of documents ended getting some UTF-8 characters by accident, it is time for a cleanup. Thanks, Mauro
On Fri, 2021-05-14 at 10:21 +0200, Mauro Carvalho Chehab wrote: > Em Wed, 12 May 2021 18:07:04 +0100 > David Woodhouse <dwmw2@infradead.org> escreveu: > > > On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote: > > > Such conversion tools - plus some text editor like LibreOffice or similar - have > > > a set of rules that turns some typed ASCII characters into UTF-8 alternatives, > > > for instance converting commas into curly commas and adding non-breakable > > > spaces. All of those are meant to produce better results when the text is > > > displayed in HTML or PDF formats. > > > > And don't we render our documentation into HTML or PDF formats? > > Yes. > > > Are > > some of those non-breaking spaces not actually *useful* for their > > intended purpose? > > No. > > The thing is: non-breaking space can cause a lot of problems. > > We even had to disable Sphinx usage of non-breaking space for > PDF outputs, as this was causing bad LaTeX/PDF outputs. > > See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output") > > The afore mentioned patch disables Sphinx default behavior of > using NON-BREAKABLE SPACE on literal blocks and strings, using this > special setting: "parsedliteralwraps=true". > > When NON-BREAKABLE SPACE were used on PDF outputs, several parts of > the media uAPI docs were violating the document margins by far, > causing texts to be truncated. > > So, please **don't add NON-BREAKABLE SPACE**, unless you test > (and keep testing it from time to time) if outputs on all > formats are properly supporting it on different Sphinx versions. And there you have a specific change with a specific fix. Nothing to do with whether NON-BREAKABLE SPACE is ∉ ASCII, and *certainly* nothing to do with the fact that, like *every* character in every kernel file except the *binary* files, it's representable in UTF-8. By all means fix the specific characters which are typographically wrong or which, like NON-BREAKABLE SPACE, cause problems for rendering the documentation. > Also, most of those came from conversion tools, together with other > eccentricities, like the usage of U+FEFF (BOM) character at the > start of some documents. The remaining ones seem to came from > cut-and-paste. ... or which are just entirely redundant and gratuitous, like a BOM in an environment where all files are UTF-8 and never 16-bit encodings anyway. > > > While it is perfectly fine to use UTF-8 characters in Linux, and specially at > > > the documentation, it is better to stick to the ASCII subset on such > > > particular case, due to a couple of reasons: > > > > > > 1. it makes life easier for tools like grep; > > > > Barely, as noted, because of things like line feeds. > > You can use grep with "-z" to seek for multi-line strings(*), Like: > > $ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f) > Documentation/RCU/Design/Data-Structures/Data-Structures.rst Yeah, right. That works if you don't just use the text that you'll have seen in the HTML/PDF "grace period started, then", and if you instead craft a *regex* for it, replacing the spaces with '\s*'. Or is that [[:space:]]* if you don't want to use the experimental Perl regex feature? $ grep -zlr 'grace[[:space:]]\+period[[:space:]]\+started,[[:space:]]\+then' Documentation/RCU Documentation/RCU/Design/Data-Structures/Data-Structures.rst And without '-l' it'll obviously just give you the whole file. No '-A5 -B5' to see the surroundings... it's hardly a useful thing, is it? > (*) Unfortunately, while "git grep" also has a "-z" flag, it > seems that this is (currently?) broken with regards of handling multilines: > > $ git grep -Pzl 'grace period started,\s*then' > $ Even better. So no, multiline grep isn't really a commonly usable feature at all. This is why we prefer to put user-visible strings on one line in C source code, even if it takes the lines over 80 characters — to allow for grep to find them. > > > 2. they easier to edit with the some commonly used text/source > > > code editors. > > > > That is nonsense. Any but the most broken and/or anachronistic > > environments and editors will be just fine. > > Not really. > > I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely > on the US-intl keyboard settings, that allow me to type as "'a" for á. > However, there's no shortcut for non-Latin UTF-codes, as far as I know. > > So, if would need to type a curly comma on the text editors I normally > use for development (vim, nano, kate), I would need to cut-and-paste > it from somewhere[1]. That's entirely irrelevant. You don't need to be able to *type* every character that you see in front of you, as long as your editor will render it correctly and perhaps let you cut/paste it as you're editing the document if you're moving things around. > [1] If I have a table with UTF-8 codes handy, I could type the UTF-8 > number manually... However, it seems that this is currently broken > at least on Fedora 33 (with Mate Desktop and US intl keyboard with > dead keys). > > Here, <CTRL><SHIFT>U is not working. No idea why. I haven't > test it for *years*, as I din't see any reason why I would > need to type UTF-8 characters by numbers until we started > this thread. Please provide the bug number for this; I'd like to track it. > But even in the best case scenario where I know the UTF-8 and > <CTRL><SHIFT>U works, if I wanted to use, for instance, a curly > comma, the keystroke sequence would be: > > <CTRL><SHIFT>U201csome string<CTRL><SHIFT>U201d > > That's a lot harder than typing and has a higher chances of > mistakenly add a wrong symbol than just typing: > > "some string" > > Knowing that both will produce *exactly* the same output, why > should I bother doing it the hard way? Nobody's asked you to do it the "hard way". That's completely irrelevant to the discussion we were having. > Now, I'm not arguing that you can't use whatever UTF-8 symbol you > want on your docs. I'm just saying that, now that the conversion > is over and a lot of documents ended getting some UTF-8 characters > by accident, it is time for a cleanup. All text documents are *full* of UTF-8 characters. If there is a file in the source code which has *any* non-UTF8, we call that a 'binary file'. Again, if you want to make specific fixes like removing non-breaking spaces and byte order marks, with specific reasons, then those make sense. But it's got very little to do with UTF-8 and how easy it is to type them. And the excuse you've put in the commit comment for your patches is utterly bogus.
> On Fri, 2021-05-14 at 10:21 +0200, Mauro Carvalho Chehab wrote: >> I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely >> on the US-intl keyboard settings, that allow me to type as "'a" for á. >> However, there's no shortcut for non-Latin UTF-codes, as far as I know. >> >> So, if would need to type a curly comma on the text editors I normally >> use for development (vim, nano, kate), I would need to cut-and-paste >> it from somewhere For anyone who doesn't know about it: X has this wonderful thing called the Compose key[1]. For instance, type ⎄--- to get —, or ⎄<" for “. Much more mnemonic than Unicode codepoints; and you can extend it with user-defined sequences in your ~/.XCompose file. (I assume Wayland supports all this too, but don't know the details.) On 14/05/2021 10:06, David Woodhouse wrote: > Again, if you want to make specific fixes like removing non-breaking > spaces and byte order marks, with specific reasons, then those make > sense. But it's got very little to do with UTF-8 and how easy it is to > type them. And the excuse you've put in the commit comment for your > patches is utterly bogus. +1 -ed [1] https://en.wikipedia.org/wiki/Compose_key
Em Fri, 14 May 2021 12:08:36 +0100 Edward Cree <ecree.xilinx@gmail.com> escreveu: > For anyone who doesn't know about it: X has this wonderful thing called > the Compose key[1]. For instance, type ⎄--- to get —, or ⎄<" for “. > Much more mnemonic than Unicode codepoints; and you can extend it with > user-defined sequences in your ~/.XCompose file. Good tip. I haven't use composite for years, as US-intl with dead keys is enough for 99.999% of my needs. Btw, at least on Fedora with Mate, Composite is disabled by default. It has to be enabled first using the same tool that allows changing the Keyboard layout[1]. Yet, typing an EN DASH for example, would be "<composite>--.", with is 4 keystrokes instead of just two ('--'). It means twice the effort ;-) [1] KDE, GNome, Mate, ... have different ways to enable it and to select what key would be considered <composite>: https://dry.sailingissues.com/us-international-keyboard-layout.html https://help.ubuntu.com/community/ComposeKey Thanks, Mauro