Message ID | pull.1071.v3.git.1636031609982.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Accepted |
Commit | bc756bc586a67619ec1cf44c753f6744b61f27ca |
Headers | show |
Series | [v3] ci: disallow directional formatting | expand |
On Thu, Nov 04 2021, Johannes Schindelin via GitGitGadget wrote: > From: Johannes Schindelin <johannes.schindelin@gmx.de> > [...] > * Corrected a code comment: my custom git grep was not PCRE-enabled, > but Ubuntu's isn't. But git grep -P still does not understand \uNNNN. > * Even if the *.po files currently pass the check, the script is now > future-proof by excluding them. > [...] > +# This is intended to run on an Ubuntu agent in a GitHub workflow. > +# > +# To allow translated messages to introduce such directional formatting in the > +# future, we exclude the `.po` files from this validation. > +# > +# Neither GNU grep nor `git grep` (not even with `-P`) handle `\u` as a way to > +# specify UTF-8. > +# > +# To work around that, we use `printf` to produce the pattern as a byte > +# sequence, and then feed that to `git grep` as a byte sequence (setting > +# `LC_CTYPE` to make sure that the arguments are interpreted as intended). > +# > +# Note: we need to use Bash here because its `printf` interprets `\uNNNN` as > +# UTF-8 code points, as desired. Running this script through Ubuntu's `dash`, > +# for example, would use a `printf` that does not understand that syntax. > + > +# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO > +# U+2066..U+2069: LRI, RLI, FSI and PDI > +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)' > + > +! LC_CTYPE=C git grep -El "$(LC_CTYPE=C.UTF-8 printf "$regex")" \ > + -- ':(exclude,attr:binary)' ':(exclude)*.po' > > base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49 So this is still using not-PCRE instead of the much simpler PCRE-enabled one I suggested in[1] because your locally-built git doesn't link to libpcre? At the very least that comment is still quite confusing. I.e. it starts out by saying that it's meant to run in CI where we've got PCRE, but then goes on to describe an elaborate workaround that's only needed with a not-PCRE grep. Would be less confusing to understand why it's so complex as: # This is intended to run on an Ubuntu agent in a GitHub # workflow. We've got PCRE there, so all of the below could also be: # # ! git -P grep -nP '[\N{U+202a}-\N{U+202e}\N{U+2066}-\N{U+2069}]' ':!(attr:binary)' ':!**po/**' # # # But the author wanted to run this locally on a system that didn't # have PCRE, So [go on to describe elaborate bash / "git grep -E" / # LC_* tweaking workaround....] [...] B.t.w. I think that **po/** exclusion is closer to what you want, i.e. "a 'po' dir anywhere in our tree". It'll also exclude this if we e.g. end up using these in language-specific README files there or whatever. 1. https://lore.kernel.org/git/211103.86zgqlhzvz.gmgdl@evledraar.gmail.com/ or: ! git -P grep -nP '[\N{U+202a}-\N{U+202e}\N{U+2066}-\N{U+2069}]' ':!(attr:binary)'
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: >> +# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO >> +# U+2066..U+2069: LRI, RLI, FSI and PDI >> +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)' >> + >> +! LC_CTYPE=C git grep -El "$(LC_CTYPE=C.UTF-8 printf "$regex")" \ >> + -- ':(exclude,attr:binary)' ':(exclude)*.po' > ... > ! git -P grep -nP '[\N{U+202a}-\N{U+202e}\N{U+2066}-\N{U+2069}]' ':!(attr:binary)' So you are comparing * requiring bash and C.UTF-8 locale to be available vs * requiring git built with PCRE assuming that "Dscho says doesn't work with PCRE and you say it works with PCRE" is resolved? They seem roughly the same difficulty to me.
On Thu, Nov 04 2021, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > >>> +# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO >>> +# U+2066..U+2069: LRI, RLI, FSI and PDI >>> +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)' >>> + >>> +! LC_CTYPE=C git grep -El "$(LC_CTYPE=C.UTF-8 printf "$regex")" \ >>> + -- ':(exclude,attr:binary)' ':(exclude)*.po' >> ... >> ! git -P grep -nP '[\N{U+202a}-\N{U+202e}\N{U+2066}-\N{U+2069}]' ':!(attr:binary)' > > So you are comparing > > * requiring bash and C.UTF-8 locale to be available > > vs > > * requiring git built with PCRE > > assuming that "Dscho says doesn't work with PCRE and you say it > works with PCRE" is resolved? They seem roughly the same > difficulty to me. We can hard depend on a git built with PCRE, since the point of this thing is to run in GitHub CI, Ubuntu builds git with PCRE, and that's unlikely to ever change. The caveats around PCRE that still somewhat persist around PCRE are due to a misunderstanding, i.e. I think Johannes tried the \uXXXX syntax, which we don't opt git-grep into, but as shown above you can just use the other universally supported PCRE syntax for referring to Unicode codepoints.
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: >> So you are comparing >> >> * requiring bash and C.UTF-8 locale to be available >> >> vs >> >> * requiring git built with PCRE >> >> assuming that "Dscho says doesn't work with PCRE and you say it >> works with PCRE" is resolved? They seem roughly the same >> difficulty to me. > > We can hard depend on a git built with PCRE, since the point of this > thing is to run in GitHub CI, Ubuntu builds git with PCRE, and that's > unlikely to ever change. Yes, so is the availability of bash and C.UTF-8 for the same reason: we are talking about controlled environment. That is what I meant by "roughly the same difficulty to me". FWIW, I am OK with either approach, as I find the patch in question is just as readable as any rewrite that would use "grep -P", so...
On Mon, Nov 08 2021, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > >>> So you are comparing >>> >>> * requiring bash and C.UTF-8 locale to be available >>> >>> vs >>> >>> * requiring git built with PCRE >>> >>> assuming that "Dscho says doesn't work with PCRE and you say it >>> works with PCRE" is resolved? They seem roughly the same >>> difficulty to me. >> >> We can hard depend on a git built with PCRE, since the point of this >> thing is to run in GitHub CI, Ubuntu builds git with PCRE, and that's >> unlikely to ever change. > > Yes, so is the availability of bash and C.UTF-8 for the same reason: > we are talking about controlled environment. That is what I meant > by "roughly the same difficulty to me". > > FWIW, I am OK with either approach, as I find the patch in question > is just as readable as any rewrite that would use "grep -P", so... To each his own I guess :) I do find the simple regex of: '[\N{U+202a}-\N{U+202e}\N{U+2066}-\N{U+2069}]' Much easier to understand than something using printf, shell interpolation, and needing to switch around LC_CTYPE to two different values twice on one line. But anyway, that's a matter of taste. What isn't is the issue I noted at the bottom of [1], i.e. if we're relying on '(attr:binary)' we should probably start with an assertion that our idea of "binary" matches reality, or perhaps go back to the -I heuristic. Because now we've got at least one binary non-attribute-marked file, and if files like that ever get updated they might start matching this pattern. Maybe not a big deal, but someone updating those might confusingly trip over this otherwise... 1. https://lore.kernel.org/git/211103.86zgqlhzvz.gmgdl@evledraar.gmail.com/
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml index 6ed6a9e8076..deda12db3a9 100644 --- a/.github/workflows/main.yml +++ b/.github/workflows/main.yml @@ -289,6 +289,7 @@ jobs: - uses: actions/checkout@v2 - run: ci/install-dependencies.sh - run: ci/run-static-analysis.sh + - run: ci/check-directional-formatting.bash sparse: needs: ci-config if: needs.ci-config.outputs.enabled == 'yes' diff --git a/ci/check-directional-formatting.bash b/ci/check-directional-formatting.bash new file mode 100755 index 00000000000..e6211b141aa --- /dev/null +++ b/ci/check-directional-formatting.bash @@ -0,0 +1,27 @@ +#!/bin/bash + +# This script verifies that the non-binary files tracked in the Git index do +# not contain any Unicode directional formatting: such formatting could be used +# to deceive reviewers into interpreting code differently from the compiler. +# This is intended to run on an Ubuntu agent in a GitHub workflow. +# +# To allow translated messages to introduce such directional formatting in the +# future, we exclude the `.po` files from this validation. +# +# Neither GNU grep nor `git grep` (not even with `-P`) handle `\u` as a way to +# specify UTF-8. +# +# To work around that, we use `printf` to produce the pattern as a byte +# sequence, and then feed that to `git grep` as a byte sequence (setting +# `LC_CTYPE` to make sure that the arguments are interpreted as intended). +# +# Note: we need to use Bash here because its `printf` interprets `\uNNNN` as +# UTF-8 code points, as desired. Running this script through Ubuntu's `dash`, +# for example, would use a `printf` that does not understand that syntax. + +# U+202a..U+2a2e: LRE, RLE, PDF, LRO and RLO +# U+2066..U+2069: LRI, RLI, FSI and PDI +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)' + +! LC_CTYPE=C git grep -El "$(LC_CTYPE=C.UTF-8 printf "$regex")" \ + -- ':(exclude,attr:binary)' ':(exclude)*.po'