[v3] ci: disallow directional formatting

From: Johannes Schindelin <johannes.schindelin@gmx.de>

From: Johannes Schindelin <johannes.schindelin@gmx.de>

As described in https://trojansource.codes/trojan-source.pdf, it is
possible to abuse directional formatting (a feature of Unicode) to
deceive human readers into interpreting code differently from compilers.

For example, an "if ()" expression could be enclosed in a comment, but
rendered as if it was outside of that comment. In effect, this could
fool a reviewer into misinterpreting the code flow as benign when it is
not.

It is highly unlikely that Git's source code wants to contain such
directional formatting in the first place, so let's just disallow it.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
    ci: disallow directional formatting

    A couple days ago, I stumbled over
    https://siliconangle.com/2021/11/01/trojan-source-technique-can-inject-malware-source-code-without-detection/,
    which details an interesting social-engineering attack: it uses
    directional formatting in source code to pretend to human readers that
    the code does something different than it actually does.

    It is highly unlikely that Git's source code wants to contain such
    directional formatting in the first place, so let's disallow it.

    Technically, this is not exactly -rc material, but the paper was just
    published, and I want us to be safe.

    Changes since v2:

     * The pathspec excluding binary files is now used directly instead of
       doing the ls-files | xargs dance.
     * Corrected a code comment: my custom git grep was not PCRE-enabled,
       but Ubuntu's isn't. But git grep -P still does not understand \uNNNN.
     * Even if the *.po files currently pass the check, the script is now
       future-proof by excluding them.
     * Renamed the script to have the .bash extension, to indicate that it
       requires a Bashism (i.e. a printf that understands the \uNNNN
       syntax).

    Changes since v1:

     * The code was moved into a script in ci/.
     * We use git ls-files now to exclude files marked as binary in the Git
       attributes.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1071%2Fdscho%2Fcheck-for-utf-8-directional-formatting-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1071/dscho/check-for-utf-8-directional-formatting-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/1071

Range-diff vs v2:

 1:  bbf963695ba ! 1:  80447819de3 ci: disallow directional formatting
     @@ .github/workflows/main.yml: jobs:
           - uses: actions/checkout@v2
           - run: ci/install-dependencies.sh
           - run: ci/run-static-analysis.sh
     -+    - run: ci/check-directional-formatting.sh
     ++    - run: ci/check-directional-formatting.bash
         sparse:
           needs: ci-config
           if: needs.ci-config.outputs.enabled == 'yes'

     - ## ci/check-directional-formatting.sh (new) ##
     + ## ci/check-directional-formatting.bash (new) ##
      @@
      +#!/bin/bash
      +
     @@ ci/check-directional-formatting.sh (new)
      +# to deceive reviewers into interpreting code differently from the compiler.
      +# This is intended to run on an Ubuntu agent in a GitHub workflow.
      +#
     -+# `git grep` as well as GNU grep do not handle `\u` as a way to specify UTF-8.
     -+# A PCRE-enabled `git grep` would handle `\u` as desired, but Ubuntu does
     -+# not build its `git` packages with PCRE support.
     ++# To allow translated messages to introduce such directional formatting in the
     ++# future, we exclude the `.po` files from this validation.
     ++#
     ++# Neither GNU grep nor `git grep` (not even with `-P`) handle `\u` as a way to
     ++# specify UTF-8.
      +#
      +# To work around that, we use `printf` to produce the pattern as a byte
      +# sequence, and then feed that to `git grep` as a byte sequence (setting
     @@ ci/check-directional-formatting.sh (new)
      +# U+2066..U+2069: LRI, RLI, FSI and PDI
      +regex='(\u202a|\u202b|\u202c|\u202d|\u202e|\u2066|\u2067|\u2068|\u2069)'
      +
     -+! git ls-files -z ':(attr:!binary)' |
     -+LC_CTYPE=C xargs -0r git grep -Ele "$(LC_CTYPE=C.UTF-8 printf "$regex")" --
     ++! LC_CTYPE=C git grep -El "$(LC_CTYPE=C.UTF-8 printf "$regex")" \
     ++	-- ':(exclude,attr:binary)' ':(exclude)*.po'

 .github/workflows/main.yml           |  1 +
 ci/check-directional-formatting.bash | 27 +++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)
 create mode 100755 ci/check-directional-formatting.bash

base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49

Message ID	pull.1071.v3.git.1636031609982.gitgitgadget@gmail.com (mailing list archive)
State	Accepted
Commit	bc756bc586a67619ec1cf44c753f6744b61f27ca
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <pull.1071.v3.git.1636031609982.gitgitgadget@gmail.com> In-Reply-To: <pull.1071.v2.git.1635942236065.gitgitgadget@gmail.com> References: <pull.1071.v2.git.1635942236065.gitgitgadget@gmail.com> Date: Thu, 04 Nov 2021 13:13:29 +0000 Subject: [PATCH v3] ci: disallow directional formatting Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason <avarab@gmail.com>, Taylor Blau <me@ttaylorr.com>, Johannes Schindelin <johannes.schindelin@gmx.de>, Johannes Schindelin <johannes.schindelin@gmx.de> Precedence: bulk From: Johannes Schindelin <johannes.schindelin@gmx.de>
Series	[v3] ci: disallow directional formatting \| expand [v3] ci: disallow directional formatting

[v3] ci: disallow directional formatting

Commit Message

Comments

Patch