[v2,0/5] Fun with cpp word regex

Message ID	pull.1054.v2.git.1633720197.gitgitgadget@gmail.com (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <pull.1054.v2.git.1633720197.gitgitgadget@gmail.com> In-Reply-To: <pull.1054.git.1633589461.gitgitgadget@gmail.com> References: <pull.1054.git.1633589461.gitgitgadget@gmail.com> From: "Johannes Sixt via GitGitGadget" <gitgitgadget@gmail.com> Date: Fri, 08 Oct 2021 19:09:52 +0000 Subject: [PATCH v2 0/5] Fun with cpp word regex Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason <avarab@gmail.com>, Johannes Sixt <j6t@kdbg.org> Precedence: bulk
Series	Fun with cpp word regex \| expand [v2,0/5] Fun with cpp word regex [v2,1/5] t4034/cpp: actually test that operator tokens are not split [v2,2/5] t4034: add tests showing problematic cpp tokenizations [v2,3/5] userdiff-cpp: tighten word regex [v2,4/5] userdiff-cpp: permit the digit-separating single-quote in numbers [v2,5/5] userdiff-cpp: learn the C++ spaceship operator

Message ID

pull.1054.v2.git.1633720197.gitgitgadget@gmail.com (mailing list archive)

Headers

Message-Id: <pull.1054.v2.git.1633720197.gitgitgadget@gmail.com>
In-Reply-To: <pull.1054.git.1633589461.gitgitgadget@gmail.com>
References: <pull.1054.git.1633589461.gitgitgadget@gmail.com>
From: "Johannes Sixt via GitGitGadget" <gitgitgadget@gmail.com>
Date: Fri, 08 Oct 2021 19:09:52 +0000
Subject: [PATCH v2 0/5] Fun with cpp word regex
Fcc: Sent
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
To: git@vger.kernel.org
Cc: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason  <avarab@gmail.com>,
 Johannes Sixt <j6t@kdbg.org>
Precedence: bulk

Series

Fun with cpp word regex | expand

Message

Philippe Blain via GitGitGadget Oct. 8, 2021, 7:09 p.m. UTC

The cpp word regex driver is a bit too loose and can match too much text
where the intent is to match only a number.

The first patch makes the cpp word regex tests more effective.

The second patch adds problematic test cases. The third patch fixes these
problems.

The final two patches add support for digit separators and the spaceship
operator <=> (generalized comparison operator).

I left out support for hexadecimal floating point constants because that
would require to tighten the regex even more to avoid that entire
expressions are treated as single tokens.

Changes since V1:

 * Tests, tests, tests.
 * Polished commit messages.

Johannes Sixt (5):
  t4034/cpp: actually test that operator tokens are not split
  t4034: add tests showing problematic cpp tokenizations
  userdiff-cpp: tighten word regex
  userdiff-cpp: permit the digit-separating single-quote in numbers
  userdiff-cpp: learn the C++ spaceship operator

 t/t4034/cpp/expect | 63 +++++++++++++++++++++++-----------------------
 t/t4034/cpp/post   | 47 +++++++++++++++++++++-------------
 t/t4034/cpp/pre    | 41 +++++++++++++++++++-----------
 userdiff.c         | 10 ++++++--
 4 files changed, 94 insertions(+), 67 deletions(-)


base-commit: 225bc32a989d7a22fa6addafd4ce7dcd04675dbf
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1054%2Fj6t%2Ffun-with-cpp-word-regex-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1054/j6t/fun-with-cpp-word-regex-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1054

Range-diff vs v1:

 -:  ----------- > 1:  dd9f82ba712 t4034/cpp: actually test that operator tokens are not split
 -:  ----------- > 2:  5a84fc9cf71 t4034: add tests showing problematic cpp tokenizations
 1:  a47ab9ba20e ! 3:  d4ebe45fddc userdiff: tighten cpp word regex
     @@ Metadata
      Author: Johannes Sixt <j6t@kdbg.org>
      
       ## Commit message ##
     -    userdiff: tighten cpp word regex
     +    userdiff-cpp: tighten word regex
      
          Generally, word regex can be written such that they match tokens
          liberally and need not model the actual syntax because it can be assumed
     @@ Commit message
      
             .l      as in str.length
             .f      as in str.find
     +       .e      as in str.erase
      
          Tighten the regex in the following way:
      
     @@ Commit message
      
          For readability, factor hex- and binary numbers into an own term.
      
     -    As a drive-by, this fixes that floatingpoint numbers such as 12E5
     +    As a drive-by, this fixes that floating point numbers such as 12E5
          (with upper-case E) were split into two tokens.
      
          Signed-off-by: Johannes Sixt <j6t@kdbg.org>
      
     + ## t/t4034/cpp/expect ##
     +@@
     + <BOLD>--- a/pre<RESET>
     + <BOLD>+++ b/post<RESET>
     + <CYAN>@@ -1,30 +1,30 @@<RESET>
     +-Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>(x<RED>.f<RESET><GREEN>.F<RESET>ind); }
     ++Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>(x.<RED>find<RESET><GREEN>Find<RESET>); }
     + cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
     +-<GREEN>(<RESET>1 <RED>-1e10<RESET><GREEN>+1e10<RESET> 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
     ++<GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
     + // long double<RESET>
     + <RED>3.141592653e-10l<RESET><GREEN>3.141592654e+10l<RESET>
     + // float<RESET>
     +-120<RED>E5f<RESET><GREEN>E6f<RESET>
     ++<RED>120E5f<RESET><GREEN>120E6f<RESET>
     + // hex<RESET>
     +-<RED>0xdeadbeaf+8<RESET><GREEN>0xdeadBeaf+7<RESET>ULL
     ++<RED>0xdeadbeaf<RESET><GREEN>0xdeadBeaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
     + // octal<RESET>
     + <RED>01234567<RESET><GREEN>01234560<RESET>
     + // binary<RESET>
     + <RED>0b1000<RESET><GREEN>0b1100<RESET>+e1
     + // expression<RESET>
     +-<RED>1.5-e+2+f<RESET><GREEN>1.5-e+3+f<RESET>
     ++1.5-e+<RED>2<RESET><GREEN>3<RESET>+f
     + // another one<RESET>
     +-str<RED>.e+65<RESET><GREEN>.e+75<RESET>
     +-[a] b<RED>-><RESET><GREEN>->*<RESET>v d<RED>.e<RESET><GREEN>.*e<RESET>
     ++str.e+<RED>65<RESET><GREEN>75<RESET>
     ++[a] b<RED>-><RESET><GREEN>->*<RESET>v d<RED>.<RESET><GREEN>.*<RESET>e
     + <GREEN>~<RESET>!a <GREEN>!<RESET>~b c<RED>++<RESET><GREEN>+<RESET> d<RED>--<RESET><GREEN>-<RESET> e*<GREEN>*<RESET>f g<RED>&<RESET><GREEN>&&<RESET>h
     + a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
     + a<RED>+<RESET><GREEN>++<RESET>b c<RED>-<RESET><GREEN>--<RESET>d
     +@@ t/t4034/cpp/expect: a<RED>==<RESET><GREEN>!=<RESET>b c<RED>!=<RESET><GREEN>=<RESET>d
     + a<RED>^<RESET><GREEN>^=<RESET>b c<RED>|<RESET><GREEN>|=<RESET>d e<RED>&&<RESET><GREEN>&=<RESET>f
     + a<RED>||<RESET><GREEN>|<RESET>b
     + a?<GREEN>:<RESET>b
     +-a<RED>=<RESET><GREEN>==<RESET>b c<RED>+=<RESET><GREEN>+<RESET>d <RED>e-=f<RESET><GREEN>e-f<RESET> g<RED>*=<RESET><GREEN>*<RESET>h i<RED>/=<RESET><GREEN>/<RESET>j k<RED>%=<RESET><GREEN>%<RESET>l m<RED><<=<RESET><GREEN><<<RESET>n o<RED>>>=<RESET><GREEN>>><RESET>p q<RED>&=<RESET><GREEN>&<RESET>r s<RED>^=<RESET><GREEN>^<RESET>t u<RED>|=<RESET><GREEN>|<RESET>v
     ++a<RED>=<RESET><GREEN>==<RESET>b c<RED>+=<RESET><GREEN>+<RESET>d e<RED>-=<RESET><GREEN>-<RESET>f g<RED>*=<RESET><GREEN>*<RESET>h i<RED>/=<RESET><GREEN>/<RESET>j k<RED>%=<RESET><GREEN>%<RESET>l m<RED><<=<RESET><GREEN><<<RESET>n o<RED>>>=<RESET><GREEN>>><RESET>p q<RED>&=<RESET><GREEN>&<RESET>r s<RED>^=<RESET><GREEN>^<RESET>t u<RED>|=<RESET><GREEN>|<RESET>v
     + a,b<RESET>
     + a<RED>::<RESET><GREEN>:<RESET>b
     +
       ## userdiff.c ##
      @@ userdiff.c: PATTERNS("cpp",
       	 /* functions/methods, variables, and compounds at top level */
 2:  9d1c05f5f41 ! 4:  dd75d19cee9 userdiff: permit the digit-separating single-quote in numbers
     @@ Metadata
      Author: Johannes Sixt <j6t@kdbg.org>
      
       ## Commit message ##
     -    userdiff: permit the digit-separating single-quote in numbers
     +    userdiff-cpp: permit the digit-separating single-quote in numbers
      
          Since C++17, the single-quote can be used as digit separator:
      
     @@ Commit message
             1'000'000
             0xdead'beaf
      
     -    Make it known to the word regex of the cpp driver, so that numbers are not
     -    split into separate tokens at the single-quotes.
     +    Make it known to the word regex of the cpp driver, so that numbers are
     +    not split into separate tokens at the single-quotes.
      
          Signed-off-by: Johannes Sixt <j6t@kdbg.org>
      
     + ## t/t4034/cpp/expect ##
     +@@
     + <BOLD>diff --git a/pre b/post<RESET>
     +-<BOLD>index 1229cdb..3feae6f 100644<RESET>
     ++<BOLD>index 60f3640..f6fbf7b 100644<RESET>
     + <BOLD>--- a/pre<RESET>
     + <BOLD>+++ b/post<RESET>
     + <CYAN>@@ -1,30 +1,30 @@<RESET>
     +@@ t/t4034/cpp/expect: Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>
     + cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
     + <GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
     + // long double<RESET>
     +-<RED>3.141592653e-10l<RESET><GREEN>3.141592654e+10l<RESET>
     ++<RED>3.141'592'653e-10l<RESET><GREEN>3.141'592'654e+10l<RESET>
     + // float<RESET>
     + <RED>120E5f<RESET><GREEN>120E6f<RESET>
     + // hex<RESET>
     +-<RED>0xdeadbeaf<RESET><GREEN>0xdeadBeaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
     ++<RED>0xdead'beaf<RESET><GREEN>0xdead'Beaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
     + // octal<RESET>
     +-<RED>01234567<RESET><GREEN>01234560<RESET>
     ++<RED>0123'4567<RESET><GREEN>0123'4560<RESET>
     + // binary<RESET>
     +-<RED>0b1000<RESET><GREEN>0b1100<RESET>+e1
     ++<RED>0b10'00<RESET><GREEN>0b11'00<RESET>+e1
     + // expression<RESET>
     + 1.5-e+<RED>2<RESET><GREEN>3<RESET>+f
     + // another one<RESET>
     +
     + ## t/t4034/cpp/post ##
     +@@ t/t4034/cpp/post: Foo() : x(0&42) { bar(x.Find); }
     + cout<<"Hello World?\n"<<endl;
     + (1 +1e10 0xabcdef) 'y'
     + // long double
     +-3.141592654e+10l
     ++3.141'592'654e+10l
     + // float
     + 120E6f
     + // hex
     +-0xdeadBeaf+7ULL
     ++0xdead'Beaf+7ULL
     + // octal
     +-01234560
     ++0123'4560
     + // binary
     +-0b1100+e1
     ++0b11'00+e1
     + // expression
     + 1.5-e+3+f
     + // another one
     +
     + ## t/t4034/cpp/pre ##
     +@@ t/t4034/cpp/pre: Foo():x(0&&1){ foo0( x.find); }
     + cout<<"Hello World!\n"<<endl;
     + 1 -1e10 0xabcdef 'x'
     + // long double
     +-3.141592653e-10l
     ++3.141'592'653e-10l
     + // float
     + 120E5f
     + // hex
     +-0xdeadbeaf+8ULL
     ++0xdead'beaf+8ULL
     + // octal
     +-01234567
     ++0123'4567
     + // binary
     +-0b1000+e1
     ++0b10'00+e1
     + // expression
     + 1.5-e+2+f
     + // another one
     +
       ## userdiff.c ##
      @@ userdiff.c: PATTERNS("cpp",
       	 /* identifiers and keywords */
 3:  df66485e7f0 ! 5:  43a701f5ffd userdiff: learn the C++ spaceship operator
     @@ Metadata
      Author: Johannes Sixt <j6t@kdbg.org>
      
       ## Commit message ##
     -    userdiff: learn the C++ spaceship operator
     +    userdiff-cpp: learn the C++ spaceship operator
      
     -    Since C++20, the language has a generalized comparison operator. Teach
     -    the cpp driver not to separate it into <= and > tokens.
     +    Since C++20, the language has a generalized comparison operator <=>.
     +    Teach the cpp driver not to separate it into <= and > tokens.
      
          Signed-off-by: Johannes Sixt <j6t@kdbg.org>
      
     + ## t/t4034/cpp/expect ##
     +@@
     + <BOLD>diff --git a/pre b/post<RESET>
     +-<BOLD>index 60f3640..f6fbf7b 100644<RESET>
     ++<BOLD>index 144cd98..244f79c 100644<RESET>
     + <BOLD>--- a/pre<RESET>
     + <BOLD>+++ b/post<RESET>
     + <CYAN>@@ -1,30 +1,30 @@<RESET>
     +@@ t/t4034/cpp/expect: str.e+<RED>65<RESET><GREEN>75<RESET>
     + a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
     + a<RED>+<RESET><GREEN>++<RESET>b c<RED>-<RESET><GREEN>--<RESET>d
     + a<RED><<<RESET><GREEN><<=<RESET>b c<RED>>><RESET><GREEN>>>=<RESET>d
     +-a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h
     ++a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h i<RED><=<RESET><GREEN><=><RESET>j
     + a<RED>==<RESET><GREEN>!=<RESET>b c<RED>!=<RESET><GREEN>=<RESET>d
     + a<RED>^<RESET><GREEN>^=<RESET>b c<RED>|<RESET><GREEN>|=<RESET>d e<RED>&&<RESET><GREEN>&=<RESET>f
     + a<RED>||<RESET><GREEN>|<RESET>b
     +
     + ## t/t4034/cpp/post ##
     +@@ t/t4034/cpp/post: str.e+75
     + a*=b c/=d e%=f
     + a++b c--d
     + a<<=b c>>=d
     +-a<=b c<d e>=f g>h
     ++a<=b c<d e>=f g>h i<=>j
     + a!=b c=d
     + a^=b c|=d e&=f
     + a|b
     +
     + ## t/t4034/cpp/pre ##
     +@@ t/t4034/cpp/pre: str.e+65
     + a*b c/d e%f
     + a+b c-d
     + a<<b c>>d
     +-a<b c<=d e>f g>=h
     ++a<b c<=d e>f g>=h i<=j
     + a==b c!=d
     + a^b c|d e&&f
     + a||b
     +
       ## userdiff.c ##
      @@ userdiff.c: PATTERNS("cpp",
       	 "|0[xXbB][0-9a-fA-F']+[lLuU]*"

Comments

Ævar Arnfjörð Bjarmason Oct. 8, 2021, 8:07 p.m. UTC | #1

On Fri, Oct 08 2021, Johannes Sixt via GitGitGadget wrote:

> The cpp word regex driver is a bit too loose and can match too much text
> where the intent is to match only a number.
>
> The first patch makes the cpp word regex tests more effective.
>
> The second patch adds problematic test cases. The third patch fixes these
> problems.
>
> The final two patches add support for digit separators and the spaceship
> operator <=> (generalized comparison operator).
>
> I left out support for hexadecimal floating point constants because that
> would require to tighten the regex even more to avoid that entire
> expressions are treated as single tokens.
>
> Changes since V1:
>
>  * Tests, tests, tests.
>  * Polished commit messages.

I've read this much improved v2 over, thanks a lot. This looks much
better.

Just some general comments, in my mind this is more than good enough
already and doesn't require a re-roll, just food for thought:

 * I wonder if it isn't time to split up "cpp" into a "c" driver,
   e.g. git.git's .gitattributes has "cpp" for *.[ch] files, but as C++
   adds more syntax sugar.

   So e.g. if you use "<=>" after this series we'll tokenize it
   differently in *.c files, but it's a C++-only operator, on the other
   hand probably nobody cares that much...

 * I found myself back-porting some of your tests (manually mostly),
   maybe you disagree, but in cases like 123'123, <=> etc. I'd find it
   easier to follow if we first added the test data, and then the
   changed behavior.

   Because after all, we're going to change how we highlight existing
   data, so testing for that would be informative.

 * This pre-dates your much improved tests, but these test files could
   really use some test comments, as in:

   /* Now that we're going to understand the "'" character somehow, will any of this change? */
   /* We haven't written code like this since the 1960's ... */
   /* Run & free */

   I.e. we don't just highlight code the compiler likes to eat, but also
   comments. So particularly for smaller tokens that also occur in
   natural language like "'" and "&" are we getting expected results?

   It looked like it from skimming your changes, i.e. the ' rule is
   checked by surrounding digits, just might be handy to test it ... :)

> Johannes Sixt (5):
>   t4034/cpp: actually test that operator tokens are not split
>   t4034: add tests showing problematic cpp tokenizations
>   userdiff-cpp: tighten word regex
>   userdiff-cpp: permit the digit-separating single-quote in numbers
>   userdiff-cpp: learn the C++ spaceship operator
>
>  t/t4034/cpp/expect | 63 +++++++++++++++++++++++-----------------------
>  t/t4034/cpp/post   | 47 +++++++++++++++++++++-------------
>  t/t4034/cpp/pre    | 41 +++++++++++++++++++-----------
>  userdiff.c         | 10 ++++++--
>  4 files changed, 94 insertions(+), 67 deletions(-)
>
>
> base-commit: 225bc32a989d7a22fa6addafd4ce7dcd04675dbf
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1054%2Fj6t%2Ffun-with-cpp-word-regex-v2
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1054/j6t/fun-with-cpp-word-regex-v2
> Pull-Request: https://github.com/gitgitgadget/git/pull/1054
>
> Range-diff vs v1:
>
>  -:  ----------- > 1:  dd9f82ba712 t4034/cpp: actually test that operator tokens are not split
>  -:  ----------- > 2:  5a84fc9cf71 t4034: add tests showing problematic cpp tokenizations
>  1:  a47ab9ba20e ! 3:  d4ebe45fddc userdiff: tighten cpp word regex
>      @@ Metadata
>       Author: Johannes Sixt <j6t@kdbg.org>
>       
>        ## Commit message ##
>      -    userdiff: tighten cpp word regex
>      +    userdiff-cpp: tighten word regex
>       
>           Generally, word regex can be written such that they match tokens
>           liberally and need not model the actual syntax because it can be assumed
>      @@ Commit message
>       
>              .l      as in str.length
>              .f      as in str.find
>      +       .e      as in str.erase
>       
>           Tighten the regex in the following way:
>       
>      @@ Commit message
>       
>           For readability, factor hex- and binary numbers into an own term.
>       
>      -    As a drive-by, this fixes that floatingpoint numbers such as 12E5
>      +    As a drive-by, this fixes that floating point numbers such as 12E5
>           (with upper-case E) were split into two tokens.
>       
>           Signed-off-by: Johannes Sixt <j6t@kdbg.org>
>       
>      + ## t/t4034/cpp/expect ##
>      +@@
>      + <BOLD>--- a/pre<RESET>
>      + <BOLD>+++ b/post<RESET>
>      + <CYAN>@@ -1,30 +1,30 @@<RESET>
>      +-Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>(x<RED>.f<RESET><GREEN>.F<RESET>ind); }
>      ++Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>(x.<RED>find<RESET><GREEN>Find<RESET>); }
>      + cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
>      +-<GREEN>(<RESET>1 <RED>-1e10<RESET><GREEN>+1e10<RESET> 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
>      ++<GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
>      + // long double<RESET>
>      + <RED>3.141592653e-10l<RESET><GREEN>3.141592654e+10l<RESET>
>      + // float<RESET>
>      +-120<RED>E5f<RESET><GREEN>E6f<RESET>
>      ++<RED>120E5f<RESET><GREEN>120E6f<RESET>
>      + // hex<RESET>
>      +-<RED>0xdeadbeaf+8<RESET><GREEN>0xdeadBeaf+7<RESET>ULL
>      ++<RED>0xdeadbeaf<RESET><GREEN>0xdeadBeaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
>      + // octal<RESET>
>      + <RED>01234567<RESET><GREEN>01234560<RESET>
>      + // binary<RESET>
>      + <RED>0b1000<RESET><GREEN>0b1100<RESET>+e1
>      + // expression<RESET>
>      +-<RED>1.5-e+2+f<RESET><GREEN>1.5-e+3+f<RESET>
>      ++1.5-e+<RED>2<RESET><GREEN>3<RESET>+f
>      + // another one<RESET>
>      +-str<RED>.e+65<RESET><GREEN>.e+75<RESET>
>      +-[a] b<RED>-><RESET><GREEN>->*<RESET>v d<RED>.e<RESET><GREEN>.*e<RESET>
>      ++str.e+<RED>65<RESET><GREEN>75<RESET>
>      ++[a] b<RED>-><RESET><GREEN>->*<RESET>v d<RED>.<RESET><GREEN>.*<RESET>e
>      + <GREEN>~<RESET>!a <GREEN>!<RESET>~b c<RED>++<RESET><GREEN>+<RESET> d<RED>--<RESET><GREEN>-<RESET> e*<GREEN>*<RESET>f g<RED>&<RESET><GREEN>&&<RESET>h
>      + a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
>      + a<RED>+<RESET><GREEN>++<RESET>b c<RED>-<RESET><GREEN>--<RESET>d
>      +@@ t/t4034/cpp/expect: a<RED>==<RESET><GREEN>!=<RESET>b c<RED>!=<RESET><GREEN>=<RESET>d
>      + a<RED>^<RESET><GREEN>^=<RESET>b c<RED>|<RESET><GREEN>|=<RESET>d e<RED>&&<RESET><GREEN>&=<RESET>f
>      + a<RED>||<RESET><GREEN>|<RESET>b
>      + a?<GREEN>:<RESET>b
>      +-a<RED>=<RESET><GREEN>==<RESET>b c<RED>+=<RESET><GREEN>+<RESET>d <RED>e-=f<RESET><GREEN>e-f<RESET> g<RED>*=<RESET><GREEN>*<RESET>h i<RED>/=<RESET><GREEN>/<RESET>j k<RED>%=<RESET><GREEN>%<RESET>l m<RED><<=<RESET><GREEN><<<RESET>n o<RED>>>=<RESET><GREEN>>><RESET>p q<RED>&=<RESET><GREEN>&<RESET>r s<RED>^=<RESET><GREEN>^<RESET>t u<RED>|=<RESET><GREEN>|<RESET>v
>      ++a<RED>=<RESET><GREEN>==<RESET>b c<RED>+=<RESET><GREEN>+<RESET>d e<RED>-=<RESET><GREEN>-<RESET>f g<RED>*=<RESET><GREEN>*<RESET>h i<RED>/=<RESET><GREEN>/<RESET>j k<RED>%=<RESET><GREEN>%<RESET>l m<RED><<=<RESET><GREEN><<<RESET>n o<RED>>>=<RESET><GREEN>>><RESET>p q<RED>&=<RESET><GREEN>&<RESET>r s<RED>^=<RESET><GREEN>^<RESET>t u<RED>|=<RESET><GREEN>|<RESET>v
>      + a,b<RESET>
>      + a<RED>::<RESET><GREEN>:<RESET>b
>      +
>        ## userdiff.c ##
>       @@ userdiff.c: PATTERNS("cpp",
>        	 /* functions/methods, variables, and compounds at top level */
>  2:  9d1c05f5f41 ! 4:  dd75d19cee9 userdiff: permit the digit-separating single-quote in numbers
>      @@ Metadata
>       Author: Johannes Sixt <j6t@kdbg.org>
>       
>        ## Commit message ##
>      -    userdiff: permit the digit-separating single-quote in numbers
>      +    userdiff-cpp: permit the digit-separating single-quote in numbers
>       
>           Since C++17, the single-quote can be used as digit separator:
>       
>      @@ Commit message
>              1'000'000
>              0xdead'beaf
>       
>      -    Make it known to the word regex of the cpp driver, so that numbers are not
>      -    split into separate tokens at the single-quotes.
>      +    Make it known to the word regex of the cpp driver, so that numbers are
>      +    not split into separate tokens at the single-quotes.
>       
>           Signed-off-by: Johannes Sixt <j6t@kdbg.org>
>       
>      + ## t/t4034/cpp/expect ##
>      +@@
>      + <BOLD>diff --git a/pre b/post<RESET>
>      +-<BOLD>index 1229cdb..3feae6f 100644<RESET>
>      ++<BOLD>index 60f3640..f6fbf7b 100644<RESET>
>      + <BOLD>--- a/pre<RESET>
>      + <BOLD>+++ b/post<RESET>
>      + <CYAN>@@ -1,30 +1,30 @@<RESET>
>      +@@ t/t4034/cpp/expect: Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>
>      + cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
>      + <GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
>      + // long double<RESET>
>      +-<RED>3.141592653e-10l<RESET><GREEN>3.141592654e+10l<RESET>
>      ++<RED>3.141'592'653e-10l<RESET><GREEN>3.141'592'654e+10l<RESET>
>      + // float<RESET>
>      + <RED>120E5f<RESET><GREEN>120E6f<RESET>
>      + // hex<RESET>
>      +-<RED>0xdeadbeaf<RESET><GREEN>0xdeadBeaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
>      ++<RED>0xdead'beaf<RESET><GREEN>0xdead'Beaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
>      + // octal<RESET>
>      +-<RED>01234567<RESET><GREEN>01234560<RESET>
>      ++<RED>0123'4567<RESET><GREEN>0123'4560<RESET>
>      + // binary<RESET>
>      +-<RED>0b1000<RESET><GREEN>0b1100<RESET>+e1
>      ++<RED>0b10'00<RESET><GREEN>0b11'00<RESET>+e1
>      + // expression<RESET>
>      + 1.5-e+<RED>2<RESET><GREEN>3<RESET>+f
>      + // another one<RESET>
>      +
>      + ## t/t4034/cpp/post ##
>      +@@ t/t4034/cpp/post: Foo() : x(0&42) { bar(x.Find); }
>      + cout<<"Hello World?\n"<<endl;
>      + (1 +1e10 0xabcdef) 'y'
>      + // long double
>      +-3.141592654e+10l
>      ++3.141'592'654e+10l
>      + // float
>      + 120E6f
>      + // hex
>      +-0xdeadBeaf+7ULL
>      ++0xdead'Beaf+7ULL
>      + // octal
>      +-01234560
>      ++0123'4560
>      + // binary
>      +-0b1100+e1
>      ++0b11'00+e1
>      + // expression
>      + 1.5-e+3+f
>      + // another one
>      +
>      + ## t/t4034/cpp/pre ##
>      +@@ t/t4034/cpp/pre: Foo():x(0&&1){ foo0( x.find); }
>      + cout<<"Hello World!\n"<<endl;
>      + 1 -1e10 0xabcdef 'x'
>      + // long double
>      +-3.141592653e-10l
>      ++3.141'592'653e-10l
>      + // float
>      + 120E5f
>      + // hex
>      +-0xdeadbeaf+8ULL
>      ++0xdead'beaf+8ULL
>      + // octal
>      +-01234567
>      ++0123'4567
>      + // binary
>      +-0b1000+e1
>      ++0b10'00+e1
>      + // expression
>      + 1.5-e+2+f
>      + // another one
>      +
>        ## userdiff.c ##
>       @@ userdiff.c: PATTERNS("cpp",
>        	 /* identifiers and keywords */
>  3:  df66485e7f0 ! 5:  43a701f5ffd userdiff: learn the C++ spaceship operator
>      @@ Metadata
>       Author: Johannes Sixt <j6t@kdbg.org>
>       
>        ## Commit message ##
>      -    userdiff: learn the C++ spaceship operator
>      +    userdiff-cpp: learn the C++ spaceship operator
>       
>      -    Since C++20, the language has a generalized comparison operator. Teach
>      -    the cpp driver not to separate it into <= and > tokens.
>      +    Since C++20, the language has a generalized comparison operator <=>.
>      +    Teach the cpp driver not to separate it into <= and > tokens.
>       
>           Signed-off-by: Johannes Sixt <j6t@kdbg.org>
>       
>      + ## t/t4034/cpp/expect ##
>      +@@
>      + <BOLD>diff --git a/pre b/post<RESET>
>      +-<BOLD>index 60f3640..f6fbf7b 100644<RESET>
>      ++<BOLD>index 144cd98..244f79c 100644<RESET>
>      + <BOLD>--- a/pre<RESET>
>      + <BOLD>+++ b/post<RESET>
>      + <CYAN>@@ -1,30 +1,30 @@<RESET>
>      +@@ t/t4034/cpp/expect: str.e+<RED>65<RESET><GREEN>75<RESET>
>      + a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
>      + a<RED>+<RESET><GREEN>++<RESET>b c<RED>-<RESET><GREEN>--<RESET>d
>      + a<RED><<<RESET><GREEN><<=<RESET>b c<RED>>><RESET><GREEN>>>=<RESET>d
>      +-a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h
>      ++a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h i<RED><=<RESET><GREEN><=><RESET>j
>      + a<RED>==<RESET><GREEN>!=<RESET>b c<RED>!=<RESET><GREEN>=<RESET>d
>      + a<RED>^<RESET><GREEN>^=<RESET>b c<RED>|<RESET><GREEN>|=<RESET>d e<RED>&&<RESET><GREEN>&=<RESET>f
>      + a<RED>||<RESET><GREEN>|<RESET>b
>      +
>      + ## t/t4034/cpp/post ##
>      +@@ t/t4034/cpp/post: str.e+75
>      + a*=b c/=d e%=f
>      + a++b c--d
>      + a<<=b c>>=d
>      +-a<=b c<d e>=f g>h
>      ++a<=b c<d e>=f g>h i<=>j
>      + a!=b c=d
>      + a^=b c|=d e&=f
>      + a|b
>      +
>      + ## t/t4034/cpp/pre ##
>      +@@ t/t4034/cpp/pre: str.e+65
>      + a*b c/d e%f
>      + a+b c-d
>      + a<<b c>>d
>      +-a<b c<=d e>f g>=h
>      ++a<b c<=d e>f g>=h i<=j
>      + a==b c!=d
>      + a^b c|d e&&f
>      + a||b
>      +
>        ## userdiff.c ##
>       @@ userdiff.c: PATTERNS("cpp",
>        	 "|0[xXbB][0-9a-fA-F']+[lLuU]*"

Johannes Sixt Oct. 8, 2021, 10:11 p.m. UTC | #2

Am 08.10.21 um 22:07 schrieb Ævar Arnfjörð Bjarmason:
>  * I wonder if it isn't time to split up "cpp" into a "c" driver,
>    e.g. git.git's .gitattributes has "cpp" for *.[ch] files, but as C++
>    adds more syntax sugar.
> 
>    So e.g. if you use "<=>" after this series we'll tokenize it
>    differently in *.c files, but it's a C++-only operator, on the other
>    hand probably nobody cares that much...

Yes, it is that: <=> won't appear in a correct C file (outside of
comments), so no-one will care. As far as tokenization is concerned, C
is a subset of C++. I don't think we need to separate the drivers.

>  * I found myself back-porting some of your tests (manually mostly),
>    maybe you disagree, but in cases like 123'123, <=> etc. I'd find it
>    easier to follow if we first added the test data, and then the
>    changed behavior.
> 
>    Because after all, we're going to change how we highlight existing
>    data, so testing for that would be informative.

Good point. I'll work a bit more on that.

>  * This pre-dates your much improved tests, but these test files could
>    really use some test comments, as in:
> 
>    /* Now that we're going to understand the "'" character somehow, will any of this change? */
>    /* We haven't written code like this since the 1960's ... */
>    /* Run & free */
> 
>    I.e. we don't just highlight code the compiler likes to eat, but also
>    comments. So particularly for smaller tokens that also occur in
>    natural language like "'" and "&" are we getting expected results?

Comments are free text. Anything can happen. There is no such thing as
"correct tokenization" in comments. Not interested.

Thank you for the review.
-- Hannes

Ævar Arnfjörð Bjarmason Oct. 9, 2021, midnight UTC | #3

On Sat, Oct 09 2021, Johannes Sixt wrote:

> Am 08.10.21 um 22:07 schrieb Ævar Arnfjörð Bjarmason:

[re-arranged a bit]

> [...]As far as tokenization is concerned, C
> is a subset of C++. I don't think we need to separate the drivers.
>
>>  * I found myself back-porting some of your tests (manually mostly),
>>    maybe you disagree, but in cases like 123'123, <=> etc. I'd find it
>>    easier to follow if we first added the test data, and then the
>>    changed behavior.
>> 
>>    Because after all, we're going to change how we highlight existing
>>    data, so testing for that would be informative.
>
> Good point. I'll work a bit more on that.

Great!

>>  * I wonder if it isn't time to split up "cpp" into a "c" driver,
>>    e.g. git.git's .gitattributes has "cpp" for *.[ch] files, but as C++
>>    adds more syntax sugar.
>> 
>>    So e.g. if you use "<=>" after this series we'll tokenize it
>>    differently in *.c files, but it's a C++-only operator, on the other
>>    hand probably nobody cares that much...
>
> Yes, it is that: <=> won't appear in a correct C file (outside of
> comments), so no-one will care. [...]

..mmm..

>>  * This pre-dates your much improved tests, but these test files could
>>    really use some test comments, as in:
>> 
>>    /* Now that we're going to understand the "'" character somehow, will any of this change? */
>>    /* We haven't written code like this since the 1960's ... */
>>    /* Run & free */
>> 
>>    I.e. we don't just highlight code the compiler likes to eat, but also
>>    comments. So particularly for smaller tokens that also occur in
>>    natural language like "'" and "&" are we getting expected results?
>
> Comments are free text. Anything can happen. There is no such thing as
> "correct tokenization" in comments. Not interested.

Sure there is, just because the problem is fuzzy doesn't mean there
aren't more and less correct things to do.

But most importantly the output of "git diff" is made for human
consumption, people who use --word-diff are going to be looking at code
that contains comments, embedded natural language in C-strings etc.

I've got no reason to think that your changes here make it worse, but
just as a general matter we absolutely should consider that one of the
top priorities when it comes to these language drivers.

E.g. in some languages (like CJK) it's common for characters or words
not to have any unicode whitespace between them, so even a word-diff
mode for C can benefit from recognizing those character ranges and
splitting them appropriately, or at least trying.

So to take a comment of yours where you changed a comment at random:

    $ git log -U0 --oneline -1 --word-diff -p af920e36977 --word-diff-regex='[ a-zA-Z]*'
    [...]
    [- Note that this seemingly redundant second declaration is required-]{+ Note that this redundant forward declaration is required+}

Don't you think that would suck v.s. the now-behavior of:

    [...]
    Note that this[-seemingly-] redundant [-second-]{+forward+} declaration is required

The former is exactly the sort of thing you'd get in CJK languages with
a word-diff driver thought the line we should stop at was the same as a
comiler tokenizer.

Anyway, the cpp driver seems to do just fine on CJK.

I'm just saying that as a general thing it's definitely a priority for
these drivers to not only handle the narrow cases of tokens a compiler
would know about. Text that people commonly use should be presented in
some way that isn't line noise.

For an example of something we do a bit badly with the cpp driver is
parts of my 66f5f6dca95 (C style: use standard style for "TRANSLATORS"
comments, 2017-05-11).

I.e. there I was changing a comment format, and added a full stop to a
sentence, the word-diff is:

        /*
         {+*+} TRANSLATORS: here is a comment that explains the string to
         {+*+} be translated, that follows immediately after [-it-]{+it.+}
         */

Even though it has nothing to do with C syntax per-se that would be much more useful as:

        /*
         {+*+} TRANSLATORS: here is a comment that explains the string to
         {+*+} be translated, that follows immediately after it{+.+}
         */

I.e. treating a "." at the end of a word specially isn't C or C++
syntax, but it's absolutely input that the cpp driver *is* getting and
should be if possible be handling well.

I just did that by experimenting with
--word-diff-regex='([A-Za-z:]*|\*|\.)', that example is unchanged with
your series, but maybe low-hanging fruit....

Thanks for working on this, just an unsolicited braindump :)

Johannes Sixt Oct. 10, 2021, 8:15 p.m. UTC | #4

Am 09.10.21 um 02:00 schrieb Ævar Arnfjörð Bjarmason:
> For an example of something we do a bit badly with the cpp driver is
> parts of my 66f5f6dca95 (C style: use standard style for "TRANSLATORS"
> comments, 2017-05-11).
> 
> I.e. there I was changing a comment format, and added a full stop to a
> sentence, the word-diff is:
> 
>         /*
>          {+*+} TRANSLATORS: here is a comment that explains the string to
>          {+*+} be translated, that follows immediately after [-it-]{+it.+}
>          */
> 
> Even though it has nothing to do with C syntax per-se that would be much more useful as:
> 
>         /*
>          {+*+} TRANSLATORS: here is a comment that explains the string to
>          {+*+} be translated, that follows immediately after it{+.+}
>          */
> 
> I.e. treating a "." at the end of a word specially isn't C or C++
> syntax, but it's absolutely input that the cpp driver *is* getting and
> should be if possible be handling well.

FWIW, I wondered why the cpp driver should not handle this case as you
expect it, and the answer is that it actually does when this comment
appears in a C file. This particular word-diff occurs in
Documentation/CodingGuidelines and that is not covered by the cpp driver.

-- Hannes