From patchwork Fri Oct 8 19:09:52 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Philippe Blain via GitGitGadget X-Patchwork-Id: 12546223 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 156A7C433EF for ; Fri, 8 Oct 2021 19:10:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id EC20C60F4F for ; Fri, 8 Oct 2021 19:10:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240187AbhJHTL4 (ORCPT ); Fri, 8 Oct 2021 15:11:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55770 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231433AbhJHTLz (ORCPT ); Fri, 8 Oct 2021 15:11:55 -0400 Received: from mail-wr1-x434.google.com (mail-wr1-x434.google.com [IPv6:2a00:1450:4864:20::434]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5B929C061570 for ; Fri, 8 Oct 2021 12:09:59 -0700 (PDT) Received: by mail-wr1-x434.google.com with SMTP id e12so32672327wra.4 for ; Fri, 08 Oct 2021 12:09:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=jGKb3bD3PdTJF5JK9UPW+pnauTj2gi9QtPj5hYf3JIE=; b=NvZikYOIsfKor6M/wrPp+uDrx7I/eVdRDA8hwy5kr8/2EuZuoKIp5uwDfKm+crNihS vXQA48nOI9pDzfOB65exLlescmP4Qr/onSw61KlgK52y2iVhMx2L8q+wLhrqyeuzTFbx bloF6HCqh10AVOAB+pow2Jsrw3SbM0unKuc5uzranNL8RSYSrh7Fqie0KTP6rvz5NM2o RCRmg9j7O3KPnj9XtZYbymZU98jziqEyYcGQLtTAbffisUA+BPkw/ezDtBzMn4Lr+fg0 KMAEdXu81+dO7KdVU5Fv3gHUk8YT0EHQUrP5nK7qZQt/YxWTtbQniIhgtawyFpaCXlLO UjSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=jGKb3bD3PdTJF5JK9UPW+pnauTj2gi9QtPj5hYf3JIE=; b=LGb6Ur3Qmy1G2646tXP3U5kUdFXTcC3rLm32zFQkA0l2hKFD3s0uu5MpDbbQ2chdjv ACLj44ZrDueLRegJCJIqlEh17aEzmc9RJY3FKY1VHmd7xUnOQYnCeeMFFq+DO6F9/DqG kb6wKiX9fPN78HbftM+nHhB4HePK6l2NE2zzi7MkBJ9xVStrQS9JFf+vwnSycAUXLFUC 7GEF4LnObKXR++T2D8am279yALvWhhaqzTLZkm9UBIDNJJtaKdbn6QUc0DmoWzC+IeEa 2KZhKi1j9PYv5UmCUm/d5MzJ2RXj3i48+r1hbhg2RuJFZkRc4SHcGyajtqBdnSedE4Ix gX/Q== X-Gm-Message-State: AOAM532Gb19O5pIwjtfE6NmvDjxQ6N+AWIzLALH4kTFNomkqKG09fn14 Xdz9lJUq9eNp9efE9BU/We+JgeIR2mY= X-Google-Smtp-Source: ABdhPJxWnY2mP6iJgGp7g6yAPt7ma77NAX1bXCw8q8eWkG4HUuw40wqqhBKRzI408pdXrcWYiSWHHg== X-Received: by 2002:a5d:4a4a:: with SMTP id v10mr6307643wrs.306.1633720197961; Fri, 08 Oct 2021 12:09:57 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id a2sm208332wrq.9.2021.10.08.12.09.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 Oct 2021 12:09:57 -0700 (PDT) Message-Id: In-Reply-To: References: From: "Johannes Sixt via GitGitGadget" Date: Fri, 08 Oct 2021 19:09:52 +0000 Subject: [PATCH v2 0/5] Fun with cpp word regex Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason , Johannes Sixt Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org The cpp word regex driver is a bit too loose and can match too much text where the intent is to match only a number. The first patch makes the cpp word regex tests more effective. The second patch adds problematic test cases. The third patch fixes these problems. The final two patches add support for digit separators and the spaceship operator <=> (generalized comparison operator). I left out support for hexadecimal floating point constants because that would require to tighten the regex even more to avoid that entire expressions are treated as single tokens. Changes since V1: * Tests, tests, tests. * Polished commit messages. Johannes Sixt (5): t4034/cpp: actually test that operator tokens are not split t4034: add tests showing problematic cpp tokenizations userdiff-cpp: tighten word regex userdiff-cpp: permit the digit-separating single-quote in numbers userdiff-cpp: learn the C++ spaceship operator t/t4034/cpp/expect | 63 +++++++++++++++++++++++----------------------- t/t4034/cpp/post | 47 +++++++++++++++++++++------------- t/t4034/cpp/pre | 41 +++++++++++++++++++----------- userdiff.c | 10 ++++++-- 4 files changed, 94 insertions(+), 67 deletions(-) base-commit: 225bc32a989d7a22fa6addafd4ce7dcd04675dbf Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1054%2Fj6t%2Ffun-with-cpp-word-regex-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1054/j6t/fun-with-cpp-word-regex-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/1054 Range-diff vs v1: -: ----------- > 1: dd9f82ba712 t4034/cpp: actually test that operator tokens are not split -: ----------- > 2: 5a84fc9cf71 t4034: add tests showing problematic cpp tokenizations 1: a47ab9ba20e ! 3: d4ebe45fddc userdiff: tighten cpp word regex @@ Metadata Author: Johannes Sixt ## Commit message ## - userdiff: tighten cpp word regex + userdiff-cpp: tighten word regex Generally, word regex can be written such that they match tokens liberally and need not model the actual syntax because it can be assumed @@ Commit message .l as in str.length .f as in str.find + .e as in str.erase Tighten the regex in the following way: @@ Commit message For readability, factor hex- and binary numbers into an own term. - As a drive-by, this fixes that floatingpoint numbers such as 12E5 + As a drive-by, this fixes that floating point numbers such as 12E5 (with upper-case E) were split into two tokens. Signed-off-by: Johannes Sixt + ## t/t4034/cpp/expect ## +@@ + --- a/pre + +++ b/post + @@ -1,30 +1,30 @@ +-Foo() : x(0&&1&42) { foo0bar(x.f.Find); } ++Foo() : x(0&&1&42) { foo0bar(x.findFind); } + cout<<"Hello World!?\n"<(1 -1e10+1e10 0xabcdef) 'xy' ++(1 -+1e10 0xabcdef) 'xy' + // long double + 3.141592653e-10l3.141592654e+10l + // float +-120E5fE6f ++120E5f120E6f + // hex +-0xdeadbeaf+80xdeadBeaf+7ULL ++0xdeadbeaf0xdeadBeaf+8ULL7ULL + // octal + 0123456701234560 + // binary + 0b10000b1100+e1 + // expression +-1.5-e+2+f1.5-e+3+f ++1.5-e+23+f + // another one +-str.e+65.e+75 +-[a] b->->*v d.e.*e ++str.e+6575 ++[a] b->->*v d..*e + ~!a !~b c+++ d--- e**f g&&&h + a**=b c//=d e%%=f + a+++b c---d +@@ t/t4034/cpp/expect: a==!=b c!==d + a^^=b c||=d e&&&=f + a|||b + a?:b +-a===b c+=+d e-=fe-f g*=*h i/=/j k%=%l m<<=<<n o>>=>>p q&=&r s^=^t u|=|v ++a===b c+=+d e-=-f g*=*h i/=/j k%=%l m<<=<<n o>>=>>p q&=&r s^=^t u|=|v + a,b + a:::b + ## userdiff.c ## @@ userdiff.c: PATTERNS("cpp", /* functions/methods, variables, and compounds at top level */ 2: 9d1c05f5f41 ! 4: dd75d19cee9 userdiff: permit the digit-separating single-quote in numbers @@ Metadata Author: Johannes Sixt ## Commit message ## - userdiff: permit the digit-separating single-quote in numbers + userdiff-cpp: permit the digit-separating single-quote in numbers Since C++17, the single-quote can be used as digit separator: @@ Commit message 1'000'000 0xdead'beaf - Make it known to the word regex of the cpp driver, so that numbers are not - split into separate tokens at the single-quotes. + Make it known to the word regex of the cpp driver, so that numbers are + not split into separate tokens at the single-quotes. Signed-off-by: Johannes Sixt + ## t/t4034/cpp/expect ## +@@ + diff --git a/pre b/post +-index 1229cdb..3feae6f 100644 ++index 60f3640..f6fbf7b 100644 + --- a/pre + +++ b/post + @@ -1,30 +1,30 @@ +@@ t/t4034/cpp/expect: Foo() : x(0&&1&42) { foo0bar + cout<<"Hello World!?\n"<(1 -+1e10 0xabcdef) 'xy' + // long double +-3.141592653e-10l3.141592654e+10l ++3.141'592'653e-10l3.141'592'654e+10l + // float + 120E5f120E6f + // hex +-0xdeadbeaf0xdeadBeaf+8ULL7ULL ++0xdead'beaf0xdead'Beaf+8ULL7ULL + // octal +-0123456701234560 ++0123'45670123'4560 + // binary +-0b10000b1100+e1 ++0b10'000b11'00+e1 + // expression + 1.5-e+23+f + // another one + + ## t/t4034/cpp/post ## +@@ t/t4034/cpp/post: Foo() : x(0&42) { bar(x.Find); } + cout<<"Hello World?\n"< ## Commit message ## - userdiff: learn the C++ spaceship operator + userdiff-cpp: learn the C++ spaceship operator - Since C++20, the language has a generalized comparison operator. Teach - the cpp driver not to separate it into <= and > tokens. + Since C++20, the language has a generalized comparison operator <=>. + Teach the cpp driver not to separate it into <= and > tokens. Signed-off-by: Johannes Sixt + ## t/t4034/cpp/expect ## +@@ + diff --git a/pre b/post +-index 60f3640..f6fbf7b 100644 ++index 144cd98..244f79c 100644 + --- a/pre + +++ b/post + @@ -1,30 +1,30 @@ +@@ t/t4034/cpp/expect: str.e+6575 + a**=b c//=d e%%=f + a+++b c---d + a<<<<=b c>>>>=d +-a<<=b c<=<d e>>=f g>=>h ++a<<=b c<=<d e>>=f g>=>h i<=<=>j + a==!=b c!==d + a^^=b c||=d e&&&=f + a|||b + + ## t/t4034/cpp/post ## +@@ t/t4034/cpp/post: str.e+75 + a*=b c/=d e%=f + a++b c--d + a<<=b c>>=d +-a<=b c=f g>h ++a<=b c=f g>h i<=>j + a!=b c=d + a^=b c|=d e&=f + a|b + + ## t/t4034/cpp/pre ## +@@ t/t4034/cpp/pre: str.e+65 + a*b c/d e%f + a+b c-d + a<>d +-af g>=h ++af g>=h i<=j + a==b c!=d + a^b c|d e&&f + a||b + ## userdiff.c ## @@ userdiff.c: PATTERNS("cpp", "|0[xXbB][0-9a-fA-F']+[lLuU]*"