From patchwork Sun Oct 10 17:03:01 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Johannes Sixt X-Patchwork-Id: 12548669 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6A482C433EF for ; Sun, 10 Oct 2021 17:03:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4EFC860F4B for ; Sun, 10 Oct 2021 17:03:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232094AbhJJRFK (ORCPT ); Sun, 10 Oct 2021 13:05:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39690 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231948AbhJJRFH (ORCPT ); Sun, 10 Oct 2021 13:05:07 -0400 Received: from mail-wr1-x42a.google.com (mail-wr1-x42a.google.com [IPv6:2a00:1450:4864:20::42a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C34CAC061762 for ; Sun, 10 Oct 2021 10:03:08 -0700 (PDT) Received: by mail-wr1-x42a.google.com with SMTP id r18so47890206wrg.6 for ; Sun, 10 Oct 2021 10:03:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=ngZ0ar27WWg0VwBar7AyiNjC3RMv1OoTrNDR7TYg3MA=; b=Hpo32RVrHK4A+dmGWzn029ldCrAZm7mXtNNbwSviEuo2AUFRWjp1BxZ2kYJgfmbYTd LJX6tX6Qi/P3nXpSeSqtc8K6Lw74qyVB6DRXO1iI3iANoIoVuMUel00Ou1UP4yFVGg6W p7yAXqv3xQdinOEQWfIqUMwTRhVJB3JNcG5NziY9bQxOeIhXzrLFI6Ecui0EQxp+5/kz hRIhKot8c2cmWpOcbnVhLVvs2h1JkJA3v/cFekAAs7sGt4fnMv2fIRVfHAqyNPcxqRCO WAoxShzaq0j0B8T2++impyp49MzGvEeMIK1GkBQ0vbnLERLJGfiCTArd2cKbDbWo1xR8 kpuQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=ngZ0ar27WWg0VwBar7AyiNjC3RMv1OoTrNDR7TYg3MA=; b=hKrjrQk6A6+Wa9wppjaHdqZb6FuBACzclbHupjI1ZR2bUVpQniwUyS7CTA4SnjU6Hu 8nEwIxtUQsiR5sOgTjuWUhu+DRc3i/SUVD1CZdDIHmPm+AfzCZk5x0tmmjslrq8Vej6F /adjkaN9pl5RMBN7+q2r4WJUQ0N3Y4yDztHcOc2VYwTPCFLfTZtFW4j28tbd+OmLuwpT yFxGAPDn4o/iWsFVbpg+4wFKm0lqDtSNfP2A56Rxg+jvFN3nIKh3VNBHm8K9WB+AzYq1 N1CVVMte4NTc5VI5OuAIkBMsoYzUwMRWl90noaMWaSzZ2+k9SukiNgiI4N87eFoHfxCW 96rQ== X-Gm-Message-State: AOAM532/juB1EfKL9IuS+8F21F5KDftgMz6B3suLpTGn3djQLdLF2OLn 5+JQFnBzpaYXYuXmrIEv0hwnY8x9ryk= X-Google-Smtp-Source: ABdhPJwGWAsjA6bx4yOT84xBaCz4s2zB1zkSwVOAUiwDctNWYLE6PgpJ8MchCDz7EKdtiYfSXM/Vsg== X-Received: by 2002:adf:a390:: with SMTP id l16mr18845948wrb.104.1633885387409; Sun, 10 Oct 2021 10:03:07 -0700 (PDT) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id l11sm6501372wms.45.2021.10.10.10.03.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 10 Oct 2021 10:03:07 -0700 (PDT) Message-Id: In-Reply-To: References: Date: Sun, 10 Oct 2021 17:03:01 +0000 Subject: [PATCH v3 3/6] userdiff-cpp: tighten word regex Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason , Johannes Sixt , Johannes Sixt Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Johannes Sixt From: Johannes Sixt Generally, word regex can be written such that they match tokens liberally and need not model the actual syntax because it can be assumed that the regex will only be applied to syntactically correct text. The regex for cpp (C/C++) is too liberal, though. It regards these sequences as single tokens: 1+2 1.5-e+2+f and the following amalgams as one token: .l as in str.length .f as in str.find .e as in str.erase Tighten the regex in the following way: - Accept + and - only in one position in the exponent. + and - are no longer regarded as the sign of a number and are treated by the catcher-all that is not visible in the driver's regex. - Accept a leading decimal point only when it is followed by a digit. For readability, factor hex- and binary numbers into an own term. As a drive-by, this fixes that floating point numbers such as 12E5 (with upper-case E) were split into two tokens. Signed-off-by: Johannes Sixt --- t/t4034/cpp/expect | 16 ++++++++-------- userdiff.c | 8 +++++++- 2 files changed, 15 insertions(+), 9 deletions(-) diff --git a/t/t4034/cpp/expect b/t/t4034/cpp/expect index 63e53a61e62..46c9460a968 100644 --- a/t/t4034/cpp/expect +++ b/t/t4034/cpp/expect @@ -3,24 +3,24 @@ --- a/pre +++ b/post @@ -1,30 +1,30 @@ -Foo() : x(0&&1&42) { foo0bar(x.f.Find); } +Foo() : x(0&&1&42) { foo0bar(x.findFind); } cout<<"Hello World!?\n"<(1 -1e10+1e10 0xabcdef) 'xy' +(1 -+1e10 0xabcdef) 'xy' // long double 3.141592653e-10l3.141592654e+10l // float -120E5fE6f +120E5f120E6f // hex -0xdeadbeaf+80xdeadBeaf+7ULL +0xdeadbeaf0xdeadBeaf+8ULL7ULL // octal 0123456701234560 // binary 0b10000b1100+e1 // expression -1.5-e+2+f1.5-e+3+f +1.5-e+23+f // another one -str.e+65.e+75 -[a] b->->*v d.e.*e +str.e+6575 +[a] b->->*v d..*e ~!a !~b c+++ d--- e**f g&&&h a**=b c//=d e%%=f a+++b c---d @@ -30,6 +30,6 @@ a==!=b c!==d a^^=b c||=d e&&&=f a|||b a?:b -a===b c+=+d e-=fe-f g*=*h i/=/j k%=%l m<<=<<n o>>=>>p q&=&r s^=^t u|=|v +a===b c+=+d e-=-f g*=*h i/=/j k%=%l m<<=<<n o>>=>>p q&=&r s^=^t u|=|v a,b a:::b diff --git a/userdiff.c b/userdiff.c index d9b2ba752f0..ce2a9230703 100644 --- a/userdiff.c +++ b/userdiff.c @@ -54,8 +54,14 @@ PATTERNS("cpp", /* functions/methods, variables, and compounds at top level */ "^((::[[:space:]]*)?[A-Za-z_].*)$", /* -- */ + /* identifiers and keywords */ "[a-zA-Z_][a-zA-Z0-9_]*" - "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lLuU]*" + /* decimal and octal integers as well as floatingpoint numbers */ + "|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*" + /* hexadecimal and binary integers */ + "|0[xXbB][0-9a-fA-F]+[lLuU]*" + /* floatingpoint numbers that begin with a decimal point */ + "|\\.[0-9]+([Ee][-+]?[0-9]+)?[fFlL]?" "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*"), PATTERNS("csharp", /* Keywords */