From patchwork Fri Apr 21 21:11:48 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Paul Eggert X-Patchwork-Id: 13220683 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7259DC7618E for ; Fri, 21 Apr 2023 21:11:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233221AbjDUVLw (ORCPT ); Fri, 21 Apr 2023 17:11:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53480 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229543AbjDUVLv (ORCPT ); Fri, 21 Apr 2023 17:11:51 -0400 Received: from mail.cs.ucla.edu (mail.cs.ucla.edu [131.179.128.66]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D7565B3 for ; Fri, 21 Apr 2023 14:11:49 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 5DC5B3C097AFB; Fri, 21 Apr 2023 14:11:49 -0700 (PDT) Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id F0Dk__KggKPm; Fri, 21 Apr 2023 14:11:48 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id D51853C097AFC; Fri, 21 Apr 2023 14:11:48 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.cs.ucla.edu D51853C097AFC DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.ucla.edu; s=9D0B346E-2AEB-11ED-9476-E14B719DCE6C; t=1682111508; bh=zQdaZsDa8s9krdsZy85097kC9CndnSxNNUbdc/Og1Tc=; h=Message-ID:Date:MIME-Version:To:From; b=NA3myyvxvu5fOWnmyLbEJP2ylrYSmTPN5p0nv6jFcM4mydg+m5rFsMnBDmdj8vu/P VfuWcSFrQtp7HkGEBXmd2Z96AfXgwviuIfFXGST+oKv+I3PRYUBSyFC2wHlp72YKpI ZjPoB2CJNLH/Vhk2QAwAYhcMmH/BrjIfJI178kdu3TNQqBBzxb39iFIPhWAZKlhy2T P+PwZc30FH1fn4J/4O0G4K8ElJJqlRVko38QG/7ApZndjjv/hl3JXtYxD4CW4FDpGO sSXMtQgzJTrXS6+8uePrIDY4C5AM2ulcixoA4gzmySm183YRdR1mivhwQ0LJ0STVbN BPWuUr/xdqyQw== X-Virus-Scanned: amavisd-new at mail.cs.ucla.edu Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id cI4CcMjK-_sC; Fri, 21 Apr 2023 14:11:48 -0700 (PDT) Received: from [192.168.1.9] (cpe-172-91-119-151.socal.res.rr.com [172.91.119.151]) by mail.cs.ucla.edu (Postfix) with ESMTPSA id 9766E3C097AFB; Fri, 21 Apr 2023 14:11:48 -0700 (PDT) Message-ID: <508ca102-63a9-6334-fee8-7a1ae84c7a23@cs.ucla.edu> Date: Fri, 21 Apr 2023 14:11:48 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Content-Language: en-US To: =?utf-8?q?Carlo_Marcelo_Arenas_Bel=C3=B3n?= , Jim Meyering Cc: grep-devel@gnu.org, demerphq , pcre2-dev@googlegroups.com, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= , Junio C Hamano , git@vger.kernel.org References: <4322c414-2bb7-924f-0f6d-dbf517599c3f@cs.ucla.edu> From: Paul Eggert Organization: UCLA Computer Science Department Subject: Compatibility between GNU and Git grep -P In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org In Carlo Marcelo Arenas Belón wrote: > After using this for a while think the following will be better suited > for a release because: > > * the unreleased PCRE2 code is still changing and is unlikely to be released > for a couple of months. > * the current way to configure PCRE2 make it difficult to link with the > unreleased code (this might be an independent bug), but it is likely that > the wrong headers might be used by mistake. > * the tests and documentation were not completely accurate. Thanks for looking into this. I'm concerned about the resulting patches, though, because I see recent activity in on the Git grep -P side here: https://lore.kernel.org/git/xmqqzgaf2zpt.fsf@gitster.g/ Bleeding-edge (i.e., "master") GNU grep uses PCRE2_UCP | PCRE2_EXTRA_ASCII_BSD with unreleased PCRE2 (which introduces PCRE2_EXTRA_ASCII_BSD), and it uses neither flag with the current PCRE2 release. You're proposing to change GNU grep to never use either flag, regardless of PCRE2 release. In contrast, bleeding-edge (i.e., "next") Git grep -P always uses PCRE2_UCP and never uses PCRE2_EXTRA_ASCII_BSD. I.e., it disagrees with GNU grep regardless of whether your proposed changes were adopted. Given Jim's strong desire that \d should match only ASCII digits, I doubt whether GNU grep will simply use PCRE2_UCP without PCRE2_EXTRA_ASCII_BSD. If we want the two grep -P's to stay compatible, I see two ways forward: 1. Leave GNU grep alone and modify Git grep to behave like GNU grep (see attached patch to Git). 2. Adopt your proposed change to GNU grep, and revert the recent change to Git grep so that it never uses PCRE2_UCP. Either way, we should see what the Git folks say about this. From 5f5e54157a01c540bde02c305c8ee5e1a39d4f1c Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Fri, 21 Apr 2023 14:06:25 -0700 Subject: [PATCH] grep: be compatible with GNU grep -P Use PCRE2_UCP only when PCRE2_EXTRA_ASCII_BSD is defined, for compatibility with GNU grep. --- grep.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/grep.c b/grep.c index 073559f2cd..e9dc8dc0bc 100644 --- a/grep.c +++ b/grep.c @@ -320,8 +320,13 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } options |= PCRE2_CASELESS; } - if (!opt->ignore_locale && is_utf8_locale() && !literal) - options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF); + if (!opt->ignore_locale && is_utf8_locale() && !literal) { + options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); +#ifdef PCRE2_EXTRA_ASCII_BSD + /* Be compatible with GNU grep -P '\d'. */ + options |= (PCRE2_UCP | PCRE2_EXTRA_ASCII_BSD); +#endif + } #ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER /* -- 2.39.2