[RFC] grep: default to posix digits with -P

Message ID	20240101150336.89098-1-carenas@gmail.com (mailing list archive)
State	New, archived
Headers	show Received: from mail-ot1-f47.google.com (mail-ot1-f47.google.com [209.85.210.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7C785610B for <git@vger.kernel.org>; Mon, 1 Jan 2024 15:04:15 +0000 (UTC) From: =?utf-8?q?Carlo_Marcelo_Arenas_Bel=C3=B3n?= <carenas@gmail.com> To: git@vger.kernel.org Cc: =?utf-8?q?Carlo_Marcelo_Arenas_Bel=C3=B3n?= <carenas@gmail.com> Subject: [RFC PATCH] grep: default to posix digits with -P Date: Mon, 1 Jan 2024 07:03:36 -0800 Message-Id: <20240101150336.89098-1-carenas@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	[RFC] grep: default to posix digits with -P \| expand [RFC] grep: default to posix digits with -P

Carlo Marcelo Arenas Belón Jan. 1, 2024, 3:03 p.m. UTC

Since acabd2048e (grep: correctly identify utf-8 characters with
\{b,w} in -P, 2023-01-08), PCRE2's UCP mode has been enabled when
UTF content was expected and therefore muktibyte characters are
included as part of all character classes (when defined).

note that if the locale used is not UTF enabled (specially: C, POSIX)
or uses an extended charmap binary that is not unicode compatible, binary
match will be used instead.

It was argued that doing so, at least for \d, was not a good idea,
as that might not be what the user expected based on its historical
meaning and was also slower, and indeed a similar change that was done
to GNU grep was reverted and required further tweaks.

At that time, PCRE2 didn't have a way to disable UCP's character
expansion selectively, but flags to do so, and that will be
available in the next release (planned soon) were added, and
one of them has been in use by GNU grep since their last release
in May (only if built and linked against the prereleased PCRE2
library though).

Add flags to make sure that both \d and [:digit:] only include
ASCII digits so that `git grep` will be closer to GNU grep and
improve performance as a side effect, but add a configuration flag
to allow keeping the current behaviour (which is closer to perl
and ripgrep).

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
 Documentation/config/grep.txt       |  4 ++++
 grep.c                              | 37 ++++++++++++++++++++++-------
 grep.h                              |  5 ++++
 t/perf/p7822-grep-perl-character.sh | 11 +++++++--
 t/t7818-grep-digit.sh               | 32 +++++++++++++++++++++++++
 5 files changed, 78 insertions(+), 11 deletions(-)
 create mode 100755 t/t7818-grep-digit.sh

René Scharfe Jan. 1, 2024, 5:18 p.m. UTC | #1

Am 01.01.24 um 16:03 schrieb Carlo Marcelo Arenas Belón:
> Since acabd2048e (grep: correctly identify utf-8 characters with
> \{b,w} in -P, 2023-01-08), PCRE2's UCP mode has been enabled when
> UTF content was expected and therefore muktibyte characters are
> included as part of all character classes (when defined).
>
> note that if the locale used is not UTF enabled (specially: C, POSIX)
> or uses an extended charmap binary that is not unicode compatible, binary
> match will be used instead.
>
> It was argued that doing so, at least for \d, was not a good idea,
> as that might not be what the user expected based on its historical
> meaning and was also slower, and indeed a similar change that was done
> to GNU grep was reverted and required further tweaks.
>
> At that time, PCRE2 didn't have a way to disable UCP's character
> expansion selectively, but flags to do so, and that will be
> available in the next release (planned soon) were added, and
> one of them has been in use by GNU grep since their last release
> in May (only if built and linked against the prereleased PCRE2
> library though).
>
> Add flags to make sure that both \d and [:digit:] only include
> ASCII digits so that `git grep` will be closer to GNU grep and
> improve performance as a side effect, but add a configuration flag
> to allow keeping the current behaviour (which is closer to perl
> and ripgrep).
>
> Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
> ---
>  Documentation/config/grep.txt       |  4 ++++
>  grep.c                              | 37 ++++++++++++++++++++++-------
>  grep.h                              |  5 ++++
>  t/perf/p7822-grep-perl-character.sh | 11 +++++++--
>  t/t7818-grep-digit.sh               | 32 +++++++++++++++++++++++++
>  5 files changed, 78 insertions(+), 11 deletions(-)
>  create mode 100755 t/t7818-grep-digit.sh
>
> diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
> index e521f20390..4e405c8ad1 100644
> --- a/Documentation/config/grep.txt
> +++ b/Documentation/config/grep.txt
> @@ -26,3 +26,7 @@ grep.fullName::
>  grep.fallbackToNoIndex::
>  	If set to true, fall back to git grep --no-index if git grep
>  	is executed outside of a git repository.  Defaults to false.
> +
> +grep.perl.digit::
> +	If set to true, use the perl definitions for \d and [:digit:].
> +	Defaults to false.
> diff --git a/grep.c b/grep.c
> index fc2d0c837a..fec36ccb30 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -88,6 +88,11 @@ int grep_config(const char *var, const char *value,
>  		return 0;
>  	}
>
> +	if (!strcmp(var, "grep.perl.digit")) {
> +		opt->perl_digit = git_config_bool(var, value);
> +		return 0;
> +	}
> +
>  	if (!strcmp(var, "color.grep"))
>  		opt->color = git_config_colorbool(var, value);
>  	if (!strcmp(var, "color.grep.match")) {
> @@ -301,6 +306,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>  	int patinforet;
>  	size_t jitsizearg;
>  	int literal = !opt->ignore_case && (p->fixed || p->is_fixed);
> +	uint32_t xoptions = 0;
>
>  	/*
>  	 * Call pcre2_general_context_create() before calling any
> @@ -321,17 +327,22 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>  		}
>  		options |= PCRE2_CASELESS;
>  	}
> -	if (!opt->ignore_locale && is_utf8_locale() && !literal)
> -		options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
> +	if (!opt->ignore_locale && is_utf8_locale() && !literal) {
> +		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>
> -#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER
> -	/*
> -	 * Work around a JIT bug related to invalid Unicode character handling
> -	 * fixed in 10.35:
> -	 * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
> -	 */
> -	options &= ~PCRE2_UCP;
> +#ifdef GIT_PCRE2_VERSION_10_35_OR_HIGHER
> +		/*
> +		 * Work around a JIT bug related to invalid Unicode character handling
> +		 * fixed in 10.35:
> +		 * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
> +		 */
> +		options |= PCRE2_UCP;
> +#ifdef GIT_PCRE2_VERSION_10_43_OR_HIGHER
> +		if (!opt->perl_digit)
> +			xoptions |= (PCRE2_EXTRA_ASCII_BSD | PCRE2_EXTRA_ASCII_DIGIT);
>  #endif
> +#endif

Why do we need that extra level of indentation?

The old code turned PCRE2_UCP on by default and turned it off for older
versions.  The new code enables PCRE2_UCP only for newer versions.  The
result should be the same, no?  So why change that part at all?

But the comment is now off, isn't it?  The workaround was turning
PCRE2_UCP off for older versions (because those were broken), not
turning it on for newer versions (because it would be required by some
unfixed regression).

Do we need to nest the checks for GIT_PCRE2_VERSION_10_35_OR_HIGHER and
GIT_PCRE2_VERSION_10_43_OR_HIGHER?  Keeping them separate would help
keep the #ifdef parts as short as possible and maintainable
independently.

> +	}
>
>  #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
>  	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
> @@ -339,6 +350,14 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>  		options |= PCRE2_NO_START_OPTIMIZE;
>  #endif
>
> +	if (xoptions) {
> +		if (!p->pcre2_compile_context)
> +			p->pcre2_compile_context = pcre2_compile_context_create(p->pcre2_general_context);
> +
> +		pcre2_set_compile_extra_options(p->pcre2_compile_context,
> +						xoptions);
> +	}
> +
>  	p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern,
>  					 p->patternlen, options, &error, &erroffset,
>  					 p->pcre2_compile_context);
> diff --git a/grep.h b/grep.h
> index 926c0875c4..cd5c416a0a 100644
> --- a/grep.h
> +++ b/grep.h
> @@ -4,6 +4,9 @@
>  #ifdef USE_LIBPCRE2
>  #define PCRE2_CODE_UNIT_WIDTH 8
>  #include <pcre2.h>
> +#if (PCRE2_MAJOR >= 10 && PCRE2_MINOR >= 43) || PCRE2_MAJOR >= 11
> +#define GIT_PCRE2_VERSION_10_43_OR_HIGHER
> +#endif
>  #if (PCRE2_MAJOR >= 10 && PCRE2_MINOR >= 36) || PCRE2_MAJOR >= 11
>  #define GIT_PCRE2_VERSION_10_36_OR_HIGHER
>  #endif
> @@ -178,6 +181,8 @@ struct grep_opt {
>
>  	void (*output)(struct grep_opt *opt, const void *data, size_t size);
>  	void *output_priv;
> +
> +	unsigned perl_digit:1;
>  };
>
>  #define GREP_OPT_INIT { \
> diff --git a/t/perf/p7822-grep-perl-character.sh b/t/perf/p7822-grep-perl-character.sh
> index 87009c60df..cc88d5a695 100755
> --- a/t/perf/p7822-grep-perl-character.sh
> +++ b/t/perf/p7822-grep-perl-character.sh
> @@ -8,6 +8,13 @@ etc.) we will test the patterns under those numbers of threads.
>
>  . ./perf-lib.sh
>
> +# setting a LOCALE is needed, but not yet supported by :
> +#. "$TEST_DIRECTORY"/lib-gettext.sh
> +
> +# Invoke like:
> +#
> +# LC_ALL=is_IS.utf8 ./p7822-grep-perl-character.sh
> +
>  test_perf_large_repo
>  test_checkout_worktree
>
> @@ -27,13 +34,13 @@ do
>  	if ! test_have_prereq PERF_GREP_ENGINES_THREADS
>  	then
>  		test_perf "grep -P '$pattern'" --prereq PCRE "
> -			git -P grep -f pat || :
> +			git grep -P -f pat || :
>  		"
>  	else
>  		for threads in $GIT_PERF_GREP_THREADS
>  		do
>  			test_perf "grep -P '$pattern' with $threads threads" --prereq PTHREADS,PCRE "
> -				git -c grep.threads=$threads -P grep -f pat || :
> +				git -c grep.threads=$threads grep -P -f pat || :

What's going on here?  You remove the -P option of git (--no-pager) and
add the -P option of git grep (--perl-regexp).  So this perf test never
tested PCRE despite its name?  Seems worth a separate patch.

Do the performance numbers in acabd2048e (grep: correctly identify utf-8
characters with \{b,w} in -P, 2023-01-08) still hold up in that light?

>  			"
>  		done
>  	fi

René

Eric Sunshine Jan. 1, 2024, 6:07 p.m. UTC | #2

On Mon, Jan 1, 2024 at 10:04 AM Carlo Marcelo Arenas Belón
<carenas@gmail.com> wrote:
> Since acabd2048e (grep: correctly identify utf-8 characters with
> \{b,w} in -P, 2023-01-08), PCRE2's UCP mode has been enabled when
> UTF content was expected and therefore muktibyte characters are
> included as part of all character classes (when defined).

s/muktibyte/multibyte/

Carlo Marcelo Arenas Belón Jan. 2, 2024, 7:02 p.m. UTC | #3

On Mon, Jan 1, 2024 at 9:18 AM René Scharfe <l.s.r@web.de> wrote:
>
> Am 01.01.24 um 16:03 schrieb Carlo Marcelo Arenas Belón:
> > @@ -321,17 +327,22 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
> >               }
> >               options |= PCRE2_CASELESS;
> >       }
> > -     if (!opt->ignore_locale && is_utf8_locale() && !literal)
> > -             options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
> > +     if (!opt->ignore_locale && is_utf8_locale() && !literal) {
> > +             options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
> >
> > -#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER
> > -     /*
> > -      * Work around a JIT bug related to invalid Unicode character handling
> > -      * fixed in 10.35:
> > -      * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
> > -      */
> > -     options &= ~PCRE2_UCP;
> > +#ifdef GIT_PCRE2_VERSION_10_35_OR_HIGHER
> > +             /*
> > +              * Work around a JIT bug related to invalid Unicode character handling
> > +              * fixed in 10.35:
> > +              * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
> > +              */
> > +             options |= PCRE2_UCP;
> > +#ifdef GIT_PCRE2_VERSION_10_43_OR_HIGHER
> > +             if (!opt->perl_digit)
> > +                     xoptions |= (PCRE2_EXTRA_ASCII_BSD | PCRE2_EXTRA_ASCII_DIGIT);
> >  #endif
> > +#endif
>
> Why do we need that extra level of indentation?

I was just trying to simplify the code by including all the logic in
one single set.

The original lack of indentation that was introduced by later fixes to
the code was IMHO also misguided since the obvious "objective" as set
in the original code that added PCRE2_UCP was that it should be used
whenever UTF was also in use as shown by
acabd2048ee0ee53728100408970ab45a6dab65e.

Of course, we soon found out that the original implementation of
PCRE2_MATCH_INVALID_UTF that came with PCRE2 10.34 was buggy and so an
exception was introduced in 14b9a044798ebb3858a1f1a1377309a3d6054ac8.

Note that the problematic code is only relevant when JIT is also
enabled, but JIT is almost always enabled.

> The old code turned PCRE2_UCP on by default and turned it off for older
> versions. The new code enables PCRE2_UCP only for newer versions.  The
> result should be the same, no?  So why change that part at all?

Because it gets us a little closer to the real reason why we need to
disable UCP for anything older than 10.35, and that is that there is a
bug there that is ONLY relevant if we are using JIT.

My hope though is that with the release of 10.43 (currently in RC1),
10.34 will become irrelevant soon enough and this whole code could be
cleaned up further.

> But the comment is now off, isn't it?  The workaround was turning
> PCRE2_UCP off for older versions (because those were broken), not
> turning it on for newer versions (because it would be required by some
> unfixed regression).

The comment was never correct, because it was turning it off, because
the combination of JIT + MATCH_INVALID_UTF (which was released in
10.34) + UCP is broken.

> Do we need to nest the checks for GIT_PCRE2_VERSION_10_35_OR_HIGHER and
> GIT_PCRE2_VERSION_10_43_OR_HIGHER?  Keeping them separate would help
> keep the #ifdef parts as short as possible and maintainable
> independently.

I thought that nesting them would make it simpler to maintain, since
there are somehow connected anyway.

The additional flags that are added in PCRE 10.43 are ONLY needed and
useful on top of PCRE2_UCP, which itself only makes sense on top of
UTF.

> > @@ -27,13 +34,13 @@ do
> >       if ! test_have_prereq PERF_GREP_ENGINES_THREADS
> >       then
> >               test_perf "grep -P '$pattern'" --prereq PCRE "
> > -                     git -P grep -f pat || :
> > +                     git grep -P -f pat || :
> >               "
> >       else
> >               for threads in $GIT_PERF_GREP_THREADS
> >               do
> >                       test_perf "grep -P '$pattern' with $threads threads" --prereq PTHREADS,PCRE "
> > -                             git -c grep.threads=$threads -P grep -f pat || :
> > +                             git -c grep.threads=$threads grep -P -f pat || :
>
> What's going on here?  You remove the -P option of git (--no-pager) and
> add the -P option of git grep (--perl-regexp).  So this perf test never
> tested PCRE despite its name?  Seems worth a separate patch.

My original code was a dud.  This would make that at least more obvious,

> Do the performance numbers in acabd2048e (grep: correctly identify utf-8
> characters with \{b,w} in -P, 2023-01-08) still hold up in that light?

FWIW the performance numbers still hold up, but just because I did the
tests independently using hyperfine, and indeed when I did the first
version of this patch, I had a nice reproducible description that
showed how to get 20% better performance while searching the git
repository itself for something like 4 digits (as used in Copyright
dates).

My idea was to reply to my own post with instructions on how to test
this, which basically require to also compile the recently released
10.43RC1 and build against that.

Indeed, AFAIK the test would fail if built with an older version of
PCRE, but wasn't sure if making a prerequisite that hardcoded the
version in test-tool might be the best approach to prevent that,
especially with all the libification work.

Carlo

PS. your reply got lost somehow in my SPAM folder, so I apologize for
the late reply

René Scharfe Jan. 4, 2024, 9:14 p.m. UTC | #4

Am 02.01.24 um 20:02 schrieb Carlo Arenas:
> On Mon, Jan 1, 2024 at 9:18 AM René Scharfe <l.s.r@web.de> wrote:
>>
>> Am 01.01.24 um 16:03 schrieb Carlo Marcelo Arenas Belón:
>>> @@ -321,17 +327,22 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>>>               }
>>>               options |= PCRE2_CASELESS;
>>>       }
>>> -     if (!opt->ignore_locale && is_utf8_locale() && !literal)
>>> -             options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
>>> +     if (!opt->ignore_locale && is_utf8_locale() && !literal) {
>>> +             options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>>>
>>> -#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER
>>> -     /*
>>> -      * Work around a JIT bug related to invalid Unicode character handling
>>> -      * fixed in 10.35:
>>> -      * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
>>> -      */
>>> -     options &= ~PCRE2_UCP;
>>> +#ifdef GIT_PCRE2_VERSION_10_35_OR_HIGHER
>>> +             /*
>>> +              * Work around a JIT bug related to invalid Unicode character handling
>>> +              * fixed in 10.35:
>>> +              * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
>>> +              */
>>> +             options |= PCRE2_UCP;
>>> +#ifdef GIT_PCRE2_VERSION_10_43_OR_HIGHER
>>> +             if (!opt->perl_digit)
>>> +                     xoptions |= (PCRE2_EXTRA_ASCII_BSD | PCRE2_EXTRA_ASCII_DIGIT);
>>>  #endif
>>> +#endif
>>
>> Why do we need that extra level of indentation?
>
> I was just trying to simplify the code by including all the logic in
> one single set.
>
> The original lack of indentation that was introduced by later fixes to
> the code was IMHO also misguided since the obvious "objective" as set
> in the original code that added PCRE2_UCP was that it should be used
> whenever UTF was also in use as shown by
> acabd2048ee0ee53728100408970ab45a6dab65e.

One level of indentation is correct because the code in question is part
of a function and it's not nested in a loop or conditional.  And
preprocessor directives don't affect the indentation of C code.  Am I
missing something?

> Of course, we soon found out that the original implementation of
> PCRE2_MATCH_INVALID_UTF that came with PCRE2 10.34 was buggy and so an
> exception was introduced in 14b9a044798ebb3858a1f1a1377309a3d6054ac8.
>
> Note that the problematic code is only relevant when JIT is also
> enabled, but JIT is almost always enabled.
>
>> The old code turned PCRE2_UCP on by default and turned it off for older
>> versions. The new code enables PCRE2_UCP only for newer versions.  The
>> result should be the same, no?  So why change that part at all?
>
> Because it gets us a little closer to the real reason why we need to
> disable UCP for anything older than 10.35, and that is that there is a
> bug there that is ONLY relevant if we are using JIT.

How so?  If the same flags are set in the end then it seems like a
lateral movement to me.

Do you want to disable JIT compilation for older versions?

> My hope though is that with the release of 10.43 (currently in RC1),
> 10.34 will become irrelevant soon enough and this whole code could be
> cleaned up further.

Cleanup is good, but if we package the workarounds nicely we can keep
them around indefinitely without them bothering us.  Debian buster
still has a few months of support left and comes with PCRE2 10.32..

>> But the comment is now off, isn't it?  The workaround was turning
>> PCRE2_UCP off for older versions (because those were broken), not
>> turning it on for newer versions (because it would be required by some
>> unfixed regression).
>
> The comment was never correct, because it was turning it off, because
> the combination of JIT + MATCH_INVALID_UTF (which was released in
> 10.34) + UCP is broken.

The code disabled PCRE2_UCP for PCRE2 10.34 and older.  The comment
claimed that this was done as a workaround for a bug fixed in 10.35.
You seem to say the same.  What was incorrect about the comment?

>> Do we need to nest the checks for GIT_PCRE2_VERSION_10_35_OR_HIGHER and
>> GIT_PCRE2_VERSION_10_43_OR_HIGHER?  Keeping them separate would help
>> keep the #ifdef parts as short as possible and maintainable
>> independently.
>
> I thought that nesting them would make it simpler to maintain, since
> there are somehow connected anyway.
>
> The additional flags that are added in PCRE 10.43 are ONLY needed and
> useful on top of PCRE2_UCP, which itself only makes sense on top of
> UTF.

This conditional stacking is complicated.  I find the old model where
we say which features we want up front and then disable buggy ones
for specific versions easier to grasp.

>>> @@ -27,13 +34,13 @@ do
>>>       if ! test_have_prereq PERF_GREP_ENGINES_THREADS
>>>       then
>>>               test_perf "grep -P '$pattern'" --prereq PCRE "
>>> -                     git -P grep -f pat || :
>>> +                     git grep -P -f pat || :
>>>               "
>>>       else
>>>               for threads in $GIT_PERF_GREP_THREADS
>>>               do
>>>                       test_perf "grep -P '$pattern' with $threads threads" --prereq PTHREADS,PCRE "
>>> -                             git -c grep.threads=$threads -P grep -f pat || :
>>> +                             git -c grep.threads=$threads grep -P -f pat || :
>>
>> What's going on here?  You remove the -P option of git (--no-pager) and
>> add the -P option of git grep (--perl-regexp).  So this perf test never
>> tested PCRE despite its name?  Seems worth a separate patch.
>
> My original code was a dud.  This would make that at least more obvious,

The change is good, but I don't see any connection to the
grep.perl.digit feature that this patch is primarily about,
so I (still) think it deserves its own patch.

>> Do the performance numbers in acabd2048e (grep: correctly identify utf-8
>> characters with \{b,w} in -P, 2023-01-08) still hold up in that light?
>
> FWIW the performance numbers still hold up, but just because I did the
> tests independently using hyperfine, and indeed when I did the first
> version of this patch, I had a nice reproducible description that
> showed how to get 20% better performance while searching the git
> repository itself for something like 4 digits (as used in Copyright
> dates).

Great!

> My idea was to reply to my own post with instructions on how to test
> this, which basically require to also compile the recently released
> 10.43RC1 and build against that.

OK, so this is about the grep.perl.digit feature, right?

What I meant above was: Is the statement in acabd2048e (grep: correctly
identify utf-8 characters with \{b,w} in -P, 2023-01-08) about 20-40%
performance impact (in which direction, by the way) still measurable
with the fixed perf test script?

With the fix and PCRE2 10.42 and PCRE2_UCP I get:

Test                           this tree
----------------------------------------------
7822.1: grep -P '\bhow'        0.05(0.02+0.30)
7822.2: grep -P '\bÆvar'       0.05(0.02+0.29)
7822.3: grep -P '\d+ \bÆvar'   0.05(0.02+0.27)
7822.4: grep -P '\bBelón\b'    0.04(0.02+0.23)
7822.5: grep -P '\w{12}\b'     0.03(0.02+0.13)

... and without PCRE2_UCP:

Test                           this tree
----------------------------------------------
7822.1: grep -P '\bhow'        0.05(0.02+0.26)
7822.2: grep -P '\bÆvar'       0.04(0.02+0.18)
7822.3: grep -P '\d+ \bÆvar'   0.05(0.02+0.26)
7822.4: grep -P '\bBelón\b'    0.05(0.02+0.27)
7822.5: grep -P '\w{12}\b'     0.04(0.02+0.18)

Hmm, inconclusive.  Perhaps the test data is too small?

> Indeed, AFAIK the test would fail if built with an older version of
> PCRE, but wasn't sure if making a prerequisite that hardcoded the
> version in test-tool might be the best approach to prevent that,
> especially with all the libification work.

You could extend test-pcre2-config to report whether that feature is
active.  Performance tests could set prerequisites based on its output.

> PS. your reply got lost somehow in my SPAM folder, so I apologize for
> the late reply

No worries, I need days to reply without any detour through the
spam folder..

René

Mathias Krause July 10, 2024, 1:49 p.m. UTC | #5

On 02.01.24 20:02, Carlo Arenas wrote:
> On Mon, Jan 1, 2024 at 9:18 AM René Scharfe <l.s.r@web.de> wrote:
>> Am 01.01.24 um 16:03 schrieb Carlo Marcelo Arenas Belón:
>>> @@ -321,17 +327,22 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>>>               }
>>>               options |= PCRE2_CASELESS;
>>>       }
>>> -     if (!opt->ignore_locale && is_utf8_locale() && !literal)
>>> -             options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
>>> +     if (!opt->ignore_locale && is_utf8_locale() && !literal) {
>>> +             options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>>>
>>> -#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER
>>> -     /*
>>> -      * Work around a JIT bug related to invalid Unicode character handling
>>> -      * fixed in 10.35:
>>> -      * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
>>> -      */
>>> -     options &= ~PCRE2_UCP;
>>> +#ifdef GIT_PCRE2_VERSION_10_35_OR_HIGHER
>>> +             /*
>>> +              * Work around a JIT bug related to invalid Unicode character handling
>>> +              * fixed in 10.35:
>>> +              * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
>>> +              */
>>> +             options |= PCRE2_UCP;
>>> +#ifdef GIT_PCRE2_VERSION_10_43_OR_HIGHER
>>> +             if (!opt->perl_digit)
>>> +                     xoptions |= (PCRE2_EXTRA_ASCII_BSD | PCRE2_EXTRA_ASCII_DIGIT);
>>>  #endif
>>> +#endif
>>
>> Why do we need that extra level of indentation?
> 
> I was just trying to simplify the code by including all the logic in
> one single set.
> 
> The original lack of indentation that was introduced by later fixes to
> the code was IMHO also misguided since the obvious "objective" as set
> in the original code that added PCRE2_UCP was that it should be used
> whenever UTF was also in use as shown by
> acabd2048ee0ee53728100408970ab45a6dab65e.
> 
> Of course, we soon found out that the original implementation of
> PCRE2_MATCH_INVALID_UTF that came with PCRE2 10.34 was buggy and so an
> exception was introduced in 14b9a044798ebb3858a1f1a1377309a3d6054ac8.

My early fix attempt[1] also had it indented but Junio argued against
it[2]. I see no reason why we should change that now?

[1]
https://lore.kernel.org/git/20230323144000.21146-1-minipli@grsecurity.net/
[2] https://lore.kernel.org/git/xmqq355va1a2.fsf@gitster.g/

> 
> Note that the problematic code is only relevant when JIT is also
> enabled, but JIT is almost always enabled.

Right. But it doesn't hurt to mask a bit that isn't set, the compiler
will figure, I guess.

> 
>> The old code turned PCRE2_UCP on by default and turned it off for older
>> versions. The new code enables PCRE2_UCP only for newer versions.  The
>> result should be the same, no?  So why change that part at all?
> 
> Because it gets us a little closer to the real reason why we need to
> disable UCP for anything older than 10.35, and that is that there is a
> bug there that is ONLY relevant if we are using JIT.
> 
> My hope though is that with the release of 10.43 (currently in RC1),
> 10.34 will become irrelevant soon enough and this whole code could be
> cleaned up further.
> 
>> But the comment is now off, isn't it?  The workaround was turning
>> PCRE2_UCP off for older versions (because those were broken), not
>> turning it on for newer versions (because it would be required by some
>> unfixed regression).
> 
> The comment was never correct, because it was turning it off, because
> the combination of JIT + MATCH_INVALID_UTF (which was released in
> 10.34) + UCP is broken.

And what makes the comment wrong? It's mentioning "JIT", "invalid
Unicode character handling", "bug" and even the URL to the PCRE2 commit
fixing the bug.

Moreover is your proposed change making the comment look wrong as it's
negating the preprocessor test and sets the PCRE2_UCP bit instead of
masking it, suggesting *this* makes it work around the bug, while it's
actually the opposite.

So, yes, IMHO we should leave that part as-is.

> [snip]

Cheers,
Mathias

[RFC] grep: default to posix digits with -P

Commit Message

Comments

Patch