Message ID | 20240101150336.89098-1-carenas@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] grep: default to posix digits with -P | expand |
Am 01.01.24 um 16:03 schrieb Carlo Marcelo Arenas Belón: > Since acabd2048e (grep: correctly identify utf-8 characters with > \{b,w} in -P, 2023-01-08), PCRE2's UCP mode has been enabled when > UTF content was expected and therefore muktibyte characters are > included as part of all character classes (when defined). > > note that if the locale used is not UTF enabled (specially: C, POSIX) > or uses an extended charmap binary that is not unicode compatible, binary > match will be used instead. > > It was argued that doing so, at least for \d, was not a good idea, > as that might not be what the user expected based on its historical > meaning and was also slower, and indeed a similar change that was done > to GNU grep was reverted and required further tweaks. > > At that time, PCRE2 didn't have a way to disable UCP's character > expansion selectively, but flags to do so, and that will be > available in the next release (planned soon) were added, and > one of them has been in use by GNU grep since their last release > in May (only if built and linked against the prereleased PCRE2 > library though). > > Add flags to make sure that both \d and [:digit:] only include > ASCII digits so that `git grep` will be closer to GNU grep and > improve performance as a side effect, but add a configuration flag > to allow keeping the current behaviour (which is closer to perl > and ripgrep). > > Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com> > --- > Documentation/config/grep.txt | 4 ++++ > grep.c | 37 ++++++++++++++++++++++------- > grep.h | 5 ++++ > t/perf/p7822-grep-perl-character.sh | 11 +++++++-- > t/t7818-grep-digit.sh | 32 +++++++++++++++++++++++++ > 5 files changed, 78 insertions(+), 11 deletions(-) > create mode 100755 t/t7818-grep-digit.sh > > diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt > index e521f20390..4e405c8ad1 100644 > --- a/Documentation/config/grep.txt > +++ b/Documentation/config/grep.txt > @@ -26,3 +26,7 @@ grep.fullName:: > grep.fallbackToNoIndex:: > If set to true, fall back to git grep --no-index if git grep > is executed outside of a git repository. Defaults to false. > + > +grep.perl.digit:: > + If set to true, use the perl definitions for \d and [:digit:]. > + Defaults to false. > diff --git a/grep.c b/grep.c > index fc2d0c837a..fec36ccb30 100644 > --- a/grep.c > +++ b/grep.c > @@ -88,6 +88,11 @@ int grep_config(const char *var, const char *value, > return 0; > } > > + if (!strcmp(var, "grep.perl.digit")) { > + opt->perl_digit = git_config_bool(var, value); > + return 0; > + } > + > if (!strcmp(var, "color.grep")) > opt->color = git_config_colorbool(var, value); > if (!strcmp(var, "color.grep.match")) { > @@ -301,6 +306,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt > int patinforet; > size_t jitsizearg; > int literal = !opt->ignore_case && (p->fixed || p->is_fixed); > + uint32_t xoptions = 0; > > /* > * Call pcre2_general_context_create() before calling any > @@ -321,17 +327,22 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt > } > options |= PCRE2_CASELESS; > } > - if (!opt->ignore_locale && is_utf8_locale() && !literal) > - options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF); > + if (!opt->ignore_locale && is_utf8_locale() && !literal) { > + options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); > > -#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER > - /* > - * Work around a JIT bug related to invalid Unicode character handling > - * fixed in 10.35: > - * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d > - */ > - options &= ~PCRE2_UCP; > +#ifdef GIT_PCRE2_VERSION_10_35_OR_HIGHER > + /* > + * Work around a JIT bug related to invalid Unicode character handling > + * fixed in 10.35: > + * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d > + */ > + options |= PCRE2_UCP; > +#ifdef GIT_PCRE2_VERSION_10_43_OR_HIGHER > + if (!opt->perl_digit) > + xoptions |= (PCRE2_EXTRA_ASCII_BSD | PCRE2_EXTRA_ASCII_DIGIT); > #endif > +#endif Why do we need that extra level of indentation? The old code turned PCRE2_UCP on by default and turned it off for older versions. The new code enables PCRE2_UCP only for newer versions. The result should be the same, no? So why change that part at all? But the comment is now off, isn't it? The workaround was turning PCRE2_UCP off for older versions (because those were broken), not turning it on for newer versions (because it would be required by some unfixed regression). Do we need to nest the checks for GIT_PCRE2_VERSION_10_35_OR_HIGHER and GIT_PCRE2_VERSION_10_43_OR_HIGHER? Keeping them separate would help keep the #ifdef parts as short as possible and maintainable independently. > + } > > #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER > /* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */ > @@ -339,6 +350,14 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt > options |= PCRE2_NO_START_OPTIMIZE; > #endif > > + if (xoptions) { > + if (!p->pcre2_compile_context) > + p->pcre2_compile_context = pcre2_compile_context_create(p->pcre2_general_context); > + > + pcre2_set_compile_extra_options(p->pcre2_compile_context, > + xoptions); > + } > + > p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern, > p->patternlen, options, &error, &erroffset, > p->pcre2_compile_context); > diff --git a/grep.h b/grep.h > index 926c0875c4..cd5c416a0a 100644 > --- a/grep.h > +++ b/grep.h > @@ -4,6 +4,9 @@ > #ifdef USE_LIBPCRE2 > #define PCRE2_CODE_UNIT_WIDTH 8 > #include <pcre2.h> > +#if (PCRE2_MAJOR >= 10 && PCRE2_MINOR >= 43) || PCRE2_MAJOR >= 11 > +#define GIT_PCRE2_VERSION_10_43_OR_HIGHER > +#endif > #if (PCRE2_MAJOR >= 10 && PCRE2_MINOR >= 36) || PCRE2_MAJOR >= 11 > #define GIT_PCRE2_VERSION_10_36_OR_HIGHER > #endif > @@ -178,6 +181,8 @@ struct grep_opt { > > void (*output)(struct grep_opt *opt, const void *data, size_t size); > void *output_priv; > + > + unsigned perl_digit:1; > }; > > #define GREP_OPT_INIT { \ > diff --git a/t/perf/p7822-grep-perl-character.sh b/t/perf/p7822-grep-perl-character.sh > index 87009c60df..cc88d5a695 100755 > --- a/t/perf/p7822-grep-perl-character.sh > +++ b/t/perf/p7822-grep-perl-character.sh > @@ -8,6 +8,13 @@ etc.) we will test the patterns under those numbers of threads. > > . ./perf-lib.sh > > +# setting a LOCALE is needed, but not yet supported by : > +#. "$TEST_DIRECTORY"/lib-gettext.sh > + > +# Invoke like: > +# > +# LC_ALL=is_IS.utf8 ./p7822-grep-perl-character.sh > + > test_perf_large_repo > test_checkout_worktree > > @@ -27,13 +34,13 @@ do > if ! test_have_prereq PERF_GREP_ENGINES_THREADS > then > test_perf "grep -P '$pattern'" --prereq PCRE " > - git -P grep -f pat || : > + git grep -P -f pat || : > " > else > for threads in $GIT_PERF_GREP_THREADS > do > test_perf "grep -P '$pattern' with $threads threads" --prereq PTHREADS,PCRE " > - git -c grep.threads=$threads -P grep -f pat || : > + git -c grep.threads=$threads grep -P -f pat || : What's going on here? You remove the -P option of git (--no-pager) and add the -P option of git grep (--perl-regexp). So this perf test never tested PCRE despite its name? Seems worth a separate patch. Do the performance numbers in acabd2048e (grep: correctly identify utf-8 characters with \{b,w} in -P, 2023-01-08) still hold up in that light? > " > done > fi René
On Mon, Jan 1, 2024 at 10:04 AM Carlo Marcelo Arenas Belón <carenas@gmail.com> wrote: > Since acabd2048e (grep: correctly identify utf-8 characters with > \{b,w} in -P, 2023-01-08), PCRE2's UCP mode has been enabled when > UTF content was expected and therefore muktibyte characters are > included as part of all character classes (when defined). s/muktibyte/multibyte/
On Mon, Jan 1, 2024 at 9:18 AM René Scharfe <l.s.r@web.de> wrote: > > Am 01.01.24 um 16:03 schrieb Carlo Marcelo Arenas Belón: > > @@ -321,17 +327,22 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt > > } > > options |= PCRE2_CASELESS; > > } > > - if (!opt->ignore_locale && is_utf8_locale() && !literal) > > - options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF); > > + if (!opt->ignore_locale && is_utf8_locale() && !literal) { > > + options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); > > > > -#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER > > - /* > > - * Work around a JIT bug related to invalid Unicode character handling > > - * fixed in 10.35: > > - * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d > > - */ > > - options &= ~PCRE2_UCP; > > +#ifdef GIT_PCRE2_VERSION_10_35_OR_HIGHER > > + /* > > + * Work around a JIT bug related to invalid Unicode character handling > > + * fixed in 10.35: > > + * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d > > + */ > > + options |= PCRE2_UCP; > > +#ifdef GIT_PCRE2_VERSION_10_43_OR_HIGHER > > + if (!opt->perl_digit) > > + xoptions |= (PCRE2_EXTRA_ASCII_BSD | PCRE2_EXTRA_ASCII_DIGIT); > > #endif > > +#endif > > Why do we need that extra level of indentation? I was just trying to simplify the code by including all the logic in one single set. The original lack of indentation that was introduced by later fixes to the code was IMHO also misguided since the obvious "objective" as set in the original code that added PCRE2_UCP was that it should be used whenever UTF was also in use as shown by acabd2048ee0ee53728100408970ab45a6dab65e. Of course, we soon found out that the original implementation of PCRE2_MATCH_INVALID_UTF that came with PCRE2 10.34 was buggy and so an exception was introduced in 14b9a044798ebb3858a1f1a1377309a3d6054ac8. Note that the problematic code is only relevant when JIT is also enabled, but JIT is almost always enabled. > The old code turned PCRE2_UCP on by default and turned it off for older > versions. The new code enables PCRE2_UCP only for newer versions. The > result should be the same, no? So why change that part at all? Because it gets us a little closer to the real reason why we need to disable UCP for anything older than 10.35, and that is that there is a bug there that is ONLY relevant if we are using JIT. My hope though is that with the release of 10.43 (currently in RC1), 10.34 will become irrelevant soon enough and this whole code could be cleaned up further. > But the comment is now off, isn't it? The workaround was turning > PCRE2_UCP off for older versions (because those were broken), not > turning it on for newer versions (because it would be required by some > unfixed regression). The comment was never correct, because it was turning it off, because the combination of JIT + MATCH_INVALID_UTF (which was released in 10.34) + UCP is broken. > Do we need to nest the checks for GIT_PCRE2_VERSION_10_35_OR_HIGHER and > GIT_PCRE2_VERSION_10_43_OR_HIGHER? Keeping them separate would help > keep the #ifdef parts as short as possible and maintainable > independently. I thought that nesting them would make it simpler to maintain, since there are somehow connected anyway. The additional flags that are added in PCRE 10.43 are ONLY needed and useful on top of PCRE2_UCP, which itself only makes sense on top of UTF. > > @@ -27,13 +34,13 @@ do > > if ! test_have_prereq PERF_GREP_ENGINES_THREADS > > then > > test_perf "grep -P '$pattern'" --prereq PCRE " > > - git -P grep -f pat || : > > + git grep -P -f pat || : > > " > > else > > for threads in $GIT_PERF_GREP_THREADS > > do > > test_perf "grep -P '$pattern' with $threads threads" --prereq PTHREADS,PCRE " > > - git -c grep.threads=$threads -P grep -f pat || : > > + git -c grep.threads=$threads grep -P -f pat || : > > What's going on here? You remove the -P option of git (--no-pager) and > add the -P option of git grep (--perl-regexp). So this perf test never > tested PCRE despite its name? Seems worth a separate patch. My original code was a dud. This would make that at least more obvious, > Do the performance numbers in acabd2048e (grep: correctly identify utf-8 > characters with \{b,w} in -P, 2023-01-08) still hold up in that light? FWIW the performance numbers still hold up, but just because I did the tests independently using hyperfine, and indeed when I did the first version of this patch, I had a nice reproducible description that showed how to get 20% better performance while searching the git repository itself for something like 4 digits (as used in Copyright dates). My idea was to reply to my own post with instructions on how to test this, which basically require to also compile the recently released 10.43RC1 and build against that. Indeed, AFAIK the test would fail if built with an older version of PCRE, but wasn't sure if making a prerequisite that hardcoded the version in test-tool might be the best approach to prevent that, especially with all the libification work. Carlo PS. your reply got lost somehow in my SPAM folder, so I apologize for the late reply
Am 02.01.24 um 20:02 schrieb Carlo Arenas: > On Mon, Jan 1, 2024 at 9:18 AM René Scharfe <l.s.r@web.de> wrote: >> >> Am 01.01.24 um 16:03 schrieb Carlo Marcelo Arenas Belón: >>> @@ -321,17 +327,22 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt >>> } >>> options |= PCRE2_CASELESS; >>> } >>> - if (!opt->ignore_locale && is_utf8_locale() && !literal) >>> - options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF); >>> + if (!opt->ignore_locale && is_utf8_locale() && !literal) { >>> + options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); >>> >>> -#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER >>> - /* >>> - * Work around a JIT bug related to invalid Unicode character handling >>> - * fixed in 10.35: >>> - * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d >>> - */ >>> - options &= ~PCRE2_UCP; >>> +#ifdef GIT_PCRE2_VERSION_10_35_OR_HIGHER >>> + /* >>> + * Work around a JIT bug related to invalid Unicode character handling >>> + * fixed in 10.35: >>> + * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d >>> + */ >>> + options |= PCRE2_UCP; >>> +#ifdef GIT_PCRE2_VERSION_10_43_OR_HIGHER >>> + if (!opt->perl_digit) >>> + xoptions |= (PCRE2_EXTRA_ASCII_BSD | PCRE2_EXTRA_ASCII_DIGIT); >>> #endif >>> +#endif >> >> Why do we need that extra level of indentation? > > I was just trying to simplify the code by including all the logic in > one single set. > > The original lack of indentation that was introduced by later fixes to > the code was IMHO also misguided since the obvious "objective" as set > in the original code that added PCRE2_UCP was that it should be used > whenever UTF was also in use as shown by > acabd2048ee0ee53728100408970ab45a6dab65e. One level of indentation is correct because the code in question is part of a function and it's not nested in a loop or conditional. And preprocessor directives don't affect the indentation of C code. Am I missing something? > Of course, we soon found out that the original implementation of > PCRE2_MATCH_INVALID_UTF that came with PCRE2 10.34 was buggy and so an > exception was introduced in 14b9a044798ebb3858a1f1a1377309a3d6054ac8. > > Note that the problematic code is only relevant when JIT is also > enabled, but JIT is almost always enabled. > >> The old code turned PCRE2_UCP on by default and turned it off for older >> versions. The new code enables PCRE2_UCP only for newer versions. The >> result should be the same, no? So why change that part at all? > > Because it gets us a little closer to the real reason why we need to > disable UCP for anything older than 10.35, and that is that there is a > bug there that is ONLY relevant if we are using JIT. How so? If the same flags are set in the end then it seems like a lateral movement to me. Do you want to disable JIT compilation for older versions? > My hope though is that with the release of 10.43 (currently in RC1), > 10.34 will become irrelevant soon enough and this whole code could be > cleaned up further. Cleanup is good, but if we package the workarounds nicely we can keep them around indefinitely without them bothering us. Debian buster still has a few months of support left and comes with PCRE2 10.32.. >> But the comment is now off, isn't it? The workaround was turning >> PCRE2_UCP off for older versions (because those were broken), not >> turning it on for newer versions (because it would be required by some >> unfixed regression). > > The comment was never correct, because it was turning it off, because > the combination of JIT + MATCH_INVALID_UTF (which was released in > 10.34) + UCP is broken. The code disabled PCRE2_UCP for PCRE2 10.34 and older. The comment claimed that this was done as a workaround for a bug fixed in 10.35. You seem to say the same. What was incorrect about the comment? >> Do we need to nest the checks for GIT_PCRE2_VERSION_10_35_OR_HIGHER and >> GIT_PCRE2_VERSION_10_43_OR_HIGHER? Keeping them separate would help >> keep the #ifdef parts as short as possible and maintainable >> independently. > > I thought that nesting them would make it simpler to maintain, since > there are somehow connected anyway. > > The additional flags that are added in PCRE 10.43 are ONLY needed and > useful on top of PCRE2_UCP, which itself only makes sense on top of > UTF. This conditional stacking is complicated. I find the old model where we say which features we want up front and then disable buggy ones for specific versions easier to grasp. >>> @@ -27,13 +34,13 @@ do >>> if ! test_have_prereq PERF_GREP_ENGINES_THREADS >>> then >>> test_perf "grep -P '$pattern'" --prereq PCRE " >>> - git -P grep -f pat || : >>> + git grep -P -f pat || : >>> " >>> else >>> for threads in $GIT_PERF_GREP_THREADS >>> do >>> test_perf "grep -P '$pattern' with $threads threads" --prereq PTHREADS,PCRE " >>> - git -c grep.threads=$threads -P grep -f pat || : >>> + git -c grep.threads=$threads grep -P -f pat || : >> >> What's going on here? You remove the -P option of git (--no-pager) and >> add the -P option of git grep (--perl-regexp). So this perf test never >> tested PCRE despite its name? Seems worth a separate patch. > > My original code was a dud. This would make that at least more obvious, The change is good, but I don't see any connection to the grep.perl.digit feature that this patch is primarily about, so I (still) think it deserves its own patch. >> Do the performance numbers in acabd2048e (grep: correctly identify utf-8 >> characters with \{b,w} in -P, 2023-01-08) still hold up in that light? > > FWIW the performance numbers still hold up, but just because I did the > tests independently using hyperfine, and indeed when I did the first > version of this patch, I had a nice reproducible description that > showed how to get 20% better performance while searching the git > repository itself for something like 4 digits (as used in Copyright > dates). Great! > My idea was to reply to my own post with instructions on how to test > this, which basically require to also compile the recently released > 10.43RC1 and build against that. OK, so this is about the grep.perl.digit feature, right? What I meant above was: Is the statement in acabd2048e (grep: correctly identify utf-8 characters with \{b,w} in -P, 2023-01-08) about 20-40% performance impact (in which direction, by the way) still measurable with the fixed perf test script? With the fix and PCRE2 10.42 and PCRE2_UCP I get: Test this tree ---------------------------------------------- 7822.1: grep -P '\bhow' 0.05(0.02+0.30) 7822.2: grep -P '\bÆvar' 0.05(0.02+0.29) 7822.3: grep -P '\d+ \bÆvar' 0.05(0.02+0.27) 7822.4: grep -P '\bBelón\b' 0.04(0.02+0.23) 7822.5: grep -P '\w{12}\b' 0.03(0.02+0.13) ... and without PCRE2_UCP: Test this tree ---------------------------------------------- 7822.1: grep -P '\bhow' 0.05(0.02+0.26) 7822.2: grep -P '\bÆvar' 0.04(0.02+0.18) 7822.3: grep -P '\d+ \bÆvar' 0.05(0.02+0.26) 7822.4: grep -P '\bBelón\b' 0.05(0.02+0.27) 7822.5: grep -P '\w{12}\b' 0.04(0.02+0.18) Hmm, inconclusive. Perhaps the test data is too small? > Indeed, AFAIK the test would fail if built with an older version of > PCRE, but wasn't sure if making a prerequisite that hardcoded the > version in test-tool might be the best approach to prevent that, > especially with all the libification work. You could extend test-pcre2-config to report whether that feature is active. Performance tests could set prerequisites based on its output. > PS. your reply got lost somehow in my SPAM folder, so I apologize for > the late reply No worries, I need days to reply without any detour through the spam folder.. René
On 02.01.24 20:02, Carlo Arenas wrote: > On Mon, Jan 1, 2024 at 9:18 AM René Scharfe <l.s.r@web.de> wrote: >> Am 01.01.24 um 16:03 schrieb Carlo Marcelo Arenas Belón: >>> @@ -321,17 +327,22 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt >>> } >>> options |= PCRE2_CASELESS; >>> } >>> - if (!opt->ignore_locale && is_utf8_locale() && !literal) >>> - options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF); >>> + if (!opt->ignore_locale && is_utf8_locale() && !literal) { >>> + options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); >>> >>> -#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER >>> - /* >>> - * Work around a JIT bug related to invalid Unicode character handling >>> - * fixed in 10.35: >>> - * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d >>> - */ >>> - options &= ~PCRE2_UCP; >>> +#ifdef GIT_PCRE2_VERSION_10_35_OR_HIGHER >>> + /* >>> + * Work around a JIT bug related to invalid Unicode character handling >>> + * fixed in 10.35: >>> + * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d >>> + */ >>> + options |= PCRE2_UCP; >>> +#ifdef GIT_PCRE2_VERSION_10_43_OR_HIGHER >>> + if (!opt->perl_digit) >>> + xoptions |= (PCRE2_EXTRA_ASCII_BSD | PCRE2_EXTRA_ASCII_DIGIT); >>> #endif >>> +#endif >> >> Why do we need that extra level of indentation? > > I was just trying to simplify the code by including all the logic in > one single set. > > The original lack of indentation that was introduced by later fixes to > the code was IMHO also misguided since the obvious "objective" as set > in the original code that added PCRE2_UCP was that it should be used > whenever UTF was also in use as shown by > acabd2048ee0ee53728100408970ab45a6dab65e. > > Of course, we soon found out that the original implementation of > PCRE2_MATCH_INVALID_UTF that came with PCRE2 10.34 was buggy and so an > exception was introduced in 14b9a044798ebb3858a1f1a1377309a3d6054ac8. My early fix attempt[1] also had it indented but Junio argued against it[2]. I see no reason why we should change that now? [1] https://lore.kernel.org/git/20230323144000.21146-1-minipli@grsecurity.net/ [2] https://lore.kernel.org/git/xmqq355va1a2.fsf@gitster.g/ > > Note that the problematic code is only relevant when JIT is also > enabled, but JIT is almost always enabled. Right. But it doesn't hurt to mask a bit that isn't set, the compiler will figure, I guess. > >> The old code turned PCRE2_UCP on by default and turned it off for older >> versions. The new code enables PCRE2_UCP only for newer versions. The >> result should be the same, no? So why change that part at all? > > Because it gets us a little closer to the real reason why we need to > disable UCP for anything older than 10.35, and that is that there is a > bug there that is ONLY relevant if we are using JIT. > > My hope though is that with the release of 10.43 (currently in RC1), > 10.34 will become irrelevant soon enough and this whole code could be > cleaned up further. > >> But the comment is now off, isn't it? The workaround was turning >> PCRE2_UCP off for older versions (because those were broken), not >> turning it on for newer versions (because it would be required by some >> unfixed regression). > > The comment was never correct, because it was turning it off, because > the combination of JIT + MATCH_INVALID_UTF (which was released in > 10.34) + UCP is broken. And what makes the comment wrong? It's mentioning "JIT", "invalid Unicode character handling", "bug" and even the URL to the PCRE2 commit fixing the bug. Moreover is your proposed change making the comment look wrong as it's negating the preprocessor test and sets the PCRE2_UCP bit instead of masking it, suggesting *this* makes it work around the bug, while it's actually the opposite. So, yes, IMHO we should leave that part as-is. > [snip] Cheers, Mathias
diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt index e521f20390..4e405c8ad1 100644 --- a/Documentation/config/grep.txt +++ b/Documentation/config/grep.txt @@ -26,3 +26,7 @@ grep.fullName:: grep.fallbackToNoIndex:: If set to true, fall back to git grep --no-index if git grep is executed outside of a git repository. Defaults to false. + +grep.perl.digit:: + If set to true, use the perl definitions for \d and [:digit:]. + Defaults to false. diff --git a/grep.c b/grep.c index fc2d0c837a..fec36ccb30 100644 --- a/grep.c +++ b/grep.c @@ -88,6 +88,11 @@ int grep_config(const char *var, const char *value, return 0; } + if (!strcmp(var, "grep.perl.digit")) { + opt->perl_digit = git_config_bool(var, value); + return 0; + } + if (!strcmp(var, "color.grep")) opt->color = git_config_colorbool(var, value); if (!strcmp(var, "color.grep.match")) { @@ -301,6 +306,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt int patinforet; size_t jitsizearg; int literal = !opt->ignore_case && (p->fixed || p->is_fixed); + uint32_t xoptions = 0; /* * Call pcre2_general_context_create() before calling any @@ -321,17 +327,22 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } options |= PCRE2_CASELESS; } - if (!opt->ignore_locale && is_utf8_locale() && !literal) - options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF); + if (!opt->ignore_locale && is_utf8_locale() && !literal) { + options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); -#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER - /* - * Work around a JIT bug related to invalid Unicode character handling - * fixed in 10.35: - * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d - */ - options &= ~PCRE2_UCP; +#ifdef GIT_PCRE2_VERSION_10_35_OR_HIGHER + /* + * Work around a JIT bug related to invalid Unicode character handling + * fixed in 10.35: + * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d + */ + options |= PCRE2_UCP; +#ifdef GIT_PCRE2_VERSION_10_43_OR_HIGHER + if (!opt->perl_digit) + xoptions |= (PCRE2_EXTRA_ASCII_BSD | PCRE2_EXTRA_ASCII_DIGIT); #endif +#endif + } #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER /* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */ @@ -339,6 +350,14 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt options |= PCRE2_NO_START_OPTIMIZE; #endif + if (xoptions) { + if (!p->pcre2_compile_context) + p->pcre2_compile_context = pcre2_compile_context_create(p->pcre2_general_context); + + pcre2_set_compile_extra_options(p->pcre2_compile_context, + xoptions); + } + p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern, p->patternlen, options, &error, &erroffset, p->pcre2_compile_context); diff --git a/grep.h b/grep.h index 926c0875c4..cd5c416a0a 100644 --- a/grep.h +++ b/grep.h @@ -4,6 +4,9 @@ #ifdef USE_LIBPCRE2 #define PCRE2_CODE_UNIT_WIDTH 8 #include <pcre2.h> +#if (PCRE2_MAJOR >= 10 && PCRE2_MINOR >= 43) || PCRE2_MAJOR >= 11 +#define GIT_PCRE2_VERSION_10_43_OR_HIGHER +#endif #if (PCRE2_MAJOR >= 10 && PCRE2_MINOR >= 36) || PCRE2_MAJOR >= 11 #define GIT_PCRE2_VERSION_10_36_OR_HIGHER #endif @@ -178,6 +181,8 @@ struct grep_opt { void (*output)(struct grep_opt *opt, const void *data, size_t size); void *output_priv; + + unsigned perl_digit:1; }; #define GREP_OPT_INIT { \ diff --git a/t/perf/p7822-grep-perl-character.sh b/t/perf/p7822-grep-perl-character.sh index 87009c60df..cc88d5a695 100755 --- a/t/perf/p7822-grep-perl-character.sh +++ b/t/perf/p7822-grep-perl-character.sh @@ -8,6 +8,13 @@ etc.) we will test the patterns under those numbers of threads. . ./perf-lib.sh +# setting a LOCALE is needed, but not yet supported by : +#. "$TEST_DIRECTORY"/lib-gettext.sh + +# Invoke like: +# +# LC_ALL=is_IS.utf8 ./p7822-grep-perl-character.sh + test_perf_large_repo test_checkout_worktree @@ -27,13 +34,13 @@ do if ! test_have_prereq PERF_GREP_ENGINES_THREADS then test_perf "grep -P '$pattern'" --prereq PCRE " - git -P grep -f pat || : + git grep -P -f pat || : " else for threads in $GIT_PERF_GREP_THREADS do test_perf "grep -P '$pattern' with $threads threads" --prereq PTHREADS,PCRE " - git -c grep.threads=$threads -P grep -f pat || : + git -c grep.threads=$threads grep -P -f pat || : " done fi diff --git a/t/t7818-grep-digit.sh b/t/t7818-grep-digit.sh new file mode 100755 index 0000000000..44007e6be6 --- /dev/null +++ b/t/t7818-grep-digit.sh @@ -0,0 +1,32 @@ +#!/bin/sh + +test_description='git grep -P with digits' + +TEST_PASSES_SANITIZE_LEAK=true +. ./lib-gettext.sh + +test_expect_success 'setup' ' + echo 2023 >ascii && + printf "\357\274\222\357\274\220\357\274\222\357\274\223\n" >fullwidth && + printf "\331\241\331\244\331\244\331\245\n" >multibyte && + git add . && + git commit -m. && + LC_ALL="$is_IS_locale" && + export LC_ALL +' + +test_expect_success PCRE 'grep -P "\d"' ' + echo "ascii:2023" >expected && + git grep -P "\d{2}[[:digit:]]{2}" >actual && + test_cmp expected actual +' + +test_expect_success PCRE 'git -c grep.perl.digit' ' + test_config grep.perl.digit true && + git grep -P "\d{2}[[:digit:]]{2}" >actual && + grep fullwidth actual && + grep multibyte actual && + test_line_count = 3 actual +' + +test_done
Since acabd2048e (grep: correctly identify utf-8 characters with \{b,w} in -P, 2023-01-08), PCRE2's UCP mode has been enabled when UTF content was expected and therefore muktibyte characters are included as part of all character classes (when defined). note that if the locale used is not UTF enabled (specially: C, POSIX) or uses an extended charmap binary that is not unicode compatible, binary match will be used instead. It was argued that doing so, at least for \d, was not a good idea, as that might not be what the user expected based on its historical meaning and was also slower, and indeed a similar change that was done to GNU grep was reverted and required further tweaks. At that time, PCRE2 didn't have a way to disable UCP's character expansion selectively, but flags to do so, and that will be available in the next release (planned soon) were added, and one of them has been in use by GNU grep since their last release in May (only if built and linked against the prereleased PCRE2 library though). Add flags to make sure that both \d and [:digit:] only include ASCII digits so that `git grep` will be closer to GNU grep and improve performance as a side effect, but add a configuration flag to allow keeping the current behaviour (which is closer to perl and ripgrep). Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com> --- Documentation/config/grep.txt | 4 ++++ grep.c | 37 ++++++++++++++++++++++------- grep.h | 5 ++++ t/perf/p7822-grep-perl-character.sh | 11 +++++++-- t/t7818-grep-digit.sh | 32 +++++++++++++++++++++++++ 5 files changed, 78 insertions(+), 11 deletions(-) create mode 100755 t/t7818-grep-digit.sh