Message ID | ba503995-866d-fc80-4e38-b4dfd0e5c5bc@web.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] grep/pcre2: use PCRE2_UTF even with ASCII patterns | expand |
On Sat, Dec 18 2021, René Scharfe wrote: > Patterns that contain no wildcards and don't have to be case-folded are > literal. Give this condition a name to increase the readability of the > boolean expression for enabling the option PCRE2_UTF. > > Signed-off-by: René Scharfe <l.s.r@web.de> > --- > grep.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/grep.c b/grep.c > index 5badb6d851..2b6ac3205d 100644 > --- a/grep.c > +++ b/grep.c > @@ -362,6 +362,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt > int jitret; > int patinforet; > size_t jitsizearg; > + int literal = !opt->ignore_case && (p->fixed || p->is_fixed); > > /* > * Call pcre2_general_context_create() before calling any > @@ -382,8 +383,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt > } > options |= PCRE2_CASELESS; > } > - if (!opt->ignore_locale && is_utf8_locale() && > - !(!opt->ignore_case && (p->fixed || p->is_fixed))) > + if (!opt->ignore_locale && is_utf8_locale() && !literal) > options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); > > #ifdef GIT_PCRE2_VERSION_10_36_OR_HIGHER I think for this and 1/2 it would be really nice to pick up a version of Hamza's CI changes: https://lore.kernel.org/git/20211118084143.279174-2-someguy@effective-light.com/ Aside: Not needed for this change, but I wonder if we could benefit minutely from: #ifdef PCRE2_LITERAL options |= PCRE2_LITERAL; #endif It'll save PCRE2 the small effort of finding that we've got no metacharacters.
René Scharfe <l.s.r@web.de> writes: > Patterns that contain no wildcards and don't have to be case-folded are > literal. Give this condition a name to increase the readability of the > boolean expression for enabling the option PCRE2_UTF. Makes sense. Quite a lot. Thanks.
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > I think for this and 1/2 it would be really nice to pick up a version of > Hamza's CI changes: > https://lore.kernel.org/git/20211118084143.279174-2-someguy@effective-light.com/ Are there so many incompatible versions of pcre2 library whose usage are subtly different that we need to protect ourselves with multiple variants in CI from breakages? I doubt pcre2 library was _that_ bad. Adding a special task that builds with the minimum version we support may not be too bad, but the library should be stable enough to allow us to declare it sufficient to test the most common version with the most common build options in our ordinary build job(s).
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > Aside: Not needed for this change, but I wonder if we could benefit minutely > from: > > #ifdef PCRE2_LITERAL > options |= PCRE2_LITERAL; > #endif > > It'll save PCRE2 the small effort of finding that we've got no metacharacters. After the dust settles, it would be a good addition in a separate patch. Thanks.
On Mon, Dec 20 2021, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > >> I think for this and 1/2 it would be really nice to pick up a version of >> Hamza's CI changes: >> https://lore.kernel.org/git/20211118084143.279174-2-someguy@effective-light.com/ > > Are there so many incompatible versions of pcre2 library whose usage > are subtly different that we need to protect ourselves with multiple > variants in CI from breakages? > > I doubt pcre2 library was _that_ bad. It's really not, but: * We have an optional >=10.34 feature we use (albeit trivial) * We have an optional >=10.36 feature we use (major, and directly related) * We might be targeting JIT or not, and the error handling isn't the same (known PCRE caveat) * We might be targeting a PCRE that knows about Unicode, or not * We use it in a mode where we might feed UTF-8 invalid data into the UTF-8 mode Any update to the relevant code really needs to test the combination of those, so it's a perfect target for CI to make that less tedious. > Adding a special task that builds with the minimum version we > support may not be too bad, but the library should be stable enough > to allow us to declare it sufficient to test the most common version > with the most common build options in our ordinary build job(s). That's a nice idea, but not the reality of the situation. Unless we're willing to bump the version requirement & insist on JIT && Unicode support before using it. The CI itself should be realtively cheap, and just runs the few tests that would spot any breakages with the above.
diff --git a/grep.c b/grep.c index 5badb6d851..2b6ac3205d 100644 --- a/grep.c +++ b/grep.c @@ -362,6 +362,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt int jitret; int patinforet; size_t jitsizearg; + int literal = !opt->ignore_case && (p->fixed || p->is_fixed); /* * Call pcre2_general_context_create() before calling any @@ -382,8 +383,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } options |= PCRE2_CASELESS; } - if (!opt->ignore_locale && is_utf8_locale() && - !(!opt->ignore_case && (p->fixed || p->is_fixed))) + if (!opt->ignore_locale && is_utf8_locale() && !literal) options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); #ifdef GIT_PCRE2_VERSION_10_36_OR_HIGHER
Patterns that contain no wildcards and don't have to be case-folded are literal. Give this condition a name to increase the readability of the boolean expression for enabling the option PCRE2_UTF. Signed-off-by: René Scharfe <l.s.r@web.de> --- grep.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- 2.34.0