Message ID | 20211008224918.603392-3-someguy@effective-light.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [v12,1/3] grep: refactor next_match() and match_one_pattern() for external use | expand |
On Fri, Oct 08 2021, Hamza Mahfooz wrote: > If we attempt to grep non-ascii log message text with an ascii pattern, we > run into the following issue: > > $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author > grep: (standard input): binary file matches > > So, to fix this teach the grep code to mark the pattern as UTF-8 (even if > the pattern is composed of only ascii characters), so long as the log > output is encoded using UTF-8. > > Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> > Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> > --- > v12: get rid of utf8_all_the_things and fix an issue with one of the unit > tests. > --- > grep.c | 6 +++-- > t/t7812-grep-icase-non-ascii.sh | 48 +++++++++++++++++++++++++++++++++ > 2 files changed, 52 insertions(+), 2 deletions(-) > > diff --git a/grep.c b/grep.c > index fe847a0111..f6e113e9f0 100644 > --- a/grep.c > +++ b/grep.c > @@ -382,8 +382,10 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt > } > options |= PCRE2_CASELESS; > } > - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && > - !(!opt->ignore_case && (p->fixed || p->is_fixed))) > + if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) || > + (!opt->ignore_locale && is_utf8_locale() && > + has_non_ascii(p->pattern) && !(!opt->ignore_case && > + (p->fixed || p->is_fixed)))) > options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); I think at least some of that existing "if" is my fault, and I *think* your patch works here, but FWIW I'd find something like this way more readable: @@ -382,8 +383,16 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } options |= PCRE2_CASELESS; } + if (opt->utf8_all_the_things) { + options |= PCRE2_UCP; + do_utf8 = 1; + } + if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && !(!opt->ignore_case && (p->fixed || p->is_fixed))) + do_utf8 = 1; + + if (do_utf8) options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); Well, without the "utf8_all_the_things" probably. That's a reference to a popular meme, probably better to name it differently, and the PCRE2_UCP is just something I was experimenting with. It's late here, but I've got to admit that I'm still a bit confused by this. Let's see if I can try to sum up why: Ultimately whether we use PCRE2_UTF *should* have nothing do to with whether the pattern is UTF-8 or not, because even an expression like: /.*/ Will behave differently under UTF-8, i.e. character classes change, byte boundaries change to "character" boundaries etc. That the existing code has has_non_ascii() and the like is a trade-off that had to be made for the grep-a-file case, because you might run into arbitrary binary data, but logs are cleaner/encoded/re-encoded etc. If you run PCRE in UTF-8 mode it will die on some of that data (as you'll see from our test suite if you turn it on unconditionally). Are there cases where my "utf8_all_the_things" POC patch would have turned on PCRE2_UTF, but yours doesn't? IOW is there a 1=1 mapping still between the encoding mode log/revision.c thinks it's in & PCRE2_UTF? Anyway, I've tried to break things with this patch and haven't succeeded, so maybe it's all OK now, thanks a lot for working on this again, it's a really neat feature. Just wondering if there's any remaining edge cases we know about...
On Sat, Oct 9 2021 at 03:37:58 AM +0200, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > Are there cases where my "utf8_all_the_things" POC patch would have > turned on PCRE2_UTF, but yours doesn't? IOW is there a 1=1 mapping > still > between the encoding mode log/revision.c thinks it's in & PCRE2_UTF? As far as I can tell, the only place where ignore_locale is mutated is in setup_revisions(), so there should be a 1=1 mapping.
diff --git a/grep.c b/grep.c index fe847a0111..f6e113e9f0 100644 --- a/grep.c +++ b/grep.c @@ -382,8 +382,10 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } options |= PCRE2_CASELESS; } - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && - !(!opt->ignore_case && (p->fixed || p->is_fixed))) + if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) || + (!opt->ignore_locale && is_utf8_locale() && + has_non_ascii(p->pattern) && !(!opt->ignore_case && + (p->fixed || p->is_fixed)))) options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); #ifdef GIT_PCRE2_VERSION_10_36_OR_HIGHER diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh index e5d1e4ea68..22487d90fd 100755 --- a/t/t7812-grep-icase-non-ascii.sh +++ b/t/t7812-grep-icase-non-ascii.sh @@ -53,6 +53,54 @@ test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' ' test_cmp expected actual ' +test_expect_success GETTEXT_LOCALE,PCRE 'log --author with an ascii pattern on UTF-8 data' ' + cat >expected <<-\EOF && + Author: <BOLD;RED>À Ú Thor<RESET> <author@example.com> + EOF + test_write_lines "forth" >file4 && + git add file4 && + git commit --author="À Ú Thor <author@example.com>" -m sécond && + git log -1 --color=always --perl-regexp --author=".*Thor" >log && + grep Author log >actual.raw && + test_decode_color <actual.raw >actual && + test_cmp expected actual +' + +test_expect_success GETTEXT_LOCALE,PCRE 'log --committer with an ascii pattern on ISO-8859-1 data' ' + cat >expected <<-\EOF && + Commit: Ç<BOLD;RED> O Mîtter <committer@example.com><RESET> + EOF + test_write_lines "fifth" >file5 && + git add file5 && + GIT_COMMITTER_NAME="Ç O Mîtter" && + GIT_COMMITTER_EMAIL="committer@example.com" && + git -c i18n.commitEncoding=latin1 commit -m thïrd && + git -c i18n.logOutputEncoding=latin1 log -1 --pretty=fuller --color=always --perl-regexp --committer=" O.*" >log && + grep Commit: log >actual.raw && + test_decode_color <actual.raw >actual && + test_cmp expected actual +' + +test_expect_success GETTEXT_LOCALE,PCRE 'log --grep with an ascii pattern on UTF-8 data' ' + cat >expected <<-\EOF && + sé<BOLD;RED>con<RESET>d + EOF + git log -1 --color=always --perl-regexp --grep="con" >log && + grep con log >actual.raw && + test_decode_color <actual.raw >actual && + test_cmp expected actual +' + +test_expect_success GETTEXT_LOCALE,PCRE 'log --grep with an ascii pattern on ISO-8859-1 data' ' + cat >expected <<-\EOF && + <BOLD;RED>thïrd<RESET> + EOF + git -c i18n.logOutputEncoding=latin1 log -1 --color=always --perl-regexp --grep="th.*rd" >log && + grep "th.*rd" log >actual.raw && + test_decode_color <actual.raw >actual && + test_cmp expected actual +' + test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: setup invalid UTF-8 data' ' printf "\\200\\n" >invalid-0x80 && echo "ævar" >expected &&