diff mbox series

[v12,3/3] grep: fix an edge case concerning ascii patterns and UTF-8 data

Message ID 20211008224918.603392-3-someguy@effective-light.com (mailing list archive)
State Superseded
Headers show
Series [v12,1/3] grep: refactor next_match() and match_one_pattern() for external use | expand

Commit Message

Hamza Mahfooz Oct. 8, 2021, 10:49 p.m. UTC
If we attempt to grep non-ascii log message text with an ascii pattern, we
run into the following issue:

    $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author
    grep: (standard input): binary file matches

So, to fix this teach the grep code to mark the pattern as UTF-8 (even if
the pattern is composed of only ascii characters), so long as the log
output is encoded using UTF-8.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Hamza Mahfooz <someguy@effective-light.com>
---
v12: get rid of utf8_all_the_things and fix an issue with one of the unit
     tests.
---
 grep.c                          |  6 +++--
 t/t7812-grep-icase-non-ascii.sh | 48 +++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+), 2 deletions(-)

Comments

Ævar Arnfjörð Bjarmason Oct. 9, 2021, 1:37 a.m. UTC | #1
On Fri, Oct 08 2021, Hamza Mahfooz wrote:

> If we attempt to grep non-ascii log message text with an ascii pattern, we
> run into the following issue:
>
>     $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author
>     grep: (standard input): binary file matches
>
> So, to fix this teach the grep code to mark the pattern as UTF-8 (even if
> the pattern is composed of only ascii characters), so long as the log
> output is encoded using UTF-8.
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> Signed-off-by: Hamza Mahfooz <someguy@effective-light.com>
> ---
> v12: get rid of utf8_all_the_things and fix an issue with one of the unit
>      tests.
> ---
>  grep.c                          |  6 +++--
>  t/t7812-grep-icase-non-ascii.sh | 48 +++++++++++++++++++++++++++++++++
>  2 files changed, 52 insertions(+), 2 deletions(-)
>
> diff --git a/grep.c b/grep.c
> index fe847a0111..f6e113e9f0 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -382,8 +382,10 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>  		}
>  		options |= PCRE2_CASELESS;
>  	}
> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
> -	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
> +	if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
> +	    (!opt->ignore_locale && is_utf8_locale() &&
> +	     has_non_ascii(p->pattern) && !(!opt->ignore_case &&
> +					    (p->fixed || p->is_fixed))))
>  		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);

I think at least some of that existing "if" is my fault, and I *think*
your patch works here, but FWIW I'd find something like this way more
readable:
	
	@@ -382,8 +383,16 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
	                }
	                options |= PCRE2_CASELESS;
	        }
	+       if (opt->utf8_all_the_things) {
	+               options |= PCRE2_UCP;
	+               do_utf8 = 1;
	+       }
	+               
	        if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
	            !(!opt->ignore_case && (p->fixed || p->is_fixed)))
	+               do_utf8 = 1;
	+
	+       if (do_utf8)
	                options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);

Well, without the "utf8_all_the_things" probably. That's a reference to
a popular meme, probably better to name it differently, and the
PCRE2_UCP is just something I was experimenting with.

It's late here, but I've got to admit that I'm still a bit confused by
this. Let's see if I can try to sum up why:

Ultimately whether we use PCRE2_UTF *should* have nothing do to with
whether the pattern is UTF-8 or not, because even an expression like:

    /.*/

Will behave differently under UTF-8, i.e. character classes change, byte
boundaries change to "character" boundaries etc.

That the existing code has has_non_ascii() and the like is a trade-off
that had to be made for the grep-a-file case, because you might run into
arbitrary binary data, but logs are cleaner/encoded/re-encoded etc.

If you run PCRE in UTF-8 mode it will die on some of that data (as
you'll see from our test suite if you turn it on unconditionally).

Are there cases where my "utf8_all_the_things" POC patch would have
turned on PCRE2_UTF, but yours doesn't? IOW is there a 1=1 mapping still
between the encoding mode log/revision.c thinks it's in & PCRE2_UTF?

Anyway, I've tried to break things with this patch and haven't
succeeded, so maybe it's all OK now, thanks a lot for working on this
again, it's a really neat feature. Just wondering if there's any
remaining edge cases we know about...
Hamza Mahfooz Oct. 12, 2021, 11:01 p.m. UTC | #2
On Sat, Oct 9 2021 at 03:37:58 AM +0200, Ævar Arnfjörð Bjarmason 
<avarab@gmail.com> wrote:
> Are there cases where my "utf8_all_the_things" POC patch would have
> turned on PCRE2_UTF, but yours doesn't? IOW is there a 1=1 mapping 
> still
> between the encoding mode log/revision.c thinks it's in & PCRE2_UTF?

As far as I can tell, the only place where ignore_locale is mutated is 
in
setup_revisions(), so there should be a 1=1 mapping.
diff mbox series

Patch

diff --git a/grep.c b/grep.c
index fe847a0111..f6e113e9f0 100644
--- a/grep.c
+++ b/grep.c
@@ -382,8 +382,10 @@  static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		}
 		options |= PCRE2_CASELESS;
 	}
-	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
-	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
+	if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
+	    (!opt->ignore_locale && is_utf8_locale() &&
+	     has_non_ascii(p->pattern) && !(!opt->ignore_case &&
+					    (p->fixed || p->is_fixed))))
 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
 
 #ifdef GIT_PCRE2_VERSION_10_36_OR_HIGHER
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index e5d1e4ea68..22487d90fd 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -53,6 +53,54 @@  test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' '
 	test_cmp expected actual
 '
 
+test_expect_success GETTEXT_LOCALE,PCRE 'log --author with an ascii pattern on UTF-8 data' '
+	cat >expected <<-\EOF &&
+	Author: <BOLD;RED>À Ú Thor<RESET> <author@example.com>
+	EOF
+	test_write_lines "forth" >file4 &&
+	git add file4 &&
+	git commit --author="À Ú Thor <author@example.com>" -m sécond &&
+	git log -1 --color=always --perl-regexp --author=".*Thor" >log &&
+	grep Author log >actual.raw &&
+	test_decode_color <actual.raw >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success GETTEXT_LOCALE,PCRE 'log --committer with an ascii pattern on ISO-8859-1 data' '
+	cat >expected <<-\EOF &&
+	Commit:     Ç<BOLD;RED> O Mîtter <committer@example.com><RESET>
+	EOF
+	test_write_lines "fifth" >file5 &&
+	git add file5 &&
+	GIT_COMMITTER_NAME="Ç O Mîtter" &&
+	GIT_COMMITTER_EMAIL="committer@example.com" &&
+	git -c i18n.commitEncoding=latin1 commit -m thïrd &&
+	git -c i18n.logOutputEncoding=latin1 log -1 --pretty=fuller --color=always --perl-regexp --committer=" O.*" >log &&
+	grep Commit: log >actual.raw &&
+	test_decode_color <actual.raw >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success GETTEXT_LOCALE,PCRE 'log --grep with an ascii pattern on UTF-8 data' '
+	cat >expected <<-\EOF &&
+	    sé<BOLD;RED>con<RESET>d
+	EOF
+	git log -1 --color=always --perl-regexp --grep="con" >log &&
+	grep con log >actual.raw &&
+	test_decode_color <actual.raw >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success GETTEXT_LOCALE,PCRE 'log --grep with an ascii pattern on ISO-8859-1 data' '
+	cat >expected <<-\EOF &&
+	    <BOLD;RED>thïrd<RESET>
+	EOF
+	git -c i18n.logOutputEncoding=latin1 log -1 --color=always --perl-regexp --grep="th.*rd" >log &&
+	grep "th.*rd" log >actual.raw &&
+	test_decode_color <actual.raw >actual &&
+	test_cmp expected actual
+'
+
 test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: setup invalid UTF-8 data' '
 	printf "\\200\\n" >invalid-0x80 &&
 	echo "ævar" >expected &&