[0/5] const-correctness in grep.c

Message ID	YUlVZk1xXulAqdef@coredump.intra.peff.net (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Date: Mon, 20 Sep 2021 23:45:42 -0400 From: Jeff King <peff@peff.net> To: git@vger.kernel.org Cc: Hamza Mahfooz <someguy@effective-light.com> Subject: [PATCH 0/5] const-correctness in grep.c Message-ID: <YUlVZk1xXulAqdef@coredump.intra.peff.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Precedence: bulk
Series	const-correctness in grep.c \| expand [0/5] const-correctness in grep.c [1/5] grep: stop modifying buffer in strip_timestamp [2/5] grep: stop modifying buffer in show_line() [3/5] grep: stop modifying buffer in grep_source_1() [4/5] grep: mark "haystack" buffers as const [5/5] grep: store grep_source buffer as const [6/5] grep.c: mark eol/bol and derived as "const char * const"

Message ID

YUlVZk1xXulAqdef@coredump.intra.peff.net (mailing list archive)

Headers

Date: Mon, 20 Sep 2021 23:45:42 -0400
From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Cc: Hamza Mahfooz <someguy@effective-light.com>
Subject: [PATCH 0/5] const-correctness in grep.c
Message-ID: <YUlVZk1xXulAqdef@coredump.intra.peff.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Precedence: bulk

Series

const-correctness in grep.c | expand

Message

Jeff King Sept. 21, 2021, 3:45 a.m. UTC

While discussing [1], I noticed that the grep code mostly takes
non-const buffers, even though it is conceptually a read-only operation
to search in them. The culprit is a handful of spots that temporarily
tie off NUL-terminated strings by overwriting a byte of the buffer and
then restoring it. But I think we no longer need to do so these days,
now that we have a regexec_buf() that can take a ptr/size pair.

The first three patches are a bit repetitive, but I broke them up
individually because they're the high-risk part. I.e., if my assumptions
about needing the NUL are wrong, it could introduce a bug. But based on
my reading of the code, plus running the test suite with ASan/UBSan, I
feel reasonably confident.

The last two are the bigger cleanups, but should obviously avoid any
behavior changes.

  [1/5]: grep: stop modifying buffer in strip_timestamp
  [2/5]: grep: stop modifying buffer in show_line()
  [3/5]: grep: stop modifying buffer in grep_source_1()
  [4/5]: grep: mark "haystack" buffers as const
  [5/5]: grep: store grep_source buffer as const

 grep.c | 87 +++++++++++++++++++++++++++++-----------------------------
 grep.h |  4 +--
 2 files changed, 45 insertions(+), 46 deletions(-)

-Peff

[1] https://lore.kernel.org/git/YUk3zwuse56v76ze@coredump.intra.peff.net/

Comments

Taylor Blau Sept. 21, 2021, 4:30 a.m. UTC | #1

On Mon, Sep 20, 2021 at 11:45:42PM -0400, Jeff King wrote:
> While discussing [1], I noticed that the grep code mostly takes
> non-const buffers, even though it is conceptually a read-only operation
> to search in them. The culprit is a handful of spots that temporarily
> tie off NUL-terminated strings by overwriting a byte of the buffer and
> then restoring it. But I think we no longer need to do so these days,
> now that we have a regexec_buf() that can take a ptr/size pair.

This all looks very reasonable to me. I appreciated the way you broke up
each spot that unnecessarily modified a buffer into its own patch with
its own explanation.

I looked through each of the three spots you mentioned with a close eye
and concurred with your reasoning. (To the extent it was possible, I
tried to ignore most of your commentary until I had generated my own
understanding while searching through my copy of grep.c).

Thanks for an enjoyable set of patches to read.

    Reviewed-by: Taylor Blau <me@ttaylorr.com>

> [1] https://lore.kernel.org/git/YUk3zwuse56v76ze@coredump.intra.peff.net/

Thanks,
Taylor

Ævar Arnfjörð Bjarmason Sept. 21, 2021, 12:07 p.m. UTC | #2

On Mon, Sep 20 2021, Jeff King wrote:

> While discussing [1], I noticed that the grep code mostly takes
> non-const buffers, even though it is conceptually a read-only operation
> to search in them. The culprit is a handful of spots that temporarily
> tie off NUL-terminated strings by overwriting a byte of the buffer and
> then restoring it. But I think we no longer need to do so these days,
> now that we have a regexec_buf() that can take a ptr/size pair.
>
> The first three patches are a bit repetitive, but I broke them up
> individually because they're the high-risk part. I.e., if my assumptions
> about needing the NUL are wrong, it could introduce a bug. But based on
> my reading of the code, plus running the test suite with ASan/UBSan, I
> feel reasonably confident.
>
> The last two are the bigger cleanups, but should obviously avoid any
> behavior changes.
>
>   [1/5]: grep: stop modifying buffer in strip_timestamp
>   [2/5]: grep: stop modifying buffer in show_line()
>   [3/5]: grep: stop modifying buffer in grep_source_1()
>   [4/5]: grep: mark "haystack" buffers as const
>   [5/5]: grep: store grep_source buffer as const
>
>  grep.c | 87 +++++++++++++++++++++++++++++-----------------------------
>  grep.h |  4 +--
>  2 files changed, 45 insertions(+), 46 deletions(-)
>
> -Peff
>
> [1] https://lore.kernel.org/git/YUk3zwuse56v76ze@coredump.intra.peff.net/

This whole thing looks good to me. I only found a small whitespace nit
in one of the patches. Did you consider following-up by having this code
take const char*/const size_t pairs. E.g. starting with something like
the below.

When this API is called it's called like that, and the regex functions
at the bottom expect that, but we have all the bol/eol twiddling in the
middle, which is often confusing because some functions pass the
pointers along as-is, and some modify them. So not for now, but I think
rolling with what I started here below would make sense for this file
eventually:

diff --git a/grep.c b/grep.c
index 14fe8a0fd23..f55ec5c0e09 100644
--- a/grep.c
+++ b/grep.c
@@ -436,7 +436,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 	}
 }
 
-static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
+static int pcre2match(struct grep_pat *p, const char *line, const size_t len,
 		regmatch_t *match, int eflags)
 {
 	int ret, flags = 0;
@@ -448,11 +448,11 @@ static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
 
 	if (p->pcre2_jit_on)
 		ret = pcre2_jit_match(p->pcre2_pattern, (unsigned char *)line,
-				      eol - line, 0, flags, p->pcre2_match_data,
+				      len, 0, flags, p->pcre2_match_data,
 				      NULL);
 	else
 		ret = pcre2_match(p->pcre2_pattern, (unsigned char *)line,
-				  eol - line, 0, flags, p->pcre2_match_data,
+				  len, 0, flags, p->pcre2_match_data,
 				  NULL);
 
 	if (ret < 0 && ret != PCRE2_ERROR_NOMATCH) {
@@ -909,15 +909,15 @@ static void show_name(struct grep_opt *opt, const char *name)
 }
 
 static int patmatch(struct grep_pat *p,
-		    const char *line, const char *eol,
+		    const char *line, const size_t len,
 		    regmatch_t *match, int eflags)
 {
 	int hit;
 
 	if (p->pcre2_pattern)
-		hit = !pcre2match(p, line, eol, match, eflags);
+		hit = !pcre2match(p, line, len, match, eflags);
 	else
-		hit = !regexec_buf(&p->regexp, line, eol - line, 1, match,
+		hit = !regexec_buf(&p->regexp, line, len, 1, match,
 				   eflags);
 
 	return hit;
@@ -976,7 +976,7 @@ static int match_one_pattern(struct grep_pat *p,
 	}
 
  again:
-	hit = patmatch(p, bol, eol, pmatch, eflags);
+	hit = patmatch(p, bol, eol - bol, pmatch, eflags);
 
 	if (hit && p->word_regexp) {
 		if ((pmatch[0].rm_so < 0) ||
@@ -1447,7 +1447,7 @@ static int look_ahead(struct grep_opt *opt,
 		int hit;
 		regmatch_t m;
 
-		hit = patmatch(p, bol, bol + *left_p, &m, 0);
+		hit = patmatch(p, bol, *left_p, &m, 0);
 		if (!hit || m.rm_so < 0 || m.rm_eo < 0)
 			continue;
 		if (earliest < 0 || m.rm_so < earliest)

Jeff King Sept. 21, 2021, 2:49 p.m. UTC | #3

On Tue, Sep 21, 2021 at 02:07:08PM +0200, Ævar Arnfjörð Bjarmason wrote:

> This whole thing looks good to me. I only found a small whitespace nit
> in one of the patches. Did you consider following-up by having this code
> take const char*/const size_t pairs. E.g. starting with something like
> the below.

I do generally find ptr/len pairs to be easier to read, but they also
make it really easy to introduce subtle bugs. E.g., if you consume part
of a buffer, you have to tweak both the ptr and the len. So the current:

	while (word_char(bol[-1]) && bol < eol)
		bol++;

has to become:

	while (word_char(bol[-1] && len > 0) {
		bol++;
		len--;
	}

So I'd be hesitant to churn battle-tested code in such a way for what I
consider to be a pretty minor benefit.

I did notice the ugly use of "unsigned long" here in a few places
(rather than size_t). I do think it is worth fixing, but it seemed a
little too far to try to cram into this series (it's obviously touching
the same lines, but it's quite orthogonal semantically).

The other hesitation I had is that the source of this "unsigned long"
pattern is almost certainly the object code (which is much more
important to convert, as it blocks people from having >4GB objects on
Windows). So we might want to just wait for a larger conversion there.
OTOH, I don't think there is any downside to a partial conversion here
in the meantime (because size_t will always be at least as long as
"unsigned long" in practice).

> -static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
> +static int pcre2match(struct grep_pat *p, const char *line, const size_t len,
>  		regmatch_t *match, int eflags)
>  {
>  	int ret, flags = 0;
> @@ -448,11 +448,11 @@ static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
>  
>  	if (p->pcre2_jit_on)
>  		ret = pcre2_jit_match(p->pcre2_pattern, (unsigned char *)line,
> -				      eol - line, 0, flags, p->pcre2_match_data,
> +				      len, 0, flags, p->pcre2_match_data,
>  				      NULL);
>  	else
>  		ret = pcre2_match(p->pcre2_pattern, (unsigned char *)line,
> -				  eol - line, 0, flags, p->pcre2_match_data,
> +				  len, 0, flags, p->pcre2_match_data,
>  				  NULL);

Not related to your point, but these casts are funny now. They are meant
to cast to "unsigned char" pointers to match pcre's signature, but now
they are casting away const-ness, too. That might be worth fixing as
part of this series.

Though should they really be casting to PCRE2_SPTR? The types are opaque
in their API because of the weird multi-width thing, though I find it
hard to imagine us ever using the wider versions of the library.

-Peff

Junio C Hamano Sept. 22, 2021, 6:57 p.m. UTC | #4

Jeff King <peff@peff.net> writes:

> While discussing [1], I noticed that the grep code mostly takes
> non-const buffers, even though it is conceptually a read-only operation
> to search in them. The culprit is a handful of spots that temporarily
> tie off NUL-terminated strings by overwriting a byte of the buffer and
> then restoring it. But I think we no longer need to do so these days,
> now that we have a regexec_buf() that can take a ptr/size pair.

Yes, the haystack has not been read-only exactly because we didn't
have <ptr,size> based regexec variant when the grep machinery was
written, and there is no reason why we want to use the "temporarily
terminate by swapping the byte with a NUL" trick.

It always is a pleasure to read such a concise and to-the-point
summary.  With a clear summary like that, a reader almost does not
have to see the patch to guess how the rest of the story goes ;-)