Message ID | YUlVZk1xXulAqdef@coredump.intra.peff.net (mailing list archive) |
---|---|
Headers | show |
Series | const-correctness in grep.c | expand |
On Mon, Sep 20, 2021 at 11:45:42PM -0400, Jeff King wrote: > While discussing [1], I noticed that the grep code mostly takes > non-const buffers, even though it is conceptually a read-only operation > to search in them. The culprit is a handful of spots that temporarily > tie off NUL-terminated strings by overwriting a byte of the buffer and > then restoring it. But I think we no longer need to do so these days, > now that we have a regexec_buf() that can take a ptr/size pair. This all looks very reasonable to me. I appreciated the way you broke up each spot that unnecessarily modified a buffer into its own patch with its own explanation. I looked through each of the three spots you mentioned with a close eye and concurred with your reasoning. (To the extent it was possible, I tried to ignore most of your commentary until I had generated my own understanding while searching through my copy of grep.c). Thanks for an enjoyable set of patches to read. Reviewed-by: Taylor Blau <me@ttaylorr.com> > [1] https://lore.kernel.org/git/YUk3zwuse56v76ze@coredump.intra.peff.net/ Thanks, Taylor
On Mon, Sep 20 2021, Jeff King wrote: > While discussing [1], I noticed that the grep code mostly takes > non-const buffers, even though it is conceptually a read-only operation > to search in them. The culprit is a handful of spots that temporarily > tie off NUL-terminated strings by overwriting a byte of the buffer and > then restoring it. But I think we no longer need to do so these days, > now that we have a regexec_buf() that can take a ptr/size pair. > > The first three patches are a bit repetitive, but I broke them up > individually because they're the high-risk part. I.e., if my assumptions > about needing the NUL are wrong, it could introduce a bug. But based on > my reading of the code, plus running the test suite with ASan/UBSan, I > feel reasonably confident. > > The last two are the bigger cleanups, but should obviously avoid any > behavior changes. > > [1/5]: grep: stop modifying buffer in strip_timestamp > [2/5]: grep: stop modifying buffer in show_line() > [3/5]: grep: stop modifying buffer in grep_source_1() > [4/5]: grep: mark "haystack" buffers as const > [5/5]: grep: store grep_source buffer as const > > grep.c | 87 +++++++++++++++++++++++++++++----------------------------- > grep.h | 4 +-- > 2 files changed, 45 insertions(+), 46 deletions(-) > > -Peff > > [1] https://lore.kernel.org/git/YUk3zwuse56v76ze@coredump.intra.peff.net/ This whole thing looks good to me. I only found a small whitespace nit in one of the patches. Did you consider following-up by having this code take const char*/const size_t pairs. E.g. starting with something like the below. When this API is called it's called like that, and the regex functions at the bottom expect that, but we have all the bol/eol twiddling in the middle, which is often confusing because some functions pass the pointers along as-is, and some modify them. So not for now, but I think rolling with what I started here below would make sense for this file eventually: diff --git a/grep.c b/grep.c index 14fe8a0fd23..f55ec5c0e09 100644 --- a/grep.c +++ b/grep.c @@ -436,7 +436,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } } -static int pcre2match(struct grep_pat *p, const char *line, const char *eol, +static int pcre2match(struct grep_pat *p, const char *line, const size_t len, regmatch_t *match, int eflags) { int ret, flags = 0; @@ -448,11 +448,11 @@ static int pcre2match(struct grep_pat *p, const char *line, const char *eol, if (p->pcre2_jit_on) ret = pcre2_jit_match(p->pcre2_pattern, (unsigned char *)line, - eol - line, 0, flags, p->pcre2_match_data, + len, 0, flags, p->pcre2_match_data, NULL); else ret = pcre2_match(p->pcre2_pattern, (unsigned char *)line, - eol - line, 0, flags, p->pcre2_match_data, + len, 0, flags, p->pcre2_match_data, NULL); if (ret < 0 && ret != PCRE2_ERROR_NOMATCH) { @@ -909,15 +909,15 @@ static void show_name(struct grep_opt *opt, const char *name) } static int patmatch(struct grep_pat *p, - const char *line, const char *eol, + const char *line, const size_t len, regmatch_t *match, int eflags) { int hit; if (p->pcre2_pattern) - hit = !pcre2match(p, line, eol, match, eflags); + hit = !pcre2match(p, line, len, match, eflags); else - hit = !regexec_buf(&p->regexp, line, eol - line, 1, match, + hit = !regexec_buf(&p->regexp, line, len, 1, match, eflags); return hit; @@ -976,7 +976,7 @@ static int match_one_pattern(struct grep_pat *p, } again: - hit = patmatch(p, bol, eol, pmatch, eflags); + hit = patmatch(p, bol, eol - bol, pmatch, eflags); if (hit && p->word_regexp) { if ((pmatch[0].rm_so < 0) || @@ -1447,7 +1447,7 @@ static int look_ahead(struct grep_opt *opt, int hit; regmatch_t m; - hit = patmatch(p, bol, bol + *left_p, &m, 0); + hit = patmatch(p, bol, *left_p, &m, 0); if (!hit || m.rm_so < 0 || m.rm_eo < 0) continue; if (earliest < 0 || m.rm_so < earliest)
On Tue, Sep 21, 2021 at 02:07:08PM +0200, Ævar Arnfjörð Bjarmason wrote: > This whole thing looks good to me. I only found a small whitespace nit > in one of the patches. Did you consider following-up by having this code > take const char*/const size_t pairs. E.g. starting with something like > the below. I do generally find ptr/len pairs to be easier to read, but they also make it really easy to introduce subtle bugs. E.g., if you consume part of a buffer, you have to tweak both the ptr and the len. So the current: while (word_char(bol[-1]) && bol < eol) bol++; has to become: while (word_char(bol[-1] && len > 0) { bol++; len--; } So I'd be hesitant to churn battle-tested code in such a way for what I consider to be a pretty minor benefit. I did notice the ugly use of "unsigned long" here in a few places (rather than size_t). I do think it is worth fixing, but it seemed a little too far to try to cram into this series (it's obviously touching the same lines, but it's quite orthogonal semantically). The other hesitation I had is that the source of this "unsigned long" pattern is almost certainly the object code (which is much more important to convert, as it blocks people from having >4GB objects on Windows). So we might want to just wait for a larger conversion there. OTOH, I don't think there is any downside to a partial conversion here in the meantime (because size_t will always be at least as long as "unsigned long" in practice). > -static int pcre2match(struct grep_pat *p, const char *line, const char *eol, > +static int pcre2match(struct grep_pat *p, const char *line, const size_t len, > regmatch_t *match, int eflags) > { > int ret, flags = 0; > @@ -448,11 +448,11 @@ static int pcre2match(struct grep_pat *p, const char *line, const char *eol, > > if (p->pcre2_jit_on) > ret = pcre2_jit_match(p->pcre2_pattern, (unsigned char *)line, > - eol - line, 0, flags, p->pcre2_match_data, > + len, 0, flags, p->pcre2_match_data, > NULL); > else > ret = pcre2_match(p->pcre2_pattern, (unsigned char *)line, > - eol - line, 0, flags, p->pcre2_match_data, > + len, 0, flags, p->pcre2_match_data, > NULL); Not related to your point, but these casts are funny now. They are meant to cast to "unsigned char" pointers to match pcre's signature, but now they are casting away const-ness, too. That might be worth fixing as part of this series. Though should they really be casting to PCRE2_SPTR? The types are opaque in their API because of the weird multi-width thing, though I find it hard to imagine us ever using the wider versions of the library. -Peff
Jeff King <peff@peff.net> writes: > While discussing [1], I noticed that the grep code mostly takes > non-const buffers, even though it is conceptually a read-only operation > to search in them. The culprit is a handful of spots that temporarily > tie off NUL-terminated strings by overwriting a byte of the buffer and > then restoring it. But I think we no longer need to do so these days, > now that we have a regexec_buf() that can take a ptr/size pair. Yes, the haystack has not been read-only exactly because we didn't have <ptr,size> based regexec variant when the grep machinery was written, and there is no reason why we want to use the "temporarily terminate by swapping the byte with a NUL" trick. It always is a pleasure to read such a concise and to-the-point summary. With a clear summary like that, a reader almost does not have to see the patch to guess how the rest of the story goes ;-)