Message ID | 20240307092638.GK2080210@coredump.intra.peff.net (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | allow multi-byte core.commentChar | expand |
On Thu, Mar 07, 2024 at 04:26:38AM -0500, Jeff King wrote: > IMHO this is the trickiest commit of the whole series, as it would be > easy to get the length computations subtly wrong. And sure enough... > diff --git a/trailer.c b/trailer.c > index fe18faf6c5..f59c90b4b5 100644 > --- a/trailer.c > +++ b/trailer.c > @@ -882,7 +882,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len) > > /* The first paragraph is the title and cannot be trailers */ > for (s = buf; s < buf + len; s = next_line(s)) { > - if (s[0] == comment_line_char) > + if (starts_with_mem(s, buf + len - s, comment_line_str)) > continue; > if (is_blank_line(s)) > break; > @@ -902,7 +902,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len) > const char **p; > ssize_t separator_pos; > > - if (bol[0] == comment_line_char) { > + if (starts_with_mem(bol, buf + end_of_title - bol, comment_line_str)) { > non_trailer_lines += possible_continuation_lines; > possible_continuation_lines = 0; > continue; This second hunk needs: diff --git a/trailer.c b/trailer.c index f59c90b4b5..fdb0b8137e 100644 --- a/trailer.c +++ b/trailer.c @@ -902,7 +902,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len) const char **p; ssize_t separator_pos; - if (starts_with_mem(bol, buf + end_of_title - bol, comment_line_str)) { + if (starts_with_mem(bol, buf + len - bol, comment_line_str)) { non_trailer_lines += possible_continuation_lines; possible_continuation_lines = 0; continue; I was trying to bound the size based on the loop, which is: for (l = last_line(buf, len); l >= end_of_title; l = last_line(buf, l)) { const char *bol = buf + l; but I misread "end_of_title" as an upper bound, not a lower one. Which makes sense because we're iterating backwards over the lines. So I suppose we could bound it by the previous "bol" value. But in practice, your prefix won't cross such a boundary anyway, as it won't have a newline in it (maybe that's something we should enforce? I guess you could set core.commentChar to '\n' even before my series, which would be slightly insane). So just bounding ourselves to "buf + len" seems reasonable, as that makes sure we don't step outside the buffer passed into the function. Curiously, this was found by the sanitizer job in CI, where UBSan complains of integer overflow in the pointer computation. I had run with both ASan/UBSan locally, but just using gcc, which doesn't seem to find it (the CI job uses clang). So I'll that to my mental tally of "clang seems to be better with sanitizers". -Peff
Am 07.03.24 um 12:08 schrieb Jeff King: > On Thu, Mar 07, 2024 at 04:26:38AM -0500, Jeff King wrote: > >> IMHO this is the trickiest commit of the whole series, as it would be >> easy to get the length computations subtly wrong. > > And sure enough... > >> diff --git a/trailer.c b/trailer.c >> index fe18faf6c5..f59c90b4b5 100644 >> --- a/trailer.c >> +++ b/trailer.c >> @@ -882,7 +882,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len) >> >> /* The first paragraph is the title and cannot be trailers */ >> for (s = buf; s < buf + len; s = next_line(s)) { >> - if (s[0] == comment_line_char) >> + if (starts_with_mem(s, buf + len - s, comment_line_str)) >> continue; >> if (is_blank_line(s)) >> break; >> @@ -902,7 +902,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len) >> const char **p; >> ssize_t separator_pos; >> >> - if (bol[0] == comment_line_char) { >> + if (starts_with_mem(bol, buf + end_of_title - bol, comment_line_str)) { >> non_trailer_lines += possible_continuation_lines; >> possible_continuation_lines = 0; >> continue; > > This second hunk needs: > > diff --git a/trailer.c b/trailer.c > index f59c90b4b5..fdb0b8137e 100644 > --- a/trailer.c > +++ b/trailer.c > @@ -902,7 +902,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len) > const char **p; > ssize_t separator_pos; > > - if (starts_with_mem(bol, buf + end_of_title - bol, comment_line_str)) { > + if (starts_with_mem(bol, buf + len - bol, comment_line_str)) { > non_trailer_lines += possible_continuation_lines; > possible_continuation_lines = 0; > continue; > > I was trying to bound the size based on the loop, which is: > > for (l = last_line(buf, len); > l >= end_of_title; > l = last_line(buf, l)) { > const char *bol = buf + l; > > but I misread "end_of_title" as an upper bound, not a lower one. Which > makes sense because we're iterating backwards over the lines. So I > suppose we could bound it by the previous "bol" value. But in practice, > your prefix won't cross such a boundary anyway, as it won't have a > newline in it (maybe that's something we should enforce? I guess you > could set core.commentChar to '\n' even before my series, which would be > slightly insane). > > So just bounding ourselves to "buf + len" seems reasonable, as that > makes sure we don't step outside the buffer passed into the function. > > Curiously, this was found by the sanitizer job in CI, where UBSan > complains of integer overflow in the pointer computation. I had run with > both ASan/UBSan locally, but just using gcc, which doesn't seem to find > it (the CI job uses clang). So I'll that to my mental tally of "clang > seems to be better with sanitizers". > > -Peff
Am 07.03.24 um 10:26 schrieb Jeff King: > As with the previous patch, we need to swap out single-byte matching for > something like starts_with() to match all bytes of a multi-byte comment > character. But for cases where the buffer is not NUL-terminated (and we > instead have an explicit size or end pointer), it's not safe to use > starts_with(), as it might walk off the end of the buffer. > > Let's introduce a new starts_with_mem() that does the same thing but > also accepts the length of the "haystack" str and makes sure not to walk > past it. > > Note that in most cases the existing code did not need a length check at > all, since it was written in a way that knew we had at least one byte > available (and that was all we checked). So I had to read each one to > find the appropriate bounds. The one exception is sequencer.c's > add_commented_lines(), where we can actually get rid of the length > check. Just like starts_with(), our starts_with_mem() handles an empty > haystack variable by not matching (assuming a non-empty prefix). > > A few notes on the implementation of starts_with_mem(): > > - it would be equally correct to take an "end" pointer (and indeed, > many of the callers have this and have to subtract to come up with > the length). I think taking a ptr/size combo is a more usual > interface for our codebase, though, and has the added benefit that > the function signature makes it harder to mix up the three > parameters. > > - we could obviously build starts_with() on top of this by passing > strlen(str) as the length. But it's possible that starts_with() is a > relatively hot code path, and it should not pay that penalty (it can > generally return an answer proportional to the size of the prefix, > not the whole string). > > - it naively feels like xstrncmpz() should be able to do the same > thing, but that's not quite true. If you pass the length of the > haystack buffer, then strncmp() finds that a shorter prefix string > is "less than" than the haystack, even if the haystack starts with > the prefix. If you pass the length of the prefix, then you risk > reading past the end of the haystack if it is shorter than the > prefix. So I think we really do need a new function. Yes. xstrncmpz() compares a NUL-terminated string and a length-limited string. If you want to check whether the former is a prefix of the latter then you need to stop comparing when reaching its NUL, and also after exhausting the latter. So you need to take both lengths into account: int starts_with_mem(const char *str, size_t len, const char *prefix) { size_t prefixlen = strlen(prefix); return prefixlen <= len && !xstrncmpz(prefix, str, prefixlen); } Using memcmp() here is equivalent and simpler: int starts_with_mem(const char *str, size_t len, const char *prefix) { size_t prefixlen = strlen(prefix); return prefixlen <= len && !memcmp(str, prefix, prefixlen); } And your version below avoids function calls and avoids traversing the strings beyond their common prefix, of course. > > Signed-off-by: Jeff King <peff@peff.net> > --- > Arguably starts_with() and this new function should both be inlined, > like we do for skip_prefix(), but I think that's out of scope for this > series. Inlining would allow the compiler to unroll the loop for string constants. I doubt it would do that for variables, as in the code below. Inlining the strlen()+memcmp() version above might allow the compiler to push the strlen() call out of a loop. Would any of that improve performance noticeably? For the call sites below I doubt it. But it would probably increase the object text size. > And it's possible I was simply too dumb to figure out xstrncmpz() here. > I'm waiting for René to show up and tell me how to do it. ;) Nah, it's not a good fit, as it requires the two strings to have the same length. > > IMHO this is the trickiest commit of the whole series, as it would be > easy to get the length computations subtly wrong. > > commit.c | 3 ++- > sequencer.c | 4 ++-- > strbuf.c | 11 +++++++++++ > strbuf.h | 1 + > trailer.c | 4 ++-- > 5 files changed, 18 insertions(+), 5 deletions(-) > > diff --git a/commit.c b/commit.c > index ef679a0b93..531a666cba 100644 > --- a/commit.c > +++ b/commit.c > @@ -1796,7 +1796,8 @@ size_t ignored_log_message_bytes(const char *buf, size_t len) > else > next_line++; > > - if (buf[bol] == comment_line_char || buf[bol] == '\n') { > + if (starts_with_mem(buf + bol, cutoff - bol, comment_line_str) || > + buf[bol] == '\n') { > /* is this the first of the run of comments? */ > if (!boc) > boc = bol; > diff --git a/sequencer.c b/sequencer.c > index 991a2dbe96..664986e3b2 100644 > --- a/sequencer.c > +++ b/sequencer.c > @@ -1840,7 +1840,7 @@ static int is_fixup_flag(enum todo_command command, unsigned flag) > static void add_commented_lines(struct strbuf *buf, const void *str, size_t len) > { > const char *s = str; > - while (len > 0 && s[0] == comment_line_char) { > + while (starts_with_mem(s, len, comment_line_str)) { > size_t count; > const char *n = memchr(s, '\n', len); > if (!n) > @@ -2562,7 +2562,7 @@ static int parse_insn_line(struct repository *r, struct todo_item *item, > /* left-trim */ > bol += strspn(bol, " \t"); > > - if (bol == eol || *bol == '\r' || *bol == comment_line_char) { > + if (bol == eol || *bol == '\r' || starts_with_mem(bol, eol - bol, comment_line_str)) { If the strspn() call is safe (which it is, as the caller expects the string to be NUL-terminated) then you could use starts_with() here and avoid the length calculation. But that would also match comment_line_str values that contain LF, which the _mem version does not and that's better. Not sure why lines that start with CR are considered comment lines, though. > item->command = TODO_COMMENT; > item->commit = NULL; > item->arg_offset = bol - buf; > diff --git a/strbuf.c b/strbuf.c > index 7c8f582127..291bdc2a65 100644 > --- a/strbuf.c > +++ b/strbuf.c > @@ -24,6 +24,17 @@ int istarts_with(const char *str, const char *prefix) > return 0; > } > > +int starts_with_mem(const char *str, size_t len, const char *prefix) > +{ > + const char *end = str + len; > + for (; ; str++, prefix++) { > + if (!*prefix) > + return 1; > + else if (str == end || *str != *prefix) > + return 0; > + } > +} So this checks whether a length-limited string has a prefix given as a NUL-terminated string. I'd have called it mem_starts_with() and have expected starts_with_mem() to check a NUL-terminated string for a length-limited prefix (think !strncmp(str, prefix, prefixlen)). > + > int skip_to_optional_arg_default(const char *str, const char *prefix, > const char **arg, const char *def) > { > diff --git a/strbuf.h b/strbuf.h > index 58dddf2777..3156d6ea8c 100644 > --- a/strbuf.h > +++ b/strbuf.h > @@ -673,6 +673,7 @@ char *xstrfmt(const char *fmt, ...); > > int starts_with(const char *str, const char *prefix); > int istarts_with(const char *str, const char *prefix); > +int starts_with_mem(const char *str, size_t len, const char *prefix); > > /* > * If the string "str" is the same as the string in "prefix", then the "arg" > diff --git a/trailer.c b/trailer.c > index fe18faf6c5..f59c90b4b5 100644 > --- a/trailer.c > +++ b/trailer.c > @@ -882,7 +882,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len) > > /* The first paragraph is the title and cannot be trailers */ > for (s = buf; s < buf + len; s = next_line(s)) { > - if (s[0] == comment_line_char) > + if (starts_with_mem(s, buf + len - s, comment_line_str)) > continue; > if (is_blank_line(s)) Another case where starts_with() would be safe to use, as is_blank_line() expects (and gets) a NUL-terminated string, but it would allow matching comment_line_str values that contain LF. > break; > @@ -902,7 +902,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len) > const char **p; > ssize_t separator_pos; > > - if (bol[0] == comment_line_char) { > + if (starts_with_mem(bol, buf + end_of_title - bol, comment_line_str)) { We're in the same buffer, so the above comment applies here as well. > non_trailer_lines += possible_continuation_lines; > possible_continuation_lines = 0; > continue;
Am 07.03.24 um 20:41 schrieb René Scharfe: Sorry, sent too early. > Am 07.03.24 um 12:08 schrieb Jeff King: >> On Thu, Mar 07, 2024 at 04:26:38AM -0500, Jeff King wrote: >> >>> IMHO this is the trickiest commit of the whole series, as it would be >>> easy to get the length computations subtly wrong. >> >> And sure enough... >> >>> diff --git a/trailer.c b/trailer.c >>> index fe18faf6c5..f59c90b4b5 100644 >>> --- a/trailer.c >>> +++ b/trailer.c >>> @@ -882,7 +882,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len) >>> >>> /* The first paragraph is the title and cannot be trailers */ >>> for (s = buf; s < buf + len; s = next_line(s)) { >>> - if (s[0] == comment_line_char) >>> + if (starts_with_mem(s, buf + len - s, comment_line_str)) >>> continue; >>> if (is_blank_line(s)) >>> break; >>> @@ -902,7 +902,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len) >>> const char **p; >>> ssize_t separator_pos; >>> >>> - if (bol[0] == comment_line_char) { >>> + if (starts_with_mem(bol, buf + end_of_title - bol, comment_line_str)) { >>> non_trailer_lines += possible_continuation_lines; >>> possible_continuation_lines = 0; >>> continue; >> >> This second hunk needs: >> >> diff --git a/trailer.c b/trailer.c >> index f59c90b4b5..fdb0b8137e 100644 >> --- a/trailer.c >> +++ b/trailer.c >> @@ -902,7 +902,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len) >> const char **p; >> ssize_t separator_pos; >> >> - if (starts_with_mem(bol, buf + end_of_title - bol, comment_line_str)) { >> + if (starts_with_mem(bol, buf + len - bol, comment_line_str)) { >> non_trailer_lines += possible_continuation_lines; >> possible_continuation_lines = 0; >> continue; >> >> I was trying to bound the size based on the loop, which is: >> >> for (l = last_line(buf, len); >> l >= end_of_title; >> l = last_line(buf, l)) { >> const char *bol = buf + l; >> >> but I misread "end_of_title" as an upper bound, not a lower one. Which >> makes sense because we're iterating backwards over the lines. So I >> suppose we could bound it by the previous "bol" value. But in practice, >> your prefix won't cross such a boundary anyway, as it won't have a >> newline in it (maybe that's something we should enforce? I guess you >> could set core.commentChar to '\n' even before my series, which would be >> slightly insane). >> >> So just bounding ourselves to "buf + len" seems reasonable, as that >> makes sure we don't step outside the buffer passed into the function. If you don't want or expect LF in comment_line_str, better check it. And if you do that, most callers of starts_with_mem() -- including this one -- can use starts_with() instead, as mentioned in my reply to your patch. Less calculations, less errors.. René
Hi Peff and René On 07/03/2024 19:42, René Scharfe wrote: > Am 07.03.24 um 10:26 schrieb Jeff King: >> diff --git a/sequencer.c b/sequencer.c >> index 991a2dbe96..664986e3b2 100644 >> --- a/sequencer.c >> +++ b/sequencer.c >> @@ -1840,7 +1840,7 @@ static int is_fixup_flag(enum todo_command command, unsigned flag) >> static void add_commented_lines(struct strbuf *buf, const void *str, size_t len) >> { >> const char *s = str; >> - while (len > 0 && s[0] == comment_line_char) { >> + while (starts_with_mem(s, len, comment_line_str)) { >> size_t count; >> const char *n = memchr(s, '\n', len); >> if (!n) >> @@ -2562,7 +2562,7 @@ static int parse_insn_line(struct repository *r, struct todo_item *item, >> /* left-trim */ >> bol += strspn(bol, " \t"); >> >> - if (bol == eol || *bol == '\r' || *bol == comment_line_char) { >> + if (bol == eol || *bol == '\r' || starts_with_mem(bol, eol - bol, comment_line_str)) { > > If the strspn() call is safe (which it is, as the caller expects the > string to be NUL-terminated) then you could use starts_with() here and > avoid the length calculation. But that would also match > comment_line_str values that contain LF, which the _mem version does not > and that's better. I agree with your analysis. I do wonder though if we should reject whitespace and control characters when parsing core.commentChar, it feels like accepting them is a bug waiting to happen. If comment_line_char starts with ' ' or '\t' that part will be eaten by the strspn() above and so starts_with_mem() wont match. Also we will never match a comment if comment_line_str contains '\n'. > Not sure why lines that start with CR are considered comment lines, > though. I think it is a lazy way of looking for an empty line ending in CR LF, it should really be || (bol[0] == '\r' && bol[1] == '\n') || Best Wishes Phillip
Phillip Wood <phillip.wood123@gmail.com> writes: > I agree with your analysis. I do wonder though if we should reject > whitespace and control characters when parsing core.commentChar, it > feels like accepting them is a bug waiting to happen. If > comment_line_char starts with ' ' or '\t' that part will be eaten by > the strspn() above and so starts_with_mem() wont match. Also we will > never match a comment if comment_line_str contains '\n'. Another thing I was wondering is what we want to do a random byte-sequence that may match from the middle of a multi-byte UTF-8 character. The reason I haven't mentioned these "nonsense input" is because they will at worst only lead to self-denial-of-service to those who are too curious, and will fall into "don't do it then" category. Also, what exactly is the definition of "nonsense" will become can of worms. I can sympathise if somebody wants to use "#\t" to give themselves a bit more room than usual on the left for visibility, for example, so there might be a case to want whitespace characters. >> Not sure why lines that start with CR are considered comment lines, >> though. > > I think it is a lazy way of looking for an empty line ending in CR LF, > it should really be > > || (bol[0] == '\r' && bol[1] == '\n') || My recollection matches your speculation. IIRC the lazy persono was probably me but I didn't run "git blame".
On 08/03/2024 15:58, Junio C Hamano wrote: > Phillip Wood <phillip.wood123@gmail.com> writes: > >> I agree with your analysis. I do wonder though if we should reject >> whitespace and control characters when parsing core.commentChar, it >> feels like accepting them is a bug waiting to happen. If >> comment_line_char starts with ' ' or '\t' that part will be eaten by >> the strspn() above and so starts_with_mem() wont match. Also we will >> never match a comment if comment_line_str contains '\n'. > > Another thing I was wondering is what we want to do a random > byte-sequence that may match from the middle of a multi-byte UTF-8 > character. > > The reason I haven't mentioned these "nonsense input" is because > they will at worst only lead to self-denial-of-service to those who > are too curious, and will fall into "don't do it then" category. We could certainly leave it as-is and tell users they are only hurting themselves if they complain when it does not work. > Also, what exactly is the definition of "nonsense" will become can > of worms. I can sympathise if somebody wants to use "#\t" to give > themselves a bit more room than usual on the left for visibility, > for example, so there might be a case to want whitespace characters. That's fair, maybe we could just ban leading whitespace if we do decide to restrict core.commentChar Best Wishes Phillip >>> Not sure why lines that start with CR are considered comment lines, >>> though. >> >> I think it is a lazy way of looking for an empty line ending in CR LF, >> it should really be >> >> || (bol[0] == '\r' && bol[1] == '\n') || > > My recollection matches your speculation. > > IIRC the lazy persono was probably me but I didn't run "git blame".
On Thu, Mar 07, 2024 at 08:42:22PM +0100, René Scharfe wrote: > > Arguably starts_with() and this new function should both be inlined, > > like we do for skip_prefix(), but I think that's out of scope for this > > series. > > Inlining would allow the compiler to unroll the loop for string > constants. I doubt it would do that for variables, as in the code > below. > > Inlining the strlen()+memcmp() version above might allow the compiler > to push the strlen() call out of a loop. > > Would any of that improve performance noticeably? For the call sites > below I doubt it. But it would probably increase the object text size. Good point. With non-constant prefixes in these cases, it probably wouldn't buy much. There are a lot of other cases with actual string constants. A compiler in theory could turn starts_with(str, "foo") into a few instructions. But it's not even clear that it's in very many hot paths. It would definitely be something we'd have to measure. > > And it's possible I was simply too dumb to figure out xstrncmpz() here. > > I'm waiting for René to show up and tell me how to do it. ;) > > Nah, it's not a good fit, as it requires the two strings to have the > same length. Thanks for confirming I wasn't missing anything. :) > > @@ -2562,7 +2562,7 @@ static int parse_insn_line(struct repository *r, struct todo_item *item, > > /* left-trim */ > > bol += strspn(bol, " \t"); > > > > - if (bol == eol || *bol == '\r' || *bol == comment_line_char) { > > + if (bol == eol || *bol == '\r' || starts_with_mem(bol, eol - bol, comment_line_str)) { > > If the strspn() call is safe (which it is, as the caller expects the > string to be NUL-terminated) then you could use starts_with() here and > avoid the length calculation. But that would also match > comment_line_str values that contain LF, which the _mem version does not > and that's better. I try not to read too much into the use of string functions on what otherwise appears to be an unterminated buffer. While in Git it is quite often terminated at allocation time (coming from a strbuf, etc) I feel like I've fixed a number of out-of-bounds reads simply due to sloppy practices. And even if something is correct today, it is easy for it to change, since the assumption is made far away from allocation. So I dunno. Like you said, fewer computations is fewer opportunity to mess things up. I don't like the idea of introducing a new hand-grenade that might blow up later, but maybe if it's right next to a strspn() call that's already a problem, it's not materially making anything worse. > > +int starts_with_mem(const char *str, size_t len, const char *prefix) > > +{ > > + const char *end = str + len; > > + for (; ; str++, prefix++) { > > + if (!*prefix) > > + return 1; > > + else if (str == end || *str != *prefix) > > + return 0; > > + } > > +} > > So this checks whether a length-limited string has a prefix given as a > NUL-terminated string. I'd have called it mem_starts_with() and have > expected starts_with_mem() to check a NUL-terminated string for a > length-limited prefix (think !strncmp(str, prefix, prefixlen)). I was going for consistency with skip_prefix_mem() and strip_suffix_mem(). To be fair, I probably also named those ones, but I think it's pretty established. We've never needed the length-limited prefix variant yet, so I don't know that we're squatting on anything too valuable. > > @@ -882,7 +882,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len) > > > > /* The first paragraph is the title and cannot be trailers */ > > for (s = buf; s < buf + len; s = next_line(s)) { > > - if (s[0] == comment_line_char) > > + if (starts_with_mem(s, buf + len - s, comment_line_str)) > > continue; > > if (is_blank_line(s)) > > Another case where starts_with() would be safe to use, as > is_blank_line() expects (and gets) a NUL-terminated string, but it would > allow matching comment_line_str values that contain LF. Hmm. Yes, it is a NUL-terminated string always, but the caller has told us not to look past end_of_log_message(). I suspect that if there is no newline in comment_line_str() it's probably impossible to go past "len" (just because the end of the log surely ends with either a NUL or a newline). But it feels iffy to me. I dunno. -Peff
On Fri, Mar 08, 2024 at 04:20:12PM +0000, Phillip Wood wrote: > On 08/03/2024 15:58, Junio C Hamano wrote: > > Phillip Wood <phillip.wood123@gmail.com> writes: > > > > > I agree with your analysis. I do wonder though if we should reject > > > whitespace and control characters when parsing core.commentChar, it > > > feels like accepting them is a bug waiting to happen. If > > > comment_line_char starts with ' ' or '\t' that part will be eaten by > > > the strspn() above and so starts_with_mem() wont match. Also we will > > > never match a comment if comment_line_str contains '\n'. > > > > Another thing I was wondering is what we want to do a random > > byte-sequence that may match from the middle of a multi-byte UTF-8 > > character. > > > > The reason I haven't mentioned these "nonsense input" is because > > they will at worst only lead to self-denial-of-service to those who > > are too curious, and will fall into "don't do it then" category. > > We could certainly leave it as-is and tell users they are only hurting > themselves if they complain when it does not work. That was mostly my plan. To some degree I think this is orthogonal to my series. You can already set core.commentChar to space or newline, and I'm sure the results are not very good. Actually, I guess it is easy to try: git -c core.commentChar=$'\n' commit --allow-empty treats everything as not-a-comment. Maybe it's worth forbidding this at the start of the series, and then carrying it through. I really do think newline is the most special character here, just because it's obviously going to be meaningful to all of our line-oriented parsing. So you'll get weird results, as opposed to broken multibyte characters, where things would still work if you choose to consistently use them (and arguably we cannot even define "broken" as the user can use a different encoding). Likewise, I guess people might complain that their core.commentChar is NFD and their editor writes out NFC characters or something, and we don't match. I was hoping we could just punt on that and nobody would ever notice (certainly I think it is OK to punt for now and somebody who truly cares can make a utf8_starts_with() or similar). > > Also, what exactly is the definition of "nonsense" will become can > > of worms. I can sympathise if somebody wants to use "#\t" to give > > themselves a bit more room than usual on the left for visibility, > > for example, so there might be a case to want whitespace characters. > > That's fair, maybe we could just ban leading whitespace if we do decide to > restrict core.commentChar Leading whitespace actually does work, though I think you'd be slightly insane to use it. I'm currently using "! COMMENT !" (after using a unicode char for a few days). It's horribly ugly, but I wanted to see if any bugs cropped up (and vim's built-in git syntax highlighting colors it correctly ;) ). -Peff
Hi Peff On 12/03/2024 08:19, Jeff King wrote: > On Fri, Mar 08, 2024 at 04:20:12PM +0000, Phillip Wood wrote: >> We could certainly leave it as-is and tell users they are only hurting >> themselves if they complain when it does not work. > > That was mostly my plan. To some degree I think this is orthogonal to my > series. You can already set core.commentChar to space or newline, and > I'm sure the results are not very good. Actually, I guess it is easy to > try: > > git -c core.commentChar=$'\n' commit --allow-empty > > treats everything as not-a-comment. > > Maybe it's worth forbidding this at the start of the series, and then > carrying it through. I really do think newline is the most special > character here, just because it's obviously going to be meaningful to > all of our line-oriented parsing. So you'll get weird results, as > opposed to broken multibyte characters, where things would still work if > you choose to consistently use them (and arguably we cannot even define > "broken" as the user can use a different encoding). I agree newline is a special case compared to broken multibyte characters, I see you've disallowed it in v2 which seems like a good idea. > Likewise, I guess people might complain that their core.commentChar is > NFD and their editor writes out NFC characters or something, and we > don't match. I was hoping we could just punt on that and nobody would > ever notice (certainly I think it is OK to punt for now and somebody who > truly cares can make a utf8_starts_with() or similar). > >>> Also, what exactly is the definition of "nonsense" will become can >>> of worms. I can sympathise if somebody wants to use "#\t" to give >>> themselves a bit more room than usual on the left for visibility, >>> for example, so there might be a case to want whitespace characters. >> >> That's fair, maybe we could just ban leading whitespace if we do decide to >> restrict core.commentChar > > Leading whitespace actually does work, though I think you'd be slightly > insane to use it. For "git rebase" in only works if you edit the todo list with "git rebase --edit-todo" which calls strbuf_stripspace() and therefore parse_insn_line() never sees the comments. If you edit the todo list directly then it will error out. You can see this with git -c core.commentChar=' ' rebase -x 'echo " this is a comment" >>"$(git rev-parse --git-path rebase-merge/git-rebase-todo)"' HEAD^ which successfully picks HEAD but then gives error: invalid command 'this' when it tries to parse the todo list after the exec command is run. Given it is broken already I'm not sure we should worry about it here. In any case it is not clear how much we should worry about problems caused by users editing the todo list without using "git rebase --edit-todo". There is code in parse_insn_line() which is clearly there to handle direct editing of the file but I don't think it is tested and directly editing the file probably bypasses the rebase.missingCommitsCheck checks as well. Best Wishes Phillip > I'm currently using "! COMMENT !" (after using a unicode char for a few > days). It's horribly ugly, but I wanted to see if any bugs cropped up > (and vim's built-in git syntax highlighting colors it correctly ;) ). > > -Peff
On Tue, Mar 12, 2024 at 02:36:36PM +0000, phillip.wood123@gmail.com wrote: > > Leading whitespace actually does work, though I think you'd be slightly > > insane to use it. > > For "git rebase" in only works if you edit the todo list with "git rebase > --edit-todo" which calls strbuf_stripspace() and therefore parse_insn_line() > never sees the comments. If you edit the todo list directly then it will > error out. You can see this with > > git -c core.commentChar=' ' rebase -x 'echo " this is a comment" > >>"$(git rev-parse --git-path rebase-merge/git-rebase-todo)"' HEAD^ > > which successfully picks HEAD but then gives > > error: invalid command 'this' > > when it tries to parse the todo list after the exec command is run. Given it > is broken already I'm not sure we should worry about it here. In any case it > is not clear how much we should worry about problems caused by users editing > the todo list without using "git rebase --edit-todo". There is code in > parse_insn_line() which is clearly there to handle direct editing of the > file but I don't think it is tested and directly editing the file probably > bypasses the rebase.missingCommitsCheck checks as well. Ah, thanks for the example. I guess it's not too surprising that it can cause confusion. Given that it's an existing issue, I think my preference would be to leave it out of the series under discussion (given how long and complicated it is already), but I'd have no objection to tightening things further on top as a separate series. -Peff
Am 12.03.24 um 09:05 schrieb Jeff King: > On Thu, Mar 07, 2024 at 08:42:22PM +0100, René Scharfe wrote: > >>> @@ -2562,7 +2562,7 @@ static int parse_insn_line(struct repository *r, struct todo_item *item, >>> /* left-trim */ >>> bol += strspn(bol, " \t"); >>> >>> - if (bol == eol || *bol == '\r' || *bol == comment_line_char) { >>> + if (bol == eol || *bol == '\r' || starts_with_mem(bol, eol - bol, comment_line_str)) { >> >> If the strspn() call is safe (which it is, as the caller expects the >> string to be NUL-terminated) then you could use starts_with() here and >> avoid the length calculation. But that would also match >> comment_line_str values that contain LF, which the _mem version does not >> and that's better. > > I try not to read too much into the use of string functions on what > otherwise appears to be an unterminated buffer. While in Git it is quite > often terminated at allocation time (coming from a strbuf, etc) I feel > like I've fixed a number of out-of-bounds reads simply due to sloppy > practices. And even if something is correct today, it is easy for it to > change, since the assumption is made far away from allocation. > > So I dunno. Like you said, fewer computations is fewer opportunity to > mess things up. I don't like the idea of introducing a new hand-grenade > that might blow up later, but maybe if it's right next to a strspn() > call that's already a problem, it's not materially making anything > worse. Yeah, and my logic was flawed: If the caller somehow guarantees that a space or tab occurs before eol then the strspn() call is safe. Its presence doesn't guarantee NUL termination. parse_insn_line() would not be safe to use without that prerequisite, but that's a different matter.. >>> @@ -882,7 +882,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len) >>> >>> /* The first paragraph is the title and cannot be trailers */ >>> for (s = buf; s < buf + len; s = next_line(s)) { >>> - if (s[0] == comment_line_char) >>> + if (starts_with_mem(s, buf + len - s, comment_line_str)) >>> continue; >>> if (is_blank_line(s)) >> >> Another case where starts_with() would be safe to use, as >> is_blank_line() expects (and gets) a NUL-terminated string, but it would >> allow matching comment_line_str values that contain LF. > > Hmm. Yes, it is a NUL-terminated string always, but the caller has told > us not to look past end_of_log_message(). I suspect that if there is no > newline in comment_line_str() it's probably impossible to go past "len" > (just because the end of the log surely ends with either a NUL or a > newline). But it feels iffy to me. I dunno. Same flawed thinking on my part: As long as we're guaranteed a blank line in the buffer we won't walk past its end. That doesn't mean we can assume a NUL is present. But that's fragile. The code should use memchr() instead of strchrnul(). That's not the problem you set out to solve in your series, though, and you avoid making it worse by respecting the length limit in the code you change. #leftoverbits Keeping track of the remaining length increases code size and adds opportunities for mistakes. Not sure how to avoid it, however. Using eol instead of len at least avoids subtractions. tl;dr: Good patch (in v2). René
diff --git a/commit.c b/commit.c index ef679a0b93..531a666cba 100644 --- a/commit.c +++ b/commit.c @@ -1796,7 +1796,8 @@ size_t ignored_log_message_bytes(const char *buf, size_t len) else next_line++; - if (buf[bol] == comment_line_char || buf[bol] == '\n') { + if (starts_with_mem(buf + bol, cutoff - bol, comment_line_str) || + buf[bol] == '\n') { /* is this the first of the run of comments? */ if (!boc) boc = bol; diff --git a/sequencer.c b/sequencer.c index 991a2dbe96..664986e3b2 100644 --- a/sequencer.c +++ b/sequencer.c @@ -1840,7 +1840,7 @@ static int is_fixup_flag(enum todo_command command, unsigned flag) static void add_commented_lines(struct strbuf *buf, const void *str, size_t len) { const char *s = str; - while (len > 0 && s[0] == comment_line_char) { + while (starts_with_mem(s, len, comment_line_str)) { size_t count; const char *n = memchr(s, '\n', len); if (!n) @@ -2562,7 +2562,7 @@ static int parse_insn_line(struct repository *r, struct todo_item *item, /* left-trim */ bol += strspn(bol, " \t"); - if (bol == eol || *bol == '\r' || *bol == comment_line_char) { + if (bol == eol || *bol == '\r' || starts_with_mem(bol, eol - bol, comment_line_str)) { item->command = TODO_COMMENT; item->commit = NULL; item->arg_offset = bol - buf; diff --git a/strbuf.c b/strbuf.c index 7c8f582127..291bdc2a65 100644 --- a/strbuf.c +++ b/strbuf.c @@ -24,6 +24,17 @@ int istarts_with(const char *str, const char *prefix) return 0; } +int starts_with_mem(const char *str, size_t len, const char *prefix) +{ + const char *end = str + len; + for (; ; str++, prefix++) { + if (!*prefix) + return 1; + else if (str == end || *str != *prefix) + return 0; + } +} + int skip_to_optional_arg_default(const char *str, const char *prefix, const char **arg, const char *def) { diff --git a/strbuf.h b/strbuf.h index 58dddf2777..3156d6ea8c 100644 --- a/strbuf.h +++ b/strbuf.h @@ -673,6 +673,7 @@ char *xstrfmt(const char *fmt, ...); int starts_with(const char *str, const char *prefix); int istarts_with(const char *str, const char *prefix); +int starts_with_mem(const char *str, size_t len, const char *prefix); /* * If the string "str" is the same as the string in "prefix", then the "arg" diff --git a/trailer.c b/trailer.c index fe18faf6c5..f59c90b4b5 100644 --- a/trailer.c +++ b/trailer.c @@ -882,7 +882,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len) /* The first paragraph is the title and cannot be trailers */ for (s = buf; s < buf + len; s = next_line(s)) { - if (s[0] == comment_line_char) + if (starts_with_mem(s, buf + len - s, comment_line_str)) continue; if (is_blank_line(s)) break; @@ -902,7 +902,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len) const char **p; ssize_t separator_pos; - if (bol[0] == comment_line_char) { + if (starts_with_mem(bol, buf + end_of_title - bol, comment_line_str)) { non_trailer_lines += possible_continuation_lines; possible_continuation_lines = 0; continue;
As with the previous patch, we need to swap out single-byte matching for something like starts_with() to match all bytes of a multi-byte comment character. But for cases where the buffer is not NUL-terminated (and we instead have an explicit size or end pointer), it's not safe to use starts_with(), as it might walk off the end of the buffer. Let's introduce a new starts_with_mem() that does the same thing but also accepts the length of the "haystack" str and makes sure not to walk past it. Note that in most cases the existing code did not need a length check at all, since it was written in a way that knew we had at least one byte available (and that was all we checked). So I had to read each one to find the appropriate bounds. The one exception is sequencer.c's add_commented_lines(), where we can actually get rid of the length check. Just like starts_with(), our starts_with_mem() handles an empty haystack variable by not matching (assuming a non-empty prefix). A few notes on the implementation of starts_with_mem(): - it would be equally correct to take an "end" pointer (and indeed, many of the callers have this and have to subtract to come up with the length). I think taking a ptr/size combo is a more usual interface for our codebase, though, and has the added benefit that the function signature makes it harder to mix up the three parameters. - we could obviously build starts_with() on top of this by passing strlen(str) as the length. But it's possible that starts_with() is a relatively hot code path, and it should not pay that penalty (it can generally return an answer proportional to the size of the prefix, not the whole string). - it naively feels like xstrncmpz() should be able to do the same thing, but that's not quite true. If you pass the length of the haystack buffer, then strncmp() finds that a shorter prefix string is "less than" than the haystack, even if the haystack starts with the prefix. If you pass the length of the prefix, then you risk reading past the end of the haystack if it is shorter than the prefix. So I think we really do need a new function. Signed-off-by: Jeff King <peff@peff.net> --- Arguably starts_with() and this new function should both be inlined, like we do for skip_prefix(), but I think that's out of scope for this series. And it's possible I was simply too dumb to figure out xstrncmpz() here. I'm waiting for René to show up and tell me how to do it. ;) IMHO this is the trickiest commit of the whole series, as it would be easy to get the length computations subtly wrong. commit.c | 3 ++- sequencer.c | 4 ++-- strbuf.c | 11 +++++++++++ strbuf.h | 1 + trailer.c | 4 ++-- 5 files changed, 18 insertions(+), 5 deletions(-)