Message ID | 20200418035449.9450-1-congdanhqx@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mailinfo.c::convert_to_utf8: reuse strlen info | expand |
Hi Danh (I think that's how you want to be addressed?), On Sat, 18 Apr 2020 at 06:00, Đoàn Trần Công Danh <congdanhqx@gmail.com> wrote: > @@ -447,19 +447,21 @@ static int convert_to_utf8(struct mailinfo *mi, > struct strbuf *line, const char *charset) > { > char *out; > + size_t out_len; > > if (!mi->metainfo_charset || !charset || !*charset) > return 0; > > if (same_encoding(mi->metainfo_charset, charset)) > return 0; > - out = reencode_string(line->buf, mi->metainfo_charset, charset); > + out = reencode_string_len(line->buf, line->len, > + mi->metainfo_charset, charset, &out_len); This is equivalent as long as `line->len` is equal to `strlen(line->buf)`, which it will be (should be) because it's a strbuf. Ok. > if (!out) { > mi->input_error = -1; > return error("cannot convert from %s to %s", > charset, mi->metainfo_charset); > } > - strbuf_attach(line, out, strlen(out), strlen(out)); > + strbuf_attach(line, out, out_len, out_len); This conversion is ok as such. I wondered why we pass in the same value twice (before and after this patch). Turns out this usage is wrong (as per the documentation in strbuf.h) but safe (as per my understanding of the implementation in strbuf.c). I'll follow up with a series that fell out of that investigation. > return 0; > } All in all, this conversion is correct and it doesn't leave the use of `strbuf_attach()` any less correct than it already was. Martin
Martin Ågren <martin.agren@gmail.com> writes: > This is equivalent as long as `line->len` is equal to > `strlen(line->buf)`, which it will be (should be) because it's a > strbuf. Ok. For the guarantee to hold true, line->buf[0..line->len] should not have any '\0' byte in it. This helper has two callers, but in either case, it needs to be prepared to work on output of decode_[bq]_segment(). Is there code anywhere that guarntees that the decoding won't stuff '\0' in the line?
On 2020-04-18 16:12:06-0700, Junio C Hamano <gitster@pobox.com> wrote: > Martin Ågren <martin.agren@gmail.com> writes: > > > This is equivalent as long as `line->len` is equal to > > `strlen(line->buf)`, which it will be (should be) because it's a > > strbuf. Ok. > > For the guarantee to hold true, line->buf[0..line->len] should not > have any '\0' byte in it. > > This helper has two callers, but in either case, it needs to be > prepared to work on output of decode_[bq]_segment(). Is there code > anywhere that guarntees that the decoding won't stuff '\0' in the > line? - First caller: `decode_header`, we don't have such guarantee, The old code will discard everything after first NUL characters. I'm _not_ really familiar with email standard, does email allow UTF-16 (albeit in [bq]-encoded) in header? If yes, the current code is likely disallow it. The new code accidentally, accept [bq]-encoded-utf-16 header and reencode to utf-8 for commit. Yes, it's very likely only UTF-8, ISO-8859-1, some variant of ISO-2022 is used nowaday in email. *If* we would like to not exclude UTF-16, the new question is should we trust the length of newly converted utf-8 string. - Second caller: `handle_commit_msg`: everything get interesting from here. Get back to the question of trusting the length of newly converted utf-8 string. I _think_ we should, because I _think_ there shouldn't be any discrimination against any encoding (be it utf-8, ISO-8859-1, etc...), the current code allows NUL character in utf-8 [bq]-encoded string in this function (early return) and its caller, and report an error later: error: a NUL byte in commit log message not allowed. meanwhile, if the email was sent in other encoding, the current code discards everything after NUL in that line, thus silently accept broken commit message. Attached is the faulty patch in ISO-8859-1, which was used to demonstrate my words. The current code will make a commit with only "Áb" in the body, while the new code rightly claims we have a faulty email.
On Sun, 19 Apr 2020 at 04:48, Danh Doan <congdanhqx@gmail.com> wrote: > > On 2020-04-18 16:12:06-0700, Junio C Hamano <gitster@pobox.com> wrote: > > Martin Ågren <martin.agren@gmail.com> writes: > > > > > This is equivalent as long as `line->len` is equal to > > > `strlen(line->buf)`, which it will be (should be) because it's a > > > strbuf. Ok. > > > > For the guarantee to hold true, line->buf[0..line->len] should not > > have any '\0' byte in it. Yes. I sort of assumed it shouldn't ("because strbuf"), but it's a good question. Especially considering this is about various encodings.. This assumption that "strlen is length so the conversion is a no-op" could potentially be broken both for input (line->len) and output (out_len). > > This helper has two callers, but in either case, it needs to be > > prepared to work on output of decode_[bq]_segment(). Is there code > > anywhere that guarntees that the decoding won't stuff '\0' in the > > line? > > the current code allows NUL character in utf-8 [bq]-encoded string > in this function (early return) and its caller, > and report an error later: > > error: a NUL byte in commit log message not allowed. > > meanwhile, if the email was sent in other encoding, the current code > discards everything after NUL in that line, > thus silently accept broken commit message. My knee-jerk reaction to Junio's question was along the same line: surely if we could have a NUL in there, the current `strlen()` would use it as an excuse to silently truncate the string, either before processing or afterwards. Thanks for looking into that more. > Attached is the faulty patch in ISO-8859-1, which was used to > demonstrate my words. > The current code will make a commit with only "Áb" in the body, > while the new code rightly claims we have a faulty email. Good find. Martin
Martin Ågren <martin.agren@gmail.com> writes: > My knee-jerk reaction to Junio's question was along the same line: > surely if we could have a NUL in there, the current `strlen()` would use > it as an excuse to silently truncate the string, either before > processing or afterwards. Thanks for looking into that more. Exactly. The change introduces a behaviour change, and we should describe it in the log message to help future developers know that we did this change, knowingly that we are changing the behaviour, and believing that the new behaviour is a better one. Thanks.
diff --git a/mailinfo.c b/mailinfo.c index 742fa376ab..0e9911df6d 100644 --- a/mailinfo.c +++ b/mailinfo.c @@ -447,19 +447,21 @@ static int convert_to_utf8(struct mailinfo *mi, struct strbuf *line, const char *charset) { char *out; + size_t out_len; if (!mi->metainfo_charset || !charset || !*charset) return 0; if (same_encoding(mi->metainfo_charset, charset)) return 0; - out = reencode_string(line->buf, mi->metainfo_charset, charset); + out = reencode_string_len(line->buf, line->len, + mi->metainfo_charset, charset, &out_len); if (!out) { mi->input_error = -1; return error("cannot convert from %s to %s", charset, mi->metainfo_charset); } - strbuf_attach(line, out, strlen(out), strlen(out)); + strbuf_attach(line, out, out_len, out_len); return 0; }
We're passing buffer from strbuf to reencode_string, which will call strlen(3) on that buffer, and discard the length of newly created buffer. Then, we compute the length of the return buffer to attach to strbuf. We can do better by reusing all the available information, and call the underlying lower level function that will be called indirectly by reencode_string. Signed-off-by: Đoàn Trần Công Danh <congdanhqx@gmail.com> --- mailinfo.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)