mailinfo.c::convert_to_utf8: reuse strlen info

Message ID	20200418035449.9450-1-congdanhqx@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=qItC=6C=vger.kernel.org=git-owner@kernel.org> From: =?utf-8?b?xJBvw6BuIFRy4bqnbiBDw7RuZyBEYW5o?= <congdanhqx@gmail.com> To: git@vger.kernel.org Cc: =?utf-8?b?xJBvw6BuIFRy4bqnbiBDw7RuZyBEYW5o?= <congdanhqx@gmail.com> Subject: [PATCH] mailinfo.c::convert_to_utf8: reuse strlen info Date: Sat, 18 Apr 2020 10:54:49 +0700 Message-Id: <20200418035449.9450-1-congdanhqx@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk
Series	mailinfo.c::convert_to_utf8: reuse strlen info \| expand mailinfo.c::convert_to_utf8: reuse strlen info

Đoàn Trần Công Danh April 18, 2020, 3:54 a.m. UTC

We're passing buffer from strbuf to reencode_string,
which will call strlen(3) on that buffer,
and discard the length of newly created buffer.

Then, we compute the length of the return buffer to attach to strbuf.

We can do better by reusing all the available information,
and call the underlying lower level function that will be called
indirectly by reencode_string.

Signed-off-by: Đoàn Trần Công Danh <congdanhqx@gmail.com>
---
 mailinfo.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Martin Ågren April 18, 2020, 7:56 p.m. UTC | #1

Hi Danh (I think that's how you want to be addressed?),

On Sat, 18 Apr 2020 at 06:00, Đoàn Trần Công Danh <congdanhqx@gmail.com> wrote:
> @@ -447,19 +447,21 @@ static int convert_to_utf8(struct mailinfo *mi,
>                            struct strbuf *line, const char *charset)
>  {
>         char *out;
> +       size_t out_len;
>
>         if (!mi->metainfo_charset || !charset || !*charset)
>                 return 0;
>
>         if (same_encoding(mi->metainfo_charset, charset))
>                 return 0;
> -       out = reencode_string(line->buf, mi->metainfo_charset, charset);
> +       out = reencode_string_len(line->buf, line->len,
> +                                 mi->metainfo_charset, charset, &out_len);

This is equivalent as long as `line->len` is equal to
`strlen(line->buf)`, which it will be (should be) because it's a
strbuf. Ok.

>         if (!out) {
>                 mi->input_error = -1;
>                 return error("cannot convert from %s to %s",
>                              charset, mi->metainfo_charset);
>         }
> -       strbuf_attach(line, out, strlen(out), strlen(out));
> +       strbuf_attach(line, out, out_len, out_len);

This conversion is ok as such. I wondered why we pass in the same
value twice (before and after this patch). Turns out this usage is wrong
(as per the documentation in strbuf.h) but safe (as per my understanding
of the implementation in strbuf.c). I'll follow up with a series that
fell out of that investigation.

>         return 0;
>  }

All in all, this conversion is correct and it doesn't leave the
use of `strbuf_attach()` any less correct than it already was.

Martin

Junio C Hamano April 18, 2020, 11:12 p.m. UTC | #2

Martin Ågren <martin.agren@gmail.com> writes:

> This is equivalent as long as `line->len` is equal to
> `strlen(line->buf)`, which it will be (should be) because it's a
> strbuf. Ok.

For the guarantee to hold true, line->buf[0..line->len] should not
have any '\0' byte in it.  

This helper has two callers, but in either case, it needs to be
prepared to work on output of decode_[bq]_segment().  Is there code
anywhere that guarntees that the decoding won't stuff '\0' in the
line?

Đoàn Trần Công Danh April 19, 2020, 2:48 a.m. UTC | #3

On 2020-04-18 16:12:06-0700, Junio C Hamano <gitster@pobox.com> wrote:
> Martin Ågren <martin.agren@gmail.com> writes:
> 
> > This is equivalent as long as `line->len` is equal to
> > `strlen(line->buf)`, which it will be (should be) because it's a
> > strbuf. Ok.
> 
> For the guarantee to hold true, line->buf[0..line->len] should not
> have any '\0' byte in it.  
> 
> This helper has two callers, but in either case, it needs to be
> prepared to work on output of decode_[bq]_segment().  Is there code
> anywhere that guarntees that the decoding won't stuff '\0' in the
> line?

- First caller: `decode_header`, we don't have such guarantee,
  The old code will discard everything after first NUL characters.
  I'm _not_ really familiar with email standard, does email allow
  UTF-16 (albeit in [bq]-encoded) in header? If yes, the current code
  is likely disallow it. The new code accidentally, accept
  [bq]-encoded-utf-16 header and reencode to utf-8 for commit.
  Yes, it's very likely only UTF-8, ISO-8859-1, some variant of
  ISO-2022 is used nowaday in email.
  *If* we would like to not exclude UTF-16, the new question is should
  we trust the length of newly converted utf-8 string.

- Second caller: `handle_commit_msg`: everything get interesting from
  here.

Get back to the question of trusting the length of newly converted
utf-8 string.

I _think_ we should,
because I _think_ there shouldn't be any discrimination against any
encoding (be it utf-8, ISO-8859-1, etc...),
the current code allows NUL character in utf-8 [bq]-encoded string
in this function (early return) and its caller,
and report an error later:

	error: a NUL byte in commit log message not allowed.

meanwhile, if the email was sent in other encoding, the current code 
discards everything after NUL in that line,
thus silently accept broken commit message.

Attached is the faulty patch in ISO-8859-1, which was used to
demonstrate my words.
The current code will make a commit with only "Áb" in the body,
while the new code rightly claims we have a faulty email.

Martin Ågren April 19, 2020, 4:34 a.m. UTC | #4

On Sun, 19 Apr 2020 at 04:48, Danh Doan <congdanhqx@gmail.com> wrote:
>
> On 2020-04-18 16:12:06-0700, Junio C Hamano <gitster@pobox.com> wrote:
> > Martin Ågren <martin.agren@gmail.com> writes:
> >
> > > This is equivalent as long as `line->len` is equal to
> > > `strlen(line->buf)`, which it will be (should be) because it's a
> > > strbuf. Ok.
> >
> > For the guarantee to hold true, line->buf[0..line->len] should not
> > have any '\0' byte in it.

Yes. I sort of assumed it shouldn't ("because strbuf"), but it's a good
question. Especially considering this is about various encodings..

This assumption that "strlen is length so the conversion is a no-op"
could potentially be broken both for input (line->len) and output
(out_len).

> > This helper has two callers, but in either case, it needs to be
> > prepared to work on output of decode_[bq]_segment().  Is there code
> > anywhere that guarntees that the decoding won't stuff '\0' in the
> > line?
>
> the current code allows NUL character in utf-8 [bq]-encoded string
> in this function (early return) and its caller,
> and report an error later:
>
>         error: a NUL byte in commit log message not allowed.
>
> meanwhile, if the email was sent in other encoding, the current code
> discards everything after NUL in that line,
> thus silently accept broken commit message.

My knee-jerk reaction to Junio's question was along the same line:
surely if we could have a NUL in there, the current `strlen()` would use
it as an excuse to silently truncate the string, either before
processing or afterwards. Thanks for looking into that more.

> Attached is the faulty patch in ISO-8859-1, which was used to
> demonstrate my words.
> The current code will make a commit with only "Áb" in the body,
> while the new code rightly claims we have a faulty email.

Good find.

Martin

Junio C Hamano April 19, 2020, 5:32 a.m. UTC | #5

Martin Ågren <martin.agren@gmail.com> writes:

> My knee-jerk reaction to Junio's question was along the same line:
> surely if we could have a NUL in there, the current `strlen()` would use
> it as an excuse to silently truncate the string, either before
> processing or afterwards. Thanks for looking into that more.

Exactly.

The change introduces a behaviour change, and we should describe it
in the log message to help future developers know that we did this
change, knowingly that we are changing the behaviour, and believing
that the new behaviour is a better one.

Thanks.

mailinfo.c::convert_to_utf8: reuse strlen info

Commit Message

Comments

Patch