[v2,1/1] diff.c: When appropriate, use utf8_strwidth()

Message ID	20220827085007.20030-1-tboegi@web.de (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> From: tboegi@web.de To: git@vger.kernel.org, alexander.s.m@gmail.com Cc: =?utf-8?q?Torsten_B=C3=B6gershausen?= <tboegi@web.de> Subject: [PATCH v2 1/1] diff.c: When appropriate, use utf8_strwidth() Date: Sat, 27 Aug 2022 10:50:07 +0200 Message-Id: <20220827085007.20030-1-tboegi@web.de> In-Reply-To: <CA+VDVVVmi99i6ZY64tg8RkVXDc5gOzQP_SH12zhDKRkUnhWFgw@mail.gmail.com> References: <CA+VDVVVmi99i6ZY64tg8RkVXDc5gOzQP_SH12zhDKRkUnhWFgw@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Precedence: bulk
Series	[v2,1/1] diff.c: When appropriate, use utf8_strwidth() \| expand [v2,1/1] diff.c: When appropriate, use utf8_strwidth()

Torsten Bögershausen Aug. 27, 2022, 8:50 a.m. UTC

From: Torsten Bögershausen <tboegi@web.de>

When unicode filenames (encoded in UTF-8) are used, the visible width
on the screen is not the same as strlen(filename).

For example, `git log --stat` may produce an output like this:

$ git log --stat

[snip the header]

 Arger.txt  | 1 +
 Ärger.txt | 1 +
 2 files changed, 2 insertions(+)

A side note: the original report was about cyrillic filenames.
After some investigations it turned out that
a) This is not a problem with "ambiguous characters" in unicode
b) The same problem exist for all unicode code points (so we
  can use Latin based Umlauts for demonstrations below)

The 'Ä' takes the same space on the screen as the 'A'.
But needs one more byte in memory, so the the `git log --stat` output
for "Arger.txt" (!) gets mis-aligned:
The maximum length is derived from "Ärger.txt", 10 bytes in memory,
9 positions on the screen. That is why "Arger.txt" gets one extra ' '
for aligment, it needs 9 bytes in memory.
If there was a file "Ö", it would be correctly aligned by chance,
but "Öhö" would not.

The solution is of course, to use utf8_strwidth() instead of strlen()
when dealing with the width on screen.

And then there is another problem: code like this
strbuf_addf(&out, "%-*s", len, name);

(or using the underlying snprintf() function) does not align the
buffer to a minimum of len measured in screen-width, but uses the
memory count, if name is UTF-8 encoded.

We could be tempted to wish that snprintf() was UTF-8 aware.
That doesn't seem to be the case anywhere (tested on Linux and Mac),
probably snprintf() uses the "bytes in memory"/strlen() approach to be
compatible with older versions and this will never change.

The choosen solution is to split code in diff.c like this

strbuf_addf(&out, "%-*s", len, name);

into something like this:

size_t num_padding_spaces = 0;
// [snip]
if (len > utf8_strwidth(name))
    num_padding_spaces = len - utf8_strwidth(name);
strbuf_addf(&out, "%s", name);
if (num_padding_spaces)
    strbuf_addchars(&out, ' ', num_padding_spaces);

Tests:
Two things need to be tested:
- The calculation of the maximum width
- The calculation of num_padding_spaces

The name "textfile" is changed into "textfilë", both have a width of 8.
If strlen() was used, to get the maximum width, the shorter "binfile" would
have been mis-aligned:
 binfile   |  [snip]
 textfilë | [snip]

If only "binfile" would be renamed into "binfilë":
 binfilë |  [snip]
 textfile | [snip]

In order to verify that the width is calculated correctly everywhere,
"binfile" is renamed into "binfïlë", giving 2 bytes more in strlen()
"textfile" is renamed into "textfilë", 1 byte more in strlen(),
and the updated t4012-diff-binary.sh checks the correct aligment:
 binfïlë  | [snip]
 textfilë | [snip]

Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 diff.c                 | 37 +++++++++++++++++++++++--------------
 t/t4012-diff-binary.sh | 14 +++++++-------
 2 files changed, 30 insertions(+), 21 deletions(-)

--
2.34.0

Torsten Bögershausen Aug. 27, 2022, 8:54 a.m. UTC | #1

On Sat, Aug 27, 2022 at 10:50:07AM +0200, tboegi@web.de wrote:
> From: Torsten Bögershausen <tboegi@web.de>
>


> b) The same problem exist for all unicode code points (so we

That should be "exists". Let's see if there are more comments,
before sending a new patch.

Eric Sunshine Aug. 27, 2022, 9:50 a.m. UTC | #2

On Sat, Aug 27, 2022 at 4:58 AM Torsten Bögershausen <tboegi@web.de> wrote:
> On Sat, Aug 27, 2022 at 10:50:07AM +0200, tboegi@web.de wrote:
> > From: Torsten Bögershausen <tboegi@web.de>
> > b) The same problem exist for all unicode code points (so we
>
> That should be "exists". Let's see if there are more comments,
> before sending a new patch.

Here's one:

> The choosen solution is to split code in diff.c like this

s/choosen/chosen/

Johannes Schindelin Aug. 29, 2022, 12:04 p.m. UTC | #3

Hi Torsten,

On Sat, 27 Aug 2022, tboegi@web.de wrote:

> From: Torsten Bögershausen <tboegi@web.de>
>
> When unicode filenames (encoded in UTF-8) are used, the visible width
> on the screen is not the same as strlen(filename).
>
> For example, `git log --stat` may produce an output like this:
>
> $ git log --stat
>
> [snip the header]
>
>  Arger.txt  | 1 +
>  Ärger.txt | 1 +
>  2 files changed, 2 insertions(+)
>
> A side note: the original report was about cyrillic filenames.
> After some investigations it turned out that
> a) This is not a problem with "ambiguous characters" in unicode
> b) The same problem exist for all unicode code points (so we
>   can use Latin based Umlauts for demonstrations below)
>
> The 'Ä' takes the same space on the screen as the 'A'.
> But needs one more byte in memory, so the the `git log --stat` output
> for "Arger.txt" (!) gets mis-aligned:
> The maximum length is derived from "Ärger.txt", 10 bytes in memory,
> 9 positions on the screen. That is why "Arger.txt" gets one extra ' '
> for aligment, it needs 9 bytes in memory.
> If there was a file "Ö", it would be correctly aligned by chance,
> but "Öhö" would not.
>
> The solution is of course, to use utf8_strwidth() instead of strlen()
> when dealing with the width on screen.
>
> And then there is another problem: code like this
> strbuf_addf(&out, "%-*s", len, name);
>
> (or using the underlying snprintf() function) does not align the
> buffer to a minimum of len measured in screen-width, but uses the
> memory count, if name is UTF-8 encoded.
>
> We could be tempted to wish that snprintf() was UTF-8 aware.
> That doesn't seem to be the case anywhere (tested on Linux and Mac),
> probably snprintf() uses the "bytes in memory"/strlen() approach to be
> compatible with older versions and this will never change.

An interesting read so, far, but...

>
> The choosen solution is to split code in diff.c like this
>
> strbuf_addf(&out, "%-*s", len, name);
>
> into something like this:
>
> size_t num_padding_spaces = 0;
> // [snip]
> if (len > utf8_strwidth(name))
>     num_padding_spaces = len - utf8_strwidth(name);
> strbuf_addf(&out, "%s", name);
> if (num_padding_spaces)
>     strbuf_addchars(&out, ' ', num_padding_spaces);

... this sounds like it would benefit from beinv refactored into a
separate function, e.g. `strbuf_add_padded(buf, utf8string)`, both for
readability as well as for self-documentation.

Also, it is unclear to me why we have to evaluate `utf8_strwidth()`
_twice_ and why we do not assign the result to a variable called `width`
and then have a conditional like

	if (width < len) /* pad to `len` columns */
		strbuf_addchars(&out, ' ' , len - width);

instead. That would sound more logical to me.

Besides, since the simple change from `strlen()` to `utf8_strwidth()` is
so different from changing `strbuf_addf(...)`, I would prefer to see them
split into two patches.

>
> Tests:
> Two things need to be tested:
> - The calculation of the maximum width
> - The calculation of num_padding_spaces
>
> The name "textfile" is changed into "textfilë", both have a width of 8.
> If strlen() was used, to get the maximum width, the shorter "binfile" would
> have been mis-aligned:
>  binfile   |  [snip]
>  textfilë | [snip]
>
> If only "binfile" would be renamed into "binfilë":
>  binfilë |  [snip]
>  textfile | [snip]
>
> In order to verify that the width is calculated correctly everywhere,
> "binfile" is renamed into "binfïlë", giving 2 bytes more in strlen()
> "textfile" is renamed into "textfilë", 1 byte more in strlen(),
> and the updated t4012-diff-binary.sh checks the correct aligment:
>  binfïlë  | [snip]
>  textfilë | [snip]

I wonder whether you can change only _one_ name and still verify the
correctness. When you make two changes at the same time, it is always
possible for one change to "cancel out" the other one, and therefore it is
harder to reason about the correctness of your patch.

Better keep it simple and change only one instance (personally, I would
have changed two letters in the longer one).

>
> Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
> Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> ---
>  diff.c                 | 37 +++++++++++++++++++++++--------------
>  t/t4012-diff-binary.sh | 14 +++++++-------
>  2 files changed, 30 insertions(+), 21 deletions(-)
>
> diff --git a/diff.c b/diff.c
> index 974626a621..cf38e1dc88 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2591,7 +2591,7 @@ void print_stat_summary(FILE *fp, int files,
>  static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  {
>  	int i, len, add, del, adds = 0, dels = 0;
> -	uintmax_t max_change = 0, max_len = 0;
> +	uintmax_t max_change = 0, max_width = 0;

Why rename `max_len`, but not `len`? I would have expected (and agreed
with seeing) `len` to be renamed to `width`, too.

>  	int total_files = data->nr, count;
>  	int width, name_width, graph_width, number_width = 0, bin_width = 0;
>  	const char *reset, *add_c, *del_c;
> @@ -2620,9 +2620,9 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			continue;
>  		}
>  		fill_print_name(file);
> -		len = strlen(file->print_name);
> -		if (max_len < len)
> -			max_len = len;
> +		len = utf8_strwidth(file->print_name);
> +		if (max_width < len)
> +			max_width = len;
>
>  		if (file->is_unmerged) {
>  			/* "Unmerged" is 8 characters */
> @@ -2646,7 +2646,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>
>  	/*
>  	 * We have width = stat_width or term_columns() columns total.
> -	 * We want a maximum of min(max_len, stat_name_width) for the name part.
> +	 * We want a maximum of min(max_width, stat_name_width) for the name part.
>  	 * We want a maximum of min(max_change, stat_graph_width) for the +- part.
>  	 * We also need 1 for " " and 4 + decimal_width(max_change)
>  	 * for " | NNNN " and one the empty column at the end, altogether
> @@ -2701,8 +2701,8 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  		graph_width = options->stat_graph_width;
>
>  	name_width = (options->stat_name_width > 0 &&
> -		      options->stat_name_width < max_len) ?
> -		options->stat_name_width : max_len;
> +		      options->stat_name_width < max_width) ?
> +		options->stat_name_width : max_width;

It is a bit sad that the diff lines regarding the renamed variable drown
out the actual change (`strlen()` -> `utf8_strwidth()`). But the end
result is nicer.

Thank you for working on this!
Dscho

>
>  	/*
>  	 * Adjust adjustable widths not to exceed maximum width
> @@ -2734,6 +2734,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  		char *name = file->print_name;
>  		uintmax_t added = file->added;
>  		uintmax_t deleted = file->deleted;
> +		size_t num_padding_spaces = 0;
>  		int name_len;
>
>  		if (!file->is_interesting && (added + deleted == 0))
> @@ -2743,7 +2744,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  		 * "scale" the filename
>  		 */
>  		len = name_width;
> -		name_len = strlen(name);
> +		name_len = utf8_strwidth(name);
>  		if (name_width < name_len) {
>  			char *slash;
>  			prefix = "...";
> @@ -2753,10 +2754,14 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			if (slash)
>  				name = slash;
>  		}
> +		if (len > utf8_strwidth(name))
> +			num_padding_spaces = len - utf8_strwidth(name);
>
>  		if (file->is_binary) {
> -			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
> -			strbuf_addf(&out, " %*s", number_width, "Bin");
> +			strbuf_addf(&out, " %s%s ", prefix,  name);
> +			if (num_padding_spaces)
> +				strbuf_addchars(&out, ' ', num_padding_spaces);
> +			strbuf_addf(&out, "| %*s", number_width, "Bin");
>  			if (!added && !deleted) {
>  				strbuf_addch(&out, '\n');
>  				emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
> @@ -2776,8 +2781,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			continue;
>  		}
>  		else if (file->is_unmerged) {
> -			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
> -			strbuf_addstr(&out, " Unmerged\n");
> +			strbuf_addf(&out, " %s%s ", prefix,  name);
> +			if (num_padding_spaces)
> +				strbuf_addchars(&out, ' ', num_padding_spaces);
> +			strbuf_addstr(&out, "| Unmerged\n");
>  			emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
>  					 out.buf, out.len, 0);
>  			strbuf_reset(&out);
> @@ -2803,8 +2810,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  				add = total - del;
>  			}
>  		}
> -		strbuf_addf(&out, " %s%-*s |", prefix, len, name);
> -		strbuf_addf(&out, " %*"PRIuMAX"%s",
> +		strbuf_addf(&out, " %s%s ", prefix,  name);
> +		if (num_padding_spaces)
> +			strbuf_addchars(&out, ' ', num_padding_spaces);
> +		strbuf_addf(&out, "| %*"PRIuMAX"%s",
>  			number_width, added + deleted,
>  			added + deleted ? " " : "");
>  		show_graph(&out, '+', add, add_c, reset);
> diff --git a/t/t4012-diff-binary.sh b/t/t4012-diff-binary.sh
> index c509143c81..2d49de01c8 100755
> --- a/t/t4012-diff-binary.sh
> +++ b/t/t4012-diff-binary.sh
> @@ -113,20 +113,20 @@ test_expect_success 'diff --no-index with binary creation' '
>  '
>
>  cat >expect <<EOF
> - binfile  |   Bin 0 -> 1026 bytes
> - textfile | 10000 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> + binfïlë  |   Bin 0 -> 1026 bytes
> + textfilë | 10000 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  EOF
>
>  test_expect_success 'diff --stat with binary files and big change count' '
> -	printf "\01\00%1024d" 1 >binfile &&
> -	git add binfile &&
> +	printf "\01\00%1024d" 1 >binfïlë &&
> +	git add binfïlë &&
>  	i=0 &&
>  	while test $i -lt 10000; do
>  		echo $i &&
>  		i=$(($i + 1)) || return 1
> -	done >textfile &&
> -	git add textfile &&
> -	git diff --cached --stat binfile textfile >output &&
> +	done >textfilë &&
> +	git add textfilë &&
> +	git -c core.quotepath=false diff --cached --stat binfïlë textfilë >output &&
>  	grep " | " output >actual &&
>  	test_cmp expect actual
>  '
> --
> 2.34.0
>
>

Torsten Bögershausen Aug. 29, 2022, 5:54 p.m. UTC | #4

On Mon, Aug 29, 2022 at 02:04:42PM +0200, Johannes Schindelin wrote:
> Hi Torsten,
> >
> > The choosen solution is to split code in diff.c like this
> >
> > strbuf_addf(&out, "%-*s", len, name);
> >
> > into something like this:
> >
> > size_t num_padding_spaces = 0;
> > // [snip]
> > if (len > utf8_strwidth(name))
> >     num_padding_spaces = len - utf8_strwidth(name);
> > strbuf_addf(&out, "%s", name);
> > if (num_padding_spaces)
> >     strbuf_addchars(&out, ' ', num_padding_spaces);
>
> ... this sounds like it would benefit from beinv refactored into a
> separate function, e.g. `strbuf_add_padded(buf, utf8string)`, both for
> readability as well as for self-documentation.

Yes, but:
All (tm) strbuf() functions use an unsigned size_t, and are not
tolerant against passing 0 as "do nothing".
A nicer solution (for this patch) could be a change like this:
Instead of

void strbuf_addchars(struct strbuf *sb, int c, size_t n)
{
        strbuf_grow(sb, n);
	memset(sb->buf + sb->len, c, n);
	strbuf_setlen(sb, sb->len + n);
}

We would find:
void strbuf_addchars(struct strbuf *sb, int c, ssize_t n)
{
        if (n <= 0)
	       return;
        strbuf_grow(sb, (size_t)n);
	memset(sb->buf + sb->len, c, (size_t)n);
	strbuf_setlen(sb, sb->len + (size_t)n);
}

I couldn't convince myself to do so.
Since it is mainly diff.c that needs this adjustment/padding of strings,
I coulnd't convince myself to write another function in strbuf.c


>
> Also, it is unclear to me why we have to evaluate `utf8_strwidth()`
> _twice_ and why we do not assign the result to a variable called `width`
> and then have a conditional like
>
> 	if (width < len) /* pad to `len` columns */
> 		strbuf_addchars(&out, ' ' , len - width);
>
> instead. That would sound more logical to me.

This is caused by the logic in diff.c:
  /*
   * Find the longest filename and max number of changes
   */
   for (i = 0; (i < count) && (i < data->nr); i++) {
       struct diffstat_file *file = data->files[i];
       [snip]
       len = utf8_strwidth(file->print_name);
       if (max_width < len)
          max_width = len;
// and later
    /*
     * From here name_width is the width of the name area,
     * and graph_width is the width of the graph area.
     * max_change is used to scale graph properly.
     */
    for (i = 0; i < count; i++) {
    /*
     * "scale" the filename
     */
     // TB: Which means either shortening it with ...
     // Or padding it, if needed, and here we need
     // another
     name_len = utf8_strwidth(name);

>
> Besides, since the simple change from `strlen()` to `utf8_strwidth()` is
> so different from changing `strbuf_addf(...)`, I would prefer to see them
> split into two patches.

Hm, that is a possiblity. Seems to ease the burden for reviewers.

>
> >
> > Tests:
> > Two things need to be tested:
> > - The calculation of the maximum width
> > - The calculation of num_padding_spaces
> >
> > The name "textfile" is changed into "textfilë", both have a width of 8.
> > If strlen() was used, to get the maximum width, the shorter "binfile" would
> > have been mis-aligned:
> >  binfile   |  [snip]
> >  textfilë | [snip]
> >
> > If only "binfile" would be renamed into "binfilë":
> >  binfilë |  [snip]
> >  textfile | [snip]
> >
> > In order to verify that the width is calculated correctly everywhere,
> > "binfile" is renamed into "binfïlë", giving 2 bytes more in strlen()
> > "textfile" is renamed into "textfilë", 1 byte more in strlen(),
> > and the updated t4012-diff-binary.sh checks the correct aligment:
> >  binfïlë  | [snip]
> >  textfilë | [snip]
>
> I wonder whether you can change only _one_ name and still verify the
> correctness. When you make two changes at the same time, it is always
> possible for one change to "cancel out" the other one, and therefore it is
> harder to reason about the correctness of your patch.

Nee, I have a hard time to see how a +/- 1 can "cancel out" a +/- 2.
But I may improve the commit message, to make that more clear.

>
> Better keep it simple and change only one instance (personally,
> I would have changed two letters in the longer one).

That is certainly doable.


>
> >
> > Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
> > Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> > ---
> >  diff.c                 | 37 +++++++++++++++++++++++--------------
> >  t/t4012-diff-binary.sh | 14 +++++++-------
> >  2 files changed, 30 insertions(+), 21 deletions(-)
> >
> > diff --git a/diff.c b/diff.c
> > index 974626a621..cf38e1dc88 100644
> > --- a/diff.c
> > +++ b/diff.c
> > @@ -2591,7 +2591,7 @@ void print_stat_summary(FILE *fp, int files,
> >  static void show_stats(struct diffstat_t *data, struct diff_options *options)
> >  {
> >  	int i, len, add, del, adds = 0, dels = 0;
> > -	uintmax_t max_change = 0, max_len = 0;
> > +	uintmax_t max_change = 0, max_width = 0;
>
> Why rename `max_len`, but not `len`? I would have expected (and agreed
> with seeing) `len` to be renamed to `width`, too.

That is a valid point.
There is, however, already a variable called "width".
And renaming this one into a new one as well ?

>
> >  	int total_files = data->nr, count;
> >  	int width, name_width, graph_width, number_width = 0, bin_width = 0;
> >  	const char *reset, *add_c, *del_c;
> > @@ -2620,9 +2620,9 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
> >  			continue;
> >  		}
> >  		fill_print_name(file);
> > -		len = strlen(file->print_name);
> > -		if (max_len < len)
> > -			max_len = len;
> > +		len = utf8_strwidth(file->print_name);
> > +		if (max_width < len)
> > +			max_width = len;
> >
> >  		if (file->is_unmerged) {
> >  			/* "Unmerged" is 8 characters */
> > @@ -2646,7 +2646,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
> >
> >  	/*
> >  	 * We have width = stat_width or term_columns() columns total.
> > -	 * We want a maximum of min(max_len, stat_name_width) for the name part.
> > +	 * We want a maximum of min(max_width, stat_name_width) for the name part.
> >  	 * We want a maximum of min(max_change, stat_graph_width) for the +- part.
> >  	 * We also need 1 for " " and 4 + decimal_width(max_change)
> >  	 * for " | NNNN " and one the empty column at the end, altogether
> > @@ -2701,8 +2701,8 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
> >  		graph_width = options->stat_graph_width;
> >
> >  	name_width = (options->stat_name_width > 0 &&
> > -		      options->stat_name_width < max_len) ?
> > -		options->stat_name_width : max_len;
> > +		      options->stat_name_width < max_width) ?
> > +		options->stat_name_width : max_width;
>
> It is a bit sad that the diff lines regarding the renamed variable drown
> out the actual change (`strlen()` -> `utf8_strwidth()`). But the end
> result is nicer.
>
> Thank you for working on this!
> Dscho

Thanks so much for the review - let's see if I can make a better patch
the next days (better say weeks)

Junio C Hamano Aug. 29, 2022, 6:37 p.m. UTC | #5

Torsten Bögershausen <tboegi@web.de> writes:

> This is caused by the logic in diff.c:
>   /*
>    * Find the longest filename and max number of changes
>    */
>    for (i = 0; (i < count) && (i < data->nr); i++) {
>        struct diffstat_file *file = data->files[i];
>        [snip]
>        len = utf8_strwidth(file->print_name);
>        if (max_width < len)
>           max_width = len;
> // and later
>     /*
>      * From here name_width is the width of the name area,
>      * and graph_width is the width of the graph area.
>      * max_change is used to scale graph properly.
>      */
>     for (i = 0; i < count; i++) {
>     /*
>      * "scale" the filename
>      */
>      // TB: Which means either shortening it with ...
>      // Or padding it, if needed, and here we need
>      // another
>      name_len = utf8_strwidth(name);
>
>>
>> Besides, since the simple change from `strlen()` to `utf8_strwidth()` is
>> so different from changing `strbuf_addf(...)`, I would prefer to see them
>> split into two patches.
>
> Hm, that is a possiblity. Seems to ease the burden for reviewers.

Another thing I remembered (this is a comment primarily on the
original I wrote based on 'all world is ASCII' mindset that led to
the use of strlen() as a display-width indicator) in the code is
that we "abbreviate" an overly long pathname and transform renames
that originally is in the a/b/c -> a/B/c form into a/{b->B}/c form,
and IIRC they are all byte based.  The latter may be OK because the
transformation is limited to '/' boundary, but the former may chomp
a single multi-byte letter in the middle, which would need to be
corrected as a part of this change.

Johannes Schindelin Sept. 2, 2022, 9:47 a.m. UTC | #6

Hi Torsten,

On Mon, 29 Aug 2022, Torsten Bögershausen wrote:

> On Mon, Aug 29, 2022 at 02:04:42PM +0200, Johannes Schindelin wrote:
> > >
> > > The choosen solution is to split code in diff.c like this
> > >
> > > strbuf_addf(&out, "%-*s", len, name);
> > >
> > > into something like this:
> > >
> > > size_t num_padding_spaces = 0;
> > > // [snip]
> > > if (len > utf8_strwidth(name))
> > >     num_padding_spaces = len - utf8_strwidth(name);
> > > strbuf_addf(&out, "%s", name);
> > > if (num_padding_spaces)
> > >     strbuf_addchars(&out, ' ', num_padding_spaces);
> >
> > ... this sounds like it would benefit from beinv refactored into a
> > separate function, e.g. `strbuf_add_padded(buf, utf8string)`, both for
> > readability as well as for self-documentation.
>
> Yes, but:
> All (tm) strbuf() functions use an unsigned size_t, and are not
> tolerant against passing 0 as "do nothing".

I am missing something, as this seems not to contradict the idea of
`strbuf_add_padded()`. Simply provide the desired width as a `size_t`,
compare the width of the actual added string, and if it is shorter, pad
with spaces. At no stage does this require a signed type, all involved
values are strictly non-negative.

> >
> > Also, it is unclear to me why we have to evaluate `utf8_strwidth()`
> > _twice_ and why we do not assign the result to a variable called `width`
> > and then have a conditional like
> >
> > 	if (width < len) /* pad to `len` columns */
> > 		strbuf_addchars(&out, ' ' , len - width);
> >
> > instead. That would sound more logical to me.
>
> This is caused by the logic in diff.c:
>   /*
>    * Find the longest filename and max number of changes
>    */
>    for (i = 0; (i < count) && (i < data->nr); i++) {
>        struct diffstat_file *file = data->files[i];
>        [snip]
>        len = utf8_strwidth(file->print_name);
>        if (max_width < len)
>           max_width = len;
> // and later
>     /*
>      * From here name_width is the width of the name area,
>      * and graph_width is the width of the graph area.
>      * max_change is used to scale graph properly.
>      */
>     for (i = 0; i < count; i++) {
>     /*
>      * "scale" the filename
>      */
>      // TB: Which means either shortening it with ...
>      // Or padding it, if needed, and here we need
>      // another
>      name_len = utf8_strwidth(name);

I was referring to this part of the commit message:

	if (len > utf8_strwidth(name))
		num_padding_spaces = len - utf8_strwidth(name);

Here, we evaluate `utf8_strwidth(name)`, compare it to `len`, and if the
former was smaller, we evaluate the same function call _again_.

What my feedback intended to suggest was to store the result and reuse it:

	name_width = utf8_strwidth(name);
	if (name_width < len)
		num_padding_spaces = len - name_width;

Ciao,
Dscho

[v2,1/1] diff.c: When appropriate, use utf8_strwidth()

Commit Message

Comments

Patch