[v5,1/1] diff.c: When appropriate, use utf8_strwidth()

Message ID	20220914151333.3309-1-tboegi@web.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> From: tboegi@web.de To: git@vger.kernel.org, alexander.s.m@gmail.com, Johannes.Schindelin@gmx.de Cc: =?utf-8?q?Torsten_B=C3=B6gershausen?= <tboegi@web.de> Subject: [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth() Date: Wed, 14 Sep 2022 17:13:33 +0200 Message-Id: <20220914151333.3309-1-tboegi@web.de> In-Reply-To: <CA+VDVVVmi99i6ZY64tg8RkVXDc5gOzQP_SH12zhDKRkUnhWFgw@mail.gmail.com> References: <CA+VDVVVmi99i6ZY64tg8RkVXDc5gOzQP_SH12zhDKRkUnhWFgw@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Precedence: bulk
Series	[v5,1/1] diff.c: When appropriate, use utf8_strwidth() \| expand [v5,1/1] diff.c: When appropriate, use utf8_strwidth()

Torsten Bögershausen Sept. 14, 2022, 3:13 p.m. UTC

From: Torsten Bögershausen <tboegi@web.de>

When unicode filenames (encoded in UTF-8) are used, the visible width
on the screen is not the same as strlen().

For example, `git log --stat` may produce an output like this:

[snip the header]

 Arger.txt  | 1 +
 Ärger.txt | 1 +
 2 files changed, 2 insertions(+)

A side note: the original report was about cyrillic filenames.
After some investigations it turned out that
a) This is not a problem with "ambiguous characters" in unicode
b) The same problem exists for all unicode code points (so we
  can use Latin based Umlauts for demonstrations below)

The 'Ä' takes the same space on the screen as the 'A'.
But needs one more byte in memory, so the the `git log --stat` output
for "Arger.txt" (!) gets mis-aligned:
The maximum length is derived from "Ärger.txt", 10 bytes in memory,
9 positions on the screen. That is why "Arger.txt" gets one extra ' '
for aligment, it needs 9 bytes in memory.
If there was a file "Ö", it would be correctly aligned by chance,
but "Öhö" would not.

The solution is of course, to use utf8_strwidth() instead of strlen()
when dealing with the width on screen.

And then there is another problem, code like this:
strbuf_addf(&out, "%-*s", len, name);
(or using the underlying snprintf() function) does not align the
buffer to a minimum of len measured in screen-width, but uses the
memory count.

One could be tempted to wish that snprintf() was UTF-8 aware.
That doesn't seem to be the case anywhere (tested on Linux and Mac),
probably snprintf() uses the "bytes in memory"/strlen() approach to be
compatible with older versions and this will never change.

The basic idea is to change code in diff.c like this
strbuf_addf(&out, "%-*s", len, name);

into something like this:
int padding = len - utf8_strwidth(name);
if (padding < 0)
	padding = 0;
strbuf_addf(&out, " %s%*s", name, padding, "");

The real change is slighty bigger, as it, as well, integrates two calls
of strbuf_addf() into one.

Tests:
Two things need to be tested:
 - The calculation of the maximum width
 - The calculation of padding

The name "textfile" is changed into "tëxtfilë", both have a width of 8.
If strlen() was used, to get the maximum width, the shorter "binfile" would
have been mis-aligned:
 binfile    | [snip]
 tëxtfilë | [snip]

If only "binfile" would be renamed into "binfilë":
 binfilë | [snip]
 textfile | [snip]

In order to verify that the width is calculated correctly everywhere,
"binfile" is renamed into "binfilë", giving 1 bytes more in strlen()
"tëxtfile" is renamed into "tëxtfilë", 2 byte more in strlen().

The updated t4012-diff-binary.sh checks the correct aligment:
 binfilë  | [snip]
 tëxtfilë | [snip]

Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 diff.c                 | 27 ++++++++++++++++-----------
 t/t4012-diff-binary.sh | 14 +++++++-------
 2 files changed, 23 insertions(+), 18 deletions(-)

--
2.34.0

Junio C Hamano Sept. 14, 2022, 4:40 p.m. UTC | #1

> The basic idea is to change code in diff.c like this
> strbuf_addf(&out, "%-*s", len, name);
>
> into something like this:
> int padding = len - utf8_strwidth(name);
> if (padding < 0)
> 	padding = 0;
> strbuf_addf(&out, " %s%*s", name, padding, "");
> ...
> Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
> Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
> Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> ---
>  diff.c                 | 27 ++++++++++++++++-----------
>  t/t4012-diff-binary.sh | 14 +++++++-------
>  2 files changed, 23 insertions(+), 18 deletions(-)

> diff --git a/diff.c b/diff.c
> index 974626a621..35b9da90fe 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2620,7 +2620,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			continue;
>  		}
>  		fill_print_name(file);
> -		len = strlen(file->print_name);
> +		len = utf8_strwidth(file->print_name);
>  		if (max_len < len)
>  			max_len = len;

The changes in this patch are isolated to the show_stats() helper
function, and looking at use of "len", "max_len", and "name_len", it
may be a good clean-up to make them based on "width".  A bit of care
needs to be taken because the way existing variables are used is a
bit convoluted at times:

 - "width" already exists.  "len" and "max_len" are used in an early
   loop to eventually derive "name_width".

 - "len" is later used in the loop for each pathname to hold a copy
   of "name_width" that can locally be adjusted to accomodate "..."
   abbreviation/munging of the pathname.

 - "name_width" already exists in addition to "name_len".  The
   former holds how many display columns a pathname can occupy in the
   diffstat output, while the latter is used in a loop to hold the
   display columns of the pathname each iteration is looking at, to
   see if it is wider than "name_width" (in which case there is the
   "..." abbreviation that is NOT UTF-8 aware even after this patch)
   or narrower (in which case we'd do the padding).  As the existing
   "name_width" is how we want to name our variables (i.e. the width
   allocated for names), the "name_len", if we were to follow "len
   misleads us to think it is byte length, so use width instead",
   would need to become something like "this_name_width" (i.e. the
   width of the name of the pathname in this iteration of the loop).

But I am OK to do WITHOUT any such renaming, and I do not want to
see such renaming in the same patch ("preliminary clean-up" or
"clean-up after the dust settles" are good, thoguh).  Counting
display columns correctly is more important.

I think I spotted two remaining "bugs" that are left unfixed with
this patch..

There is "stat_width is -1 (auto)" case, which reads like so:

	if (options->stat_width == -1)
		width = term_columns() - strlen(line_prefix);
	else
		width = options->stat_width ? options->stat_width : 80;

Here line_prefix eventually comes from the "git log --graph" and
shows the colored graph segments on the same output line as the
diffstat.

This patch is probably not making anything worse, but by leaving it
strlen(), it is likely overcounting the width of it.  We can
presumably use utf8_strnwidth() that can optionally be told to be
aware of the ANSI color sequence to count its width correctly to fix
it.

> @@ -2743,7 +2743,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  		 * "scale" the filename
>  		 */
>  		len = name_width;
> -		name_len = strlen(name);
> +		name_len = utf8_strwidth(name);
>  		if (name_width < name_len) {
>  			char *slash;
>  			prefix = "...";

The code around here between this and the next hunk needs cleaning up.

		if (name_width < name_len) {
			char *slash;
			prefix = "...";
			len -= 3;
			name += name_len - len;
			slash = strchr(name, '/');
			if (slash)
				name = slash;
		}

We found the display columns of the current item "name_len" is wider
than what we allocated "name_width" for the names.  We are going to
chomp as many pathname components from the front as needed, at '/'
boundary, to turn "aaaa/bbbb/cccc/dddd.txt" into ".../cccc/dddd.txt"
to make the result fit.

But the way to ensure that '/' before "cccc" is the one we want (as
opposed to the one before "bbbb" or "dddd") is initially based on
columns (i.e. because we want "...", we first subtract 3 from len
which is a local synonym for name_width and then subtract that from
"name_len", i.e. we ask: "how many display columns do we have in the
current pathname that is excess of what we can afford to allocate?"
The intention is to skip that many columns from the beginning of "name"
and start looking for '/' from there.

But we move "name" pointer by that many *bytes*!  We end up scanning
starting at a middle of a character.  What we look for is '/' and when
we find it we know the byte is a standalone character, so we do not
chomp a character in the middle, but it is very likely that we find
a slash that leaves the remaining string still too long, because
skipping say 2 columns may need skipping 4 bytes, but we only
skipped the same number of bytes as the number of columns we need to
skip.

This is the other remaining bug.

I think this needs to become a loop that loops while the width of
the current suffix is still wider than we can afford, discarding one
leading pathname component at a time at '/', measuring the resulting
width, or something like that.  Something along the lines of this
not-even-compile-tested sketch:

        /* we assume strlen(prefix) == utf8_strwidth(prefix) */
	while (name_width < utf8_strwidth(name) + strlen(prefix)) {
		char *slash;
		if (name[0] == '/')
			name++;
                slash = strchr(name);
		if (slash)
			name = slash;
		else
			break; /* Give Up */
		prefix = "...";
	}

> @@ -2753,10 +2753,14 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			if (slash)
>  				name = slash;
>  		}
> +		padding = len - utf8_strwidth(name);
> +		if (padding < 0)
> +			padding = 0;

Here "len" cannot become "name_length" because the former has to be
narrower than the latter by the display width of "prefix"
(i.e. "..."), so while this looks "strange", it is correct.

I think the remainder of the patch I did not quote looked quite
straight-forward and correct.

Thanks for working on this topic.

Junio C Hamano Sept. 15, 2022, 2:57 a.m. UTC | #2

tboegi@web.de writes:

> From: Torsten Bögershausen <tboegi@web.de>
> Subject: Re: [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth()

Let's retitle it to "diff.c: use utf8_strwidth() to count display width".

Torsten Bögershausen Sept. 26, 2022, 6:43 p.m. UTC | #3

On Wed, Sep 14, 2022 at 09:40:04AM -0700, Junio C Hamano wrote:

[]

> I think I spotted two remaining "bugs" that are left unfixed with
> this patch..
>
> There is "stat_width is -1 (auto)" case, which reads like so:
>
> 	if (options->stat_width == -1)
> 		width = term_columns() - strlen(line_prefix);
> 	else
> 		width = options->stat_width ? options->stat_width : 80;
>
> Here line_prefix eventually comes from the "git log --graph" and
> shows the colored graph segments on the same output line as the
> diffstat.
>
> This patch is probably not making anything worse, but by leaving it
> strlen(), it is likely overcounting the width of it.  We can
> presumably use utf8_strnwidth() that can optionally be told to be
> aware of the ANSI color sequence to count its width correctly to fix
> it.

[]
> This is the other remaining bug.

[]

> I think the remainder of the patch I did not quote looked quite
> straight-forward and correct.
>
> Thanks for working on this topic.

How should we proceed here ?
This patch fixes one, and only one, reported bug,
which is now verfied by a test case using unicode instead of ASCII.
Fixing additional bugs in diff.c (or anywhere else) had never been
part of this.

Things that needs more fixing and cleanups had been layed out as the
result of a review, that is good.

"git log --graph" was mentioned.
Do we have test cases, that test this ?
How easy are they converted into unicode instead of ASCII ?

I am not even sure, if I ever used "git log --graph" myself.
Digging further here, is somewhat out of my scope.
At least for the moment.

Junio C Hamano Oct. 10, 2022, 9:58 p.m. UTC | #4

Torsten Bögershausen <tboegi@web.de> writes:

> On Wed, Sep 14, 2022 at 09:40:04AM -0700, Junio C Hamano wrote:
>
> []
>
>> I think I spotted two remaining "bugs" that are left unfixed with
>> this patch..
>> ...
> How should we proceed here ?
> This patch fixes one, and only one, reported bug,

But then two more were reported in the message you are responding
to, and they stem from the same underlying logic bug where byte
count and display columns are mixed interchangeably.

> "git log --graph" was mentioned.
> Do we have test cases, that test this ?
> How easy are they converted into unicode instead of ASCII ?

The graph stuff pushes your "start of line" to the right, making the
available screen real estate narrower.  I do not think in the
current code we need to worry about unicode vs ascii (IIRC, we stick
to ASCII graphics while drawing lines), but we do need to take into
account the fact that ANSI COLOR escape sequences have non-zero byte
count while occupying zero display columns.

The other bug about the code that finds which / to use to abbreviate
a long pathname on diffstat lines does involve byte vs column that
comes from unicode.  From the bug description in the message you are
responding to, if we have a directory name whose display columns and
byte count are significantly different, the end result by chopping
with the current code would end up wider than it should be, which
sounds like a recipe to cook up a test case to me.

Torsten Bögershausen Oct. 20, 2022, 3:46 p.m. UTC | #5

On Mon, Oct 10, 2022 at 02:58:26PM -0700, Junio C Hamano wrote:
> Torsten Bögershausen <tboegi@web.de> writes:
>
> > On Wed, Sep 14, 2022 at 09:40:04AM -0700, Junio C Hamano wrote:
> >
> > []
> >
> >> I think I spotted two remaining "bugs" that are left unfixed with
> >> this patch..
> >> ...
> > How should we proceed here ?
> > This patch fixes one, and only one, reported bug,
>
> But then two more were reported in the message you are responding
> to, and they stem from the same underlying logic bug where byte
> count and display columns are mixed interchangeably.
>
> > "git log --graph" was mentioned.
> > Do we have test cases, that test this ?
> > How easy are they converted into unicode instead of ASCII ?
>
> The graph stuff pushes your "start of line" to the right, making the
> available screen real estate narrower.  I do not think in the
> current code we need to worry about unicode vs ascii (IIRC, we stick
> to ASCII graphics while drawing lines), but we do need to take into
> account the fact that ANSI COLOR escape sequences have non-zero byte
> count while occupying zero display columns.
>
> The other bug about the code that finds which / to use to abbreviate
> a long pathname on diffstat lines does involve byte vs column that
> comes from unicode.  From the bug description in the message you are
> responding to, if we have a directory name whose display columns and
> byte count are significantly different, the end result by chopping
> with the current code would end up wider than it should be, which
> sounds like a recipe to cook up a test case to me.
>


I couldn't find how to trigger this code path.
The `git log --graph` help says:
--graph
    Draw a text-based graphical representation of the commit history
    on the left hand side of the output.
    This may cause extra lines to be printed in between commits,
    in order for the graph history to be drawn properly.
    Cannot be combined with --no-walk.

There is no indication about filenames or diffs in the
resultet output.
If someone has time and knowledge to cook up a test case,
that would help.

For the moment, I don't have enough spare time to spend on digging
how to write this test case, that's the sad part of the story.
And that is probably a good start, or, to be more strict,
an absolute precondition, if I need to change another single line
in diff.c

I still haven't understood why the current patch can not move forward
on its own ?
There is a bug report, patch, a test case that verifies the fix.

What more is needed ?
To fix all other bugs/issues/limitations in diff.c ?
If yes, they need to go in separate commits anyway, or do I miss
something ?

Can we dampen the expectations a little bit ?

Junio C Hamano Oct. 20, 2022, 5:43 p.m. UTC | #6

Torsten Bögershausen <tboegi@web.de> writes:

> What more is needed ?
> To fix all other bugs/issues/limitations in diff.c ?
> If yes, they need to go in separate commits anyway, or do I miss
> something ?

At least leave some NEEDSWORK comment in the code that is known to
need more work, to remind others that the fix in the area of the
code is not done, perhaps.  Otherwise, much of the effort in the
review gets lost.

I offhand recall at least two (please go back to the original thread
to find the details of them).  One that measures the width of
long/path/name in bytes to determine where to start chomping in the
diffstat filename (because it still mixes display columns and
bytes), and the other one that measures the width of leading graph
segment in bytes without ignoring the ANSI color sequence, which
should be using utf8_strnwidth() but is using strlen().

https://lore.kernel.org/git/xmqqpmfx52qj.fsf@gitster.g/

Thanks.

Torsten Bögershausen Oct. 21, 2022, 3:19 p.m. UTC | #7

On Thu, Oct 20, 2022 at 10:43:07AM -0700, Junio C Hamano wrote:
> Torsten Bögershausen <tboegi@web.de> writes:
>
> > What more is needed ?
> > To fix all other bugs/issues/limitations in diff.c ?
> > If yes, they need to go in separate commits anyway, or do I miss
> > something ?
>
> At least leave some NEEDSWORK comment in the code that is known to
> need more work, to remind others that the fix in the area of the
> code is not done, perhaps.  Otherwise, much of the effort in the
> review gets lost.
>
> I offhand recall at least two (please go back to the original thread
> to find the details of them).  One that measures the width of
> long/path/name in bytes to determine where to start chomping in the
> diffstat filename (because it still mixes display columns and
> bytes), and the other one that measures the width of leading graph
> segment in bytes without ignoring the ANSI color sequence, which
> should be using utf8_strnwidth() but is using strlen().
>
> https://lore.kernel.org/git/xmqqpmfx52qj.fsf@gitster.g/
>
> Thanks.

Good, good, good.
For the moment I don't have any spare time to spend on Git.
All your comments are noted, and I hope to get time to address them later.
If you kick out the branch from seen and the whats cooking list,
that would be fine with me.

Junio C Hamano Oct. 21, 2022, 9:59 p.m. UTC | #8

Torsten Bögershausen <tboegi@web.de> writes:

> For the moment I don't have any spare time to spend on Git.
> All your comments are noted, and I hope to get time to address them later.
> If you kick out the branch from seen and the whats cooking list,
> that would be fine with me.

I'd rather not waste the efforts so far.  I am tempted to queue the
following on top or squash it in.

----- >8 --------- >8 --------- >8 --------- >8 --------- >8 -----
Subject: [PATCH] diff: leave NEEDWORK notes in show_stats() function

The previous step made an attempt to correctly compute display
columns allocated and padded different parts of diffstat output.
There are at least two known codepaths in the function that still
mixes up display widths and byte length that need to be fixed.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 diff.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/diff.c b/diff.c
index 2751cae131..1d222d87b2 100644
--- a/diff.c
+++ b/diff.c
@@ -2675,6 +2675,11 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 	 * making the line longer than the maximum width.
 	 */
 
+	/*
+	 * NEEDSWORK: line_prefix is often used for "log --graph" output
+	 * and contains ANSI-colored string.  utf8_strnwidth() should be
+	 * used to correctly count the display width instead of strlen().
+	 */
 	if (options->stat_width == -1)
 		width = term_columns() - strlen(line_prefix);
 	else
@@ -2750,6 +2755,16 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			char *slash;
 			prefix = "...";
 			len -= 3;
+			/*
+			 * NEEDSWORK: (name_len - len) counts the display
+			 * width, which would be shorter than the byte
+			 * length of the corresponding substring.
+			 * Advancing "name" by that number of bytes does
+			 * *NOT* skip over that many columns, so it is
+			 * very likely that chomping the pathname at the
+			 * slash we will find starting from "name" will
+			 * leave the resulting string still too long.
+			 */
 			name += name_len - len;
 			slash = strchr(name, '/');
 			if (slash)

Torsten Bögershausen Oct. 23, 2022, 8:02 p.m. UTC | #9

On Fri, Oct 21, 2022 at 02:59:09PM -0700, Junio C Hamano wrote:
> Torsten Bögershausen <tboegi@web.de> writes:
>
> > For the moment I don't have any spare time to spend on Git.
> > All your comments are noted, and I hope to get time to address them later.
> > If you kick out the branch from seen and the whats cooking list,
> > that would be fine with me.
>
> I'd rather not waste the efforts so far.  I am tempted to queue the
> following on top or squash it in.
>
> ----- >8 --------- >8 --------- >8 --------- >8 --------- >8 -----
> Subject: [PATCH] diff: leave NEEDWORK notes in show_stats() function
>
> The previous step made an attempt to correctly compute display
> columns allocated and padded different parts of diffstat output.
> There are at least two known codepaths in the function that still
> mixes up display widths and byte length that need to be fixed.
>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>  diff.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/diff.c b/diff.c
> index 2751cae131..1d222d87b2 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2675,6 +2675,11 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  	 * making the line longer than the maximum width.
>  	 */
>
> +	/*
> +	 * NEEDSWORK: line_prefix is often used for "log --graph" output
> +	 * and contains ANSI-colored string.  utf8_strnwidth() should be
> +	 * used to correctly count the display width instead of strlen().
> +	 */
>  	if (options->stat_width == -1)
>  		width = term_columns() - strlen(line_prefix);
>  	else
> @@ -2750,6 +2755,16 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			char *slash;
>  			prefix = "...";
>  			len -= 3;
> +			/*
> +			 * NEEDSWORK: (name_len - len) counts the display
> +			 * width, which would be shorter than the byte
> +			 * length of the corresponding substring.
> +			 * Advancing "name" by that number of bytes does
> +			 * *NOT* skip over that many columns, so it is
> +			 * very likely that chomping the pathname at the
> +			 * slash we will find starting from "name" will
> +			 * leave the resulting string still too long.
> +			 */
>  			name += name_len - len;
>  			slash = strchr(name, '/');
>  			if (slash)


That looks good to me -
my preferred version would be a patch on it's own on top.

[v5,1/1] diff.c: When appropriate, use utf8_strwidth()

Commit Message

Comments

Patch