diff mbox series

[v3,1/2] diff.c: When appropriate, use utf8_strwidth(), part1

Message ID 20220902042133.13883-1-tboegi@web.de (mailing list archive)
State New, archived
Headers show
Series [v3,1/2] diff.c: When appropriate, use utf8_strwidth(), part1 | expand

Commit Message

Torsten Bögershausen Sept. 2, 2022, 4:21 a.m. UTC
From: Torsten Bögershausen <tboegi@web.de>

When unicode filenames (encoded in UTF-8) are used, the visible width
on the screen is not the same as strlen(filename).

For example, `git log --stat` may produce an output like this:

[snip the header]

 Arger.txt  | 1 +
 Ärger.txt | 1 +
 2 files changed, 2 insertions(+)

A side note: the original report was about cyrillic filenames.
After some investigations it turned out that
a) This is not a problem with "ambiguous characters" in unicode
b) The same problem exists for all unicode code points (so we
  can use Latin based Umlauts for demonstrations below)

The 'Ä' takes the same space on the screen as the 'A'.
But needs one more byte in memory, so the the `git log --stat` output
for "Arger.txt" (!) gets mis-aligned:
The maximum length is derived from "Ärger.txt", 10 bytes in memory,
9 positions on the screen. That is why "Arger.txt" gets one extra ' '
for aligment, it needs 9 bytes in memory.
If there was a file "Ö", it would be correctly aligned by chance,
but "Öhö" would not.

The solution is of course, to use utf8_strwidth() instead of strlen()
when dealing with the width on screen.

Side note 1:
Needed changes for this fix are split into 2 commits:
This commit only changes strlen() into utf8_strwidth() in diff.c:
The next commit will add tests and further needed changes.

Side note 2:
Junio C Hamano suspects that there is probably more work to be done,
in a separate commit:
Code in diff.c::pprint_rename() that "abbreviates" overly long pathnames
and "transforms" renames lines like
"a/b/c -> a/B/c" into the shorter
"a/{b->B}/c" form, and IIRC this is all byte based.

Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 diff.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--
2.34.0

Comments

Johannes Schindelin Sept. 2, 2022, 9:39 a.m. UTC | #1
Hi Torsten,

On Fri, 2 Sep 2022, tboegi@web.de wrote:

> diff --git a/diff.c b/diff.c
> index 974626a621..b5df464de5 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2620,7 +2620,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			continue;
>  		}
>  		fill_print_name(file);
> -		len = strlen(file->print_name);
> +		len = utf8_strwidth(file->print_name);

So this is no longer a length (in bytes) but a width (in columns).

In 2/2, a similar change incurs renaming `max_len` to `max_width`.

I would prefer for 1/2 and 2/2 to be on the same page here: either they
both rename variables that have `len` in their name but are actually about
a width (in columns), or neither of the patches rename these variables.

Thanks,
Dscho

>  		if (max_len < len)
>  			max_len = len;
>
> @@ -2743,7 +2743,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  		 * "scale" the filename
>  		 */
>  		len = name_width;
> -		name_len = strlen(name);
> +		name_len = utf8_strwidth(name);
>  		if (name_width < name_len) {
>  			char *slash;
>  			prefix = "...";
> --
> 2.34.0
>
>
diff mbox series

Patch

diff --git a/diff.c b/diff.c
index 974626a621..b5df464de5 100644
--- a/diff.c
+++ b/diff.c
@@ -2620,7 +2620,7 @@  static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			continue;
 		}
 		fill_print_name(file);
-		len = strlen(file->print_name);
+		len = utf8_strwidth(file->print_name);
 		if (max_len < len)
 			max_len = len;

@@ -2743,7 +2743,7 @@  static void show_stats(struct diffstat_t *data, struct diff_options *options)
 		 * "scale" the filename
 		 */
 		len = name_width;
-		name_len = strlen(name);
+		name_len = utf8_strwidth(name);
 		if (name_width < name_len) {
 			char *slash;
 			prefix = "...";