Message ID | 20220903053931.15611-1-tboegi@web.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4,1/2] diff.c: When appropriate, use utf8_strwidth(), part1 | expand |
tboegi@web.de writes: > From: Torsten Bögershausen <tboegi@web.de> > Subject: Re: [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 Given 2/2 does not share a similar title, "part1" sounds somewhat strange. In any case, 'when appropriate,' is probalby best unsaid, as it is almost a given. We won't deliberately use something that is not appropriate on purpose anyway. Even if we =were to keep that word, downcase "When". > When unicode filenames (encoded in UTF-8) are used, the visible width > on the screen is not the same as strlen(filename). > > For example, `git log --stat` may produce an output like this: > > [snip the header] > > Arger.txt | 1 + > Ärger.txt | 1 + > 2 files changed, 2 insertions(+) > > A side note: the original report was about cyrillic filenames. > After some investigations it turned out that > a) This is not a problem with "ambiguous characters" in unicode > b) The same problem exists for all unicode code points (so we > can use Latin based Umlauts for demonstrations below) > > The 'Ä' takes the same space on the screen as the 'A'. > But needs one more byte in memory, so the the `git log --stat` output > for "Arger.txt" (!) gets mis-aligned: > The maximum length is derived from "Ärger.txt", 10 bytes in memory, > 9 positions on the screen. That is why "Arger.txt" gets one extra ' ' > for aligment, it needs 9 bytes in memory. > If there was a file "Ö", it would be correctly aligned by chance, > but "Öhö" would not. > > The solution is of course, to use utf8_strwidth() instead of strlen() > when dealing with the width on screen. > > Side note 1: > Needed changes for this fix are split into 2 commits: > This commit only changes strlen() into utf8_strwidth() in diff.c: > The next commit will add tests and further needed changes. I am not sure if it makes sense to split them into two. It is hard for us to demonistrate the need for this step if it does not come with its own test. > Side note 2: > Junio C Hamano suspects that there is probably more work to be done, > in a separate commit: > Code in diff.c::pprint_rename() that "abbreviates" overly long pathnames > and "transforms" renames lines like > "a/b/c -> a/B/c" into the shorter > "a/{b->B}/c" form, and IIRC this is all byte based. I already said that I suspect {b->B} conversion is OK, so the side note is probably more noise than being useful. > > Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com> > Signed-off-by: Torsten Bögershausen <tboegi@web.de> > --- > diff.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/diff.c b/diff.c > index 974626a621..b5df464de5 100644 > --- a/diff.c > +++ b/diff.c > @@ -2620,7 +2620,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options) > continue; > } > fill_print_name(file); > - len = strlen(file->print_name); > + len = utf8_strwidth(file->print_name); > if (max_len < len) > max_len = len; > > @@ -2743,7 +2743,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options) > * "scale" the filename > */ > len = name_width; > - name_len = strlen(name); > + name_len = utf8_strwidth(name); > if (name_width < name_len) { > char *slash; > prefix = "..."; > -- > 2.34.0
On Mon, Sep 05, 2022 at 01:46:57PM -0700, Junio C Hamano wrote: > tboegi@web.de writes: > > > From: Torsten Bögershausen <tboegi@web.de> > > Subject: Re: [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 > > Given 2/2 does not share a similar title, "part1" sounds somewhat > strange. In any case, 'when appropriate,' is probalby best unsaid, > as it is almost a given. We won't deliberately use something that > is not appropriate on purpose anyway. Even if we =were to keep that > word, downcase "When". Yes, agreed. In short: I will make a new patch the next weeks, in one commit (again). (That can take some days or weeks) Thanks to Dscho for his patience with the strbuf() improvements. I think that I tried a "%*s" version, but couldn't get that to work. > > Side note 2: > > Junio C Hamano suspects that there is probably more work to be done, > > in a separate commit: > > Code in diff.c::pprint_rename() that "abbreviates" overly long pathnames > > and "transforms" renames lines like > > "a/b/c -> a/B/c" into the shorter > > "a/{b->B}/c" form, and IIRC this is all byte based. > > I already said that I suspect {b->B} conversion is OK, so the side > note is probably more noise than being useful. OK - the comment can be removed. I didn't know how to read this comment: >...but the former may chomp a single multi-byte letter in the middle, > which would need to be corrected as a part of this change. After diffing into the code some more times, I think that we don't chomp a single byte out of an UTF-8 sequence.
Torsten Bögershausen <tboegi@web.de> writes: > > OK - the comment can be removed. > > I didn't know how to read this comment: >>...but the former may chomp a single multi-byte letter in the middle, >> which would need to be corrected as a part of this change. > > After diffing into the code some more times, I think that we don't > chomp a single byte out of an UTF-8 sequence. When turning a/b/c vs a/B/c into a/{b->B}/c, two steps are involved. Take common prefix and suffix (in this case 'a' and 'c') and turn 'b' vs 'B' into {b->B} is one step. The other is what to do when prefix and suffix are long. After turning aaaaa/b/c vs aaaaa/B/c into aaaaa/{b->B}/c, if the result is overly long, how we shorten the prefix (i.e. aaaaa) and the suffix? I knew the code that produces {b->B} honored '/' boundary, but I just did not remember offhand what diff.c::pprint_rename() did in its latter half, specifically, if it just chomped pfx and sfx as a sequence of bytes (which would have been wrong) or insisted that the common sequence search honors '/' boundary (which would be OK, as byte '/' will not appear in the middle of a single multi-byte UTF-8 "letter"). I think iti s doing the latter, so it should be fine. Thanks.
diff --git a/diff.c b/diff.c index 974626a621..b5df464de5 100644 --- a/diff.c +++ b/diff.c @@ -2620,7 +2620,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options) continue; } fill_print_name(file); - len = strlen(file->print_name); + len = utf8_strwidth(file->print_name); if (max_len < len) max_len = len; @@ -2743,7 +2743,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options) * "scale" the filename */ len = name_width; - name_len = strlen(name); + name_len = utf8_strwidth(name); if (name_width < name_len) { char *slash; prefix = "...";