From patchwork Wed Sep 14 15:13:33 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Torsten_B=C3=B6gershausen?= X-Patchwork-Id: 12976297 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4622DECAAD3 for ; Wed, 14 Sep 2022 15:14:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229706AbiINPOG (ORCPT ); Wed, 14 Sep 2022 11:14:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40702 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230265AbiINPNn (ORCPT ); Wed, 14 Sep 2022 11:13:43 -0400 Received: from mout.web.de (mout.web.de [212.227.17.12]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 820657C18F for ; Wed, 14 Sep 2022 08:13:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=web.de; s=s29768273; t=1663168419; bh=CXRKa2oJvGjGrgypdiXW3/0uCNCHIjMexlYQ/Ro1cbI=; h=X-UI-Sender-Class:From:To:Cc:Subject:Date:In-Reply-To:References; b=l0eg2OOUgZYHf32Zwsag6fRHOtUM0rvYx9xiACAFmjqAqnK/3r2XtL3OZ+maQxWCm mYeXTC3/AoG9JA/BSjQAPX6/IFSO2zvDGlF12TvhLaXHQFN2Kq5c8dQ8HWLdJrQXbY wPZtC88joQ/DtRFpjJ4+wiZ0KKTesHRIxK/6oWgzg+KuFq46lNlH3RKmCPBvWS1sdJ woX2BNSo68Jz5OCu2siAfDeTFbv62kS4R6W+F5No0y6hbfpaZM9jNJOsd4dWGjvNar 52Vj2SoWPKZnCcx56HO8Ir640aOoNEPt9DsmhCnqcdvotAG+iF+allwUt03ntNyUd7 7Nbf1qJN/gWgA== X-UI-Sender-Class: 814a7b36-bfc1-4dae-8640-3722d8ec6cd6 Received: from localhost.localdomain ([62.20.115.19]) by smtp.web.de (mrweb105 [213.165.67.124]) with ESMTPSA (Nemesis) id 1MLRYf-1oqHoX3KbL-00IORy; Wed, 14 Sep 2022 17:13:38 +0200 From: tboegi@web.de To: git@vger.kernel.org, alexander.s.m@gmail.com, Johannes.Schindelin@gmx.de Cc: =?utf-8?q?Torsten_B=C3=B6gershausen?= Subject: [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth() Date: Wed, 14 Sep 2022 17:13:33 +0200 Message-Id: <20220914151333.3309-1-tboegi@web.de> X-Mailer: git-send-email 2.34.0 In-Reply-To: References: MIME-Version: 1.0 X-Provags-ID: V03:K1:FNzIpcsifz82dMRvtmblMPRrnj/yEBEj/i77JK/dhjI/8zG+cH9 B4qKaUtv01H3RsArK19wG0kDqVQTDC80wsDbXiqPtdeB6nTZ2nwc68n/HcDxmLi2XgSL9Fo 1Li06AEDzmoZEdJs2gz1JBOTB9k/4OEXB0B67IDtRfCpF/SDFsRbR3EiYiCPqU1RJ9Afjwd dMPB9AczGrAcv8m8ZhUEA== X-UI-Out-Filterresults: notjunk:1;V03:K0:vbUB+HVSzg0=:yTfq3I34tJIlp+5rF91/Ox 4ilCnsvuk0e5j8neDRYMh8tfVSBu7R5hlgJg9n5YQ5IyWW8O5FFlxK/OUQnbYPX9txvdxbVsi 8wAladsQjf7K4Yzfyc32cNpxAr4gDRINYMjXVBaDyEnngERY9FA0pgS9bteghGATOG0DHmn49 Mn5XuojuwRpNoru2zg4YiFo0C4ZyLFBJ38jTDwHbGFQBVvkbAx9mqh3cVdoO3j0lALiSqjpmr a7QfYdtypLtIS29jwnfm2FqJNbzx9UUtWi/5/4bYSRVbgAP2uCVAPWo7ZF+vsSJA6RLS9r6xg 2Z7p6u8skkxQ4AfK3RWTVeTaXsPCOxgb7EoladpB1sm7ngq3xQQH7GGRxl/VQn4+pupLlHi2q NmRtv30pkjcE6zb9loy8/bh6OnKVP7D60l4cF7+Oj1GwyhccFeo5Ip6sOWJed79GwqusS0Lua 54ODSLaRuo/Gr5TkV8DIui8aGk7CGdR5qFlKDxO+podVabjMd2BxBADHhK914ioxDkNkXoZDI JyzPpo5cWWTfkiEbnT0asMDcF4JMVbmrhdhNfZYsmBSNdVbPKDC9jfZ+FUa5uxIqPn1/W0BOP wH28u+kKoghBF7UPjUP7qf5DvlYDgsqus53yCqefRJqhwKoboBeFo6H+XqGsWcDpcW3AETWJW VwGNxLsN5kvyPw0daJeddsEcc+MBIQn0Y/YCjYqACOrYnxshyi0dyiPNNR862KIU2eix6o6zo TDShP6LKME96L2QBX82hGO9xKlGyFGSRZg8FSNLazEOlV88gJru6eiYqAywZ1cVzREbXqJFfk 2m1+po/dqurxDh5cl78hwSBFcHE5H2Wroc4ikDrzAkxG005HhYVdpt5gghmMLxtwBEBAE+NoS aLyXzo0nqcGB0lvCP33+00uMNW7Z4NCH0lZEUM8D4jRSRd5LU6gN9j8wysmtpkk73SYqq/8LI Q/aHq3WiBDqUcCKqZ5aF+cGsRfqntGkn081D14NZekCYu+qeqiemVXRa4TQwjC4ktCfPUAy2J ICk86dKhw980jFTW55vnW2/Q5vDeGGaEK/EIPQ+96r1ETcjso7HX+UjmEaQEIbe7BqJydyFsR pDR11C3Q7Igy6MMuzBietuxarU9SIHt4QRPB1gyrI0ApxRieift7SL/04OZTLmjmvA50nX00v g9sC4= Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Torsten Bögershausen When unicode filenames (encoded in UTF-8) are used, the visible width on the screen is not the same as strlen(). For example, `git log --stat` may produce an output like this: [snip the header] Arger.txt | 1 + Ärger.txt | 1 + 2 files changed, 2 insertions(+) A side note: the original report was about cyrillic filenames. After some investigations it turned out that a) This is not a problem with "ambiguous characters" in unicode b) The same problem exists for all unicode code points (so we can use Latin based Umlauts for demonstrations below) The 'Ä' takes the same space on the screen as the 'A'. But needs one more byte in memory, so the the `git log --stat` output for "Arger.txt" (!) gets mis-aligned: The maximum length is derived from "Ärger.txt", 10 bytes in memory, 9 positions on the screen. That is why "Arger.txt" gets one extra ' ' for aligment, it needs 9 bytes in memory. If there was a file "Ö", it would be correctly aligned by chance, but "Öhö" would not. The solution is of course, to use utf8_strwidth() instead of strlen() when dealing with the width on screen. And then there is another problem, code like this: strbuf_addf(&out, "%-*s", len, name); (or using the underlying snprintf() function) does not align the buffer to a minimum of len measured in screen-width, but uses the memory count. One could be tempted to wish that snprintf() was UTF-8 aware. That doesn't seem to be the case anywhere (tested on Linux and Mac), probably snprintf() uses the "bytes in memory"/strlen() approach to be compatible with older versions and this will never change. The basic idea is to change code in diff.c like this strbuf_addf(&out, "%-*s", len, name); into something like this: int padding = len - utf8_strwidth(name); if (padding < 0) padding = 0; strbuf_addf(&out, " %s%*s", name, padding, ""); The real change is slighty bigger, as it, as well, integrates two calls of strbuf_addf() into one. Tests: Two things need to be tested: - The calculation of the maximum width - The calculation of padding The name "textfile" is changed into "tëxtfilë", both have a width of 8. If strlen() was used, to get the maximum width, the shorter "binfile" would have been mis-aligned: binfile | [snip] tëxtfilë | [snip] If only "binfile" would be renamed into "binfilë": binfilë | [snip] textfile | [snip] In order to verify that the width is calculated correctly everywhere, "binfile" is renamed into "binfilë", giving 1 bytes more in strlen() "tëxtfile" is renamed into "tëxtfilë", 2 byte more in strlen(). The updated t4012-diff-binary.sh checks the correct aligment: binfilë | [snip] tëxtfilë | [snip] Reported-by: Alexander Meshcheryakov Helped-by: Johannes Schindelin Signed-off-by: Torsten Bögershausen Signed-off-by: Junio C Hamano --- diff.c | 27 ++++++++++++++++----------- t/t4012-diff-binary.sh | 14 +++++++------- 2 files changed, 23 insertions(+), 18 deletions(-) -- 2.34.0 diff --git a/diff.c b/diff.c index 974626a621..35b9da90fe 100644 --- a/diff.c +++ b/diff.c @@ -2620,7 +2620,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options) continue; } fill_print_name(file); - len = strlen(file->print_name); + len = utf8_strwidth(file->print_name); if (max_len < len) max_len = len; @@ -2734,7 +2734,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options) char *name = file->print_name; uintmax_t added = file->added; uintmax_t deleted = file->deleted; - int name_len; + int name_len, padding; if (!file->is_interesting && (added + deleted == 0)) continue; @@ -2743,7 +2743,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options) * "scale" the filename */ len = name_width; - name_len = strlen(name); + name_len = utf8_strwidth(name); if (name_width < name_len) { char *slash; prefix = "..."; @@ -2753,10 +2753,14 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options) if (slash) name = slash; } + padding = len - utf8_strwidth(name); + if (padding < 0) + padding = 0; if (file->is_binary) { - strbuf_addf(&out, " %s%-*s |", prefix, len, name); - strbuf_addf(&out, " %*s", number_width, "Bin"); + strbuf_addf(&out, " %s%s%*s | %*s", + prefix, name, padding, "", + number_width, "Bin"); if (!added && !deleted) { strbuf_addch(&out, '\n'); emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE, @@ -2776,8 +2780,9 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options) continue; } else if (file->is_unmerged) { - strbuf_addf(&out, " %s%-*s |", prefix, len, name); - strbuf_addstr(&out, " Unmerged\n"); + strbuf_addf(&out, " %s%s%*s | %*s", + prefix, name, padding, "", + number_width, "Unmerged"); emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE, out.buf, out.len, 0); strbuf_reset(&out); @@ -2803,10 +2808,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options) add = total - del; } } - strbuf_addf(&out, " %s%-*s |", prefix, len, name); - strbuf_addf(&out, " %*"PRIuMAX"%s", - number_width, added + deleted, - added + deleted ? " " : ""); + strbuf_addf(&out, " %s%s%*s | %*"PRIuMAX"%s", + prefix, name, padding, "", + number_width, added + deleted, + added + deleted ? " " : ""); show_graph(&out, '+', add, add_c, reset); show_graph(&out, '-', del, del_c, reset); strbuf_addch(&out, '\n'); diff --git a/t/t4012-diff-binary.sh b/t/t4012-diff-binary.sh index c509143c81..c64d9d2f40 100755 --- a/t/t4012-diff-binary.sh +++ b/t/t4012-diff-binary.sh @@ -113,20 +113,20 @@ test_expect_success 'diff --no-index with binary creation' ' ' cat >expect < 1026 bytes - textfile | 10000 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + binfilë | Bin 0 -> 1026 bytes + tëxtfilë | 10000 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ EOF test_expect_success 'diff --stat with binary files and big change count' ' - printf "\01\00%1024d" 1 >binfile && - git add binfile && + printf "\01\00%1024d" 1 >binfilë && + git add binfilë && i=0 && while test $i -lt 10000; do echo $i && i=$(($i + 1)) || return 1 - done >textfile && - git add textfile && - git diff --cached --stat binfile textfile >output && + done >tëxtfilë && + git add tëxtfilë && + git -c core.quotepath=false diff --cached --stat binfilë tëxtfilë >output && grep " | " output >actual && test_cmp expect actual '