From patchwork Fri Sep 2 04:21:33 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Torsten_B=C3=B6gershausen?= X-Patchwork-Id: 12963623 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57D41ECAAD2 for ; Fri, 2 Sep 2022 04:21:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235258AbiIBEVu (ORCPT ); Fri, 2 Sep 2022 00:21:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36328 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235205AbiIBEVo (ORCPT ); Fri, 2 Sep 2022 00:21:44 -0400 Received: from mout.web.de (mout.web.de [212.227.15.14]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 314BF6F563 for ; Thu, 1 Sep 2022 21:21:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=web.de; s=dbaedf251592; t=1662092498; bh=4Uig4r45GIp09ued0mtOlJCZDqd4y3iCuk6ixYMqbRk=; h=X-UI-Sender-Class:From:To:Cc:Subject:Date:In-Reply-To:References; b=k5BikU6kP/7LoZrf0YtlQlKOCePXrIMM6ypEkfuDLUhMdzFOIhnpJ4v4hpjDSZzSh oTHUdSqQPcY1v29PDeF7PaN+1BtuftaKCuIHfWYUJ2p3wGYj+OoTroZYEgzr/S7lOq 3WgFu2Qy7hn3uiTkji33yclOg9aM/VjEejV35vLc= X-UI-Sender-Class: c548c8c5-30a9-4db5-a2e7-cb6cb037b8f9 Received: from localhost.localdomain ([62.20.115.19]) by smtp.web.de (mrweb006 [213.165.67.108]) with ESMTPSA (Nemesis) id 1N9LIO-1pQFB02s0i-01598d; Fri, 02 Sep 2022 06:21:38 +0200 From: tboegi@web.de To: git@vger.kernel.org, alexander.s.m@gmail.com, Johannes.Schindelin@gmx.de Cc: =?utf-8?q?Torsten_B=C3=B6gershausen?= Subject: [PATCH v3 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 Date: Fri, 2 Sep 2022 06:21:33 +0200 Message-Id: <20220902042133.13883-1-tboegi@web.de> X-Mailer: git-send-email 2.34.0 In-Reply-To: References: MIME-Version: 1.0 X-Provags-ID: V03:K1:4+etA3E4OFjA94mvSx3IfXOnycLnV79mYMbFuz33jEvt9vtHK76 RwRuRsMBBxqvqKK0f4AAd50D1v6Xtdqi+g1yUJ7oxpyTqUFp4rUtTFi8hhSGMi2XMZfEheD D07vSTsGPZMv90H8Pv3JB8oMCuzWQkkoMr3veah4FuU9QKVIg80c5UfcA3lfLR5fIP6z9i/ 5RiGkA3mWZ26sNcGSmI3w== X-UI-Out-Filterresults: notjunk:1;V03:K0:6uK+Ty3g9fo=:Ilnr/ACdhGTbS1jtfehP/K O1YkV2l5CanxqgA0mFAUm/4aMRfwDTzaJXAtY2ymXtBZYfKSM4aw5KPUqJiYL34WSqoaOIhri LwKadu0a0s8mNqXYPGw+xAn3T2AAKO05FaBllqYVoIjeU4e9SlUIWLjRvgGUc47XL+pff/nXh 0Ui11E2hk1w8FA/3tLU+bJqY5trxsKueSRdufcb2MzY/YzpvKUJte6ohTGozgARQj9+NbEL2d tRr9T2elVf7+WNbhiE1p+Aq+nhxSNEeUEM4V9Zdgzjo/qssFiaqg/rDQMXGjNS9jVb9mXKUIS iaEUzUhrbY9f2lGlz2QDzerYIhjNMscZMwv+5MGY5SUphGj71Uo+teBliQ5BOw/CBN+f4T43K N96r4wbPDxblgMY63LGb877tjGxTxwzMCLnpz0WvvdUtkahKRLkyoIv5PE83BQN3yvzAmDX7x J0b+yBP0hvYu34eo4NjyGqZoPvwfyIAhseEEBTBwZy8SpEtlcYgzAvfYWiLQfufL6ru9/QvQf iQ74MVvrGRL/PX78uCmSGN+X/WIwCCUEJlSenxYja2Iu2hOhfNVeH0l8LhSHEbv3r33mheFBa jqIwZ4G4kvUUaWgMPu3AR8VaD0HqGZH+DZyhdzycRz0dnWw6fMUdJmKOE2qlA89CJ/Q7/NUpW GZcfXr3T85BjXg2HvfAOogB33POZafncgMh+vHv/xzOD+eLqA/e+fWV9VjBUepDZ58DRlkOAs 2FXOIXSP+jlu/pt8mvRLsIDCGWSpoSrHjJPEW/XbTJOGJy44NkYQP7Me4hEXT2nnGfQ5jNLUm WCEx1dA//n3GZ4cgH3wTbixSxLUeL8NZCp1IzSo4JuLSfHq0Trut81KnnQGf0jCK0Z+EnzstP 6jiKyzG454E1Ipvw2SHQV/LSbrmdT+l09tR7cN+Nr8CwTFksHaIrl48Jcob3Imemdjv2ainXS zfvN8T00vglURdJxEVAX3zWwnbMSxUYgy61/a9J3tGd9RcpfeY9ufB1lzH1bTnjpG27HqXNkA gAIccoCywSO8n/yCefOXYRBmIK46oIi4BaDFWVOUOnWuSoF8I2CDJP7PCgIVVGz/+GrPbe/1e Gxdu5RllVIJTIGunIMdWNhCZQ8vlmYTpS601ImLK0NFGpn5IqzK6G96lcAR7n9ZljVCme2ps4 pWtkk= Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Torsten Bögershausen When unicode filenames (encoded in UTF-8) are used, the visible width on the screen is not the same as strlen(filename). For example, `git log --stat` may produce an output like this: [snip the header] Arger.txt | 1 + Ärger.txt | 1 + 2 files changed, 2 insertions(+) A side note: the original report was about cyrillic filenames. After some investigations it turned out that a) This is not a problem with "ambiguous characters" in unicode b) The same problem exists for all unicode code points (so we can use Latin based Umlauts for demonstrations below) The 'Ä' takes the same space on the screen as the 'A'. But needs one more byte in memory, so the the `git log --stat` output for "Arger.txt" (!) gets mis-aligned: The maximum length is derived from "Ärger.txt", 10 bytes in memory, 9 positions on the screen. That is why "Arger.txt" gets one extra ' ' for aligment, it needs 9 bytes in memory. If there was a file "Ö", it would be correctly aligned by chance, but "Öhö" would not. The solution is of course, to use utf8_strwidth() instead of strlen() when dealing with the width on screen. Side note 1: Needed changes for this fix are split into 2 commits: This commit only changes strlen() into utf8_strwidth() in diff.c: The next commit will add tests and further needed changes. Side note 2: Junio C Hamano suspects that there is probably more work to be done, in a separate commit: Code in diff.c::pprint_rename() that "abbreviates" overly long pathnames and "transforms" renames lines like "a/b/c -> a/B/c" into the shorter "a/{b->B}/c" form, and IIRC this is all byte based. Reported-by: Alexander Meshcheryakov Signed-off-by: Torsten Bögershausen --- diff.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- 2.34.0 diff --git a/diff.c b/diff.c index 974626a621..b5df464de5 100644 --- a/diff.c +++ b/diff.c @@ -2620,7 +2620,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options) continue; } fill_print_name(file); - len = strlen(file->print_name); + len = utf8_strwidth(file->print_name); if (max_len < len) max_len = len; @@ -2743,7 +2743,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options) * "scale" the filename */ len = name_width; - name_len = strlen(name); + name_len = utf8_strwidth(name); if (name_width < name_len) { char *slash; prefix = "...";