[v2,5/5] mailmap: support hashed entries in mailmaps

Many people, through the course of their lives, will change either a
name or an email address.  For this reason, we have the mailmap, to map
from a user's former name or email address to their current, canonical
forms.  Normally, this works well as it is.

However, sometimes people change a name or an email address and wish to
wholly disassociate themselves from that former name or email address.
For example, a person may transition from one gender to another,
changing their name, or they may have changed their name to disassociate
themselves from an abusive family or partner.  In such a case, using the
former name or address in any way may be undesirable and the person may
wish to replace it as completely as possible.

For projects which wish to support this, introduce hashed forms into the
mailmap.  These forms, which start with "@sha256:" followed by a SHA-256
hash of the entry, can be used in place of the form used in the commit
field.  This form is intentionally designed to be unlikely to conflict
with legitimate use cases.  For example, this is not a valid email
address according to RFC 5322.  In the unlikely event that a user has
put such a form into the actual commit as their name, we will accept it.

While the form of the data is designed to accept multiple hash
algorithms, we intentionally do not support SHA-1.  There is little
reason to support such a weak algorithm in new use cases and no
backwards compatibility to consider.  Moreover, SHA-256 is faster than
the SHA1DC implementation we use, so this not only improves performance,
but simplifies the current implementation somewhat as well.

Note that it is, of course, possible to perform a lookup on all commit
objects to determine the actual entry which matches the hashed form of
the data.  However, this is an improvement over the status quo.

The performance of this patch with no hashed entries is very similar to
the performance without this patch.  Considering a git log command to
look up author and committer information on 981,680 commits in the Linux
kernel history, either with an unhashed mailmap or a mailmap with all
old values hashed:

                                   Shortest  Longest  Average  Change
  Git 2.30                         7.876     8.297    8.143
  This patch, unhashed             7.923     8.484    8.237    + 1.15%
  This patch, hashed               14.510    14.783   14.672   +80.17%
  This patch, hashed, unoptimized  15.425    16.318   15.901   +95.27%

Thus, the average performance after this patch is within normal
variation of the pre-patch performance.  It's unlikely that users will
notice the difference in practice, even on much larger
repositories, unless they're using the new feature.

To minimize the performance impact of the hashing process, we maintain a
reference count of each mailmap entry and when we encounter an entry we
must hash, we insert the same object under the unhashed key as well.  We
also keep a count of the number of hashed entries.  This means we must
hash an object at most once and once we've seen all the hashed objects,
we won't hash any more objects.  Times without this optimization are
listed above in the unoptimized entry.

This has the potential to cause a performance problem as we insert items
into a sorted list, but changing the implementation to use a khash map
instead does not result in a significantly faster implementation,
despite the improved insertion speed.  Performance in the unhashed case
is slightly worse, so this approach was not adopted since it provides
few benefits.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Documentation/mailmap.txt | 28 +++++++++++
 mailmap.c                 | 99 ++++++++++++++++++++++++++++++++++-----
 mailmap.h                 |  2 +
 t/t4203-mailmap.sh        | 35 ++++++++++++++
 4 files changed, 152 insertions(+), 12 deletions(-)

Message ID	20210103211849.2691287-6-sandals@crustytoothpaste.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 07D9CC433E6 for <git@archiver.kernel.org>; Sun, 3 Jan 2021 21:20:37 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C90E720784 for <git@archiver.kernel.org>; Sun, 3 Jan 2021 21:20:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727593AbhACVU3 (ORCPT <rfc822;git@archiver.kernel.org>); Sun, 3 Jan 2021 16:20:29 -0500 Received: from injection.crustytoothpaste.net ([192.241.140.119]:45782 "EHLO injection.crustytoothpaste.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727543AbhACVU3 (ORCPT <rfc822;git@vger.kernel.org>); Sun, 3 Jan 2021 16:20:29 -0500 Received: from camp.crustytoothpaste.net (unknown [IPv6:2001:470:b978:101:b610:a2f0:36c1:12e3]) (using TLSv1.2 with cipher ECDHE-RSA-CHACHA20-POLY1305 (256/256 bits)) (No client certificate requested) by injection.crustytoothpaste.net (Postfix) with ESMTPSA id 2807F60815; Sun, 3 Jan 2021 21:19:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=crustytoothpaste.net; s=default; t=1609708748; bh=kCe9FJ/gwE+bPsM2vIcOW5M5MbhKot5CavKY6VHqT34=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From:Reply-To: Subject:Date:To:CC:Resent-Date:Resent-From:Resent-To:Resent-Cc: In-Reply-To:References:Content-Type:Content-Disposition; b=wSPUT8phFMRT+JGvpB0Ao7I1zTDP1mpRJNmoeMjifZLsmbAjX5B5Obg5ALdS3wo3b 3vMfArkUwJfc3WsdDT92sVHfxCGDCbZ7OJd12SYX2WFP9RvnenkC19Nhes6/QUvSSd 5BckdQBAEebF28cQU9+iv5FiGpUElMgECmDQkH0PGDMLrNGS8oVS3DA9//M3oB59cS mfqAhW/KTsMjXx9/I36GxDKXIszRzxuRhJCBqW3pMIFW1HI3IumrRTOUWUfwu/B1hI QT7HIOtfO7bR1wETAfGZJ2qIooARQexAbfmw97mmVZyoE+J9dqeNSttfS0R1dwAmGU 6ZwbCMU1vDd7NXsajfAOvOm7XRnxv/CDWA+Dl3+GY+Yz7m2/OZrKvamRiaXCWl61Ll U3TQqhkjiv689FDdgC9fBnWlSvREi5hm5IVGYWKg/mElQ8HX8jic+Wbrxf6eF5BOg1 vW+xTpfO3BGpFps5HFSujMiEIVL+PFyEePgbHIubhmhxzNgW7g3 From: "brian m. carlson" <sandals@crustytoothpaste.net> To: <git@vger.kernel.org> Cc: Jeff King <peff@peff.net>, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFz?= =?utf-8?b?b24=?= <avarab@gmail.com>, Phillip Wood <phillip.wood123@gmail.com> Subject: [PATCH v2 5/5] mailmap: support hashed entries in mailmaps Date: Sun, 3 Jan 2021 21:18:49 +0000 Message-Id: <20210103211849.2691287-6-sandals@crustytoothpaste.net> X-Mailer: git-send-email 2.30.0.284.gd98b1dd5eaa7 In-Reply-To: <20210103211849.2691287-1-sandals@crustytoothpaste.net> References: <20210103211849.2691287-1-sandals@crustytoothpaste.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <git.vger.kernel.org> X-Mailing-List: git@vger.kernel.org
Series	Hashed mailmap \| expand [v2,0/5] Hashed mailmap [v2,1/5] mailmap: add a function to inspect the number of entries [v2,2/5] mailmap: switch to opaque struct [v2,3/5] t4203: add failing test for case-sensitive local-parts and names [v2,4/5] mailmap: use case-sensitive comparisons for local-parts and names [v2,5/5] mailmap: support hashed entries in mailmaps

[v2,5/5] mailmap: support hashed entries in mailmaps

Commit Message

Comments

Patch