mbox series

[v2,0/5] Hashed mailmap

Message ID 20210103211849.2691287-1-sandals@crustytoothpaste.net (mailing list archive)
Headers show
Series Hashed mailmap | expand

Message

brian m. carlson Jan. 3, 2021, 9:18 p.m. UTC
Many people, through the course of their lives, will change either a
name or an email address.  For this reason, we have the mailmap, to map
from a user's former name or email address to their current, canonical
forms.  Normally, this works well as it is.

However, sometimes people change a name (or an email) and want to
completely cease use of the former name or email.  This could be because
a transgender person has transitioned, because a person has left an
abusive partner or broken ties with an abusive family member, or for any
other number of good and valuable reasons.  In these cases, placing the
former name in the .mailmap may be undesirable.

For those situations, let's introduce a hashed mailmap, where the user's
former name or email address can be in the form @sha256:<hash>.  This
obscures the former name or email.

In the course of experimenting with some solutions for v2, I noticed
that our mailmap support has a bunch of problems with case sensitivity.
Notably, it treats local-parts of email addresses in a case-insensitive
way, when the RFC specifically says that they are case sensitive, and we
also treat names case insensitively, but only for ASCII characters.
Both of those have been fixed here, and the commit messages explain in
lurid detail why, while incompatible, this is the correct behavior.

I've also added some performance numbers and explained some alternate
solutions in the commit message for the final patch.  That's in addition
to the performance improvements I've done so that the feature is both
cheaper for users and nearly invisible for non-users.  That isn't quite
the same as adding a perf test, which I haven't done, but I think this
explains the situation quite well.  If folks are still dying for a perf
test, I can add one in v3.

I will point out that fully hashing a mailmap isn't necessarily cheap,
but how expensive it is depends on the weighting of current and former
members of the project.  As mentioned in the original thread, I think a
hash rather than an encoding is the right choice here.  It is likely
that in a few iterations of hardware, all users will have accelerated
SHA-256 and the cost will end up being a handful of cycles per name
overall.

Changes from v1:
* Fix case-sensitivity problems in the mailmap.
* Add documentation.
* Add explanation of how to compute the value.
* Add some optimizations to improve performance.
* Improve commit message to discuss performance numbers and explain
  rationale better.

brian m. carlson (5):
  mailmap: add a function to inspect the number of entries
  mailmap: switch to opaque struct
  t4203: add failing test for case-sensitive local-parts and names
  mailmap: use case-sensitive comparisons for local-parts and names
  mailmap: support hashed entries in mailmaps

 Documentation/mailmap.txt |  28 ++++++++
 builtin/blame.c           |   2 +-
 builtin/check-mailmap.c   |   4 +-
 builtin/commit.c          |   2 +-
 mailmap.c                 | 139 +++++++++++++++++++++++++++++++++-----
 mailmap.h                 |  15 ++--
 pretty.c                  |   4 +-
 pretty.h                  |   2 +-
 revision.c                |   2 +-
 revision.h                |   3 +-
 shortlog.h                |   3 +-
 t/t4203-mailmap.sh        |  64 +++++++++++++++++-
 12 files changed, 236 insertions(+), 32 deletions(-)