diff mbox series

[v2,5/5] mailmap: support hashed entries in mailmaps

Message ID 20210103211849.2691287-6-sandals@crustytoothpaste.net (mailing list archive)
State New, archived
Headers show
Series Hashed mailmap | expand

Commit Message

brian m. carlson Jan. 3, 2021, 9:18 p.m. UTC
Many people, through the course of their lives, will change either a
name or an email address.  For this reason, we have the mailmap, to map
from a user's former name or email address to their current, canonical
forms.  Normally, this works well as it is.

However, sometimes people change a name or an email address and wish to
wholly disassociate themselves from that former name or email address.
For example, a person may transition from one gender to another,
changing their name, or they may have changed their name to disassociate
themselves from an abusive family or partner.  In such a case, using the
former name or address in any way may be undesirable and the person may
wish to replace it as completely as possible.

For projects which wish to support this, introduce hashed forms into the
mailmap.  These forms, which start with "@sha256:" followed by a SHA-256
hash of the entry, can be used in place of the form used in the commit
field.  This form is intentionally designed to be unlikely to conflict
with legitimate use cases.  For example, this is not a valid email
address according to RFC 5322.  In the unlikely event that a user has
put such a form into the actual commit as their name, we will accept it.

While the form of the data is designed to accept multiple hash
algorithms, we intentionally do not support SHA-1.  There is little
reason to support such a weak algorithm in new use cases and no
backwards compatibility to consider.  Moreover, SHA-256 is faster than
the SHA1DC implementation we use, so this not only improves performance,
but simplifies the current implementation somewhat as well.

Note that it is, of course, possible to perform a lookup on all commit
objects to determine the actual entry which matches the hashed form of
the data.  However, this is an improvement over the status quo.

The performance of this patch with no hashed entries is very similar to
the performance without this patch.  Considering a git log command to
look up author and committer information on 981,680 commits in the Linux
kernel history, either with an unhashed mailmap or a mailmap with all
old values hashed:

                                   Shortest  Longest  Average  Change
  Git 2.30                         7.876     8.297    8.143
  This patch, unhashed             7.923     8.484    8.237    + 1.15%
  This patch, hashed               14.510    14.783   14.672   +80.17%
  This patch, hashed, unoptimized  15.425    16.318   15.901   +95.27%

Thus, the average performance after this patch is within normal
variation of the pre-patch performance.  It's unlikely that users will
notice the difference in practice, even on much larger
repositories, unless they're using the new feature.

To minimize the performance impact of the hashing process, we maintain a
reference count of each mailmap entry and when we encounter an entry we
must hash, we insert the same object under the unhashed key as well.  We
also keep a count of the number of hashed entries.  This means we must
hash an object at most once and once we've seen all the hashed objects,
we won't hash any more objects.  Times without this optimization are
listed above in the unoptimized entry.

This has the potential to cause a performance problem as we insert items
into a sorted list, but changing the implementation to use a khash map
instead does not result in a significantly faster implementation,
despite the improved insertion speed.  Performance in the unhashed case
is slightly worse, so this approach was not adopted since it provides
few benefits.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Documentation/mailmap.txt | 28 +++++++++++
 mailmap.c                 | 99 ++++++++++++++++++++++++++++++++++-----
 mailmap.h                 |  2 +
 t/t4203-mailmap.sh        | 35 ++++++++++++++
 4 files changed, 152 insertions(+), 12 deletions(-)

Comments

Ævar Arnfjörð Bjarmason Jan. 5, 2021, 2:21 p.m. UTC | #1
On Sun, Jan 03 2021, brian m. carlson wrote:

I think it makes sense to split up 1-4/5 here from 5/5 in this series
since they're really unrelated changes, although due to the changes in
1-4 they'll conflict.

> Many people, through the course of their lives, will change either a
> name or an email address.  For this reason, we have the mailmap, to map
> from a user's former name or email address to their current, canonical
> forms.  Normally, this works well as it is.
>
> However, sometimes people change a name or an email address and wish to
> wholly disassociate themselves from that former name or email address.
> For example, a person may transition from one gender to another,
> changing their name, or they may have changed their name to disassociate
> themselves from an abusive family or partner.  In such a case, using the
> former name or address in any way may be undesirable and the person may
> wish to replace it as completely as possible.

The cover letter noted "As mentioned in the original thread, I think a
hash rather than an encoding is the right choice here.". Reading the v1
I think you're referring to
https://lore.kernel.org/git/X9wUGaR3IXcpV0nT@camp.crustytoothpaste.net/

In v1 I pointed out you needed to read some combination of the cover
letter & the patch to see what this was intended for (see [1]). I think
for v3 the commit itself should summarize the trade-offs & design
choices.

> For projects which wish to support this, introduce hashed forms into the
> mailmap.  These forms, which start with "@sha256:" followed by a SHA-256
> hash of the entry, can be used in place of the form used in the commit
> field.  This form is intentionally designed to be unlikely to conflict
> with legitimate use cases.  For example, this is not a valid email
> address according to RFC 5322.  In the unlikely event that a user has
> put such a form into the actual commit as their name, we will accept it.

We'll emit the commit author information as-is in that case under "git
show", or run the mapping and map it via mailmap? Anyway, it seems
there's a test for this. Probably better to just point to it.

> While the form of the data is designed to accept multiple hash
> algorithms, we intentionally do not support SHA-1.  There is little
> reason to support such a weak algorithm in new use cases and no
> backwards compatibility to consider.  Moreover, SHA-256 is faster than
> the SHA1DC implementation we use, so this not only improves performance,
> but simplifies the current implementation somewhat as well.

I agree with most of this aside from the "weak algorithm" part. That
seems like an irrelevant aside for this specific use of a hashing
algorithm, no? We could even use MD5 here, so SHA256-only is just
setting is up for not needing to deal with SHA1 forever in this one
place in some SHA256 future repo.

> Note that it is, of course, possible to perform a lookup on all commit
> objects to determine the actual entry which matches the hashed form of
> the data.  However, this is an improvement over the status quo.
>
> The performance of this patch with no hashed entries is very similar to
> the performance without this patch.  Considering a git log command to
> look up author and committer information on 981,680 commits in the Linux
> kernel history, either with an unhashed mailmap or a mailmap with all
> old values hashed:
>
>                                    Shortest  Longest  Average  Change
>   Git 2.30                         7.876     8.297    8.143
>   This patch, unhashed             7.923     8.484    8.237    + 1.15%
>   This patch, hashed               14.510    14.783   14.672   +80.17%
>   This patch, hashed, unoptimized  15.425    16.318   15.901   +95.27%
>
> Thus, the average performance after this patch is within normal
> variation of the pre-patch performance.  It's unlikely that users will
> notice the difference in practice, even on much larger
> repositories, unless they're using the new feature.

Am I reading this right that if there's a single hashed entry in
.mailmap anything using %aE or %aN is around 2x as slow?

Your v1 mentioned that a project might "insert entries for many
contributors in order to make discovery of "interesting" entries
significantly less convenient." which is gone in the v2 patch. As noted
in [1] I don't see how it helps the obscurity much, but if that's still
the intended use we'd expect to get more slowdowns in the wild if users
intend to convert their whole mailmap to this form if they want a single
entry to use the form.

Anyway, as you might have guessed I'm still not a fan of this direction.

But most of it is because I honestly don't get why this specific
approach is required to achieve the stated aims, there's a few of them,
so here's an attempt to break them down:

1. User changed their name and doesn't want themselves or others to see
  their old name

For the case where Joe Developer is now known as Jane Doe in most cases
you don't need to put the old name at all into the .mailmap. E.g. for
git.git this patch to our .mailmap produces the same output for `log
--all --pretty="%h %an%ae%aN%aE"`:
    
     brian m. carlson <sandals@crustytoothpaste.net>
    -brian m. carlson <sandals@crustytoothpaste.net> <sandals@crustytoothpaste.ath.cx>
    -brian m. carlson <sandals@crustytoothpaste.net> <bk2204@github.com>
    +<sandals@crustytoothpaste.net> <sandals@crustytoothpaste.ath.cx>
    +<sandals@crustytoothpaste.net> <bk2204@github.com>

So the new->name/email mapping (as opposed to new->email) is really only
needed for some really obscure cases where two people shared an E-Mail
or something.

So we're talking about hiding the old E-Mail, presumably because it was
joe@ intsead of jane@, so in that case we could just support URI
encoding:

    Jane Doe <jane@example.com>
    <jane@example.com> <%6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D>

Made via:

    $ perl -MURI::Escape=uri_escape -wE 'say uri_escape q[joe@developer.com], "^@."'
    %6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D

Which also has the nice attribute that people can make it obvious what
part they want to hide, since this is really a feature to enable social
politeness & consideration:

    Jane Doe <jane@example.com>
    # I don't want to be known by my old name, thanks
    <jane@example.com> <%6A%6F%65@developer.com>

2. Hiding from your enemies

For the other use-case of "abusive family or partner" I had the comment
in v1 of "but not so much that you'd still take the risk of submitting a
patch to .mailmap?".

Now that's obviously phrased in an off-the-cuff manner, but I'm
serious.

I think it is important that the non-security of this feature obviously
looks like some trivial encoding, because that's what it is. People get
lulled into a false sense of security with these things all the time
(e.g. thinking their "Authorization" HTTP header is safe to post on a
public pastebin). So we should as much as possible make this look like
the non-security it is.

3. Enabling people not to treat .mailmap as binary or a multi-encoding
file.

I mentioned this in my [1]. Your implementation doesn't do this, but
e.g. it would be very nice for a project that switched from latin-1 to
utf-8 to be able to do, in some cases:

    # Made with: perl -MURI::Escape=uri_escape -wE 'say uri_escape "@ARGV", "^a-z@. "' $(echo Ævar Arnfjörð Bjarmason | iconv -f utf-8 -t iso-8859-1)
    #

    Ævar Arnfjörð Bjarmason <avarab@gmail.com> %C6var %41rnfj%F6r%F0 %42jarmason <avarab@gmail.com>

Or some combination thereof, so e.g. previously Big5/latin1 who migrated
to UTF-8 don't need to have non-valid UTF-8 in .mailmap

4. Spam

You mentioned this in your [2] (but not as a use-case in the v2
re-rolled commit message):

    And we know that spammers and recruiters (which, in this case, are
    also spammers) do indeed scrape repositories via the repository web
    interfaces.

Surely these people are most interested in the current E-Mail addresses,
which if they're scraping the common web interfaces (e.g. Github,
GitLab) are easily accessible there. It doesn't seem very plausible that
someone would care enough to scrape .mailmap for old addresses but not
just update their scraper to clone & run "git log" for the purposes of
e.g. their recruitment E-Mails.

5. Interaction with other systems

Something I mentioned in the last 3 paragraphs of my [1]. I think you're
only considering the cases where git itself does the mailmap
translation, but we have 3rd party systems that make use of the format
in good ways (also doing the Joe->Jane mapping). Making it a hassle for
those systems makes it more likely that Jane doesn't get the mapping she
wants.


1. https://lore.kernel.org/git/87eejswql6.fsf@evledraar.gmail.com/
2. https://lore.kernel.org/git/X9wUGaR3IXcpV0nT@camp.crustytoothpaste.net/
Junio C Hamano Jan. 5, 2021, 8:05 p.m. UTC | #2
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> For example, a person may transition from one gender to another,
> changing their name, or they may have changed their name to disassociate
> themselves from an abusive family or partner.  In such a case, using the
> former name or address in any way may be undesirable and the person may
> wish to replace it as completely as possible.

I am not sure if we want to even mention the "for example" here.

These are certainly all legitimate reasons to want this feature, but
after reading the "for example", lack of a corresponding negative
statement (e.g. sometimes people also change their name or address
to hide their bad behaviour in the past that is associated with
these names) needlessly stood out and made me wonder if we need to
somehow defend the feature with "...but we do not mean to abet
people in hiding their past bad behaviour with this mechanism".  I'd
prefer us not forced to defend the mechanism if we did not have to.

> Note that it is, of course, possible to perform a lookup on all commit
> objects to determine the actual entry which matches the hashed form of
> the data.  However, this is an improvement over the status quo.

There were suggestions to use reversible encoding, IIRC, just for
obscurity.  I do not have a strong preference either way myself, but
because such an approach would give the same improvement over the
status quo, would be simpler, more performant and most importantly,
it makes it clear that this is not serious security but casual
obscurity, I'd want to be convinced why we want to use a hash here
a bit more strongly.

> +In addition to specifying a former name or email literally, it is also possible
> +to specify it in a hashed form, which consists of the string `@sha256:`,
> +followed by an all-lowercase SHA-256 hash of the entry in hexadecimal.  For
> +example, to take the example above, instead of specifying the replacement for
> +"Some Dude" as such, you could specify one of these lines:
> ...

> +SHA-1 is not accepted as a hash algorithm in mailmaps.

Is this needed to be said?  After all, we won't take @md5: or
@blake2: or anything other than @sha256: in this version (and
probably any forseeable versions).  Unless we offer a way to plug-in
algos of projects' choice, that is, and at that point, "SHA-1 is not
accepted" is a statement too strong for us to make.
brian m. carlson Jan. 6, 2021, 12:24 a.m. UTC | #3
On 2021-01-05 at 14:21:40, Ævar Arnfjörð Bjarmason wrote:
> 
> On Sun, Jan 03 2021, brian m. carlson wrote:
> 
> I think it makes sense to split up 1-4/5 here from 5/5 in this series
> since they're really unrelated changes, although due to the changes in
> 1-4 they'll conflict.

Okay, I'll drop them.

> In v1 I pointed out you needed to read some combination of the cover
> letter & the patch to see what this was intended for (see [1]). I think
> for v3 the commit itself should summarize the trade-offs & design
> choices.

I can do that.  It's a very long commit message anyway, but if you think
it would be better in the commit message, I can add it.

> > For projects which wish to support this, introduce hashed forms into the
> > mailmap.  These forms, which start with "@sha256:" followed by a SHA-256
> > hash of the entry, can be used in place of the form used in the commit
> > field.  This form is intentionally designed to be unlikely to conflict
> > with legitimate use cases.  For example, this is not a valid email
> > address according to RFC 5322.  In the unlikely event that a user has
> > put such a form into the actual commit as their name, we will accept it.
> 
> We'll emit the commit author information as-is in that case under "git
> show", or run the mapping and map it via mailmap? Anyway, it seems
> there's a test for this. Probably better to just point to it.

It will be handled correctly via the mailmap code, in which case we'll
make a no-op transformation.  If the user is not using the mailmap, then
it will be handled trivially.

> > While the form of the data is designed to accept multiple hash
> > algorithms, we intentionally do not support SHA-1.  There is little
> > reason to support such a weak algorithm in new use cases and no
> > backwards compatibility to consider.  Moreover, SHA-256 is faster than
> > the SHA1DC implementation we use, so this not only improves performance,
> > but simplifies the current implementation somewhat as well.
> 
> I agree with most of this aside from the "weak algorithm" part. That
> seems like an irrelevant aside for this specific use of a hashing
> algorithm, no? We could even use MD5 here, so SHA256-only is just
> setting is up for not needing to deal with SHA1 forever in this one
> place in some SHA256 future repo.

One should avoid the use of weak algorithms when possible even if they
are not being used in a way that makes them weak because it incentivizes
others to use them, often in a way that is insecure.  I had a
conversation with a junior candidate during an interview who said they
used SHA-1 in a particular case "because Git uses it."  That's why I
mentioned it.

> > Note that it is, of course, possible to perform a lookup on all commit
> > objects to determine the actual entry which matches the hashed form of
> > the data.  However, this is an improvement over the status quo.
> >
> > The performance of this patch with no hashed entries is very similar to
> > the performance without this patch.  Considering a git log command to
> > look up author and committer information on 981,680 commits in the Linux
> > kernel history, either with an unhashed mailmap or a mailmap with all
> > old values hashed:
> >
> >                                    Shortest  Longest  Average  Change
> >   Git 2.30                         7.876     8.297    8.143
> >   This patch, unhashed             7.923     8.484    8.237    + 1.15%
> >   This patch, hashed               14.510    14.783   14.672   +80.17%
> >   This patch, hashed, unoptimized  15.425    16.318   15.901   +95.27%
> >
> > Thus, the average performance after this patch is within normal
> > variation of the pre-patch performance.  It's unlikely that users will
> > notice the difference in practice, even on much larger
> > repositories, unless they're using the new feature.
> 
> Am I reading this right that if there's a single hashed entry in
> .mailmap anything using %aE or %aN is around 2x as slow?

No, that's not the case.  As soon as we see every hashed entry, we will
stop hashing new entries.  Linux is not necessarily the best case for
this because it has a long history with many one-off contributors long
ago in the history.

I'll explain that further in the commit message and add some more
metrics.

> Your v1 mentioned that a project might "insert entries for many
> contributors in order to make discovery of "interesting" entries
> significantly less convenient." which is gone in the v2 patch. As noted
> in [1] I don't see how it helps the obscurity much, but if that's still
> the intended use we'd expect to get more slowdowns in the wild if users
> intend to convert their whole mailmap to this form if they want a single
> entry to use the form.

Peff objected to that text, so I removed it.

As mentioned above, it depends on who you put in the mailmap.  If
they're the most recent 50 contributors, it'll probably be pretty cheap.
If you put the oldest contributors in there and they've not sent any
recent commits, it will be more expensive.

> Anyway, as you might have guessed I'm still not a fan of this direction.

I've got that impression pretty strongly.

I do want to point out that generally I'm pretty willing to change
approaches and do things differently.  I've completely redone a decent
number of patches in the past in response to feedback on the list.

I'm not changing the approach here because, as mentioned below, I don't
think that just encoding meets the use cases I'm targeting here.  So I
have heard your suggestions and to be clear, I do value your input on
this (and on other topics), it's just that I disagree that such a change
is one I should make.

> So the new->name/email mapping (as opposed to new->email) is really only
> needed for some really obscure cases where two people shared an E-Mail
> or something.

That's unlikely, but it does happen.  That's why we have it.

> So we're talking about hiding the old E-Mail, presumably because it was
> joe@ intsead of jane@, so in that case we could just support URI
> encoding:
> 
>     Jane Doe <jane@example.com>
>     <jane@example.com> <%6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D>
> 
> Made via:
> 
>     $ perl -MURI::Escape=uri_escape -wE 'say uri_escape q[joe@developer.com], "^@."'
>     %6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D
> 
> Which also has the nice attribute that people can make it obvious what
> part they want to hide, since this is really a feature to enable social
> politeness & consideration:
> 
>     Jane Doe <jane@example.com>
>     # I don't want to be known by my old name, thanks
>     <jane@example.com> <%6A%6F%65@developer.com>

I don't think this feature is going to get used if we just encode names
or email addresses.  In the United States, when someone transitions,
they get a court order to change their name.  I don't think a lot of
corporate environments are going to want to just encode an old name or
email address in a trivially invertible way given that.  This is
typically a topic handled with some sensitivity in most companies.

I will tell you that I would not just use an encoded version if I were
changing my name for any of the reasons I've mentioned.  That wouldn't
cut it for me, and I wouldn't use such a feature.  The feature I'm
implementing is a feature I've talked with trans folks about, and that's
why I'm implementing this as it is.  The response I got was essentially,
"It's not everything I want, but it's an improvement."

If the decision is that we want to go with encoding instead of hashing,
then I'll drop this patch.  I'm not going to put my name or sign-off on
that because I don't think it meets the need I'm addressing here.

The entire problem, of course, is that we bake a human's personal name
and email address immutably into a Merkle tree.  We know full well that
people do change their names and email addresses all the time (e.g.,
marriage, job changes), and yet we have this design.  In retrospect, we
should have done something different, but hindsight is 20/20 and I'm
just trying to do the best we can with what we've got.

> 2. Hiding from your enemies
> 
> For the other use-case of "abusive family or partner" I had the comment
> in v1 of "but not so much that you'd still take the risk of submitting a
> patch to .mailmap?".

No, my use case isn't "hiding from an abusive family or partner".  It's
"I'm finally free of that **** and I never want to hear their name
again."  (I've known people in this situation.)  Also, the similar use
case of, "my family member, with whom I share an uncommon name, murdered
someone, which I obviously found abhorrent, and I would like to not be
associated with them when my name is Googled."  And yes, I knew an
acquaintance many years ago whose family member murdered someone.

In other words, the person changed their name to disassociate
themselves, not to hide from their abuser.

> 4. Spam
> 
> You mentioned this in your [2] (but not as a use-case in the v2
> re-rolled commit message):
> 
>     And we know that spammers and recruiters (which, in this case, are
>     also spammers) do indeed scrape repositories via the repository web
>     interfaces.
> 
> Surely these people are most interested in the current E-Mail addresses,
> which if they're scraping the common web interfaces (e.g. Github,
> GitLab) are easily accessible there. It doesn't seem very plausible that
> someone would care enough to scrape .mailmap for old addresses but not
> just update their scraper to clone & run "git log" for the purposes of
> e.g. their recruitment E-Mails.

Unless the user is using the GitHub-provided noreply address or a
similar address, which is common.  This allows people to map all of
their old addresses to such an address, which, judging from
StackOverflow, is a thing people want to do.

I can tell you from dealing with abuse that raising the bar even the
tiniest bit is very significant to stopping it.  Most recruiters are not
developers and they and spammers don't have Git installed.  They're
going to rely on Googling or other public search functionality, and this
makes that harder.

Greylisting is exactly raising the bar the tiniest amount and it's
extraordinarily effective.

> 5. Interaction with other systems
> 
> Something I mentioned in the last 3 paragraphs of my [1]. I think you're
> only considering the cases where git itself does the mailmap
> translation, but we have 3rd party systems that make use of the format
> in good ways (also doing the Joe->Jane mapping). Making it a hassle for
> those systems makes it more likely that Jane doesn't get the mapping she
> wants.

This is an argument for never changing the format.  Sometimes things
change, and I don't want to avoid making a change because other
implementations haven't implemented it yet.  Under that approach, we'd
never have the SHA-256 work.
brian m. carlson Jan. 6, 2021, 12:28 a.m. UTC | #4
On 2021-01-05 at 20:05:22, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> > For example, a person may transition from one gender to another,
> > changing their name, or they may have changed their name to disassociate
> > themselves from an abusive family or partner.  In such a case, using the
> > former name or address in any way may be undesirable and the person may
> > wish to replace it as completely as possible.
> 
> I am not sure if we want to even mention the "for example" here.
> 
> These are certainly all legitimate reasons to want this feature, but
> after reading the "for example", lack of a corresponding negative
> statement (e.g. sometimes people also change their name or address
> to hide their bad behaviour in the past that is associated with
> these names) needlessly stood out and made me wonder if we need to
> somehow defend the feature with "...but we do not mean to abet
> people in hiding their past bad behaviour with this mechanism".  I'd
> prefer us not forced to defend the mechanism if we did not have to.

I added it because I imagine the use cases for this feature aren't
immediately obvious to a lot of people and the general rule is that
commit messages explain why we would implement such a feature.  If you'd
prefer I drop it and leave it up to the imagination (or to the list
archives), I can do that.

> > +SHA-1 is not accepted as a hash algorithm in mailmaps.
> 
> Is this needed to be said?  After all, we won't take @md5: or
> @blake2: or anything other than @sha256: in this version (and
> probably any forseeable versions).  Unless we offer a way to plug-in
> algos of projects' choice, that is, and at that point, "SHA-1 is not
> accepted" is a statement too strong for us to make.

I'll drop that line.
Junio C Hamano Jan. 6, 2021, 1:50 a.m. UTC | #5
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> I added it because I imagine the use cases for this feature aren't
> immediately obvious to a lot of people and the general rule is that
> commit messages explain why we would implement such a feature.

Yeah, I understand that.

>> > +SHA-1 is not accepted as a hash algorithm in mailmaps.
>>  ...
> I'll drop that line.

Thanks.
Ævar Arnfjörð Bjarmason Jan. 10, 2021, 7:24 p.m. UTC | #6
On Wed, Jan 06 2021, brian m. carlson wrote:

> On 2021-01-05 at 14:21:40, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Sun, Jan 03 2021, brian m. carlson wrote:
>> 
>> I think it makes sense to split up 1-4/5 here from 5/5 in this series
>> since they're really unrelated changes, although due to the changes in
>> 1-4 they'll conflict.
>
> Okay, I'll drop them.

Not replying to most of this E-Mail because I think there's nothing left
to add / you clarified things for me in those cases / we respectfully
disagree / any outstanding points we can pick up in your re-roll /
whatever :)

>> So we're talking about hiding the old E-Mail, presumably because it was
>> joe@ intsead of jane@, so in that case we could just support URI
>> encoding:
>> 
>>     Jane Doe <jane@example.com>
>>     <jane@example.com> <%6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D>
>> 
>> Made via:
>> 
>>     $ perl -MURI::Escape=uri_escape -wE 'say uri_escape q[joe@developer.com], "^@."'
>>     %6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D
>> 
>> Which also has the nice attribute that people can make it obvious what
>> part they want to hide, since this is really a feature to enable social
>> politeness & consideration:
>> 
>>     Jane Doe <jane@example.com>
>>     # I don't want to be known by my old name, thanks
>>     <jane@example.com> <%6A%6F%65@developer.com>
>
> I don't think this feature is going to get used if we just encode names
> or email addresses.  In the United States, when someone transitions,
> they get a court order to change their name.  I don't think a lot of
> corporate environments are going to want to just encode an old name or
> email address in a trivially invertible way given that.  This is
> typically a topic handled with some sensitivity in most companies.
>
> I will tell you that I would not just use an encoded version if I were
> changing my name for any of the reasons I've mentioned.  That wouldn't
> cut it for me, and I wouldn't use such a feature.  The feature I'm
> implementing is a feature I've talked with trans folks about, and that's
> why I'm implementing this as it is.  The response I got was essentially,
> "It's not everything I want, but it's an improvement."
>
> If the decision is that we want to go with encoding instead of hashing,
> then I'll drop this patch.  I'm not going to put my name or sign-off on
> that because I don't think it meets the need I'm addressing here.
>
> The entire problem, of course, is that we bake a human's personal name
> and email address immutably into a Merkle tree.  We know full well that
> people do change their names and email addresses all the time (e.g.,
> marriage, job changes), and yet we have this design.  In retrospect, we
> should have done something different, but hindsight is 20/20 and I'm
> just trying to do the best we can with what we've got.

Doesn't the difference in some sense boil down to either an implicit
promise or an implicit assumption that the hashed version is forever
going to be protected by some security-through-obscurity/inconvenience
when it comes to git.git & its default tooling?

And would those users be as comfortable with the difference between
encoded v.s. hashed if e.g. "git check-mailmap" learned to read the
.mailmap and search-replace all the hashed versions with their
materialized values, or if popular tools like Emacs learned to via a Git
.mailmap in a "need translation" similar to *.gpg and *.gz. How about if
popular web views of Git served up that materialized "check-mailmap"
output by default?

None of which I think is implausible that we'll get as follow-up
patches, I might even submit some at some point, not out of some spite.
Just because I don't want to maintain out-of-tree code for an
out-of-tree program that understands a Git .mailmap today, but where I'd
need to search-replace the hashed versions.

Ditto it being very likely that popular editors or web viewers will gain
support for this, just because it's tedious to manually hash &
copy/paste & validate values.

In looking at some of the fsck code recently & having some
yet-unsubmitted patches I thought of trying to compine it with
mailmap. I.e. it seems like a natural feature for fsck to gain to warn
you about unused mailmap entries, just like it can warn about
unreachable/dangling objects. After all these are really just sort-of
pointers into our Merkle tree. Spewing out all the mappings seems like
an obvious addition to that, e.g. in spewing out an
"optimized/non-redundant" (plain or hashed) mailmap to re-commit.

That's the main reason I'm uncomfortable with this approach, because it
seems to me to implicitly rely on things that are tedious now, but which
the march of history all but inevitably should make trivial if we were
to integrate it. Unless we're *also* promising to forever intentionally
(and artificially) keep it inconvenient.

E.g. the example of how long it takes to clone & extract this info from
chromium.git in the v1 thread.

It seems like a fair assumption that we'll have some future version of
git where you can ask a remote server about that sort of thing in
milliseconds.

Not because of this hashed .mailmap thing in particular, just as an
emergent effect that it's happy to serve up things it knows about the
DAG from having walked & cached it in general. E.g. info from the
commit-graph, what hash is contained in what ref, or how one value (such
as a .mailmap entry) maps to another etc.
brian m. carlson Jan. 10, 2021, 9:26 p.m. UTC | #7
On 2021-01-10 at 19:24:34, Ævar Arnfjörð Bjarmason wrote:
> Doesn't the difference in some sense boil down to either an implicit
> promise or an implicit assumption that the hashed version is forever
> going to be protected by some security-through-obscurity/inconvenience
> when it comes to git.git & its default tooling?
> 
> And would those users be as comfortable with the difference between
> encoded v.s. hashed if e.g. "git check-mailmap" learned to read the
> .mailmap and search-replace all the hashed versions with their
> materialized values, or if popular tools like Emacs learned to via a Git
> .mailmap in a "need translation" similar to *.gpg and *.gz. How about if
> popular web views of Git served up that materialized "check-mailmap"
> output by default?
> 
> None of which I think is implausible that we'll get as follow-up
> patches, I might even submit some at some point, not out of some spite.
> Just because I don't want to maintain out-of-tree code for an
> out-of-tree program that understands a Git .mailmap today, but where I'd
> need to search-replace the hashed versions.

Yes, I think we do rely on this being inconvenient.  If you plan to
submit such a patch, I'm going to let this series drop.
diff mbox series

Patch

diff --git a/Documentation/mailmap.txt b/Documentation/mailmap.txt
index 4a8c276529..b21194bf3e 100644
--- a/Documentation/mailmap.txt
+++ b/Documentation/mailmap.txt
@@ -73,3 +73,31 @@  Santa Claus <santa.claus@northpole.xx> <me@company.xx>
 
 Use hash '#' for comments that are either on their own line, or after
 the email address.
+
+In addition to specifying a former name or email literally, it is also possible
+to specify it in a hashed form, which consists of the string `@sha256:`,
+followed by an all-lowercase SHA-256 hash of the entry in hexadecimal.  For
+example, to take the example above, instead of specifying the replacement for
+"Some Dude" as such, you could specify one of these lines:
+
+------------
+Some Dude <some@dude.xx> nick1 <@sha256:bee4fdd8c5e2e85009c8ae231d5a395adb24d5a597f2b75489926460680b8ce1>
+Some Dude <some@dude.xx> @sha256:56030827e2765e8878c94c4cc43f5410b22f3b8c2b1ef8f631ac3953f8299279 <bugs@company.xx>
+Some Dude <some@dude.xx> @sha256:56030827e2765e8878c94c4cc43f5410b22f3b8c2b1ef8f631ac3953f8299279 <@sha256:bee4fdd8c5e2e85009c8ae231d5a395adb24d5a597f2b75489926460680b8ce1>
+------------
+
+These hash is a hash of the literal name or email without any trailing newlines.
+For example, you can compute the values above like so, using the Perl `shasum`
+command (or a similar command of your choice):
+
+------------
+$ printf '%s' bugs@company.xx | shasum -a 256
+bee4fdd8c5e2e85009c8ae231d5a395adb24d5a597f2b75489926460680b8ce1  -
+------------
+
+SHA-1 is not accepted as a hash algorithm in mailmaps.
+
+Using the hashed form may be desirable to obscure one's former name or email,
+but be aware that it is just obfuscation: it's still possible for someone with
+access to the repository to iterate through all authors and committers and map
+the hashed values to unhashed ones.
diff --git a/mailmap.c b/mailmap.c
index 5c52dbb7e0..ed401bb1e4 100644
--- a/mailmap.c
+++ b/mailmap.c
@@ -18,6 +18,8 @@  const char *git_mailmap_blob;
 struct mailmap_info {
 	char *name;
 	char *email;
+
+	unsigned refcount;
 };
 
 struct mailmap_entry {
@@ -25,6 +27,10 @@  struct mailmap_entry {
 	char *name;
 	char *email;
 
+	unsigned refcount;
+	unsigned hashed_count;
+	unsigned hashed_seen;
+
 	/* name and email for the complex mail and name matching case */
 	struct string_list namemap;
 };
@@ -32,6 +38,9 @@  struct mailmap_entry {
 static void free_mailmap_info(void *p, const char *s)
 {
 	struct mailmap_info *mi = (struct mailmap_info *)p;
+	if (--mi->refcount)
+		return;
+
 	debug_mm("mailmap: -- complex: '%s' -> '%s' <%s>\n",
 		 s, debug_str(mi->name), debug_str(mi->email));
 	free(mi->name);
@@ -41,6 +50,9 @@  static void free_mailmap_info(void *p, const char *s)
 static void free_mailmap_entry(void *p, const char *s)
 {
 	struct mailmap_entry *me = (struct mailmap_entry *)p;
+	if (--me->refcount)
+		return;
+
 	debug_mm("mailmap: removing entries for <%s>, with %d sub-entries\n",
 		 s, me->namemap.nr);
 	debug_mm("mailmap: - simple: '%s' <%s>\n",
@@ -82,10 +94,17 @@  static char *lowercase_email(char *s)
 	return s;
 }
 
-static void add_mapping(struct string_list *map,
+static int is_hashed(const char *s)
+{
+	const char *prefix = "@sha256:";
+	return strncmp(s, prefix, strlen(prefix)) == 0;
+}
+
+static void add_mapping(struct mailmap *mailmap,
 			char *new_name, char *new_email,
 			char *old_name, char *old_email)
 {
+	struct string_list *map = mailmap->mailmap;
 	struct mailmap_entry *me;
 	struct string_list_item *item;
 
@@ -95,7 +114,10 @@  static void add_mapping(struct string_list *map,
 		old_email = new_email;
 		new_email = NULL;
 	} else {
-		lowercase_email(old_email);
+		if (is_hashed(old_email))
+			mailmap->hashed_count++;
+		else
+			lowercase_email(old_email);
 	}
 
 	item = string_list_insert(map, old_email);
@@ -105,6 +127,7 @@  static void add_mapping(struct string_list *map,
 		me = xcalloc(1, sizeof(struct mailmap_entry));
 		me->namemap.strdup_strings = 1;
 		me->namemap.cmp = namemap_cmp;
+		me->refcount = 1;
 		item->util = me;
 	}
 
@@ -125,6 +148,9 @@  static void add_mapping(struct string_list *map,
 		debug_mm("mailmap: adding (complex) entry for '%s'\n", old_email);
 		mi->name = xstrdup_or_null(new_name);
 		mi->email = xstrdup_or_null(new_email);
+		mi->refcount = 1;
+		if (is_hashed(old_name))
+			me->hashed_count++;
 		string_list_insert(&me->namemap, old_name)->util = mi;
 	}
 
@@ -162,7 +188,7 @@  static char *parse_name_and_email(char *buffer, char **name,
 	return (*right == '\0' ? NULL : right);
 }
 
-static void read_mailmap_line(struct string_list *map, char *buffer,
+static void read_mailmap_line(struct mailmap *map, char *buffer,
 			      char **repo_abbrev)
 {
 	char *name1 = NULL, *email1 = NULL, *name2 = NULL, *email2 = NULL;
@@ -194,7 +220,7 @@  static void read_mailmap_line(struct string_list *map, char *buffer,
 		add_mapping(map, name1, email1, name2, email2);
 }
 
-static int read_mailmap_file(struct string_list *map, const char *filename,
+static int read_mailmap_file(struct mailmap *map, const char *filename,
 			     char **repo_abbrev)
 {
 	char buffer[1024];
@@ -216,7 +242,7 @@  static int read_mailmap_file(struct string_list *map, const char *filename,
 	return 0;
 }
 
-static void read_mailmap_string(struct string_list *map, char *buf,
+static void read_mailmap_string(struct mailmap *map, char *buf,
 				char **repo_abbrev)
 {
 	while (*buf) {
@@ -230,7 +256,7 @@  static void read_mailmap_string(struct string_list *map, char *buf,
 	}
 }
 
-static int read_mailmap_blob(struct string_list *map,
+static int read_mailmap_blob(struct mailmap *map,
 			     const char *name,
 			     char **repo_abbrev)
 {
@@ -269,10 +295,10 @@  int read_mailmap(struct mailmap *mailmap, char **repo_abbrev)
 	if (!git_mailmap_blob && is_bare_repository())
 		git_mailmap_blob = "HEAD:.mailmap";
 
-	err |= read_mailmap_file(map, ".mailmap", repo_abbrev);
+	err |= read_mailmap_file(mailmap, ".mailmap", repo_abbrev);
 	if (startup_info->have_repository)
-		err |= read_mailmap_blob(map, git_mailmap_blob, repo_abbrev);
-	err |= read_mailmap_file(map, git_mailmap_file, repo_abbrev);
+		err |= read_mailmap_blob(mailmap, git_mailmap_blob, repo_abbrev);
+	err |= read_mailmap_file(mailmap, git_mailmap_file, repo_abbrev);
 	return err;
 }
 
@@ -282,7 +308,7 @@  void clear_mailmap(struct mailmap *mailmap)
 	debug_mm("mailmap: clearing %d entries...\n", map->nr);
 	map->strdup_strings = 1;
 	string_list_clear_func(map, free_mailmap_entry);
-	string_list_clear(map, 1);
+	string_list_clear(map, 0);
 	free(map);
 	debug_mm("mailmap: cleared\n");
 }
@@ -338,6 +364,55 @@  static struct string_list_item *lookup_prefix(struct string_list *map,
 	return NULL;
 }
 
+/*
+ * Convert an email or name into a hashed form for comparison.  The hashed form
+ * will be created in the form
+ * @sha256:c68b7a430ac8dee9676ec77a387194e23f234d024e03d844050cf6c01775c8f6,
+ * which would be the hashed form for "doe@example.com".
+ */
+static char *hashed_form(struct strbuf *buf, const struct git_hash_algo *algop, const char *key, size_t keylen)
+{
+	git_hash_ctx ctx;
+	unsigned char hashbuf[GIT_MAX_RAWSZ];
+	char hexbuf[GIT_MAX_HEXSZ + 1];
+
+	algop->init_fn(&ctx);
+	algop->update_fn(&ctx, key, keylen);
+	algop->final_fn(hashbuf, &ctx);
+	hash_to_hex_algop_r(hexbuf, hashbuf, algop);
+
+	strbuf_addf(buf, "@%s:%s", algop->name, hexbuf);
+	return buf->buf;
+}
+
+static struct string_list_item *lookup_one(struct string_list *map,
+					   const char *string, size_t len,
+					   unsigned hashed_count,
+					   unsigned *hashed_seen)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct string_list_item *item = lookup_prefix(map, string, len);
+	if (item || !hashed_count || hashed_count == *hashed_seen)
+		return item;
+
+	hashed_form(&buf, &hash_algos[GIT_HASH_SHA256], string, len);
+	item = lookup_prefix(map, buf.buf, buf.len);
+	if (item) {
+		struct mailmap_info *mi = (struct mailmap_info *)item->util;
+		char *s = xstrndup(string, len);
+		map->strdup_strings = 0;
+		item = string_list_insert(map, s);
+		map->strdup_strings = 1;
+		if (!item->util) {
+			item->util = mi;
+			mi->refcount++;
+			(*hashed_seen)++;
+		}
+	}
+	strbuf_release(&buf);
+	return item;
+}
+
 int map_user(struct mailmap *map,
 	     const char **email, size_t *emaillen,
 	     const char **name, size_t *namelen)
@@ -350,7 +425,7 @@  int map_user(struct mailmap *map,
 		 (int)*namelen, debug_str(*name),
 		 (int)*emaillen, debug_str(*email));
 
-	item = lookup_prefix(map->mailmap, searchable_email, *emaillen);
+	item = lookup_one(map->mailmap, searchable_email, *emaillen, map->hashed_count, &map->hashed_seen);
 	free(searchable_email);
 	if (item != NULL) {
 		me = (struct mailmap_entry *)item->util;
@@ -361,7 +436,7 @@  int map_user(struct mailmap *map,
 			 * simple entry.
 			 */
 			struct string_list_item *subitem;
-			subitem = lookup_prefix(&me->namemap, *name, *namelen);
+			subitem = lookup_one(&me->namemap, *name, *namelen, me->hashed_count, &me->hashed_seen);
 			if (subitem)
 				item = subitem;
 		}
diff --git a/mailmap.h b/mailmap.h
index 4cdce3b064..69f8be5705 100644
--- a/mailmap.h
+++ b/mailmap.h
@@ -5,6 +5,8 @@ 
 
 struct mailmap {
 	struct string_list *mailmap;
+	unsigned hashed_count;
+	unsigned hashed_seen;
 };
 
 int read_mailmap(struct mailmap *map, char **repo_abbrev);
diff --git a/t/t4203-mailmap.sh b/t/t4203-mailmap.sh
index df4a0e03cc..004b4a3d40 100755
--- a/t/t4203-mailmap.sh
+++ b/t/t4203-mailmap.sh
@@ -62,6 +62,41 @@  test_expect_success 'check-mailmap --stdin arguments' '
 	test_cmp expect actual
 '
 
+test_expect_success 'hashed mailmap' '
+	test_config mailmap.file ./hashed &&
+	hashed_author_name="@sha256:$(printf "$GIT_AUTHOR_NAME" | test-tool sha256)" &&
+	hashed_author_email="@sha256:$(printf "$GIT_AUTHOR_EMAIL" | test-tool sha256)" &&
+	cat >expect <<-EOF &&
+	$GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL>
+	EOF
+
+	cat >hashed <<-EOF &&
+	$GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $hashed_author_name <$GIT_AUTHOR_EMAIL>
+	EOF
+	git check-mailmap "$GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL>" >actual &&
+	test_cmp expect actual &&
+
+	cat >hashed <<-EOF &&
+	Wrong <wrong@example.org> $GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL>
+	$GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $hashed_author_name <$GIT_AUTHOR_EMAIL>
+	EOF
+	# Check that we prefer literal matches over hashed names.
+	git check-mailmap "$hashed_author_name <$GIT_AUTHOR_EMAIL>" >actual &&
+	test_cmp expect actual &&
+
+	cat >hashed <<-EOF &&
+	$GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $hashed_author_name <$hashed_author_email>
+	EOF
+	git check-mailmap "$GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL>" >actual &&
+	test_cmp expect actual &&
+
+	cat >hashed <<-EOF &&
+	$GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> <$hashed_author_email>
+	EOF
+	git check-mailmap "$GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL>" >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success 'check-mailmap bogus contact' '
 	test_must_fail git check-mailmap bogus
 '