[0/1] Hashed mailmap support

Message ID	20201213010539.544101-1-sandals@crustytoothpaste.net (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> From: "brian m. carlson" <sandals@crustytoothpaste.net> To: <git@vger.kernel.org> Subject: [PATCH 0/1] Hashed mailmap support Date: Sun, 13 Dec 2020 01:05:38 +0000 Message-Id: <20201213010539.544101-1-sandals@crustytoothpaste.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Hashed mailmap support \| expand [0/1] Hashed mailmap support [1/1] mailmap: support hashed entries in mailmaps

brian m. carlson Dec. 13, 2020, 1:05 a.m. UTC

Many people, through the course of their lives, will change either a
name or an email address.  For this reason, we have the mailmap, to map
from a user's former name or email address to their current, canonical
forms.  Normally, this works well as it is.

However, sometimes people change a name (or an email) and want to
completely cease use of the former name or email.  This could be because
a transgender person has transitioned, because a person has left an
abusive partner or broken ties with an abusive family member, or for any
other number of good and valuable reasons.  In these cases, placing the
former name in the .mailmap may be undesirable.

For those situations, let's introduce a hashed mailmap, where the user's
former name or email address can be in the form @sha256:<hash>.  This
obscures the former name or email.

Note that this is not perfect, because a user can simply look up all the
hashed values and find out the old values.  However, for projects which
wish to adopt the feature, it can be somewhat effective to hash all
existing mailmap entries and include some no-op entries from other
contributors as well, so as to make this process less convenient.

I've spoken to a variety of folks about this, and while we all agree
this design isn't perfect, it is an improvement over the status quo.  It
is obfuscation, not security, and in this case, I think that's fine.
I'm open to hearing ideas about how to improve this design if there are
any.

I welcome feedback on this patch, while encouraging people to be mindful
of our code of conduct.

brian m. carlson (1):
  mailmap: support hashed entries in mailmaps

 mailmap.c          | 39 +++++++++++++++++++++++++++++++++++++--
 t/t4203-mailmap.sh | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+), 2 deletions(-)

Jeff King Dec. 15, 2020, 1:48 a.m. UTC | #1

On Sun, Dec 13, 2020 at 01:05:38AM +0000, brian m. carlson wrote:

> Note that this is not perfect, because a user can simply look up all the
> hashed values and find out the old values.  However, for projects which
> wish to adopt the feature, it can be somewhat effective to hash all
> existing mailmap entries and include some no-op entries from other
> contributors as well, so as to make this process less convenient.

I remain unconvinced of the value of any noop entries. Ultimately it's
easy to invert a one-way hash that comes from a small known set of
inputs. And that's true whether there are extra noops or not.

The interesting argument IMHO is that somebody has to _bother_ to invert
the hash. So it means that the old and new identities do not show up
next to each other in a file indexed by search engines, etc. That drops
the low-hanging fruit.

And from that argument, I think the obvious question becomes: is it
worth using a real one-way function, as opposed to just obscuring the
raw bytes (which Ævar went into in more detail). I don't have a strong
opinion either way (the obvious one in favor is that it's less expensive
to do so; and something like "git log" will have to either compute a lot
of these hashes, or cache the hash computations internally).

I think somebody also mentioned that there's value in the social
signaling here, and I agree with that. But that is true even for a
reversible encoding, I think.

-Peff

Jeff King Dec. 15, 2020, 2:40 a.m. UTC | #2

On Mon, Dec 14, 2020 at 08:48:14PM -0500, Jeff King wrote:

> On Sun, Dec 13, 2020 at 01:05:38AM +0000, brian m. carlson wrote:
> 
> > Note that this is not perfect, because a user can simply look up all the
> > hashed values and find out the old values.  However, for projects which
> > wish to adopt the feature, it can be somewhat effective to hash all
> > existing mailmap entries and include some no-op entries from other
> > contributors as well, so as to make this process less convenient.
> 
> I remain unconvinced of the value of any noop entries. Ultimately it's
> easy to invert a one-way hash that comes from a small known set of
> inputs. And that's true whether there are extra noops or not.
> 
> The interesting argument IMHO is that somebody has to _bother_ to invert
> the hash. So it means that the old and new identities do not show up
> next to each other in a file indexed by search engines, etc. That drops
> the low-hanging fruit.
> 
> And from that argument, I think the obvious question becomes: is it
> worth using a real one-way function, as opposed to just obscuring the
> raw bytes (which Ævar went into in more detail). I don't have a strong
> opinion either way (the obvious one in favor is that it's less expensive
> to do so; and something like "git log" will have to either compute a lot
> of these hashes, or cache the hash computations internally).
> 
> I think somebody also mentioned that there's value in the social
> signaling here, and I agree with that. But that is true even for a
> reversible encoding, I think.

After re-reading what I wrote, I just wanted to make clear: overall the
feature makes sense to me. I am questioning only the argument for it,
and whether a one-way hash is the right tradeoff there.

-Peff

Phillip Wood Dec. 15, 2020, 11:15 a.m. UTC | #3

Hi Peff, Ævar, Brian

On 15/12/2020 01:48, Jeff King wrote:
> On Sun, Dec 13, 2020 at 01:05:38AM +0000, brian m. carlson wrote:
> 
>> Note that this is not perfect, because a user can simply look up all the
>> hashed values and find out the old values.  However, for projects which
>> wish to adopt the feature, it can be somewhat effective to hash all
>> existing mailmap entries and include some no-op entries from other
>> contributors as well, so as to make this process less convenient.
> 
> I remain unconvinced of the value of any noop entries. Ultimately it's
> easy to invert a one-way hash that comes from a small known set of
> inputs. And that's true whether there are extra noops or not.
> 
> The interesting argument IMHO is that somebody has to _bother_ to invert
> the hash. So it means that the old and new identities do not show up
> next to each other in a file indexed by search engines, etc. That drops
> the low-hanging fruit.
> 
> And from that argument, I think the obvious question becomes: is it
> worth using a real one-way function, as opposed to just obscuring the
> raw bytes (which Ævar went into in more detail). I don't have a strong
> opinion either way (the obvious one in favor is that it's less expensive
> to do so; and something like "git log" will have to either compute a lot
> of these hashes, or cache the hash computations internally).
> 
> I think somebody also mentioned that there's value in the social
> signaling here, and I agree with that. But that is true even for a
> reversible encoding, I think.

 From an obscurity point of view one possible advantage of using a 
one-way function as opposed to just obscuring the raw bytes with a 
reversible encoding is that looking up an old identity requires someone 
to have both the .mailmap and the repository, they cannot get the old 
identities by just downloading the .mailmap file. (I think this the same 
argument as Ævar makes in favor of a reversible encoding as the .mailmap 
file has other uses)

Best Wishes

Phillip

> -Peff
>

brian m. carlson Dec. 18, 2020, 2:29 a.m. UTC | #4

On 2020-12-15 at 01:48:14, Jeff King wrote:
> On Sun, Dec 13, 2020 at 01:05:38AM +0000, brian m. carlson wrote:
> 
> > Note that this is not perfect, because a user can simply look up all the
> > hashed values and find out the old values.  However, for projects which
> > wish to adopt the feature, it can be somewhat effective to hash all
> > existing mailmap entries and include some no-op entries from other
> > contributors as well, so as to make this process less convenient.
> 
> I remain unconvinced of the value of any noop entries. Ultimately it's
> easy to invert a one-way hash that comes from a small known set of
> inputs. And that's true whether there are extra noops or not.
> 
> The interesting argument IMHO is that somebody has to _bother_ to invert
> the hash. So it means that the old and new identities do not show up
> next to each other in a file indexed by search engines, etc. That drops
> the low-hanging fruit.
> 
> And from that argument, I think the obvious question becomes: is it
> worth using a real one-way function, as opposed to just obscuring the
> raw bytes (which Ævar went into in more detail). I don't have a strong
> opinion either way (the obvious one in favor is that it's less expensive
> to do so; and something like "git log" will have to either compute a lot
> of these hashes, or cache the hash computations internally).

I don't disagree that it's easy to invert.  The question is, is somebody
going to look at a large set of (e.g., a couple hundred) hashed entries
and be able to easily find ones of people they'd like to make life
difficult for or into whose business they'd like to pry or is it going
to be too inconvenient?  I think base64 makes the job too easy and if it
were me in that situation, I'd prefer a little more effort.

I think there's also the benefit, at least for email addresses, in that
people can map a "private" email address that they used accidentally
into one with more robust filtering without letting bad actors invert it
trivially.  That doesn't mean spammers can't run through the log, but it
does mean that they can't write a simple tool to invert base64 email
addresses they've harvested out of Git repositories.  And we know that
spammers and recruiters (which, in this case, are also spammers) do
indeed scrape repositories via the repository web interfaces.

And as someone who had to download all 21 GB of the Chromium repository
for testing purposes recently, I can tell you that absent a very
compelling use case, nobody's going to want to download that entire
repository just to extract some personal information, especially since
the git index-pack operation is essentially guaranteed to take at least
7 minutes at maximum speed.  So by hashing, we've guaranteed significant
inconvenience unless you have the repository, whereas that's not the
case for base64.  And making abuse even slightly harder can often deter
a surprising amount of it[0].

So I think I'm firmly in favor of hashing.  If that means my patch needs
to implement caching, then I'll reroll with that change.  I think by
switching to a hash table I may be able to actually improve total
performance overall, at least in some cases.

> I think somebody also mentioned that there's value in the social
> signaling here, and I agree with that. But that is true even for a
> reversible encoding, I think.

That's true, I agree.  And for many projects, that will be sufficient.
If I saw a hashed mailmap entry, I would assume that it was intended to
be private and would respect that.

[0] See, for example, greylisting.

Jeff King Dec. 18, 2020, 5:56 a.m. UTC | #5

On Fri, Dec 18, 2020 at 02:29:45AM +0000, brian m. carlson wrote:

> > And from that argument, I think the obvious question becomes: is it
> > worth using a real one-way function, as opposed to just obscuring the
> > raw bytes (which Ævar went into in more detail). I don't have a strong
> > opinion either way (the obvious one in favor is that it's less expensive
> > to do so; and something like "git log" will have to either compute a lot
> > of these hashes, or cache the hash computations internally).
> [...]
> So I think I'm firmly in favor of hashing.  If that means my patch needs
> to implement caching, then I'll reroll with that change.  I think by
> switching to a hash table I may be able to actually improve total
> performance overall, at least in some cases.

OK. I agree it raises the bar a little bit. Whether that matters or not
depends on your threat model (e.g., casual spammers versus dedicated
information seekers). I don't have a particularly strong opinion on
what's realistic, but I don't mind erring on the side of caution here.

It might be worth making a short argument along those lines in the
commit message.

As far as caching goes, my main concern is mostly that people who are
not using the feature do not pay a performance penalty. So:

  - if the feature is not used in the repository's mailmap, it should
    have zero cost (i.e., we do not bother hashing lookup entries if
    there are no hashed entries in the map)

  - as soon as there is one hashed entry, we need to hash the key for
    every lookup in the map. I'm not sure what the overhead is like. It
    might be negligible. But I think we should confirm that before
    proceeding.

> And as someone who had to download all 21 GB of the Chromium repository
> for testing purposes recently, I can tell you that absent a very
> compelling use case, nobody's going to want to download that entire
> repository just to extract some personal information, especially since
> the git index-pack operation is essentially guaranteed to take at least
> 7 minutes at maximum speed.  So by hashing, we've guaranteed significant
> inconvenience unless you have the repository, whereas that's not the
> case for base64.  And making abuse even slightly harder can often deter
> a surprising amount of it[0].

They just need the objects that have ident lines in them, so:

  $ time git clone --bare --filter=tree:0 https://github.com/chromium/chromium
  Cloning into bare repository 'chromium.git'...
  remote: Enumerating objects: 202, done.
  remote: Counting objects: 100% (202/202), done.
  remote: Compressing objects: 100% (161/161), done.
  remote: Total 1105453 (delta 49), reused 194 (delta 41), pack-reused 1105251
  Receiving objects: 100% (1105453/1105453), 462.14 MiB | 11.13 MiB/s, done.
  Resolving deltas: 100% (99790/99790), done.
  
  real	0m49.304s
  user	0m21.330s
  sys	0m4.727s

gets you there much quicker. I don't think that negates your point about
raising the bar, but my guess is that the threat model of "casual
spammer" would probably be deterred, but "troll who wants to annoy
specific person" would probably not be.

-Peff

[0/1] Hashed mailmap support

Message

Comments