Message ID | 20210103211849.2691287-6-sandals@crustytoothpaste.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Hashed mailmap | expand |
On Sun, Jan 03 2021, brian m. carlson wrote: I think it makes sense to split up 1-4/5 here from 5/5 in this series since they're really unrelated changes, although due to the changes in 1-4 they'll conflict. > Many people, through the course of their lives, will change either a > name or an email address. For this reason, we have the mailmap, to map > from a user's former name or email address to their current, canonical > forms. Normally, this works well as it is. > > However, sometimes people change a name or an email address and wish to > wholly disassociate themselves from that former name or email address. > For example, a person may transition from one gender to another, > changing their name, or they may have changed their name to disassociate > themselves from an abusive family or partner. In such a case, using the > former name or address in any way may be undesirable and the person may > wish to replace it as completely as possible. The cover letter noted "As mentioned in the original thread, I think a hash rather than an encoding is the right choice here.". Reading the v1 I think you're referring to https://lore.kernel.org/git/X9wUGaR3IXcpV0nT@camp.crustytoothpaste.net/ In v1 I pointed out you needed to read some combination of the cover letter & the patch to see what this was intended for (see [1]). I think for v3 the commit itself should summarize the trade-offs & design choices. > For projects which wish to support this, introduce hashed forms into the > mailmap. These forms, which start with "@sha256:" followed by a SHA-256 > hash of the entry, can be used in place of the form used in the commit > field. This form is intentionally designed to be unlikely to conflict > with legitimate use cases. For example, this is not a valid email > address according to RFC 5322. In the unlikely event that a user has > put such a form into the actual commit as their name, we will accept it. We'll emit the commit author information as-is in that case under "git show", or run the mapping and map it via mailmap? Anyway, it seems there's a test for this. Probably better to just point to it. > While the form of the data is designed to accept multiple hash > algorithms, we intentionally do not support SHA-1. There is little > reason to support such a weak algorithm in new use cases and no > backwards compatibility to consider. Moreover, SHA-256 is faster than > the SHA1DC implementation we use, so this not only improves performance, > but simplifies the current implementation somewhat as well. I agree with most of this aside from the "weak algorithm" part. That seems like an irrelevant aside for this specific use of a hashing algorithm, no? We could even use MD5 here, so SHA256-only is just setting is up for not needing to deal with SHA1 forever in this one place in some SHA256 future repo. > Note that it is, of course, possible to perform a lookup on all commit > objects to determine the actual entry which matches the hashed form of > the data. However, this is an improvement over the status quo. > > The performance of this patch with no hashed entries is very similar to > the performance without this patch. Considering a git log command to > look up author and committer information on 981,680 commits in the Linux > kernel history, either with an unhashed mailmap or a mailmap with all > old values hashed: > > Shortest Longest Average Change > Git 2.30 7.876 8.297 8.143 > This patch, unhashed 7.923 8.484 8.237 + 1.15% > This patch, hashed 14.510 14.783 14.672 +80.17% > This patch, hashed, unoptimized 15.425 16.318 15.901 +95.27% > > Thus, the average performance after this patch is within normal > variation of the pre-patch performance. It's unlikely that users will > notice the difference in practice, even on much larger > repositories, unless they're using the new feature. Am I reading this right that if there's a single hashed entry in .mailmap anything using %aE or %aN is around 2x as slow? Your v1 mentioned that a project might "insert entries for many contributors in order to make discovery of "interesting" entries significantly less convenient." which is gone in the v2 patch. As noted in [1] I don't see how it helps the obscurity much, but if that's still the intended use we'd expect to get more slowdowns in the wild if users intend to convert their whole mailmap to this form if they want a single entry to use the form. Anyway, as you might have guessed I'm still not a fan of this direction. But most of it is because I honestly don't get why this specific approach is required to achieve the stated aims, there's a few of them, so here's an attempt to break them down: 1. User changed their name and doesn't want themselves or others to see their old name For the case where Joe Developer is now known as Jane Doe in most cases you don't need to put the old name at all into the .mailmap. E.g. for git.git this patch to our .mailmap produces the same output for `log --all --pretty="%h %an%ae%aN%aE"`: brian m. carlson <sandals@crustytoothpaste.net> -brian m. carlson <sandals@crustytoothpaste.net> <sandals@crustytoothpaste.ath.cx> -brian m. carlson <sandals@crustytoothpaste.net> <bk2204@github.com> +<sandals@crustytoothpaste.net> <sandals@crustytoothpaste.ath.cx> +<sandals@crustytoothpaste.net> <bk2204@github.com> So the new->name/email mapping (as opposed to new->email) is really only needed for some really obscure cases where two people shared an E-Mail or something. So we're talking about hiding the old E-Mail, presumably because it was joe@ intsead of jane@, so in that case we could just support URI encoding: Jane Doe <jane@example.com> <jane@example.com> <%6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D> Made via: $ perl -MURI::Escape=uri_escape -wE 'say uri_escape q[joe@developer.com], "^@."' %6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D Which also has the nice attribute that people can make it obvious what part they want to hide, since this is really a feature to enable social politeness & consideration: Jane Doe <jane@example.com> # I don't want to be known by my old name, thanks <jane@example.com> <%6A%6F%65@developer.com> 2. Hiding from your enemies For the other use-case of "abusive family or partner" I had the comment in v1 of "but not so much that you'd still take the risk of submitting a patch to .mailmap?". Now that's obviously phrased in an off-the-cuff manner, but I'm serious. I think it is important that the non-security of this feature obviously looks like some trivial encoding, because that's what it is. People get lulled into a false sense of security with these things all the time (e.g. thinking their "Authorization" HTTP header is safe to post on a public pastebin). So we should as much as possible make this look like the non-security it is. 3. Enabling people not to treat .mailmap as binary or a multi-encoding file. I mentioned this in my [1]. Your implementation doesn't do this, but e.g. it would be very nice for a project that switched from latin-1 to utf-8 to be able to do, in some cases: # Made with: perl -MURI::Escape=uri_escape -wE 'say uri_escape "@ARGV", "^a-z@. "' $(echo Ævar Arnfjörð Bjarmason | iconv -f utf-8 -t iso-8859-1) # Ævar Arnfjörð Bjarmason <avarab@gmail.com> %C6var %41rnfj%F6r%F0 %42jarmason <avarab@gmail.com> Or some combination thereof, so e.g. previously Big5/latin1 who migrated to UTF-8 don't need to have non-valid UTF-8 in .mailmap 4. Spam You mentioned this in your [2] (but not as a use-case in the v2 re-rolled commit message): And we know that spammers and recruiters (which, in this case, are also spammers) do indeed scrape repositories via the repository web interfaces. Surely these people are most interested in the current E-Mail addresses, which if they're scraping the common web interfaces (e.g. Github, GitLab) are easily accessible there. It doesn't seem very plausible that someone would care enough to scrape .mailmap for old addresses but not just update their scraper to clone & run "git log" for the purposes of e.g. their recruitment E-Mails. 5. Interaction with other systems Something I mentioned in the last 3 paragraphs of my [1]. I think you're only considering the cases where git itself does the mailmap translation, but we have 3rd party systems that make use of the format in good ways (also doing the Joe->Jane mapping). Making it a hassle for those systems makes it more likely that Jane doesn't get the mapping she wants. 1. https://lore.kernel.org/git/87eejswql6.fsf@evledraar.gmail.com/ 2. https://lore.kernel.org/git/X9wUGaR3IXcpV0nT@camp.crustytoothpaste.net/
"brian m. carlson" <sandals@crustytoothpaste.net> writes: > For example, a person may transition from one gender to another, > changing their name, or they may have changed their name to disassociate > themselves from an abusive family or partner. In such a case, using the > former name or address in any way may be undesirable and the person may > wish to replace it as completely as possible. I am not sure if we want to even mention the "for example" here. These are certainly all legitimate reasons to want this feature, but after reading the "for example", lack of a corresponding negative statement (e.g. sometimes people also change their name or address to hide their bad behaviour in the past that is associated with these names) needlessly stood out and made me wonder if we need to somehow defend the feature with "...but we do not mean to abet people in hiding their past bad behaviour with this mechanism". I'd prefer us not forced to defend the mechanism if we did not have to. > Note that it is, of course, possible to perform a lookup on all commit > objects to determine the actual entry which matches the hashed form of > the data. However, this is an improvement over the status quo. There were suggestions to use reversible encoding, IIRC, just for obscurity. I do not have a strong preference either way myself, but because such an approach would give the same improvement over the status quo, would be simpler, more performant and most importantly, it makes it clear that this is not serious security but casual obscurity, I'd want to be convinced why we want to use a hash here a bit more strongly. > +In addition to specifying a former name or email literally, it is also possible > +to specify it in a hashed form, which consists of the string `@sha256:`, > +followed by an all-lowercase SHA-256 hash of the entry in hexadecimal. For > +example, to take the example above, instead of specifying the replacement for > +"Some Dude" as such, you could specify one of these lines: > ... > +SHA-1 is not accepted as a hash algorithm in mailmaps. Is this needed to be said? After all, we won't take @md5: or @blake2: or anything other than @sha256: in this version (and probably any forseeable versions). Unless we offer a way to plug-in algos of projects' choice, that is, and at that point, "SHA-1 is not accepted" is a statement too strong for us to make.
On 2021-01-05 at 14:21:40, Ævar Arnfjörð Bjarmason wrote: > > On Sun, Jan 03 2021, brian m. carlson wrote: > > I think it makes sense to split up 1-4/5 here from 5/5 in this series > since they're really unrelated changes, although due to the changes in > 1-4 they'll conflict. Okay, I'll drop them. > In v1 I pointed out you needed to read some combination of the cover > letter & the patch to see what this was intended for (see [1]). I think > for v3 the commit itself should summarize the trade-offs & design > choices. I can do that. It's a very long commit message anyway, but if you think it would be better in the commit message, I can add it. > > For projects which wish to support this, introduce hashed forms into the > > mailmap. These forms, which start with "@sha256:" followed by a SHA-256 > > hash of the entry, can be used in place of the form used in the commit > > field. This form is intentionally designed to be unlikely to conflict > > with legitimate use cases. For example, this is not a valid email > > address according to RFC 5322. In the unlikely event that a user has > > put such a form into the actual commit as their name, we will accept it. > > We'll emit the commit author information as-is in that case under "git > show", or run the mapping and map it via mailmap? Anyway, it seems > there's a test for this. Probably better to just point to it. It will be handled correctly via the mailmap code, in which case we'll make a no-op transformation. If the user is not using the mailmap, then it will be handled trivially. > > While the form of the data is designed to accept multiple hash > > algorithms, we intentionally do not support SHA-1. There is little > > reason to support such a weak algorithm in new use cases and no > > backwards compatibility to consider. Moreover, SHA-256 is faster than > > the SHA1DC implementation we use, so this not only improves performance, > > but simplifies the current implementation somewhat as well. > > I agree with most of this aside from the "weak algorithm" part. That > seems like an irrelevant aside for this specific use of a hashing > algorithm, no? We could even use MD5 here, so SHA256-only is just > setting is up for not needing to deal with SHA1 forever in this one > place in some SHA256 future repo. One should avoid the use of weak algorithms when possible even if they are not being used in a way that makes them weak because it incentivizes others to use them, often in a way that is insecure. I had a conversation with a junior candidate during an interview who said they used SHA-1 in a particular case "because Git uses it." That's why I mentioned it. > > Note that it is, of course, possible to perform a lookup on all commit > > objects to determine the actual entry which matches the hashed form of > > the data. However, this is an improvement over the status quo. > > > > The performance of this patch with no hashed entries is very similar to > > the performance without this patch. Considering a git log command to > > look up author and committer information on 981,680 commits in the Linux > > kernel history, either with an unhashed mailmap or a mailmap with all > > old values hashed: > > > > Shortest Longest Average Change > > Git 2.30 7.876 8.297 8.143 > > This patch, unhashed 7.923 8.484 8.237 + 1.15% > > This patch, hashed 14.510 14.783 14.672 +80.17% > > This patch, hashed, unoptimized 15.425 16.318 15.901 +95.27% > > > > Thus, the average performance after this patch is within normal > > variation of the pre-patch performance. It's unlikely that users will > > notice the difference in practice, even on much larger > > repositories, unless they're using the new feature. > > Am I reading this right that if there's a single hashed entry in > .mailmap anything using %aE or %aN is around 2x as slow? No, that's not the case. As soon as we see every hashed entry, we will stop hashing new entries. Linux is not necessarily the best case for this because it has a long history with many one-off contributors long ago in the history. I'll explain that further in the commit message and add some more metrics. > Your v1 mentioned that a project might "insert entries for many > contributors in order to make discovery of "interesting" entries > significantly less convenient." which is gone in the v2 patch. As noted > in [1] I don't see how it helps the obscurity much, but if that's still > the intended use we'd expect to get more slowdowns in the wild if users > intend to convert their whole mailmap to this form if they want a single > entry to use the form. Peff objected to that text, so I removed it. As mentioned above, it depends on who you put in the mailmap. If they're the most recent 50 contributors, it'll probably be pretty cheap. If you put the oldest contributors in there and they've not sent any recent commits, it will be more expensive. > Anyway, as you might have guessed I'm still not a fan of this direction. I've got that impression pretty strongly. I do want to point out that generally I'm pretty willing to change approaches and do things differently. I've completely redone a decent number of patches in the past in response to feedback on the list. I'm not changing the approach here because, as mentioned below, I don't think that just encoding meets the use cases I'm targeting here. So I have heard your suggestions and to be clear, I do value your input on this (and on other topics), it's just that I disagree that such a change is one I should make. > So the new->name/email mapping (as opposed to new->email) is really only > needed for some really obscure cases where two people shared an E-Mail > or something. That's unlikely, but it does happen. That's why we have it. > So we're talking about hiding the old E-Mail, presumably because it was > joe@ intsead of jane@, so in that case we could just support URI > encoding: > > Jane Doe <jane@example.com> > <jane@example.com> <%6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D> > > Made via: > > $ perl -MURI::Escape=uri_escape -wE 'say uri_escape q[joe@developer.com], "^@."' > %6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D > > Which also has the nice attribute that people can make it obvious what > part they want to hide, since this is really a feature to enable social > politeness & consideration: > > Jane Doe <jane@example.com> > # I don't want to be known by my old name, thanks > <jane@example.com> <%6A%6F%65@developer.com> I don't think this feature is going to get used if we just encode names or email addresses. In the United States, when someone transitions, they get a court order to change their name. I don't think a lot of corporate environments are going to want to just encode an old name or email address in a trivially invertible way given that. This is typically a topic handled with some sensitivity in most companies. I will tell you that I would not just use an encoded version if I were changing my name for any of the reasons I've mentioned. That wouldn't cut it for me, and I wouldn't use such a feature. The feature I'm implementing is a feature I've talked with trans folks about, and that's why I'm implementing this as it is. The response I got was essentially, "It's not everything I want, but it's an improvement." If the decision is that we want to go with encoding instead of hashing, then I'll drop this patch. I'm not going to put my name or sign-off on that because I don't think it meets the need I'm addressing here. The entire problem, of course, is that we bake a human's personal name and email address immutably into a Merkle tree. We know full well that people do change their names and email addresses all the time (e.g., marriage, job changes), and yet we have this design. In retrospect, we should have done something different, but hindsight is 20/20 and I'm just trying to do the best we can with what we've got. > 2. Hiding from your enemies > > For the other use-case of "abusive family or partner" I had the comment > in v1 of "but not so much that you'd still take the risk of submitting a > patch to .mailmap?". No, my use case isn't "hiding from an abusive family or partner". It's "I'm finally free of that **** and I never want to hear their name again." (I've known people in this situation.) Also, the similar use case of, "my family member, with whom I share an uncommon name, murdered someone, which I obviously found abhorrent, and I would like to not be associated with them when my name is Googled." And yes, I knew an acquaintance many years ago whose family member murdered someone. In other words, the person changed their name to disassociate themselves, not to hide from their abuser. > 4. Spam > > You mentioned this in your [2] (but not as a use-case in the v2 > re-rolled commit message): > > And we know that spammers and recruiters (which, in this case, are > also spammers) do indeed scrape repositories via the repository web > interfaces. > > Surely these people are most interested in the current E-Mail addresses, > which if they're scraping the common web interfaces (e.g. Github, > GitLab) are easily accessible there. It doesn't seem very plausible that > someone would care enough to scrape .mailmap for old addresses but not > just update their scraper to clone & run "git log" for the purposes of > e.g. their recruitment E-Mails. Unless the user is using the GitHub-provided noreply address or a similar address, which is common. This allows people to map all of their old addresses to such an address, which, judging from StackOverflow, is a thing people want to do. I can tell you from dealing with abuse that raising the bar even the tiniest bit is very significant to stopping it. Most recruiters are not developers and they and spammers don't have Git installed. They're going to rely on Googling or other public search functionality, and this makes that harder. Greylisting is exactly raising the bar the tiniest amount and it's extraordinarily effective. > 5. Interaction with other systems > > Something I mentioned in the last 3 paragraphs of my [1]. I think you're > only considering the cases where git itself does the mailmap > translation, but we have 3rd party systems that make use of the format > in good ways (also doing the Joe->Jane mapping). Making it a hassle for > those systems makes it more likely that Jane doesn't get the mapping she > wants. This is an argument for never changing the format. Sometimes things change, and I don't want to avoid making a change because other implementations haven't implemented it yet. Under that approach, we'd never have the SHA-256 work.
On 2021-01-05 at 20:05:22, Junio C Hamano wrote: > "brian m. carlson" <sandals@crustytoothpaste.net> writes: > > > For example, a person may transition from one gender to another, > > changing their name, or they may have changed their name to disassociate > > themselves from an abusive family or partner. In such a case, using the > > former name or address in any way may be undesirable and the person may > > wish to replace it as completely as possible. > > I am not sure if we want to even mention the "for example" here. > > These are certainly all legitimate reasons to want this feature, but > after reading the "for example", lack of a corresponding negative > statement (e.g. sometimes people also change their name or address > to hide their bad behaviour in the past that is associated with > these names) needlessly stood out and made me wonder if we need to > somehow defend the feature with "...but we do not mean to abet > people in hiding their past bad behaviour with this mechanism". I'd > prefer us not forced to defend the mechanism if we did not have to. I added it because I imagine the use cases for this feature aren't immediately obvious to a lot of people and the general rule is that commit messages explain why we would implement such a feature. If you'd prefer I drop it and leave it up to the imagination (or to the list archives), I can do that. > > +SHA-1 is not accepted as a hash algorithm in mailmaps. > > Is this needed to be said? After all, we won't take @md5: or > @blake2: or anything other than @sha256: in this version (and > probably any forseeable versions). Unless we offer a way to plug-in > algos of projects' choice, that is, and at that point, "SHA-1 is not > accepted" is a statement too strong for us to make. I'll drop that line.
"brian m. carlson" <sandals@crustytoothpaste.net> writes: > I added it because I imagine the use cases for this feature aren't > immediately obvious to a lot of people and the general rule is that > commit messages explain why we would implement such a feature. Yeah, I understand that. >> > +SHA-1 is not accepted as a hash algorithm in mailmaps. >> ... > I'll drop that line. Thanks.
On Wed, Jan 06 2021, brian m. carlson wrote: > On 2021-01-05 at 14:21:40, Ævar Arnfjörð Bjarmason wrote: >> >> On Sun, Jan 03 2021, brian m. carlson wrote: >> >> I think it makes sense to split up 1-4/5 here from 5/5 in this series >> since they're really unrelated changes, although due to the changes in >> 1-4 they'll conflict. > > Okay, I'll drop them. Not replying to most of this E-Mail because I think there's nothing left to add / you clarified things for me in those cases / we respectfully disagree / any outstanding points we can pick up in your re-roll / whatever :) >> So we're talking about hiding the old E-Mail, presumably because it was >> joe@ intsead of jane@, so in that case we could just support URI >> encoding: >> >> Jane Doe <jane@example.com> >> <jane@example.com> <%6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D> >> >> Made via: >> >> $ perl -MURI::Escape=uri_escape -wE 'say uri_escape q[joe@developer.com], "^@."' >> %6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D >> >> Which also has the nice attribute that people can make it obvious what >> part they want to hide, since this is really a feature to enable social >> politeness & consideration: >> >> Jane Doe <jane@example.com> >> # I don't want to be known by my old name, thanks >> <jane@example.com> <%6A%6F%65@developer.com> > > I don't think this feature is going to get used if we just encode names > or email addresses. In the United States, when someone transitions, > they get a court order to change their name. I don't think a lot of > corporate environments are going to want to just encode an old name or > email address in a trivially invertible way given that. This is > typically a topic handled with some sensitivity in most companies. > > I will tell you that I would not just use an encoded version if I were > changing my name for any of the reasons I've mentioned. That wouldn't > cut it for me, and I wouldn't use such a feature. The feature I'm > implementing is a feature I've talked with trans folks about, and that's > why I'm implementing this as it is. The response I got was essentially, > "It's not everything I want, but it's an improvement." > > If the decision is that we want to go with encoding instead of hashing, > then I'll drop this patch. I'm not going to put my name or sign-off on > that because I don't think it meets the need I'm addressing here. > > The entire problem, of course, is that we bake a human's personal name > and email address immutably into a Merkle tree. We know full well that > people do change their names and email addresses all the time (e.g., > marriage, job changes), and yet we have this design. In retrospect, we > should have done something different, but hindsight is 20/20 and I'm > just trying to do the best we can with what we've got. Doesn't the difference in some sense boil down to either an implicit promise or an implicit assumption that the hashed version is forever going to be protected by some security-through-obscurity/inconvenience when it comes to git.git & its default tooling? And would those users be as comfortable with the difference between encoded v.s. hashed if e.g. "git check-mailmap" learned to read the .mailmap and search-replace all the hashed versions with their materialized values, or if popular tools like Emacs learned to via a Git .mailmap in a "need translation" similar to *.gpg and *.gz. How about if popular web views of Git served up that materialized "check-mailmap" output by default? None of which I think is implausible that we'll get as follow-up patches, I might even submit some at some point, not out of some spite. Just because I don't want to maintain out-of-tree code for an out-of-tree program that understands a Git .mailmap today, but where I'd need to search-replace the hashed versions. Ditto it being very likely that popular editors or web viewers will gain support for this, just because it's tedious to manually hash & copy/paste & validate values. In looking at some of the fsck code recently & having some yet-unsubmitted patches I thought of trying to compine it with mailmap. I.e. it seems like a natural feature for fsck to gain to warn you about unused mailmap entries, just like it can warn about unreachable/dangling objects. After all these are really just sort-of pointers into our Merkle tree. Spewing out all the mappings seems like an obvious addition to that, e.g. in spewing out an "optimized/non-redundant" (plain or hashed) mailmap to re-commit. That's the main reason I'm uncomfortable with this approach, because it seems to me to implicitly rely on things that are tedious now, but which the march of history all but inevitably should make trivial if we were to integrate it. Unless we're *also* promising to forever intentionally (and artificially) keep it inconvenient. E.g. the example of how long it takes to clone & extract this info from chromium.git in the v1 thread. It seems like a fair assumption that we'll have some future version of git where you can ask a remote server about that sort of thing in milliseconds. Not because of this hashed .mailmap thing in particular, just as an emergent effect that it's happy to serve up things it knows about the DAG from having walked & cached it in general. E.g. info from the commit-graph, what hash is contained in what ref, or how one value (such as a .mailmap entry) maps to another etc.
On 2021-01-10 at 19:24:34, Ævar Arnfjörð Bjarmason wrote: > Doesn't the difference in some sense boil down to either an implicit > promise or an implicit assumption that the hashed version is forever > going to be protected by some security-through-obscurity/inconvenience > when it comes to git.git & its default tooling? > > And would those users be as comfortable with the difference between > encoded v.s. hashed if e.g. "git check-mailmap" learned to read the > .mailmap and search-replace all the hashed versions with their > materialized values, or if popular tools like Emacs learned to via a Git > .mailmap in a "need translation" similar to *.gpg and *.gz. How about if > popular web views of Git served up that materialized "check-mailmap" > output by default? > > None of which I think is implausible that we'll get as follow-up > patches, I might even submit some at some point, not out of some spite. > Just because I don't want to maintain out-of-tree code for an > out-of-tree program that understands a Git .mailmap today, but where I'd > need to search-replace the hashed versions. Yes, I think we do rely on this being inconvenient. If you plan to submit such a patch, I'm going to let this series drop.
diff --git a/Documentation/mailmap.txt b/Documentation/mailmap.txt index 4a8c276529..b21194bf3e 100644 --- a/Documentation/mailmap.txt +++ b/Documentation/mailmap.txt @@ -73,3 +73,31 @@ Santa Claus <santa.claus@northpole.xx> <me@company.xx> Use hash '#' for comments that are either on their own line, or after the email address. + +In addition to specifying a former name or email literally, it is also possible +to specify it in a hashed form, which consists of the string `@sha256:`, +followed by an all-lowercase SHA-256 hash of the entry in hexadecimal. For +example, to take the example above, instead of specifying the replacement for +"Some Dude" as such, you could specify one of these lines: + +------------ +Some Dude <some@dude.xx> nick1 <@sha256:bee4fdd8c5e2e85009c8ae231d5a395adb24d5a597f2b75489926460680b8ce1> +Some Dude <some@dude.xx> @sha256:56030827e2765e8878c94c4cc43f5410b22f3b8c2b1ef8f631ac3953f8299279 <bugs@company.xx> +Some Dude <some@dude.xx> @sha256:56030827e2765e8878c94c4cc43f5410b22f3b8c2b1ef8f631ac3953f8299279 <@sha256:bee4fdd8c5e2e85009c8ae231d5a395adb24d5a597f2b75489926460680b8ce1> +------------ + +These hash is a hash of the literal name or email without any trailing newlines. +For example, you can compute the values above like so, using the Perl `shasum` +command (or a similar command of your choice): + +------------ +$ printf '%s' bugs@company.xx | shasum -a 256 +bee4fdd8c5e2e85009c8ae231d5a395adb24d5a597f2b75489926460680b8ce1 - +------------ + +SHA-1 is not accepted as a hash algorithm in mailmaps. + +Using the hashed form may be desirable to obscure one's former name or email, +but be aware that it is just obfuscation: it's still possible for someone with +access to the repository to iterate through all authors and committers and map +the hashed values to unhashed ones. diff --git a/mailmap.c b/mailmap.c index 5c52dbb7e0..ed401bb1e4 100644 --- a/mailmap.c +++ b/mailmap.c @@ -18,6 +18,8 @@ const char *git_mailmap_blob; struct mailmap_info { char *name; char *email; + + unsigned refcount; }; struct mailmap_entry { @@ -25,6 +27,10 @@ struct mailmap_entry { char *name; char *email; + unsigned refcount; + unsigned hashed_count; + unsigned hashed_seen; + /* name and email for the complex mail and name matching case */ struct string_list namemap; }; @@ -32,6 +38,9 @@ struct mailmap_entry { static void free_mailmap_info(void *p, const char *s) { struct mailmap_info *mi = (struct mailmap_info *)p; + if (--mi->refcount) + return; + debug_mm("mailmap: -- complex: '%s' -> '%s' <%s>\n", s, debug_str(mi->name), debug_str(mi->email)); free(mi->name); @@ -41,6 +50,9 @@ static void free_mailmap_info(void *p, const char *s) static void free_mailmap_entry(void *p, const char *s) { struct mailmap_entry *me = (struct mailmap_entry *)p; + if (--me->refcount) + return; + debug_mm("mailmap: removing entries for <%s>, with %d sub-entries\n", s, me->namemap.nr); debug_mm("mailmap: - simple: '%s' <%s>\n", @@ -82,10 +94,17 @@ static char *lowercase_email(char *s) return s; } -static void add_mapping(struct string_list *map, +static int is_hashed(const char *s) +{ + const char *prefix = "@sha256:"; + return strncmp(s, prefix, strlen(prefix)) == 0; +} + +static void add_mapping(struct mailmap *mailmap, char *new_name, char *new_email, char *old_name, char *old_email) { + struct string_list *map = mailmap->mailmap; struct mailmap_entry *me; struct string_list_item *item; @@ -95,7 +114,10 @@ static void add_mapping(struct string_list *map, old_email = new_email; new_email = NULL; } else { - lowercase_email(old_email); + if (is_hashed(old_email)) + mailmap->hashed_count++; + else + lowercase_email(old_email); } item = string_list_insert(map, old_email); @@ -105,6 +127,7 @@ static void add_mapping(struct string_list *map, me = xcalloc(1, sizeof(struct mailmap_entry)); me->namemap.strdup_strings = 1; me->namemap.cmp = namemap_cmp; + me->refcount = 1; item->util = me; } @@ -125,6 +148,9 @@ static void add_mapping(struct string_list *map, debug_mm("mailmap: adding (complex) entry for '%s'\n", old_email); mi->name = xstrdup_or_null(new_name); mi->email = xstrdup_or_null(new_email); + mi->refcount = 1; + if (is_hashed(old_name)) + me->hashed_count++; string_list_insert(&me->namemap, old_name)->util = mi; } @@ -162,7 +188,7 @@ static char *parse_name_and_email(char *buffer, char **name, return (*right == '\0' ? NULL : right); } -static void read_mailmap_line(struct string_list *map, char *buffer, +static void read_mailmap_line(struct mailmap *map, char *buffer, char **repo_abbrev) { char *name1 = NULL, *email1 = NULL, *name2 = NULL, *email2 = NULL; @@ -194,7 +220,7 @@ static void read_mailmap_line(struct string_list *map, char *buffer, add_mapping(map, name1, email1, name2, email2); } -static int read_mailmap_file(struct string_list *map, const char *filename, +static int read_mailmap_file(struct mailmap *map, const char *filename, char **repo_abbrev) { char buffer[1024]; @@ -216,7 +242,7 @@ static int read_mailmap_file(struct string_list *map, const char *filename, return 0; } -static void read_mailmap_string(struct string_list *map, char *buf, +static void read_mailmap_string(struct mailmap *map, char *buf, char **repo_abbrev) { while (*buf) { @@ -230,7 +256,7 @@ static void read_mailmap_string(struct string_list *map, char *buf, } } -static int read_mailmap_blob(struct string_list *map, +static int read_mailmap_blob(struct mailmap *map, const char *name, char **repo_abbrev) { @@ -269,10 +295,10 @@ int read_mailmap(struct mailmap *mailmap, char **repo_abbrev) if (!git_mailmap_blob && is_bare_repository()) git_mailmap_blob = "HEAD:.mailmap"; - err |= read_mailmap_file(map, ".mailmap", repo_abbrev); + err |= read_mailmap_file(mailmap, ".mailmap", repo_abbrev); if (startup_info->have_repository) - err |= read_mailmap_blob(map, git_mailmap_blob, repo_abbrev); - err |= read_mailmap_file(map, git_mailmap_file, repo_abbrev); + err |= read_mailmap_blob(mailmap, git_mailmap_blob, repo_abbrev); + err |= read_mailmap_file(mailmap, git_mailmap_file, repo_abbrev); return err; } @@ -282,7 +308,7 @@ void clear_mailmap(struct mailmap *mailmap) debug_mm("mailmap: clearing %d entries...\n", map->nr); map->strdup_strings = 1; string_list_clear_func(map, free_mailmap_entry); - string_list_clear(map, 1); + string_list_clear(map, 0); free(map); debug_mm("mailmap: cleared\n"); } @@ -338,6 +364,55 @@ static struct string_list_item *lookup_prefix(struct string_list *map, return NULL; } +/* + * Convert an email or name into a hashed form for comparison. The hashed form + * will be created in the form + * @sha256:c68b7a430ac8dee9676ec77a387194e23f234d024e03d844050cf6c01775c8f6, + * which would be the hashed form for "doe@example.com". + */ +static char *hashed_form(struct strbuf *buf, const struct git_hash_algo *algop, const char *key, size_t keylen) +{ + git_hash_ctx ctx; + unsigned char hashbuf[GIT_MAX_RAWSZ]; + char hexbuf[GIT_MAX_HEXSZ + 1]; + + algop->init_fn(&ctx); + algop->update_fn(&ctx, key, keylen); + algop->final_fn(hashbuf, &ctx); + hash_to_hex_algop_r(hexbuf, hashbuf, algop); + + strbuf_addf(buf, "@%s:%s", algop->name, hexbuf); + return buf->buf; +} + +static struct string_list_item *lookup_one(struct string_list *map, + const char *string, size_t len, + unsigned hashed_count, + unsigned *hashed_seen) +{ + struct strbuf buf = STRBUF_INIT; + struct string_list_item *item = lookup_prefix(map, string, len); + if (item || !hashed_count || hashed_count == *hashed_seen) + return item; + + hashed_form(&buf, &hash_algos[GIT_HASH_SHA256], string, len); + item = lookup_prefix(map, buf.buf, buf.len); + if (item) { + struct mailmap_info *mi = (struct mailmap_info *)item->util; + char *s = xstrndup(string, len); + map->strdup_strings = 0; + item = string_list_insert(map, s); + map->strdup_strings = 1; + if (!item->util) { + item->util = mi; + mi->refcount++; + (*hashed_seen)++; + } + } + strbuf_release(&buf); + return item; +} + int map_user(struct mailmap *map, const char **email, size_t *emaillen, const char **name, size_t *namelen) @@ -350,7 +425,7 @@ int map_user(struct mailmap *map, (int)*namelen, debug_str(*name), (int)*emaillen, debug_str(*email)); - item = lookup_prefix(map->mailmap, searchable_email, *emaillen); + item = lookup_one(map->mailmap, searchable_email, *emaillen, map->hashed_count, &map->hashed_seen); free(searchable_email); if (item != NULL) { me = (struct mailmap_entry *)item->util; @@ -361,7 +436,7 @@ int map_user(struct mailmap *map, * simple entry. */ struct string_list_item *subitem; - subitem = lookup_prefix(&me->namemap, *name, *namelen); + subitem = lookup_one(&me->namemap, *name, *namelen, me->hashed_count, &me->hashed_seen); if (subitem) item = subitem; } diff --git a/mailmap.h b/mailmap.h index 4cdce3b064..69f8be5705 100644 --- a/mailmap.h +++ b/mailmap.h @@ -5,6 +5,8 @@ struct mailmap { struct string_list *mailmap; + unsigned hashed_count; + unsigned hashed_seen; }; int read_mailmap(struct mailmap *map, char **repo_abbrev); diff --git a/t/t4203-mailmap.sh b/t/t4203-mailmap.sh index df4a0e03cc..004b4a3d40 100755 --- a/t/t4203-mailmap.sh +++ b/t/t4203-mailmap.sh @@ -62,6 +62,41 @@ test_expect_success 'check-mailmap --stdin arguments' ' test_cmp expect actual ' +test_expect_success 'hashed mailmap' ' + test_config mailmap.file ./hashed && + hashed_author_name="@sha256:$(printf "$GIT_AUTHOR_NAME" | test-tool sha256)" && + hashed_author_email="@sha256:$(printf "$GIT_AUTHOR_EMAIL" | test-tool sha256)" && + cat >expect <<-EOF && + $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> + EOF + + cat >hashed <<-EOF && + $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $hashed_author_name <$GIT_AUTHOR_EMAIL> + EOF + git check-mailmap "$GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL>" >actual && + test_cmp expect actual && + + cat >hashed <<-EOF && + Wrong <wrong@example.org> $GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL> + $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $hashed_author_name <$GIT_AUTHOR_EMAIL> + EOF + # Check that we prefer literal matches over hashed names. + git check-mailmap "$hashed_author_name <$GIT_AUTHOR_EMAIL>" >actual && + test_cmp expect actual && + + cat >hashed <<-EOF && + $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $hashed_author_name <$hashed_author_email> + EOF + git check-mailmap "$GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL>" >actual && + test_cmp expect actual && + + cat >hashed <<-EOF && + $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> <$hashed_author_email> + EOF + git check-mailmap "$GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL>" >actual && + test_cmp expect actual +' + test_expect_success 'check-mailmap bogus contact' ' test_must_fail git check-mailmap bogus '
Many people, through the course of their lives, will change either a name or an email address. For this reason, we have the mailmap, to map from a user's former name or email address to their current, canonical forms. Normally, this works well as it is. However, sometimes people change a name or an email address and wish to wholly disassociate themselves from that former name or email address. For example, a person may transition from one gender to another, changing their name, or they may have changed their name to disassociate themselves from an abusive family or partner. In such a case, using the former name or address in any way may be undesirable and the person may wish to replace it as completely as possible. For projects which wish to support this, introduce hashed forms into the mailmap. These forms, which start with "@sha256:" followed by a SHA-256 hash of the entry, can be used in place of the form used in the commit field. This form is intentionally designed to be unlikely to conflict with legitimate use cases. For example, this is not a valid email address according to RFC 5322. In the unlikely event that a user has put such a form into the actual commit as their name, we will accept it. While the form of the data is designed to accept multiple hash algorithms, we intentionally do not support SHA-1. There is little reason to support such a weak algorithm in new use cases and no backwards compatibility to consider. Moreover, SHA-256 is faster than the SHA1DC implementation we use, so this not only improves performance, but simplifies the current implementation somewhat as well. Note that it is, of course, possible to perform a lookup on all commit objects to determine the actual entry which matches the hashed form of the data. However, this is an improvement over the status quo. The performance of this patch with no hashed entries is very similar to the performance without this patch. Considering a git log command to look up author and committer information on 981,680 commits in the Linux kernel history, either with an unhashed mailmap or a mailmap with all old values hashed: Shortest Longest Average Change Git 2.30 7.876 8.297 8.143 This patch, unhashed 7.923 8.484 8.237 + 1.15% This patch, hashed 14.510 14.783 14.672 +80.17% This patch, hashed, unoptimized 15.425 16.318 15.901 +95.27% Thus, the average performance after this patch is within normal variation of the pre-patch performance. It's unlikely that users will notice the difference in practice, even on much larger repositories, unless they're using the new feature. To minimize the performance impact of the hashing process, we maintain a reference count of each mailmap entry and when we encounter an entry we must hash, we insert the same object under the unhashed key as well. We also keep a count of the number of hashed entries. This means we must hash an object at most once and once we've seen all the hashed objects, we won't hash any more objects. Times without this optimization are listed above in the unoptimized entry. This has the potential to cause a performance problem as we insert items into a sorted list, but changing the implementation to use a khash map instead does not result in a significantly faster implementation, despite the improved insertion speed. Performance in the unhashed case is slightly worse, so this approach was not adopted since it provides few benefits. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- Documentation/mailmap.txt | 28 +++++++++++ mailmap.c | 99 ++++++++++++++++++++++++++++++++++----- mailmap.h | 2 + t/t4203-mailmap.sh | 35 ++++++++++++++ 4 files changed, 152 insertions(+), 12 deletions(-)