Message ID | 20220919145231.48245-3-sandals@crustytoothpaste.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Opaque author and committer identifiers | expand |
On Mon, Sep 19 2022, brian m. carlson wrote: > The original design of Git embeds a personal name and email in every > commit. This has lots of downsides, including the following. > > First, people do not want to bake an email into an immutable Merkle tree > that they send everywhere. Spam, whether in general or by recruiters, > is a problem, and even when it's not, people change companies or > institutions and emails become invalid. > > Second, some people prefer to operate anonymously and don't want to > specify personal details everywhere. > > Third, and most important, people change names. This happens for many > reasons, but it comes up most saliently for transgender people, who > frequently change their name as part of their transition. Referring to > a transgender person's former name, their "deadname", is considered > inappropriate. > > We have a solution that can map former personal names and emails into > current ones, the mailmap. However, this last case poses a problem, > because we don't really want to correlate the person's deadname (or > their email, which may contain their deadname) right next to their > current name. > > Several solutions have been proposed for this case, including hashing or > encoding the old information, but these are all easily invertible. > Instead, let's propose a new form of identifier which is opaque and some > mailmap improvements to store the mailmap information outside of the > main history. With you so far... > Propose that users use the fingerprint of a cryptographic key as part of > a special-form email which is not valid according to RFC 1123, but is > accepted by earlier versions of Git. Now that we have SSH signing and > OpenSSH is available on all major platforms, creating a unique ID is as > easy as running ssh-keygen. This approach results in an identifier > which is unique, deterministic, and completely anonymous. ...but... > Propose this new option instead of using a name and email, although > users can continue to use those as before if they prefer. Continue to > associate personal information with this opaque identifier using the > mailmap, but in such a way that it lives in a special ref outside of the > history and that ref is customarily kept squashed to a single commit. > Create a special RFC 5322 header to associate a mailmap entry with the > user's opaque identifier when sending a patch if desired. ...while it's technically neat, I really don't see why this whole hashing mechanism is a necessary prerequisite to get to this point. Wouldn't we get the same thing if *by convention* we just supported authorship like this, (which we already support): UUID=$(get-some-uuid) git config user.name X git config user.email $UUID.uuid.git.example.org So you'd end up with e.g.: X <98ab8d66-38d2-11ed-a261-0242ac120002.uuid.git.example.com> Or whatever, we could bikeshed about the format, but the point is that it's not codifying *how* that looks. We'd then just support this refs/mailmap mechanism you're suggesting, where we'd have a mapping like: Ævar Arnfjörð Bjarmason <avarab@gmail.com> X <98ab8d66-38d2-11ed-a261-0242ac120002.uuid.git.example.com> Which could be force-pushed. I can see why you'd *also* want to formalize the ID generation, but I just don't see why we'd want to make that as one leaping change rather than something more incremental. I.e. even if you don't have opaque IDs in the first place this mechanism would allow you to maintain a "mailmap" ref on the remote, which would already be useful. E.g. now if I use a hosting provider and have my .mailmap in various repo I need to maintain then in each repo, but this would allow for a magical ref which would keep it up-to-date in various repos... > [...]If a user would like to preserve a history > +for some reason, they can use `--use-mailmap=commit`. For maintainers, they can > +then push this ref using the normal push refspecs, or explicitly with > +`--mailmap`, which is equivalent to `+refs/mailmap:refs/mailmap`. I obviously see why you want the "force push" aspect of this (the deadnaming), but I still wonder if it's really a good trade-off for git as an SCM to make that the default. We've been going in the other direction for e.g. tags semi-recently with my 0bc8d71b99e (fetch: stop clobbering existing tags without --force, 2018-08-31). By having that force-push default we make it so that a plumbing command (that makes use of mailmap) will give you one result today, but a different one tomorrow, with no easy way to get back. Maybe it's something we want in the end, but it's another thing that's "changed while at it", i.e. not only are we introducing "mailmap" remote refs, but also: * Changing the many-to-many mapping of history-mailmap to a many-to-one, i.e. the map is per-repo, not per-ref. * Changing it so that you can't track is as part of your history. If we wanted to ease into just one of those we could have a "mailmap" tag object, which we wouldn't clobber by default....
On 2022-09-20 at 10:51:39, Ævar Arnfjörð Bjarmason wrote: > Wouldn't we get the same thing if *by convention* we just supported > authorship like this, (which we already support): > > UUID=$(get-some-uuid) > git config user.name X > git config user.email $UUID.uuid.git.example.org You can indeed use a UUID if you want. However, it's not deterministic. Using a key hash also means account linking is trivially implemented in forges. If we use a UUID, then there's no way to prove ownership of the identifier, which means that people can claim other people's commits. Signed commits don't help here because you can't embed arbitrary non-emails in X.509 (or in OpenPGP, because nobody will certify such an ID), so you have no way of linking the commit identity to the key and therefore signed commits are worse than before. At least with an email you can verify that the owner of the account owns the email address, but you can't do that with a UUID. I want a design that works whether or not you use a forge, but realizing that most developers use forges these days, I want to make the workflow as simple and straightforward as possible for those who do. I also want a design which is going to be acceptable to forge implementers, and working for one, I think this design is going to be easier to implement and more likely to be accepted than an ID which requires extra work and isn't verifiable. For ease of use, I would be implementing tooling to make setting this from an existing user.signingkey or SSH key on the system. I literally envision this being as simple as something like `git id --set -f ~/.ssh/id_ed25519` or `git id --set --generate-ssh-key`. (This is just an example; we can argue about the details later.) > So you'd end up with e.g.: > > X <98ab8d66-38d2-11ed-a261-0242ac120002.uuid.git.example.com> > > Or whatever, we could bikeshed about the format, but the point is that > it's not codifying *how* that looks. I do very much want to codify how this looks because people are absolutely going to rely on it, whether we want them to or not. People already parse GitHub's fake no-reply emails for information. Everything that Git does people rely on, whether we like it or not. Keeping it in the form of an email maximizes compatibility for existing implementations. > We'd then just support this refs/mailmap mechanism you're suggesting, > where we'd have a mapping like: > > Ævar Arnfjörð Bjarmason <avarab@gmail.com> X <98ab8d66-38d2-11ed-a261-0242ac120002.uuid.git.example.com> > > Which could be force-pushed. > > I can see why you'd *also* want to formalize the ID generation, but I > just don't see why we'd want to make that as one leaping change rather > than something more incremental. We can make it as incremental as folks want. However, the longer we have people embedding their real names and emails in an immutable Merkle tree, the longer we're going to run into deadname problems. Thus, encouraging this new form of ID sooner means that people will adopt it sooner. If this is the only impediment, we can make it more gradual. > I.e. even if you don't have opaque IDs in the first place this mechanism > would allow you to maintain a "mailmap" ref on the remote, which would > already be useful. > > E.g. now if I use a hosting provider and have my .mailmap in various > repo I need to maintain then in each repo, but this would allow for a > magical ref which would keep it up-to-date in various repos... That's part of the goal. > I obviously see why you want the "force push" aspect of this (the > deadnaming), but I still wonder if it's really a good trade-off for git > as an SCM to make that the default. > > We've been going in the other direction for e.g. tags semi-recently with > my 0bc8d71b99e (fetch: stop clobbering existing tags without --force, > 2018-08-31). > > By having that force-push default we make it so that a plumbing command > (that makes use of mailmap) will give you one result today, but a > different one tomorrow, with no easy way to get back. I think force-pushing semantics has a nicer behaviour for my use case, but it's not essential. If the mailmap is in a separate ref, then if I work at $MEGACORP and need to update the mailmap because of a name change, I can still just rewrite the history, and as long as we preserve the force-fetch behaviour by default, then it will just work. I _do_ think we should retain the force-fetch behaviour by default.
In general, I like this proposal. It seems like a good way forward. It should be made very clear to the user that a commit authored by a key-derived ID does not imply the commit is signed by that key or provide any security guarantees; anyone can put anything in that field, same as it is now. I could see someone seeing a commit authored by <47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU@_.sha256.ssh.id.git-scm.com> and thinking that implies the commit was signed by `47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU`. On 2022-09-19 14:52:31+0000, brian m. carlson wrote: > +Anonymous IDs > +------------- > + > +Git will implement a new form of email address which is acceptable to existing > +implementations but is not valid according to RFC 1123. This takes the form of > +an email address where the local-part contains the identifier and the domain > +portion starts with `_.` and then a domain specifier which specifies an > +authority and the meaning of the identifier. > + > +In such a case, Git will specify the username as a single U+2060 in UTF-8 (the > +byte sequence 0xE2 0x81 0xA0), which is a zero width non-breaking space. This > +is compatible with existing implementations. Could you add a note here explaining why that character was chosen for the name field? It seems like it would be easier to work with a single printable character like `?` or `X`, but maybe that doesn't matter here.
On 2022-09-30 at 20:26:41, Gwyneth Morgan wrote: > In general, I like this proposal. It seems like a good way forward. > > It should be made very clear to the user that a commit authored by a > key-derived ID does not imply the commit is signed by that key or > provide any security guarantees; anyone can put anything in that field, > same as it is now. I could see someone seeing a commit authored by > <47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU@_.sha256.ssh.id.git-scm.com> > and thinking that implies the commit was signed by > `47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU`. Of course. I'll update that when I turn this into a real series. > On 2022-09-19 14:52:31+0000, brian m. carlson wrote: > > +Anonymous IDs > > +------------- > > + > > +Git will implement a new form of email address which is acceptable to existing > > +implementations but is not valid according to RFC 1123. This takes the form of > > +an email address where the local-part contains the identifier and the domain > > +portion starts with `_.` and then a domain specifier which specifies an > > +authority and the meaning of the identifier. > > + > > +In such a case, Git will specify the username as a single U+2060 in UTF-8 (the > > +byte sequence 0xE2 0x81 0xA0), which is a zero width non-breaking space. This > > +is compatible with existing implementations. > > Could you add a note here explaining why that character was chosen for > the name field? It seems like it would be easier to work with a single > printable character like `?` or `X`, but maybe that doesn't matter here. Sure, I'll include that there. The author field cannot be empty for compatibility reasons. Since there's nothing to put there until it's run through the mailmap, putting a single zero-width non-breaking space produces the same rendering as nothing, and it doesn't require special handling like "?" or "X". (Also, it should be noted that not all languages use "?" as the question mark.) Note that if this is mapped in the mailmap, you don't need to actually put the personal name that exists in the commit. The mailmap rewrites based on the email address (or, in this case, the ID), so nobody ever has to write the U+2060 in the mailmap.
diff --git a/Documentation/technical/anonymous-id.txt b/Documentation/technical/anonymous-id.txt new file mode 100644 index 0000000000..aeba5e68f2 --- /dev/null +++ b/Documentation/technical/anonymous-id.txt @@ -0,0 +1,143 @@ +Anonymous IDs +============= + +Objective +--------- + +Provide a way for people to identify themselves without the need to associate a +fixed personal name or email. + +Background +---------- + +People change their name and email many times over the course over their lives. +For example, people may marry or change jobs. In many cases, these changes can +be handled by the mailmap. However, for many transgender people, keeping the +old name in the mailmap is often undesirable. + +This document proposes a new way to specify anonymous IDs based on an SSH key or +GnuPG key instead along with a mailmap which is automatically downloaded from +the remote which provides an automatic correspondence. In this approach, all +users are expected to specify an anonymous ID and a mailmap entry. + +This does not solve the problem of previous commits, but it does solve the +approach going forward if reasonably well adopted and avoids the problems of +existing approaches of obscuring the mailmap which are defeated by simply +enumerating all entries in all commits. + +Anonymous IDs +------------- + +Git will implement a new form of email address which is acceptable to existing +implementations but is not valid according to RFC 1123. This takes the form of +an email address where the local-part contains the identifier and the domain +portion starts with `_.` and then a domain specifier which specifies an +authority and the meaning of the identifier. + +In such a case, Git will specify the username as a single U+2060 in UTF-8 (the +byte sequence 0xE2 0x81 0xA0), which is a zero width non-breaking space. This +is compatible with existing implementations. + +The Git project will specify a set of identifiers under the domain +`id.git-scm.com`. The next component is the type of key as specified by the +`gpg.program` identifier, and then a component indicating the hash type or +version number as specified below. + +This approach provides IDs which are simple and easy to create (almost all users +will have an SSH implementation which can generate keys with a single command), +opaque, completely deterministic, and not personally identifiable. + +Other authorities, such as hosting providers, may use different IDs. For +example, if the hosting provider example.com might issue the ID +`1234@_.user.example.com` for user ID 1234. Authorities are encouraged to use +database IDs or other unique IDs rather than usernames, since many usernames +contain human names or corporate affiliations, which defeats the point of this +feature. + +In conjunction with a single, constantly rewritten mailmap reference and +`mailmap.blob`, this allows users to move their real IDs outside of the commit +IDs into a mailmap which is constantly rewritten. If a user's real name or +email changes, they can submit an update to the mailmap and the ID, which will +be squashed into a single commit without history. + +Specifications +~~~~~~~~~~~~~~ + +OpenPGP Keys +^^^^^^^^^^^^ + +If a user possesses a v4 OpenPGP key, then they may use the domain +`_.v4.openpgp.id.git-scm.com` using a lowercase hex form of the SHA-1 +fingerprint as the local-part. For example, the key with the fingerprint +`da39a3ee5e6b4b0d3255bfef95601890afd80709` would have the email address +`da39a3ee5e6b4b0d3255bfef95601890afd80709@_v4.openpgp.id.git-scm.com`. + +Similarly, when RFC 4880 bis is implemented using v5 keys with SHA-256 +fingerprints, the domain `_.v5.openpgp.id.git-scm.com` may be used with a +lowercase hex form of the SHA-256 fingerprint as the local-part. + +SSH Keys +^^^^^^^^ + +If a user possesses an SSH key, then they may use the domain +`_.sha256.ssh.id.git-scm.com` using a base64url encoding (without padding) as +the local-part. This is the RFC 4648 Base64 encoding with URL and filename safe +alphabet without the padding character. For example, a user whose SSH key +fingerprint is `47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU` may use +`47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU@_.sha256.ssh.id.git-scm.com`. + +It's intentional that no specification is provided for MD5 fingerprints. MD5 is +obsolete and should not be used in new protocols such as this. + +X.509 Certificates +^^^^^^^^^^^^^^^^^^ + +If a user possesses an X.509 certificate, then they may use the domain +`_.sha256.x509.id.git-scm.com` using a lowercase hex form of the SHA-256 +fingerpint of the certificate. For example, if the key fingerprint is +`e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855`, then the ID +would be +`e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855@_.sha256.x509.id.git-scm.com`. + +Emission +~~~~~~~~ + +A user may specify, instead of `user.email`, a `user.signingkey` (or a suitable +protocol-specific setting). If `user.idFormat` is set to `email`, then the +user's email will be written into the commit; if it is instead set to `key`, +then the ID corresponding to the key is extracted from the signing program and +that is used instead. `id` can be used to specify the `user.id` value. An order +of items to try can be specifed with a colon-separated list. The default, which +is subject to change, is `id:email:key`. This allows users to specify an +independent ID which is independent of their email. + +For patches, a user may specify `format.id` as `as-is` to leave the data as is, +or as `mailmap` to use the mailmap value to rewrite it to the value in the +mailmap. If the user specifies `mailmap-metadata`, then an in-body `From:` line +in the patch is written to contain the author ID using the ID as written in the +commit, but a format-patch metadata header is written using the mailmap entry in +the commit. + +Expected Mailmap Improvements +----------------------------- + +Right now, the mailmap is included in a repository as part of a regular commit. +This means it has a history, which is undesirable if the user would like to +completely rewrite their identity. + +This can be easily solved with some mailmap improvements. `git clone` will +learn a command, `--use-mailmap`, which will specifically fetch the ref +`refs/mailmap` from the remote and keep it up to date using force updates if +necessary. This option will also specify `mailmap.blob` to point to the +`.mailmap` file in this ref, which allows the user to automatically keep it up +to date with the remote. + +`git am` or `git apply` can then apply the mailmap entry from the patch to the +appropriate ref with `--use-mailmap`. The default is `--use-mailmap=amend`, +which amends the existing commit. If a user would like to preserve a history +for some reason, they can use `--use-mailmap=commit`. For maintainers, they can +then push this ref using the normal push refspecs, or explicitly with +`--mailmap`, which is equivalent to `+refs/mailmap:refs/mailmap`. + +The goal of this is to make interacting with the mailmap refs automatic and +transparent whenever other data is fetched or cloned from the remote. diff --git a/Documentation/technical/format-patch-metadata.txt b/Documentation/technical/format-patch-metadata.txt index 5448918da9..87e301b65e 100644 --- a/Documentation/technical/format-patch-metadata.txt +++ b/Documentation/technical/format-patch-metadata.txt @@ -40,6 +40,9 @@ gpgsig-sha1:: gpgsig-sha256:: This specifies the base commit for this patch using the SHA-256 object ID, as specified in the `gpgsig-sha256` header. +mailmap-author:: + This specifies the mailmap entry to associate with the email address or other + identifier in the `From:` header. Examples --------
The original design of Git embeds a personal name and email in every commit. This has lots of downsides, including the following. First, people do not want to bake an email into an immutable Merkle tree that they send everywhere. Spam, whether in general or by recruiters, is a problem, and even when it's not, people change companies or institutions and emails become invalid. Second, some people prefer to operate anonymously and don't want to specify personal details everywhere. Third, and most important, people change names. This happens for many reasons, but it comes up most saliently for transgender people, who frequently change their name as part of their transition. Referring to a transgender person's former name, their "deadname", is considered inappropriate. We have a solution that can map former personal names and emails into current ones, the mailmap. However, this last case poses a problem, because we don't really want to correlate the person's deadname (or their email, which may contain their deadname) right next to their current name. Several solutions have been proposed for this case, including hashing or encoding the old information, but these are all easily invertible. Instead, let's propose a new form of identifier which is opaque and some mailmap improvements to store the mailmap information outside of the main history. Propose that users use the fingerprint of a cryptographic key as part of a special-form email which is not valid according to RFC 1123, but is accepted by earlier versions of Git. Now that we have SSH signing and OpenSSH is available on all major platforms, creating a unique ID is as easy as running ssh-keygen. This approach results in an identifier which is unique, deterministic, and completely anonymous. Propose this new option instead of using a name and email, although users can continue to use those as before if they prefer. Continue to associate personal information with this opaque identifier using the mailmap, but in such a way that it lives in a special ref outside of the history and that ref is customarily kept squashed to a single commit. Create a special RFC 5322 header to associate a mailmap entry with the user's opaque identifier when sending a patch if desired. Because the mailmap now lives outside the history in a single squashed commit, a user may simply update their name by sending a new patch with the same opaque ID, or proposing a change to the mailmap independently. A person's former name or email address is not retained in the history (unless the project chooses to do that for the mailmap ref). Since many people use forges for hosting their code and forges offer commit verification and SSH access, it is extremely easy for a forge to associate a commit with this new opaque identifier with a user, since they probably already have this information. Thus, for projects which use solely a forge-based development workflow, no mailmap entry need even be created unless one is desired. If one is desired, it may be able to be created and updated automatically as part of the forge's normal infrastructure simply upon sending a patch. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- Documentation/technical/anonymous-id.txt | 143 ++++++++++++++++++ .../technical/format-patch-metadata.txt | 3 + 2 files changed, 146 insertions(+) create mode 100644 Documentation/technical/anonymous-id.txt