Message ID | pull.1902.git.1744041163929.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] rebase -m: partial support for copying extra commit headers | expand |
On 2025-04-07 at 15:52:43, Phillip Wood via GitGitGadget wrote: > From: Phillip Wood <phillip.wood@dunelm.org.uk> > > [RFC] rebase -m: partial support for copying extra commit headers > > This patch is largely a response to > https://lore.kernel.org/git/Z-5rpWKAVPmz32jC@pks.im/ . I'm in two minds > about whether we should consider merging such partial support but if it > helps forges preserve extra commit headers then it may well be worth it. I'd like to see command-line options to control this and ideally a configuration option. Right now, we know nothing about these extra headers, including an expected format. If a future version of Git (say, 3.0) adds a new header and the user includes invalid data in this extra header (which happens all the time with author and committer information), then 2.50 will propagate it on rebase and it won't be fixed until the user uses a version of Git that understands the header and can fsck it correctly. That's not really great, since it means we can unknowingly spread corruption. I am pretty sure that at $DAYJOB we'll need to have a discussion about whether we want to propagate these headers during rebase and I'm personally leaning against it. Why, you ask? I've seen at least the following types of corruption: * Missing timezones * Timezones with less than four digits * Valid timezones padded to more than four digits with zeros * Timezones which don't exist and never have (e.g., +1700) * Timezones which are so absurdly large that they push the date to a year when nobody alive now will still be living * Date stamps that are larger than 2^64 * Date stamps which are smaller than 2^64 but beyond the expected life of the Sun * Extra angle brackets in the email field * Nothing in between the email brackets * Nothing before the email brackets (no name at all) * Names which are not UTF-8 but without an encoding header * Names which are not valid in the specified encoding * Emails which are not valid UTF-8[0] * Emails which don't meet the (ludicrously generous to the point of being nearly unparseable) RFC production * Encodings which are not valid IANA charsets * Messages with no body and no blank line (just the newline at the end of the final header) * gpgsig headers that include random non-ASCII bytes and control characters[1] Note that all of these must be parsed in some meaningful way because users don't want their forge to serve them a 500 despite them having sent wildly invalid data. I encountered these during part of our transition from Rugged to Git (reftable, SHA-256) and they definitely added a lot of interesting complications (plus the need for lots of tests). Considering that writing valid data should not be that hard[2] (and should definitely be a priority) but apparently is for many people, I'm very wary of us propagating headers we're not ready to fsck and I'd like to have an out for users and forges who would like to be a little more careful. With those constraints, I'm not totally opposed in principle to this feature. I see Patrick is CC'd here and I'm interested in his thoughts, as well as, of course, those of anyone else as well. [0] SMTPUTF8 (RFC 6531 et al.) specifies that mailbox names may now contain UTF-8. For instance, you can email
Hi brian On 08/04/2025 02:22, brian m. carlson wrote: > On 2025-04-07 at 15:52:43, Phillip Wood via GitGitGadget wrote: >> From: Phillip Wood <phillip.wood@dunelm.org.uk> >> >> [RFC] rebase -m: partial support for copying extra commit headers >> >> This patch is largely a response to >> https://lore.kernel.org/git/Z-5rpWKAVPmz32jC@pks.im/ . I'm in two minds >> about whether we should consider merging such partial support but if it >> helps forges preserve extra commit headers then it may well be worth it. > > I'd like to see command-line options to control this and ideally a > configuration option. Right now, we know nothing about these extra > headers, including an expected format. If a future version of Git (say, > 3.0) adds a new header and the user includes invalid data in this extra > header (which happens all the time with author and committer > information), then 2.50 will propagate it on rebase and it won't be > fixed until the user uses a version of Git that understands the header > and can fsck it correctly. That's not really great, since it means we > can unknowingly spread corruption. We could certainly add some way to make this opt-in if there is a desire for it and you make a good point about compatibility if we add a new commit header. I'm not sure I'd describe preserving these headers when rebasing as spreading corruption though as we're simply rewriting existing commits. If the user chose to merge rather than rebase we'd still have the same issues without creating any new commits. > I am pretty sure that at $DAYJOB we'll need to have a discussion about > whether we want to propagate these headers during rebase and I'm > personally leaning against it. My understanding is that GitHub has been using "git replay" for rebases and therefore copying extra commit headers since the middle of 2023 [1,2]. The message I linked to in my original mail suggests that the "change-id" header is preserved when rebasing on GitHub. > Why, you ask? I've seen at least the following types of corruption: > > * Missing timezones > * Timezones with less than four digits > * Valid timezones padded to more than four digits with zeros > * Timezones which don't exist and never have (e.g., +1700) > * Timezones which are so absurdly large that they push the date to a > year when nobody alive now will still be living > * Date stamps that are larger than 2^64 > * Date stamps which are smaller than 2^64 but beyond the expected life > of the Sun > * Extra angle brackets in the email field > * Nothing in between the email brackets > * Nothing before the email brackets (no name at all) > * Names which are not UTF-8 but without an encoding header > * Names which are not valid in the specified encoding > * Emails which are not valid UTF-8[0] > * Emails which don't meet the (ludicrously generous to the point of > being nearly unparseable) RFC production > * Encodings which are not valid IANA charsets > * Messages with no body and no blank line (just the newline at the end > of the final header) > * gpgsig headers that include random non-ASCII bytes and control > characters[1] Thanks for sharing that, it is an interesting list. On the subject of encoding I do think our documentation could be clearer that the encoding applies to all the headers as well as the commit message. As far as I can see it only mentions the commit message, not the author or committer identities but repo_logmsg_reencode() re-encodes the whole commit buffer. Out of interest do you think we could be doing a better job with fsck to pick up some of these problems earlier? I think "git rebase" only cares that the author identity can be parsed by split_ident() which is fairly lenient. > I see Patrick is CC'd here and I'm interested in his thoughts, as well > as, of course, those of anyone else as well. Yes me too Thanks for your thoughtful and intereting reply Phillip [1] https://github.blog/changelog/2023-06-28-rebase-commits-now-created-using-the-merge-ort-strategy/ [2] https://github.blog/engineering/infrastructure/scaling-merge-ort-across-github/
Phillip Wood <phillip.wood123@gmail.com> writes: > Thanks for sharing that, it is an interesting list. On the subject of > encoding I do think our documentation could be clearer that the > encoding applies to all the headers as well as the commit message. As > far as I can see it only mentions the commit message, not the author > or committer identities but repo_logmsg_reencode() re-encodes the > whole commit buffer. Out of interest do you think we could be doing a > better job with fsck to pick up some of these problems earlier? > > I think "git rebase" only cares that the author identity can be parsed > by split_ident() which is fairly lenient. "rebase" already knows that it has to be picky which header fields need to be propagated and which must not be, doesn't it? Can the same be said for arbitrary "extra" header fields? Information on some of the header fields are inherently destroyed when you refine an existing commit. The value on the 'parent' headers may need to be updated (unless "rebase" is fast-forwarding an earlier part of the changes on the same base), the 'author' information usually wants to be preserved, but when the scale of the change since the previous iteration is so large, you may give it a new authorship, the 'committer' information should record who created the new commit object that records the result of rebasing, the 'gpgsig' and 'gigsig-sha256' header fields would lose validity if you are creating a new object that is different from the original by even a single bit (if we are somehow recording which predecessor commit the new one replaces, it certainly is safe to drop these that have lost validity, as we can go back to the predecessor to see it has a valid signature, and the change in the new commit that lost the signature fields is the moral equivalent of the original. Otherwise, carrying a stale signature may serve as a reminder that the commit was rewritten in the past---I dunno). And so on. Now, one thing that worries me is this. If "extra" commit headers include truly extra fields with unknown semantics, the machinery cannot tell which ones are safe and benefitial to propagate. Thanks.
Hi Junio On 08/04/2025 15:44, Junio C Hamano wrote: > Phillip Wood <phillip.wood123@gmail.com> writes: > >> Thanks for sharing that, it is an interesting list. On the subject of >> encoding I do think our documentation could be clearer that the >> encoding applies to all the headers as well as the commit message. As >> far as I can see it only mentions the commit message, not the author >> or committer identities but repo_logmsg_reencode() re-encodes the >> whole commit buffer. Out of interest do you think we could be doing a >> better job with fsck to pick up some of these problems earlier? >> >> I think "git rebase" only cares that the author identity can be parsed >> by split_ident() which is fairly lenient. > > "rebase" already knows that it has to be picky which header fields > need to be propagated and which must not be, doesn't it? I'd say it's picky because of the way it is implemented - it calls "git commit" and there is no way to set "extra" header fields when doing that. > Can the > same be said for arbitrary "extra" header fields? > > Information on some of the header fields are inherently destroyed > when you refine an existing commit. The value on the 'parent' > headers may need to be updated (unless "rebase" is fast-forwarding > an earlier part of the changes on the same base), the 'author' > information usually wants to be preserved, but when the scale of the > change since the previous iteration is so large, you may give it a > new authorship, the 'committer' information should record who > created the new commit object that records the result of rebasing, > the 'gpgsig' and 'gigsig-sha256' header fields would lose validity > if you are creating a new object that is different from the original > by even a single bit (if we are somehow recording which predecessor > commit the new one replaces, it certainly is safe to drop these that > have lost validity, as we can go back to the predecessor to see it > has a valid signature, and the change in the new commit that lost > the signature fields is the moral equivalent of the original. > Otherwise, carrying a stale signature may serve as a reminder that > the commit was rewritten in the past---I dunno). And so on. > > Now, one thing that worries me is this. If "extra" commit headers > include truly extra fields with unknown semantics, the machinery > cannot tell which ones are safe and benefitial to propagate. That's true and we could have a config key to select which "extra" headers are propagated. We do however unconditionally propagate all "extra" headers when amending a commit with "git commit --amend" and when rewriting it with "git replay" which I think GitHub have been using to rebase branches for over 18 months. If we're worried about rebase unconditionally copying these headers we do something to stop "git replay" doing the same. On the other hand if "git replay" is being used in the wild without problems maybe we don't need to worry. Best Wishes Phillip
Phillip Wood <phillip.wood123@gmail.com> writes: > That's true and we could have a config key to select which "extra" > headers are propagated. No, please don't. No such config key should ever exist. If something has a defined semantics, which all projects that use Git can agree on, that is solid enough to deserve to be in the header part of the commit object (as opposed to one of the trailers with user-defined semantics that can vary from project to project), there should never be a way to give it different semantics by tweaking whether it is or it is not propagated when rewriting.
diff --git a/sequencer.c b/sequencer.c index ad0ab75c8d4..5b92f77b660 100644 --- a/sequencer.c +++ b/sequencer.c @@ -1527,6 +1527,9 @@ static int parse_head(struct repository *r, struct commit **head) return 0; } +/* Headers to exclude when copying extra commit headers */ +static const char *exclude_gpgsig[] = { "gpgsig", "gpgsig-sha256", NULL }; + /* * Try to commit without forking 'git commit'. In some cases we need * to run 'git commit' to display an error message @@ -1538,6 +1541,7 @@ static int parse_head(struct repository *r, struct commit **head) */ static int try_to_commit(struct repository *r, struct strbuf *msg, const char *author, + struct commit_extra_header *extra_header, struct replay_opts *opts, unsigned int flags, struct object_id *oid) { @@ -1545,7 +1549,7 @@ static int try_to_commit(struct repository *r, struct object_id tree; struct commit *current_head = NULL; struct commit_list *parents = NULL; - struct commit_extra_header *extra = NULL; + struct commit_extra_header *extra = extra_header; struct strbuf err = STRBUF_INIT; struct strbuf commit_msg = STRBUF_INIT; char *amend_author = NULL; @@ -1561,7 +1565,6 @@ static int try_to_commit(struct repository *r, return -1; if (flags & AMEND_MSG) { - const char *exclude_gpgsig[] = { "gpgsig", "gpgsig-sha256", NULL }; const char *out_enc = get_commit_output_encoding(); const char *message = repo_logmsg_reencode(r, current_head, NULL, out_enc); @@ -1714,7 +1717,8 @@ static int try_to_commit(struct repository *r, commit_post_rewrite(r, current_head, oid); out: - free_commit_extra_headers(extra); + if (extra != extra_header) + free_commit_extra_headers(extra); free_commit_list(parents); strbuf_release(&err); strbuf_release(&commit_msg); @@ -1734,6 +1738,7 @@ static int write_rebase_head(struct object_id *oid) static int do_commit(struct repository *r, const char *msg_file, const char *author, + struct commit_extra_header *extra_header, struct replay_opts *opts, unsigned int flags, struct object_id *oid) { @@ -1749,7 +1754,7 @@ static int do_commit(struct repository *r, msg_file); res = try_to_commit(r, msg_file ? &sb : NULL, - author, opts, flags, &oid); + author, extra_header, opts, flags, &oid); strbuf_release(&sb); if (!res) { refs_delete_ref(get_main_ref_store(r), "", @@ -2251,6 +2256,7 @@ static int do_pick_commit(struct repository *r, int res, unborn = 0, reword = 0, allow, drop_commit; enum todo_command command = item->command; struct commit *commit = item->commit; + struct commit_extra_header *extra_header = NULL; if (opts->no_commit) { /* @@ -2391,8 +2397,12 @@ static int do_pick_commit(struct repository *r, strbuf_addstr(&ctx->message, oid_to_hex(&commit->object.oid)); strbuf_addstr(&ctx->message, ")\n"); } - if (!is_fixup(command)) + if (!is_fixup(command)) { author = get_author(msg.message); + if (is_rebase_i(opts)) + extra_header = read_commit_extra_headers(commit, + exclude_gpgsig); + } } ctx->have_message = 1; @@ -2503,8 +2513,8 @@ static int do_pick_commit(struct repository *r, } /* else allow == 0 and there's nothing special to do */ if (!opts->no_commit && !drop_commit) { if (author || command == TODO_REVERT || (flags & AMEND_MSG)) - res = do_commit(r, msg_file, author, opts, flags, - commit? &commit->object.oid : NULL); + res = do_commit(r, msg_file, author, extra_header, opts, + flags, commit? &commit->object.oid : NULL); else res = error(_("unable to parse commit author")); *check_todo = !!(flags & EDIT_MSG); @@ -2535,6 +2545,7 @@ fast_forward_edit: leave: free_message(commit, &msg); free(author); + free_commit_extra_headers(extra_header); update_abort_safety_file(); return res; diff --git a/t/t3404-rebase-interactive.sh b/t/t3404-rebase-interactive.sh index 2aee9789a2f..200f55824c7 100755 --- a/t/t3404-rebase-interactive.sh +++ b/t/t3404-rebase-interactive.sh @@ -2297,6 +2297,54 @@ test_expect_success 'non-merge commands reject merge commits' ' test_cmp expect actual ' +test_expect_success 'unconflicted pick copies extra commit headers' ' + tree="$(git rev-parse C^{tree})" && + parent="$(git rev-parse C^{commit})" && + for i in 1 2 3 4 5 6 7 + do + cat >commit <<-EOF && + tree $tree + parent $parent + author $GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL> $GIT_AUTHOR_DATE + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE + x-extra-header value $i + + An empty commit with an extra header $i + EOF + + parent="$(git hash-object -t commit -w commit)" && + eval "oid$i=\$parent" && + test_tick || return 1 + done && + + cat >todo <<-EOF && + pick $oid1 + pick $oid2 + fixup $oid3 + reword $oid4 + edit $oid5 + pick $oid6 + squash $oid7 + EOF + + ( + set_replace_editor todo && + FAKE_COMMIT_AMEND=EDITED git rebase -i --onto A $oid1^ $oid5 + ) && + echo changed >file2 && + git add file2 && + FAKE_COMMIT_AMEND=EDITED git rebase --continue && + j=4 && + for i in 1 2 4 5 6 + do + git cat-file commit HEAD~$j >actual && + sed -n -e /^\$/q -e /^x-extra/p actual >actual.extra-header && + echo "x-extra-header value $i" >expect && + test_cmp expect actual.extra-header && + j=$((j-1)) || return 1 + done +' + # This must be the last test in this file test_expect_success '$EDITOR and friends are unchanged' ' test_editor_unchanged