Message ID | 20240312091750.GP95609@coredump.intra.peff.net (mailing list archive) |
---|---|
State | Accepted |
Commit | 8b311478ad16b2fe9d2f5b5febec9f5e8f7fd52d |
Headers | show |
Series | allow multi-byte core.commentChar | expand |
Thanks for your work on this. Now I can use dingbats as my comment char. On Tue, Mar 12, 2024, at 10:17, Jeff King wrote: > diff --git a/Documentation/config/core.txt > b/Documentation/config/core.txt > index 0e8c2832bf..c86b8c8408 100644 > --- a/Documentation/config/core.txt > +++ b/Documentation/config/core.txt > @@ -523,7 +523,9 @@ core.commentChar:: > Commands such as `commit` and `tag` that let you edit > messages consider a line that begins with this character > commented, and removes them after the editor returns > - (default '#'). > + (default '#'). Note that this option can take values larger than > + a byte (whether a single multi-byte character, or you > + could even go wild with a multi-character sequence). I don’t know if this expanded description focuses a bit much on the history of the change[1] or if it is intentionally indirect about this char-is-really-a-string behavior as a sort of easter egg.[2] Maybe it could be more directly stated like: “ Note that this variable can in fact be a string like `foo`; it doesn’t have to be a single character. (Hopefully UTF-8 is implied by “foo”. Or else “føø”.) Terms like “a byte” and “multi-byte characters” seem a bit too technical in this context when you can just say “string”. † 1: (1) What’s a “char”, is it ASCII? (2) It’s ASCII but could in principle be made multi-byte (3) And also a multi-byte *string*, right? (4) … † 2: In five years: (1) How come this Git tutorial’s commit message template has `(commit)` as the ignore-these-lines marker? How did he abuse “comment char” to make a long string? (2) Actually… ❦ Please enter the email reply. Lines starting with '❦' will be ignored, ❦ and an empty message aborts the sendout.
"Kristoffer Haugsbakk" <code@khaugsbakk.name> writes: >> + (default '#'). Note that this option can take values larger than >> + a byte (whether a single multi-byte character, or you >> + could even go wild with a multi-character sequence). > > I don’t know if this expanded description focuses a bit much on the > history of the change[1] or if it is intentionally indirect about this > char-is-really-a-string behavior as a sort of easter egg.[2] > Maybe it could be more directly stated like: > > “ Note that this variable can in fact be a string like `foo`; it > doesn’t have to be a single character. > > (Hopefully UTF-8 is implied by “foo”. Or else “føø”.) That's definitely an improvement, but I would say that using a dingbat instad of "foo", and "single character" -> "single ASCII character" (or "single byte") would make it even clearer. Thanks.
On Wed, Mar 13, 2024 at 07:23:25PM +0100, Kristoffer Haugsbakk wrote: > Thanks for your work on this. Now I can use dingbats as my comment char. Truly we have entered a golden age of technology. ;) > > @@ -523,7 +523,9 @@ core.commentChar:: > > Commands such as `commit` and `tag` that let you edit > > messages consider a line that begins with this character > > commented, and removes them after the editor returns > > - (default '#'). > > + (default '#'). Note that this option can take values larger than > > + a byte (whether a single multi-byte character, or you > > + could even go wild with a multi-character sequence). > > I don’t know if this expanded description focuses a bit much on the > history of the change[1] or if it is intentionally indirect about this > char-is-really-a-string behavior as a sort of easter egg.[2] Mostly I was worried that people would take "char" in the name to assume it could only be a single byte (I had originally even started the new sentence with "Despite the word 'char' in the name, this option can..."). And that is not just history, but a name we are stuck with forever[1]. Certainly "char" is an ambiguous term, though. I didn't mean to leave char-is-a-string as an easter egg; that's what I meant by "multi-character sequence". Certainly "string" is a shorter way of saying that. ;) But I wasn't sure its meaning would be obvious without the word "multi-character". Giving an example as you suggested does help that. That said... > Maybe it could be more directly stated like: > > “ Note that this variable can in fact be a string like `foo`; it > doesn’t have to be a single character. I actually do think the "string" nature is mostly uninteresting, and I'd be OK leaving it as an easter egg. What your suggestion doesn't say is that multi-byte characters are OK. But if we think people will just assume that in a modern UTF-8 world, then maybe we don't need to say anything at all? > (Hopefully UTF-8 is implied by “foo”. Or else “føø”.) It actually does not have to be UTF-8. If you use an alternate encoding in your editor (and probably set i18n.commitEncoding to match), I think everything might just work. (Though to be clear, I think anybody using non-UTF8 in 2024 deserves our pity either for being crazy or for being stuck working on an antiquated system). -Peff
On Fri, Mar 15, 2024, at 06:59, Jeff King wrote: > On Wed, Mar 13, 2024 at 07:23:25PM +0100, Kristoffer Haugsbakk wrote: > >> Thanks for your work on this. Now I can use dingbats as my comment char. > > Truly we have entered a golden age of technology. ;) QoL features can in aggregate have a surprising impact :) > >> > @@ -523,7 +523,9 @@ core.commentChar:: >> > Commands such as `commit` and `tag` that let you edit >> > messages consider a line that begins with this character >> > commented, and removes them after the editor returns >> > - (default '#'). >> > + (default '#'). Note that this option can take values larger than >> > + a byte (whether a single multi-byte character, or you >> > + could even go wild with a multi-character sequence). >> >> I don’t know if this expanded description focuses a bit much on the >> history of the change[1] or if it is intentionally indirect about this >> char-is-really-a-string behavior as a sort of easter egg.[2] > > Mostly I was worried that people would take "char" in the name to assume > it could only be a single byte (I had originally even started the new > sentence with "Despite the word 'char' in the name, this option > can..."). And that is not just history, but a name we are stuck with > forever[1]. Missing footnote or referring to my footnote? My suggestion was to use a `core.commentString` alias. Which might matter for new answers to questions about its use. It might not matter if in practice most people get their config tips from 1500 point StackOverflow question about how git-commit(1) keeps swallowing their GitHub issue numbers (due to automatic linewrap) from 2011. > Certainly "char" is an ambiguous term, though. I didn't mean to leave > char-is-a-string as an easter egg; that's what I meant by > "multi-character sequence". Certainly "string" is a shorter way of > saying that. ;) But I wasn't sure its meaning would be obvious without > the word "multi-character". Giving an example as you suggested does > help that. > > That said... > >> Maybe it could be more directly stated like: >> >> “ Note that this variable can in fact be a string like `foo`; it >> doesn’t have to be a single character. > > I actually do think the "string" nature is mostly uninteresting, and I'd > be OK leaving it as an easter egg. To my mind a string subsumes a char (multi- or not). Like in programming languages: some might be used to single-char `#`, but I don’t think they do a double take when they see languages with `//` or `--`. > What your suggestion doesn't say is that multi-byte characters are > OK. But if we think people will just assume that in a modern UTF-8 > world, then maybe we don't need to say anything at all? Given that we’re mostly in the context of a commit message, an ASCII-only restriction would feel archaic. I guess it depends on what the *normal* is in the documentation at large. As a user I’m used to Git handling the text that I give it. > It actually does not have to be UTF-8. Good point. Unicode is more appropriate. > (Though to be clear, I think anybody using non-UTF8 in 2024 deserves > our pity either for being crazy or for being stuck working on an > antiquated system). I honestly feel blessed that I have to worry so little about text encoding.
On Fri, Mar 15, 2024 at 08:16:53AM +0100, Kristoffer Haugsbakk wrote: > > Mostly I was worried that people would take "char" in the name to assume > > it could only be a single byte (I had originally even started the new > > sentence with "Despite the word 'char' in the name, this option > > can..."). And that is not just history, but a name we are stuck with > > forever[1]. > > Missing footnote or referring to my footnote? > > My suggestion was to use a `core.commentString` alias. Which might > matter for new answers to questions about its use. It might not matter > if in practice most people get their config tips from 1500 point > StackOverflow question about how git-commit(1) keeps swallowing their > GitHub issue numbers (due to automatic linewrap) from 2011. Heh, missing footnote. I was going to say "we could introduce core.commentStr or similar", but after your comment I searched in the archive and see that you did indeed already suggest it. I'm not sure if it would make things more or less confusing to have two related values. One nice side effect is that the new variable would be ignored by older versions of Git (whereas by extending core.commentChar, you end up with config that causes older versions to barf). That probably doesn't matter that much for most users, but as somebody who works on Git I frequently run old versions for bug testing, bisection, and so forth. > > I actually do think the "string" nature is mostly uninteresting, and I'd > > be OK leaving it as an easter egg. > > To my mind a string subsumes a char (multi- or not). Like in programming > languages: some might be used to single-char `#`, but I don’t think they > do a double take when they see languages with `//` or `--`. Hmm, good point. I was mostly focused on UTF-8 characters, but "//" is quite a reasonable thing for people to try. It is probably a better example than "foo". > > What your suggestion doesn't say is that multi-byte characters are > > OK. But if we think people will just assume that in a modern UTF-8 > > world, then maybe we don't need to say anything at all? > > Given that we’re mostly in the context of a commit message, an > ASCII-only restriction would feel archaic. > > I guess it depends on what the *normal* is in the documentation at > large. As a user I’m used to Git handling the text that I give it. Right, that's what I was asking. To me "character" means an ASCII byte, but I think I might be archaic myself. ;) If most of our readers would just assume that multi-byte characters work, perhaps it is confusing things to even mention it. > > It actually does not have to be UTF-8. > > Good point. Unicode is more appropriate. I think other Unicode encodings are likely to have problems (because they embed NULs). Specifically I was thinking that you could probably get away with latin1 or other 8-bit encodings. But again, I really hope nobody is doing that anymore. So anyway, adapting your original suggestion based on discussion in the thread, maybe squash in (to the final patch): diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index c86b8c8408..c5a8033df9 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -523,9 +523,8 @@ core.commentChar:: Commands such as `commit` and `tag` that let you edit messages consider a line that begins with this character commented, and removes them after the editor returns - (default '#'). Note that this option can take values larger than - a byte (whether a single multi-byte character, or you - could even go wild with a multi-character sequence). + (default '#'). Note that this variable can be a string like + `//` or `⁑⁕⁑`; it doesn't have to be a single ASCII character. + If set to "auto", `git-commit` would select a character that is not the beginning character of any line in existing commit messages. That's assuming we don't want to go the commentString route, which would require a bit more re-working of the patch. I'm also open to a more clever or pretty multi-byte example if we have one. ;) -Peff
> diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt > index c86b8c8408..c5a8033df9 100644 > --- a/Documentation/config/core.txt > +++ b/Documentation/config/core.txt > @@ -523,9 +523,8 @@ core.commentChar:: > Commands such as `commit` and `tag` that let you edit > messages consider a line that begins with this character > commented, and removes them after the editor returns > - (default '#'). Note that this option can take values larger than > - a byte (whether a single multi-byte character, or you > - could even go wild with a multi-character sequence). > + (default '#'). Note that this variable can be a string like > + `//` or `⁑⁕⁑`; it doesn't have to be a single ASCII character. This is perfect :)
Jeff King <peff@peff.net> writes: > + (default '#'). Note that this variable can be a string like > + `//` or `⁑⁕⁑`; it doesn't have to be a single ASCII character. Looking good. > That's assuming we don't want to go the commentString route, which would > require a bit more re-working of the patch. I'm also open to a more > clever or pretty multi-byte example if we have one. ;) Adding core.commentString can be done long after the dust settles and I would expect that most of the changes in the patch would not have to be updated. The parts that use comment_line_str variable do not have to change, the documentation needs "core.commentString is a synonym for core.commentChar, the latter of which is understood by older versions of Git (but they may use only the first byte of the string)" or something, but other than that, the existing text after this patch does not have to be updated. If we add a proper synonym support to the config machinery, that would be a sizable project, but otherwise it would be just another "if (!strcmp()) var = val". Stepping back a bit, one thing that we do need to mention in this round is what happens when you use multi-byte sequence and have it accessed by existing versions of Git. "use only the first byte" I wrote above came out of thin air without experimenting or reading the code, but something like that ought to be part of the "Note that" paragraph above. (default '#'). Note that this variable can be a string like `//` or `⁑⁕⁑`; it doesn't have to be a single ASCII character. Also note that older versions of Git used only the first byte (not necessarily a character) of the value of this variable, so you may want to be careful if you plan to use versions of Git older than 2.45. or something like that, perhaps.
On Fri, Mar 15, 2024 at 08:40:56AM -0700, Junio C Hamano wrote: > > That's assuming we don't want to go the commentString route, which would > > require a bit more re-working of the patch. I'm also open to a more > > clever or pretty multi-byte example if we have one. ;) > > Adding core.commentString can be done long after the dust settles > and I would expect that most of the changes in the patch would not > have to be updated. The parts that use comment_line_str variable do > not have to change, the documentation needs "core.commentString is a > synonym for core.commentChar, the latter of which is understood by > older versions of Git (but they may use only the first byte of the > string)" or something, but other than that, the existing text after > this patch does not have to be updated. If we add a proper synonym > support to the config machinery, that would be a sizable project, > but otherwise it would be just another "if (!strcmp()) var = val". Yeah, I agree we could add core.commentString on top of what's here, as long as we're OK with core.commentChar starting to accept strings in the meantime. Which is probably reasonable, and in which case the code portion of the patch really is just: diff --git a/config.c b/config.c index 92c752ed9f..13fb922bf5 100644 --- a/config.c +++ b/config.c @@ -1560,7 +1560,8 @@ static int git_default_core_config(const char *var, const char *value, if (!strcmp(var, "core.editor")) return git_config_string(&editor_program, var, value); - if (!strcmp(var, "core.commentchar")) { + if (!strcmp(var, "core.commentchar") || + !strcmp(var, "core.commentstring")) { if (!value) return config_error_nonbool(var); else if (!strcasecmp(value, "auto")) (the real work of course being in docs and tests). If we wanted to distinguish them more (say, core.commentChar remains as-is but core.commentString allows strings and takes precedence), then we'd need to do it now to avoid flip-flopping between versions. I don't see a huge benefit in restricting commentChar though. > Stepping back a bit, one thing that we do need to mention in this > round is what happens when you use multi-byte sequence and have it > accessed by existing versions of Git. "use only the first byte" I > wrote above came out of thin air without experimenting or reading > the code, but something like that ought to be part of the "Note > that" paragraph above. > > (default '#'). Note that this variable can be a string like > `//` or `⁑⁕⁑`; it doesn't have to be a single ASCII character. > Also note that older versions of Git used only the first byte > (not necessarily a character) of the value of this variable, > so you may want to be careful if you plan to use versions of > Git older than 2.45. The current code barfs for anything larger than a byte: $ git.v2.44.0 -c core.commentchar=foo stripspace -s error: core.commentChar should only be one ASCII character fatal: unable to parse 'core.commentchar' from command-line config I'm mixed on these sorts of version-specific notes in the documentation. For people who aren't mixing versions, the history is useless noise (whose value decreases as time goes on and 2.45 becomes "old" itself). For people who do use older versions, they'd quickly get an error like the one above. So I dunno. I'm not strictly opposed, but if this is something we think is worth warning about, then that implies to me that it is worth providing a more ergonomic solution like core.commentString. -Peff
Jeff King <peff@peff.net> writes: > So anyway, adapting your original suggestion based on discussion in the > thread, maybe squash in (to the final patch): > > diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt > index c86b8c8408..c5a8033df9 100644 > --- a/Documentation/config/core.txt > +++ b/Documentation/config/core.txt > @@ -523,9 +523,8 @@ core.commentChar:: > Commands such as `commit` and `tag` that let you edit > messages consider a line that begins with this character > commented, and removes them after the editor returns > - (default '#'). Note that this option can take values larger than > - a byte (whether a single multi-byte character, or you > - could even go wild with a multi-character sequence). > + (default '#'). Note that this variable can be a string like > + `//` or `⁑⁕⁑`; it doesn't have to be a single ASCII character. > + > If set to "auto", `git-commit` would select a character that is not > the beginning character of any line in existing commit messages. > > > That's assuming we don't want to go the commentString route, which would > require a bit more re-working of the patch. I'm also open to a more > clever or pretty multi-byte example if we have one. ;) It has been 10 days since this discussion petered out. My preference is to introduce core.commentString to avoid confusion coming from an older Git using the first-byte of a multi-byte string, or dying upon reading a configuration file meant for a newer Git, and then let core.commentString override core.commentChar, but I would prefer to see the discussion participants to raise their opinions and reach a conclusion. Thanks.
On Tue, Mar 26, 2024, at 23:10, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > >> So anyway, adapting your original suggestion based on discussion in the >> thread, maybe squash in (to the final patch): >> >> diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt >> index c86b8c8408..c5a8033df9 100644 >> --- a/Documentation/config/core.txt >> +++ b/Documentation/config/core.txt >> @@ -523,9 +523,8 @@ core.commentChar:: >> Commands such as `commit` and `tag` that let you edit >> messages consider a line that begins with this character >> commented, and removes them after the editor returns >> - (default '#'). Note that this option can take values larger than >> - a byte (whether a single multi-byte character, or you >> - could even go wild with a multi-character sequence). >> + (default '#'). Note that this variable can be a string like >> + `//` or `⁑⁕⁑`; it doesn't have to be a single ASCII character. >> + >> If set to "auto", `git-commit` would select a character that is not >> the beginning character of any line in existing commit messages. >> >> >> That's assuming we don't want to go the commentString route, which would >> require a bit more re-working of the patch. I'm also open to a more >> clever or pretty multi-byte example if we have one. ;) > > It has been 10 days since this discussion petered out. > > My preference is to introduce core.commentString to avoid confusion > coming from an older Git using the first-byte of a multi-byte > string, or dying upon reading a configuration file meant for a newer > Git, and then let core.commentString override core.commentChar, but > I would prefer to see the discussion participants to raise their > opinions and reach a conclusion. > > Thanks. Sounds good to me.
On Tue, Mar 26, 2024 at 03:10:23PM -0700, Junio C Hamano wrote: > It has been 10 days since this discussion petered out. I wrote the last message, so I was waiting for you to respond. ;) https://lore.kernel.org/git/20240316055013.GA32145@coredump.intra.peff.net/ > My preference is to introduce core.commentString to avoid confusion > coming from an older Git using the first-byte of a multi-byte > string, or dying upon reading a configuration file meant for a newer > Git, and then let core.commentString override core.commentChar, but > I would prefer to see the discussion participants to raise their > opinions and reach a conclusion. OK. I don't have a strong opinion. Are you OK with core.commentString as a strict synonym (so last-one-wins and either name overwrites previous)? Or do you want an override (i.e., commentString always overrides commentChar, regardless of order). I think it's mostly academic, and the strict synonym version is much easier to implement. -Peff
Jeff King <peff@peff.net> writes: > OK. I don't have a strong opinion. Are you OK with core.commentString as > a strict synonym (so last-one-wins and either name overwrites previous)? > Or do you want an override (i.e., commentString always overrides > commentChar, regardless of order). I think it's mostly academic, and the > strict synonym version is much easier to implement. When I wrote it, I meant "String is a successor of Char, if both exists that is used regardless of the order", but either is OK. Older versions of Git would not understand the "String" version, so it matters only to those who uses mixed versions of Git and they can control the last-one-wins in their configuration file, I would guess. Thanks.
diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index 0e8c2832bf..c86b8c8408 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -523,7 +523,9 @@ core.commentChar:: Commands such as `commit` and `tag` that let you edit messages consider a line that begins with this character commented, and removes them after the editor returns - (default '#'). + (default '#'). Note that this option can take values larger than + a byte (whether a single multi-byte character, or you + could even go wild with a multi-character sequence). + If set to "auto", `git-commit` would select a character that is not the beginning character of any line in existing commit messages. diff --git a/config.c b/config.c index 7e5dbca4bd..92c752ed9f 100644 --- a/config.c +++ b/config.c @@ -1565,13 +1565,13 @@ static int git_default_core_config(const char *var, const char *value, return config_error_nonbool(var); else if (!strcasecmp(value, "auto")) auto_comment_line_char = 1; - else if (value[0] && !value[1]) { - if (value[0] == '\n') - return error(_("core.commentChar cannot be newline")); - comment_line_str = xstrfmt("%c", value[0]); + else if (value[0]) { + if (strchr(value, '\n')) + return error(_("core.commentChar cannot contain newline")); + comment_line_str = xstrdup(value); auto_comment_line_char = 0; } else - return error(_("core.commentChar should only be one ASCII character")); + return error(_("core.commentChar must have at least one character")); return 0; } diff --git a/t/t0030-stripspace.sh b/t/t0030-stripspace.sh index e399dd9189..a161faf702 100755 --- a/t/t0030-stripspace.sh +++ b/t/t0030-stripspace.sh @@ -403,7 +403,12 @@ test_expect_success 'strip comments with changed comment char' ' test_expect_success 'newline as commentchar is forbidden' ' test_must_fail git -c core.commentChar="$LF" stripspace -s 2>err && - grep "core.commentChar cannot be newline" err + grep "core.commentChar cannot contain newline" err +' + +test_expect_success 'empty commentchar is forbidden' ' + test_must_fail git -c core.commentchar= stripspace -s 2>err && + grep "core.commentChar must have at least one character" err ' test_expect_success '-c with single line' ' diff --git a/t/t7507-commit-verbose.sh b/t/t7507-commit-verbose.sh index c3281b192e..4c7db19ce7 100755 --- a/t/t7507-commit-verbose.sh +++ b/t/t7507-commit-verbose.sh @@ -101,6 +101,16 @@ test_expect_success 'verbose diff is stripped out with set core.commentChar' ' test_grep "Aborting commit due to empty commit message." err ' +test_expect_success 'verbose diff is stripped with multi-byte comment char' ' + ( + GIT_EDITOR=cat && + export GIT_EDITOR && + test_must_fail git -c core.commentchar="foo>" commit -a -v >out 2>err + ) && + grep "^foo> " out && + test_grep "Aborting commit due to empty commit message." err +' + test_expect_success 'status does not verbose without --verbose' ' git status >actual && ! grep "^diff --git" actual diff --git a/t/t7508-status.sh b/t/t7508-status.sh index a3c18a4fc2..10ed8b32bc 100755 --- a/t/t7508-status.sh +++ b/t/t7508-status.sh @@ -1403,7 +1403,9 @@ test_expect_success "status (core.commentchar with submodule summary)" ' test_expect_success "status (core.commentchar with two chars with submodule summary)" ' test_config core.commentchar ";;" && - test_must_fail git -c status.displayCommentPrefix=true status + sed "s/^/;/" <expect >expect.double && + git -c status.displayCommentPrefix=true status >output && + test_cmp expect.double output ' test_expect_success "--ignore-submodules=all suppresses submodule summary" '
Now that all of the code handles multi-byte comment characters, it's safe to allow users to set them. There is one special case I kept: we still will not allow an empty string for the commentChar. While it might make sense in some contexts (e.g., output where you don't want any comment prefix), there are plenty where it will behave badly (e.g., all of our starts_with() checks will indicate that every line is a comment!). It might be reasonable to assign some meaningful semantics, but it would probably involve checking how each site behaves. In the interim let's forbid it and we can loosen things later. Likewise, the "commentChar cannot be a newline" rule is now extended to "it cannot contain a newline" (for the same reason: it can confuse our parsing loops). Since comment_line_str is used in many parts of the code, it's hard to cover all possibilities with tests. We can convert the existing double-semicolon prefix test to show that "git status" works. And we'll give it a more challenging case in t7507, where we confirm that git-commit strips out the commit template along with any --verbose text when reading the edited commit message back in. That covers the basics, though it's possible there could be issues in more exotic spots (e.g., the sequencer todo list uses its own code). Signed-off-by: Jeff King <peff@peff.net> --- Documentation/config/core.txt | 4 +++- config.c | 10 +++++----- t/t0030-stripspace.sh | 7 ++++++- t/t7507-commit-verbose.sh | 10 ++++++++++ t/t7508-status.sh | 4 +++- 5 files changed, 27 insertions(+), 8 deletions(-)