Message ID | 1c670101fc29a9ccc71cf4d213545a564e14aa05.1710258538.git.dsimic@manjaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Improve the documentation and test coverage for whitespace and comments | expand |
Dragan Simic <dsimic@manjaro.org> writes: > Make it more clear what the whitespace characters are in the context of git > configuration, improve the description of the trailing whitespace handling, > and correct the description of the value-internal whitespace handling. > > Signed-off-by: Dragan Simic <dsimic@manjaro.org> > --- > Documentation/config.txt | 19 +++++++++++-------- > 1 file changed, 11 insertions(+), 8 deletions(-) > > diff --git a/Documentation/config.txt b/Documentation/config.txt > index 782c2bab906c..4480bb44203b 100644 > --- a/Documentation/config.txt > +++ b/Documentation/config.txt > @@ -22,9 +22,10 @@ multivalued. > Syntax > ~~~~~~ > > -The syntax is fairly flexible and permissive; whitespaces are mostly > -ignored. The '#' and ';' characters begin comments to the end of line, > -blank lines are ignored. > +The syntax is fairly flexible and permissive. Whitespace characters, > +which in this context are the space character (SP) and the horizontal > +tabulation (HT), are mostly ignored. The '#' and ';' characters begin > +comments to the end of line. Blank lines are ignored. OK, except for "whitespace characters"---do we need to say "whitespace characters", after we already listed HT and SP are the ones, instead of just "whitespaces"? > @@ -64,12 +65,14 @@ The variable names are case-insensitive, allow only alphanumeric characters > and `-`, and must start with an alphabetic character. > > A line that defines a value can be continued to the next line by > +ending it with a `\`; the backslash and the end-of-line are stripped. > +Leading whitespace characters after 'name =', the remainder of the > line after the first comment character '#' or ';', and trailing > +whitespace characters of the line are discarded unless they are enclosed > +in double quotes. This discarding of the trailing whitespace characters > +also applies after the remainder of the line after the comment character > +is discarded. "also" makes it sound as if we do it twice, once to remove trailing whitespaces after the remainder of the line after '#", and then trim the trailing whitespaces after we removed the comment. I wonder if we can make it clearer by following the step-by-step nature of the earlier part of the paragraph through. We already say the folded line processing is done first, so break things down in conceptual phases/steps, perhaps like * The backslash at the end-of-line is removed, together with the end-of-line, to form a single long line. * Anything that come after the first unquoted comment character, either '#' or ';', are discarded. * The leading and trailing whitespaces around the value part (i.e. what follows 'name =') are discarded. * Remaining unquoted whitespaces inside the value part are munged. > +Any number of internal whitespace characters found within > +the value are converted to the same number of space (SP) characters. The last one sounds like a bug to me, by the way. At least the very original 17712991 (Add ".git/config" file parser, 2005-10-10) squashed a run of whitespace characters into a single SP, which makes sense as a "clean-up". But ebdaae37 (config: Keep inner whitespace verbatim, 2009-07-30), while claiming to "Keep" inner whitespaces, broke it by replacing any isspace() bytes that are not SP with SP, contradicting its stated purpose. As the latest change by the author of that change is from more than 10 years ago, I do not expect that he is still interested in this part of the codebase, but thanks to a very clearly written log message, we can read what the motivation behind that change was, and seeing that what the code does contradicts with the stated motivation we can safely declare that this is an ancient bug. Fixing that bug can of course be left outside the series. For those who are looking for microproject ideas who discovered this message by searching for the #leftoverbits keyword, one possible fix would be to revert ebdaae37, make sure a value with any whitespace in it gets quoted, and document clearly that an unquoted run of whitespaces is squashed into a single SP. Another way that is milder is to finish what ebdaae37 wanted to do and retain the whitespaces "verbatim". Thanks.
On 2024-03-14 02:18, Junio C Hamano wrote: > Dragan Simic <dsimic@manjaro.org> writes: >> -The syntax is fairly flexible and permissive; whitespaces are mostly >> -ignored. The '#' and ';' characters begin comments to the end of >> line, >> -blank lines are ignored. >> +The syntax is fairly flexible and permissive. Whitespace characters, >> +which in this context are the space character (SP) and the horizontal >> +tabulation (HT), are mostly ignored. The '#' and ';' characters >> begin >> +comments to the end of line. Blank lines are ignored. > > OK, except for "whitespace characters"---do we need to say > "whitespace characters", after we already listed HT and SP are the > ones, instead of just "whitespaces"? I also spent some time thinking about that. To me, the plural form, i.e. "whitespaces", simply doesn't sound very good, because "whitespace" feels to me more like a mass noun, and I really haven't seen it used in plural form in other projects. >> A line that defines a value can be continued to the next line by >> +ending it with a `\`; the backslash and the end-of-line are stripped. >> +Leading whitespace characters after 'name =', the remainder of the >> line after the first comment character '#' or ';', and trailing >> +whitespace characters of the line are discarded unless they are >> enclosed >> +in double quotes. This discarding of the trailing whitespace >> characters >> +also applies after the remainder of the line after the comment >> character >> +is discarded. > > "also" makes it sound as if we do it twice, once to remove trailing > whitespaces after the remainder of the line after '#", and then trim > the trailing whitespaces after we removed the comment. Good point, I also felt it the same way, but went with such wording simply because I thought it should be more understandable to the users, despite being technically a bit incorrect. > I wonder if we can make it clearer by following the step-by-step > nature of the earlier part of the paragraph through. We already say > the folded line processing is done first, so break things down in > conceptual phases/steps, perhaps like > > * The backslash at the end-of-line is removed, together with the > end-of-line, to form a single long line. > > * Anything that come after the first unquoted comment character, > either '#' or ';', are discarded. > > * The leading and trailing whitespaces around the value part > (i.e. what follows 'name =') are discarded. > > * Remaining unquoted whitespaces inside the value part are munged. Hmm, I'm not really sure that such a description would be more clear to the users, despite being technically more correct. I'll think a bit more about it. >> +Any number of internal whitespace characters found within >> +the value are converted to the same number of space (SP) characters. > > The last one sounds like a bug to me, by the way. > > At least the very original 17712991 (Add ".git/config" file parser, > 2005-10-10) squashed a run of whitespace characters into a single > SP, which makes sense as a "clean-up". > > But ebdaae37 (config: Keep inner whitespace verbatim, 2009-07-30), > while claiming to "Keep" inner whitespaces, broke it by replacing > any isspace() bytes that are not SP with SP, contradicting its > stated purpose. Thank you for the investigation. The ebdaae37 commit certainly introduced a bug to the value parsing, which presumably has remained undected because the included test passes. The way I see it, fixing the bug may actually be a breaking change, because some user configurations may actually rely on the current (mis)behavior. This makes me somewhat afraid that fixing this bug, which I already thought about, may actually do more harm than good. However, fixing this bug seems to be only right thing to do, which I'll explain further below. > As the latest change by the author of that change is from more than > 10 years ago, I do not expect that he is still interested in this > part of the codebase, but thanks to a very clearly written log > message, we can read what the motivation behind that change was, and > seeing that what the code does contradicts with the stated > motivation we can safely declare that this is an ancient bug. Agreed, the evidence is clear. > Fixing that bug can of course be left outside the series. For those > who are looking for microproject ideas who discovered this message > by searching for the #leftoverbits keyword, one possible fix would > be to revert ebdaae37, make sure a value with any whitespace in it > gets quoted, and document clearly that an unquoted run of > whitespaces is squashed into a single SP. Another way that is > milder is to finish what ebdaae37 wanted to do and retain the > whitespaces "verbatim". I already though about fixing the bug so the value parser actually does what git-config(1) currently says, but as I already noted above, I'm afraid a bit that fixing this bug may actually do more harm than good. Though, further investigation shows that setting a configuration value, by invoking git-config(1), converts value-internal tabs into "\t" escape sequences, which the value-parsing logic doesn't "squash" into spaces. That's why the test included in the ebdaae37 commit passes. On the other hand, value-internal literal tab characters, found in a configuration file, do get "squashed" by the value-parsing logic, so I'd say that the only right thing to do is to fix this bug by making the value-internal whitespace characters preserved verbatim. I'd be happy to include the bugfix into this series, if my above-mentioned fears prove to be unnecessary.
Dragan Simic <dsimic@manjaro.org> writes: > Though, further investigation shows that setting a configuration value, > by invoking git-config(1), converts value-internal tabs into "\t" escape > sequences, which the value-parsing logic doesn't "squash" into spaces. Correct. It would have been nicer to just quote values that had whitespaces in them, but replacing HT to SP while turning HT that comes from our tool into "\t" would still let the value round-trip, while breaking anything written manually in editors. If you stay within Git without using any editor, what ebdaae37 (config: Keep inner whitespace verbatim, 2009-07-30) left us is at least internally consistent. > I'd be happy to include the bugfix into this series, if my > above-mentioned > fears prove to be unnecessary. Documenting status quo is a good place to stop for now. I do not know if it is a good idea to add too many tests to etch the current behaviour that we know is wrong and we'll need to update when we fix the bug, though. Thanks.
On 2024-03-14 17:45, Junio C Hamano wrote: > Dragan Simic <dsimic@manjaro.org> writes: > >> Though, further investigation shows that setting a configuration >> value, >> by invoking git-config(1), converts value-internal tabs into "\t" >> escape >> sequences, which the value-parsing logic doesn't "squash" into spaces. > > Correct. It would have been nicer to just quote values that had > whitespaces in them, but replacing HT to SP while turning HT that > comes from our tool into "\t" would still let the value round-trip, > while breaking anything written manually in editors. If you stay > within Git without using any editor, what ebdaae37 (config: Keep > inner whitespace verbatim, 2009-07-30) left us is at least > internally consistent. Yes, but we already support unquoted values that contain whitespace characters, and people use editors to configure variables. For example, I never use git-config(1) to make changes to my ~/.gitconfig file. >> I'd be happy to include the bugfix into this series, if my >> above-mentioned >> fears prove to be unnecessary. > > Documenting status quo is a good place to stop for now. I do not > know if it is a good idea to add too many tests to etch the current > behaviour that we know is wrong and we'll need to update when we fix > the bug, though. But I already started to work on a bugfix? I'm pretty much close to completing the bugfix and doing some testing.
diff --git a/Documentation/config.txt b/Documentation/config.txt index 782c2bab906c..4480bb44203b 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -22,9 +22,10 @@ multivalued. Syntax ~~~~~~ -The syntax is fairly flexible and permissive; whitespaces are mostly -ignored. The '#' and ';' characters begin comments to the end of line, -blank lines are ignored. +The syntax is fairly flexible and permissive. Whitespace characters, +which in this context are the space character (SP) and the horizontal +tabulation (HT), are mostly ignored. The '#' and ';' characters begin +comments to the end of line. Blank lines are ignored. The file consists of sections and variables. A section begins with the name of the section in square brackets and continues until the next @@ -64,12 +65,14 @@ The variable names are case-insensitive, allow only alphanumeric characters and `-`, and must start with an alphabetic character. A line that defines a value can be continued to the next line by -ending it with a `\`; the backslash and the end-of-line are -stripped. Leading whitespaces after 'name =', the remainder of the +ending it with a `\`; the backslash and the end-of-line are stripped. +Leading whitespace characters after 'name =', the remainder of the line after the first comment character '#' or ';', and trailing -whitespaces of the line are discarded unless they are enclosed in -double quotes. Internal whitespaces within the value are retained -verbatim. +whitespace characters of the line are discarded unless they are enclosed +in double quotes. This discarding of the trailing whitespace characters +also applies after the remainder of the line after the comment character +is discarded. Any number of internal whitespace characters found within +the value are converted to the same number of space (SP) characters. Inside double quotes, double quote `"` and backslash `\` characters must be escaped: use `\"` for `"` and `\\` for `\`.
Make it more clear what the whitespace characters are in the context of git configuration, improve the description of the trailing whitespace handling, and correct the description of the value-internal whitespace handling. Signed-off-by: Dragan Simic <dsimic@manjaro.org> --- Documentation/config.txt | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-)