Message ID | 20250211114611.9334-2-dhar61595@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | userdiff: add built-in pattern for shell scripts | expand |
Am 11.02.25 um 12:46 schrieb Moumita: > Introduced a built-in userdiff driver for shell scripts, enabling > accurate function name recognition in `git diff` hunk headers. > > Enhancements include: > - Function name detection for both POSIX and Bash/Ksh-style functions: > - `function_name() { ... }` > - `function function_name { ... }` > - Exclusion of shell keywords that can resemble function names, > preventing false matches (e.g., `if`, `for`, `while`, `return`, etc.). > - Improved tokenization support for: > - Identifiers (variable and function names) > - Numeric constants (integers and decimals) > - Shell variables (`$VAR`, `${VAR}`) > - Logical (`&&`, `||`, `==`, `!=`, `<=`, `>=`) and arithmetic operators > - Assignment and redirection operators > - Brackets and grouping symbols > > This update improves Git’s diff readability for shell scripts, > bringing it in line with existing built-in userdiff drivers. Please remove the marketing tone from the commit message. - "accurate function name recognition": Is not possible because the tools (regular expressions) don't have sufficient context to allow it. - "Enhancements include": Either list *all* enhancements, or focus on noteworthy ones, but don't make a long list that's still incomplete. - The word "improve" is never needed in the context that we see it here, because you certainly don't want to worsen the code base. Please study section "Describe your changes well" in Documentation/SubmittingPatches on how to write the commit message. (Describe the state before the change in present tense, and the changes in imperative mood.) An important point is *why* we want this change. In the case of a new userdiff driver, however, the answer is usally simply "because we can", and everybody knows it. Please don't write a long paragraph that dances around this fact. Just don't write it; but if there are other reason, please do say so. > > Signed-off-by: Moumita <dhar61595@gmail.com> The name given in Signed-off-by has legal meaning. If you have a given name and a surname, please use both names here and then by extension also as patch author. > --- > userdiff.c | 20 ++++++++++++++++++++ > 1 file changed, 20 insertions(+) > > diff --git a/userdiff.c b/userdiff.c > index 340c4eb4f7..a8c14807c6 100644 > --- a/userdiff.c > +++ b/userdiff.c > @@ -334,6 +334,26 @@ PATTERNS("scheme", > "\\|([^\\\\]*)\\|" > /* All other words should be delimited by spaces or parentheses */ > "|([^][)(}{[ \t])+"), > +PATTERNS("shell", Correctly sorted into the list. Good! HOWEVER! We already have a diffdriver for bash. It would be better to extend that one than to introduce a new driver. > + /* Negate shell keywords that can look like functions */ > + "!^[ \t]*(if|elif|else|fi|for|while|until|case|esac|then|do|done|return|break|continue)\\b\n" This list looks unnecessary. The userdiff driver can assume that it operates on syntactically valid text. if () { } isn't correct, and that's the case for all listed keywords. > + /* POSIX-style shell functions: function_name() { ... } */ > + "^[ \t]*([a-zA-Z_][a-zA-Z0-9_]*)[ \t]*\\(\\)[ \t]*\\{\n" Two nitpicks: - There can be whitespace between the parentheses. - The function body can also be in parentheses. foo ( ) ( echo ) would be a correct shell function definition. Its body always runs in a sub-shell. > + /* Bash/Ksh-style functions: function function_name { ... } */ > + "^[ \t]*function[ \t]+([a-zA-Z_][a-zA-Z0-9_]*)[ \t]*\\{\n", Both here and above, just let the regular expression end before the opening bracket. Then it would also recognize this oddity: function x \ { echo x; } > + /* -- */ > + /* Identifiers: variable and function names */ > + "[a-zA-Z_][a-zA-Z0-9_]*" > + /* Numeric constants: integers and decimals */ > + "|[-+]?[0-9]+(\\.[0-9]*)?" Typically, a - or + isn't part of the number, but a separate operator. I suggest to drop the leading '[-+]?'. But see below. > + /* Shell variables: $VAR and ${VAR} */ > + "|\\$[a-zA-Z_][a-zA-Z0-9_]*|\\$\\{[^}]+\\}" The uses of ${name} are rather exceptional. The cases where ${ is used are the complex cases, for example: base=${fname##*/} base=${base%."$ext"} In such cases, I would prefer that the constituents of the expression are their own tokens and not lumped into a single "variable name" token. > + /* Logical and comparison operators */ > + "|\\|\\||&&|<<|>>|==|!=|<=|>=" > + /* Assignment and arithmetic operators */ > + "|[-+*/%&|^!=<>]=?" Which makes me think: Text in shell commands has only very little restrictions. A typical case are command options starting with one or two dashes. Do we want to separate the dash from the option name? This regular expression would do so, and I would consider it a deficiency. > + /* Brackets and grouping symbols */ > + "|\\(|\\)|\\{|\\}|\\[|\\]"), > PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$", > "\\\\[a-zA-Z@]+|\\\\.|([a-zA-Z0-9]|[^\x01-\x7f])+"), > { .name = "default", .binary = -1 }, It would be nice to have some tests. Have a look at t/t4018/bash-* and t/t4034/* (no bash there, yet). -- Hannes
diff --git a/userdiff.c b/userdiff.c index 340c4eb4f7..a8c14807c6 100644 --- a/userdiff.c +++ b/userdiff.c @@ -334,6 +334,26 @@ PATTERNS("scheme", "\\|([^\\\\]*)\\|" /* All other words should be delimited by spaces or parentheses */ "|([^][)(}{[ \t])+"), +PATTERNS("shell", + /* Negate shell keywords that can look like functions */ + "!^[ \t]*(if|elif|else|fi|for|while|until|case|esac|then|do|done|return|break|continue)\\b\n" + /* POSIX-style shell functions: function_name() { ... } */ + "^[ \t]*([a-zA-Z_][a-zA-Z0-9_]*)[ \t]*\\(\\)[ \t]*\\{\n" + /* Bash/Ksh-style functions: function function_name { ... } */ + "^[ \t]*function[ \t]+([a-zA-Z_][a-zA-Z0-9_]*)[ \t]*\\{\n", + /* -- */ + /* Identifiers: variable and function names */ + "[a-zA-Z_][a-zA-Z0-9_]*" + /* Numeric constants: integers and decimals */ + "|[-+]?[0-9]+(\\.[0-9]*)?" + /* Shell variables: $VAR and ${VAR} */ + "|\\$[a-zA-Z_][a-zA-Z0-9_]*|\\$\\{[^}]+\\}" + /* Logical and comparison operators */ + "|\\|\\||&&|<<|>>|==|!=|<=|>=" + /* Assignment and arithmetic operators */ + "|[-+*/%&|^!=<>]=?" + /* Brackets and grouping symbols */ + "|\\(|\\)|\\{|\\}|\\[|\\]"), PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$", "\\\\[a-zA-Z@]+|\\\\.|([a-zA-Z0-9]|[^\x01-\x7f])+"), { .name = "default", .binary = -1 },
Introduced a built-in userdiff driver for shell scripts, enabling accurate function name recognition in `git diff` hunk headers. Enhancements include: - Function name detection for both POSIX and Bash/Ksh-style functions: - `function_name() { ... }` - `function function_name { ... }` - Exclusion of shell keywords that can resemble function names, preventing false matches (e.g., `if`, `for`, `while`, `return`, etc.). - Improved tokenization support for: - Identifiers (variable and function names) - Numeric constants (integers and decimals) - Shell variables (`$VAR`, `${VAR}`) - Logical (`&&`, `||`, `==`, `!=`, `<=`, `>=`) and arithmetic operators - Assignment and redirection operators - Brackets and grouping symbols This update improves Git’s diff readability for shell scripts, bringing it in line with existing built-in userdiff drivers. Signed-off-by: Moumita <dhar61595@gmail.com> --- userdiff.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+)