Message ID | pull.1682.v4.git.git.1711653257043.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [v4] userdiff: better method/property matching for C# | expand |
"Steven Jeuris via GitGitGadget" <gitgitgadget@gmail.com> writes: > As before, I ran into many regex limitations (no possessive quantifiers, > no lookahead). It also seems different regex evaluators are used on > different test runs. Which one does git diff use? Maybe it is about time > to update this? E.g., if speed is a concern, possessive quantifiers can > speed up search. When you make regcomp(3) and regexec(3) calls, you'll be using the system library that supplies them. IOW, we use whatever is on the system.
Am 28.03.24 um 20:14 schrieb Steven Jeuris via GitGitGadget: > diff --git a/userdiff.c b/userdiff.c > index e399543823b..35e0e1183f7 100644 > --- a/userdiff.c > +++ b/userdiff.c > @@ -89,12 +89,46 @@ PATTERNS("cpp", > "|\\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?" > "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*|<=>"), > PATTERNS("csharp", > - /* Keywords */ > - "!^[ \t]*(do|while|for|if|else|instanceof|new|return|switch|case|throw|catch|using)\n" > - /* Methods and constructors */ > - "^[ \t]*(((static|public|internal|private|protected|new|virtual|sealed|override|unsafe|async)[ \t]+)*[][<>@.~_[:alnum:]]+[ \t]+[<>@._[:alnum:]]+[ \t]*\\(.*\\))[ \t]*$\n" > - /* Properties */ > - "^[ \t]*(((static|public|internal|private|protected|new|virtual|sealed|override|unsafe)[ \t]+)*[][<>@.~_[:alnum:]]+[ \t]+[@._[:alnum:]]+)[ \t]*$\n" > + /* > + * Jump over reserved keywords which are illegal method names, but which > + * can be followed by parentheses without special characters in between, > + * making them look like methods. > + */ > + "!(^|[ \t]+)" /* Start of line or whitespace. */ > + "(do|while|for|foreach|if|else|new|default|return|switch|case|throw" > + "|catch|using|lock|fixed)" > + "([ \t(]+|$)\n" /* Whitespace, "(", or end of line. */ > + /* > + * Methods/constructors: > + * The strategy is to identify a minimum of two groups (any combination > + * of keywords/type/name) before the opening parenthesis, and without > + * final unexpected characters, normally only used in ordinary statements. > + */ > + "^[ \t]*" /* Remove leading whitespace. */ > + "(" /* Start chunk header capture. */ > + "(" /* Group of keywords/type/names. */ > + "([][[:alnum:]@_<>.]|, [ |\t]*)+" /* Space only allowed after ",". */ Mental note: This pattern means: a sequence of at least one of ( one character of a certain set OR a comma followed by SP optionally followed by whitespace or | ) > + "[ \t]+" /* One required space forces a minimum of two items. */ OK. > + "([][[:alnum:]@_<>.]|, [ |\t]*)+" Same here. Is it the case of t4018/csharp-method-generics? Example<int, string> Method<TA, TB>(TA RIGHT, TB b) Let me see if I make sense of this. The idea is to treat the ', ' sequence as a single "character" so that the SP in this sequence does not count as the word separator that we otherwise have between two plain words. I am unsure whether this is a workable solution. The set of symbols that can occur before a method definition overlaps too much with the symbols that can occur elsewhere. For example, if you have these lines of code: something(arg, meth, func(x), prop, y, more); you have a false positive in the lines in the middle. I have the feeling that it is impossible to distinguish method definitions from other code reliably. We can go back and forth a while, but at some point we have to stop and accept that there are false positives. Where that point is, I cannot tell, because I do not know what is common and what is uncommon in typical C# code. It is your call. > + "[ \t]*" /* Optional space before parameters start. */ > + ")+" > + "\\(" /* Start of method parameters. */ > + "[^;]*" /* Allow complex parameters, but exclude statements (;). */ > + ")$\n" /* Close chunk header capture. */ > + /* > + * Properties: > + * As with methods, expect a minimum of two groups. But, more trivial than > + * methods, the vast majority of properties long enough to be worth > + * showing a chunk header for don't include "=:;,()" on the line they are > + * defined, since they don't have a parameter list. > + */ > + "^[ \t]*(" > + "(" > + "([][[:alnum:]@_<>.]|, [ |\t]*)+[ \t]+" > + "([][[:alnum:]@_<>.]|, [ |\t]*)+[ \t]*" > + ")+" /* Up to here, same as methods regex. */ Is the intent to match pairs of words? Or is the intent to find at least two words? If the latter, then the preceding 4 lines are better written as "([][[:alnum:]@_<>.]|,[ \t]+)+" "([ \t]+([][[:alnum:]@_<>.]|,[ \t]+)+)+" If the former, then the space after the second word must not be optional, because it would match a bc d as two pairs (a b)(c d) BTW that there is an | symbol in the potential space after the comma is certainly an oversight, right? And, why is a tab after the comma not permitted? When you construct regular expressions, make sure that repeated parts always begin with a mandatory part and that this mandatory part cannot be an optional part at the end of the preceding or the repeated pattern. Otherwise, in a non-matching case you send the matcher into an expensive backtracking loop. > + "[^;=:,()]*" /* Compared to methods, no parameter list allowed. */ > + ")$\n" > /* Type definitions */ > "^[ \t]*(((static|public|internal|private|protected|new|unsafe|sealed|abstract|partial)[ \t]+)*(class|enum|interface|struct|record)[ \t]+.*)$\n" > /* Namespace */ > > base-commit: f41f85c9ec8d4d46de0fd5fded88db94d3ec8c11
diff --git a/t/t4018/csharp-exclude-assignments b/t/t4018/csharp-exclude-assignments new file mode 100644 index 00000000000..239f312963b --- /dev/null +++ b/t/t4018/csharp-exclude-assignments @@ -0,0 +1,20 @@ +class Example +{ + string Method(int RIGHT) + { + var constantAssignment = "test"; + var methodAssignment = MethodCall(); + var multiLineMethodAssignment = MethodCall( + ); + var multiLine = "first" + + MethodCall() + + + ( MethodCall() + ) + + MethodCall(); + + return "ChangeMe"; + } + + string MethodCall(int a = 0, int b = 0) => "test"; +} diff --git a/t/t4018/csharp-exclude-control-statements b/t/t4018/csharp-exclude-control-statements new file mode 100644 index 00000000000..3a0f404ee10 --- /dev/null +++ b/t/t4018/csharp-exclude-control-statements @@ -0,0 +1,34 @@ +class Example +{ + string Method(int RIGHT) + { + if (false) + { + return "out"; + } + else { } + if (true) MethodCall( + ); + else MethodCall( + ); + switch ("test") + { + case "one": + return MethodCall( + ); + case "two": + break; + } + (int, int) tuple = (1, 4); + switch (tuple) + { + case (1, 4): + MethodCall(); + break; + } + + return "ChangeMe"; + } + + string MethodCall(int a = 0, int b = 0) => "test"; +} diff --git a/t/t4018/csharp-exclude-exceptions b/t/t4018/csharp-exclude-exceptions new file mode 100644 index 00000000000..b1e64256cfe --- /dev/null +++ b/t/t4018/csharp-exclude-exceptions @@ -0,0 +1,29 @@ +using System; + +class Example +{ + string Method(int RIGHT) + { + try + { + throw new Exception("fail"); + } + catch (Exception) + { + } + finally + { + } + try { } catch (Exception) {} + try + { + throw GetException( + ); + } + catch (Exception) { } + + return "ChangeMe"; + } + + Exception GetException() => new Exception("fail"); +} diff --git a/t/t4018/csharp-exclude-generic-method-calls b/t/t4018/csharp-exclude-generic-method-calls new file mode 100644 index 00000000000..31af546665d --- /dev/null +++ b/t/t4018/csharp-exclude-generic-method-calls @@ -0,0 +1,12 @@ +class Example +{ + string Method(int RIGHT) + { + GenericMethodCall<int, int>( + ); + + return "ChangeMe"; + } + + string GenericMethodCall<T, T2>() => "test"; +} diff --git a/t/t4018/csharp-exclude-init-dispose b/t/t4018/csharp-exclude-init-dispose new file mode 100644 index 00000000000..2bc8e194e20 --- /dev/null +++ b/t/t4018/csharp-exclude-init-dispose @@ -0,0 +1,22 @@ +using System; + +class Example : IDisposable +{ + string Method(int RIGHT) + { + new Example(); + new Example( + ); + new Example { }; + using (this) + { + } + var def = + this is default( + Example); + + return "ChangeMe"; + } + + public void Dispose() {} +} diff --git a/t/t4018/csharp-exclude-iterations b/t/t4018/csharp-exclude-iterations new file mode 100644 index 00000000000..960aa182ae2 --- /dev/null +++ b/t/t4018/csharp-exclude-iterations @@ -0,0 +1,26 @@ +using System.Linq; + +class Example +{ + string Method(int RIGHT) + { + do { } while (true); + do MethodCall( + ); while (true); + while (true); + while (true) { + break; + } + for (int i = 0; i < 10; ++i) + { + } + foreach (int i in Enumerable.Range(0, 10)) + { + } + int[] numbers = [5, 4, 1, 3, 9, 8, 6, 7, 2, 0]; + + return "ChangeMe"; + } + + string MethodCall(int a = 0, int b = 0) => "test"; +} diff --git a/t/t4018/csharp-exclude-method-calls b/t/t4018/csharp-exclude-method-calls new file mode 100644 index 00000000000..8afe2a54638 --- /dev/null +++ b/t/t4018/csharp-exclude-method-calls @@ -0,0 +1,15 @@ +class Example +{ + string Method(int RIGHT) + { + MethodCall(); + MethodCall(1, 2); + MethodCall( + 1, 2); + + return "ChangeMe"; + } + + string MethodCall(int a = 0, int b = 0) => "test"; + string GenericMethodCall<T, T2>() => "test"; +} diff --git a/t/t4018/csharp-exclude-other b/t/t4018/csharp-exclude-other new file mode 100644 index 00000000000..4d5581cf3e1 --- /dev/null +++ b/t/t4018/csharp-exclude-other @@ -0,0 +1,18 @@ +class Example +{ + string Method(int RIGHT) + { + lock (this) + { + } + unsafe + { + byte[] bytes = [1, 2, 3]; + fixed (byte* pointerToFirst = bytes) + { + } + } + + return "ChangeMe"; + } +} diff --git a/t/t4018/csharp-method b/t/t4018/csharp-method new file mode 100644 index 00000000000..16b367aca2b --- /dev/null +++ b/t/t4018/csharp-method @@ -0,0 +1,10 @@ +class Example +{ + string Method(int RIGHT) + { + // Filler + // Filler + + return "ChangeMe"; + } +} diff --git a/t/t4018/csharp-method-array b/t/t4018/csharp-method-array new file mode 100644 index 00000000000..1126de8201d --- /dev/null +++ b/t/t4018/csharp-method-array @@ -0,0 +1,10 @@ +class Example +{ + string[] Method(int RIGHT) + { + // Filler + // Filler + + return ["ChangeMe"]; + } +} diff --git a/t/t4018/csharp-method-explicit b/t/t4018/csharp-method-explicit new file mode 100644 index 00000000000..5a710116cc4 --- /dev/null +++ b/t/t4018/csharp-method-explicit @@ -0,0 +1,12 @@ +using System; + +class Example : IDisposable +{ + void IDisposable.Dispose() // RIGHT + { + // Filler + // Filler + + // ChangeMe + } +} diff --git a/t/t4018/csharp-method-generics b/t/t4018/csharp-method-generics new file mode 100644 index 00000000000..b3216bfb2a7 --- /dev/null +++ b/t/t4018/csharp-method-generics @@ -0,0 +1,11 @@ +class Example<T1, T2> +{ + Example<int, string> Method<TA, TB>(TA RIGHT, TB b) + { + // Filler + // Filler + + // ChangeMe + return null; + } +} diff --git a/t/t4018/csharp-method-modifiers b/t/t4018/csharp-method-modifiers new file mode 100644 index 00000000000..caefa8ee99c --- /dev/null +++ b/t/t4018/csharp-method-modifiers @@ -0,0 +1,13 @@ +using System.Threading.Tasks; + +class Example +{ + static internal async Task Method(int RIGHT) + { + // Filler + // Filler + + // ChangeMe + await Task.Delay(1); + } +} diff --git a/t/t4018/csharp-method-multiline b/t/t4018/csharp-method-multiline new file mode 100644 index 00000000000..3983ff42f51 --- /dev/null +++ b/t/t4018/csharp-method-multiline @@ -0,0 +1,10 @@ +class Example +{ + string Method_RIGHT( + int a, + int b, + int c) + { + return "ChangeMe"; + } +} diff --git a/t/t4018/csharp-method-params b/t/t4018/csharp-method-params new file mode 100644 index 00000000000..3f00410ba1f --- /dev/null +++ b/t/t4018/csharp-method-params @@ -0,0 +1,10 @@ +class Example +{ + string Method(int RIGHT, int b, int c = 42) + { + // Filler + // Filler + + return "ChangeMe"; + } +} diff --git a/t/t4018/csharp-method-special-chars b/t/t4018/csharp-method-special-chars new file mode 100644 index 00000000000..e6c7bc01a18 --- /dev/null +++ b/t/t4018/csharp-method-special-chars @@ -0,0 +1,11 @@ +class @Some_Type +{ + @Some_Type @Method_With_Underscore(int RIGHT) + { + // Filler + // Filler + + // ChangeMe + return new @Some_Type(); + } +} diff --git a/t/t4018/csharp-method-with-spacing b/t/t4018/csharp-method-with-spacing new file mode 100644 index 00000000000..233bb976cc2 --- /dev/null +++ b/t/t4018/csharp-method-with-spacing @@ -0,0 +1,10 @@ +class Example +{ + string Method ( int RIGHT ) + { + // Filler + // Filler + + return "ChangeMe"; + } +} diff --git a/t/t4018/csharp-property b/t/t4018/csharp-property new file mode 100644 index 00000000000..e56dfce34c1 --- /dev/null +++ b/t/t4018/csharp-property @@ -0,0 +1,11 @@ +class Example +{ + public bool RIGHT + { + get { return true; } + set + { + // ChangeMe + } + } +} diff --git a/t/t4018/csharp-property-braces-same-line b/t/t4018/csharp-property-braces-same-line new file mode 100644 index 00000000000..608131d3d31 --- /dev/null +++ b/t/t4018/csharp-property-braces-same-line @@ -0,0 +1,10 @@ +class Example +{ + public bool RIGHT { + get { return true; } + set + { + // ChangeMe + } + } +} diff --git a/userdiff.c b/userdiff.c index e399543823b..35e0e1183f7 100644 --- a/userdiff.c +++ b/userdiff.c @@ -89,12 +89,46 @@ PATTERNS("cpp", "|\\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?" "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*|<=>"), PATTERNS("csharp", - /* Keywords */ - "!^[ \t]*(do|while|for|if|else|instanceof|new|return|switch|case|throw|catch|using)\n" - /* Methods and constructors */ - "^[ \t]*(((static|public|internal|private|protected|new|virtual|sealed|override|unsafe|async)[ \t]+)*[][<>@.~_[:alnum:]]+[ \t]+[<>@._[:alnum:]]+[ \t]*\\(.*\\))[ \t]*$\n" - /* Properties */ - "^[ \t]*(((static|public|internal|private|protected|new|virtual|sealed|override|unsafe)[ \t]+)*[][<>@.~_[:alnum:]]+[ \t]+[@._[:alnum:]]+)[ \t]*$\n" + /* + * Jump over reserved keywords which are illegal method names, but which + * can be followed by parentheses without special characters in between, + * making them look like methods. + */ + "!(^|[ \t]+)" /* Start of line or whitespace. */ + "(do|while|for|foreach|if|else|new|default|return|switch|case|throw" + "|catch|using|lock|fixed)" + "([ \t(]+|$)\n" /* Whitespace, "(", or end of line. */ + /* + * Methods/constructors: + * The strategy is to identify a minimum of two groups (any combination + * of keywords/type/name) before the opening parenthesis, and without + * final unexpected characters, normally only used in ordinary statements. + */ + "^[ \t]*" /* Remove leading whitespace. */ + "(" /* Start chunk header capture. */ + "(" /* Group of keywords/type/names. */ + "([][[:alnum:]@_<>.]|, [ |\t]*)+" /* Space only allowed after ",". */ + "[ \t]+" /* One required space forces a minimum of two items. */ + "([][[:alnum:]@_<>.]|, [ |\t]*)+" + "[ \t]*" /* Optional space before parameters start. */ + ")+" + "\\(" /* Start of method parameters. */ + "[^;]*" /* Allow complex parameters, but exclude statements (;). */ + ")$\n" /* Close chunk header capture. */ + /* + * Properties: + * As with methods, expect a minimum of two groups. But, more trivial than + * methods, the vast majority of properties long enough to be worth + * showing a chunk header for don't include "=:;,()" on the line they are + * defined, since they don't have a parameter list. + */ + "^[ \t]*(" + "(" + "([][[:alnum:]@_<>.]|, [ |\t]*)+[ \t]+" + "([][[:alnum:]@_<>.]|, [ |\t]*)+[ \t]*" + ")+" /* Up to here, same as methods regex. */ + "[^;=:,()]*" /* Compared to methods, no parameter list allowed. */ + ")$\n" /* Type definitions */ "^[ \t]*(((static|public|internal|private|protected|new|unsafe|sealed|abstract|partial)[ \t]+)*(class|enum|interface|struct|record)[ \t]+.*)$\n" /* Namespace */