diff mbox series

[v2,2/2] diff: teach diff to read gitattribute diff-algorithm

Message ID cb0305631496eb4c2d51e5b586ac0ca8580c7dc1.1676410819.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Teach diff to honor diff algorithms set through git attributes | expand

Commit Message

John Cai Feb. 14, 2023, 9:40 p.m. UTC
From: John Cai <johncai86@gmail.com>

It can be useful to specify diff algorithms per file type. For example,
one may want to use the minimal diff algorithm for .json files, another
for .c files, etc.

Teach the diff machinery to check attributes for a diff driver. Also
teach the diff driver parser a new type "algorithm" to look for in the
config, which will be used if a driver has been specified through the
attributes.

Enforce precedence of diff algorithm by favoring the command line option,
then looking at the driver attributes & config combination, then finally
the diff.algorithm config.

To enforce precedence order, use the `xdl_opts_command_line` member
during options pasing to indicate the diff algorithm was set via command
line args.

Signed-off-by: John Cai <johncai86@gmail.com>
---
 Documentation/gitattributes.txt | 41 ++++++++++++++++++++++++++++++++-
 diff.c                          | 25 +++++++++++++-------
 diff.h                          |  2 ++
 t/lib-diff-alternative.sh       | 38 +++++++++++++++++++++++++++++-
 userdiff.c                      |  4 +++-
 userdiff.h                      |  1 +
 6 files changed, 100 insertions(+), 11 deletions(-)

Comments

Junio C Hamano Feb. 15, 2023, 2:56 a.m. UTC | #1
"John Cai via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: John Cai <johncai86@gmail.com>
>
> It can be useful to specify diff algorithms per file type. For example,
> one may want to use the minimal diff algorithm for .json files, another
> for .c files, etc.
>
> Teach the diff machinery to check attributes for a diff driver. Also
> teach the diff driver parser a new type "algorithm" to look for in the
> config, which will be used if a driver has been specified through the
> attributes.
>
> Enforce precedence of diff algorithm by favoring the command line option,
> then looking at the driver attributes & config combination, then finally
> the diff.algorithm config.
>
> To enforce precedence order, use the `xdl_opts_command_line` member
> during options pasing to indicate the diff algorithm was set via command
> line args.
>
> Signed-off-by: John Cai <johncai86@gmail.com>
> ---
>  Documentation/gitattributes.txt | 41 ++++++++++++++++++++++++++++++++-
>  diff.c                          | 25 +++++++++++++-------
>  diff.h                          |  2 ++
>  t/lib-diff-alternative.sh       | 38 +++++++++++++++++++++++++++++-
>  userdiff.c                      |  4 +++-
>  userdiff.h                      |  1 +
>  6 files changed, 100 insertions(+), 11 deletions(-)
>
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index c19e64ea0ef..7e69f509d0a 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -736,7 +736,6 @@ String::
>  	by the configuration variables in the "diff.foo" section of the
>  	Git config file.
>  
> -
>  Defining an external diff driver
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Unrelated change?  Wider paragraph gap between two sections than
other inter-paragraph gaps inside a single section is what original
had, and I think that is a reasonable thing to keep.

> @@ -758,6 +757,46 @@ with the above configuration, i.e. `j-c-diff`, with 7
>  parameters, just like `GIT_EXTERNAL_DIFF` program is called.
>  See linkgit:git[1] for details.

In other words, this new section wants another blank line before to match.

>  
> +Setting the internal diff algorithm
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +The diff algorithm can be set through the `diff.algorithm` config key, but
> +sometimes it may be helpful to set the diff algorithm by path. For example, one
> +might wish to set a diff algorithm automatically for all `.json` files such that
> +the user would not need to pass in a separate command line `--diff-algorithm` flag each
> +time.

That's an overly wide paragraph.

> +
> +First, in `.gitattributes`, you would assign the `diff` attribute for paths.
> +
> +*Git attributes*

Discard this line (mimic an existing section, like "Defining a
custom hunk-header").

> +------------------------
> +*.json diff=<name>
> +------------------------
> +
> +Then, you would define a "diff.<name>.algorithm" configuration to specify the
> +diff algorithm, choosing from `meyers`, `patience`, `minimal`, and `histogram`.
> +
> +*Git config*

Likewise, discard this line (I won't repeat but the next hunk has
the same issue).

> diff --git a/diff.c b/diff.c
> index 92a0eab942e..24da439e56f 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -4456,15 +4456,11 @@ static void run_diff_cmd(const char *pgm,
>  	const char *xfrm_msg = NULL;
>  	int complete_rewrite = (p->status == DIFF_STATUS_MODIFIED) && p->score;
>  	int must_show_header = 0;
> +	struct userdiff_driver *drv = userdiff_find_by_path(o->repo->index, attr_path);

Do we run this look-up unconditionally, even when .allow_external
bit is not set?  Why?

> -
> -	if (o->flags.allow_external) {
> -		struct userdiff_driver *drv;
> -
> -		drv = userdiff_find_by_path(o->repo->index, attr_path);
> +	if (o->flags.allow_external)
>  		if (drv && drv->external)
>  			pgm = drv->external;
> -	}
>  
>  	if (msg) {
>  		/*
> @@ -4481,12 +4477,17 @@ static void run_diff_cmd(const char *pgm,
>  		run_external_diff(pgm, name, other, one, two, xfrm_msg, o);
>  		return;
>  	}
> -	if (one && two)
> +	if (one && two) {
> +		if (!o->xdl_opts_command_line)
> +			if (drv && drv->algorithm)
> +				set_diff_algorithm(o, drv->algorithm);

The idea here seems to be "if there is no explicit instruction, and
if the diff driver specifies an algorithm, then use that one", which
is very straightforward and sensible.  Can we reliably tell if we
had an explicit instruction to override the driver?  That should
probably appear in other parts of the code, I guess.

>  		builtin_diff(name, other ? other : name,
>  			     one, two, xfrm_msg, must_show_header,
>  			     o, complete_rewrite);
> -	else
> +	} else {
>  		fprintf(o->file, "* Unmerged path %s\n", name);
> +	}
>  }



> @@ -4583,6 +4584,10 @@ static void run_diffstat(struct diff_filepair *p, struct diff_options *o,
>  	const char *name;
>  	const char *other;
>  
> +	struct userdiff_driver *drv = userdiff_find_by_path(o->repo->index, p->one->path);
> +	if (drv && drv->algorithm)
> +		set_diff_algorithm(o, drv->algorithm);

Interesting.  Does external diff play a role, like in run_diff_cmd()
we saw earlier?

> @@ -5130,6 +5135,8 @@ static int diff_opt_diff_algorithm(const struct option *opt,
>  		return error(_("option diff-algorithm accepts \"myers\", "
>  			       "\"minimal\", \"patience\" and \"histogram\""));
>  
> +	options->xdl_opts_command_line = 1;

OK, calling this member "xdl_" anything is highly misleading, as it
has nothing to do with the xdiff machinery.  How about calling it
after what it does, i.e. allowing the attribute driven diff driver
to specify the algorithm?  options.ignore_driver_algorithm or
something?  The options coming _from_ the command line may happen to
be the condition to trigger this behaviour in this current
implementation, but it does not have to stay that way forever.
Losing "command line" from the name of the flag would make it
clearer what is essential (i.e. this controls if the diff driver is
allowed to affect the choice of the algorithm) and what is not (i.e.
we happen to let it decided based on the presence or absense of
command line choice).

Thanks.
Junio C Hamano Feb. 15, 2023, 3:20 a.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

>> diff --git a/diff.c b/diff.c
>> index 92a0eab942e..24da439e56f 100644
>> --- a/diff.c
>> +++ b/diff.c
>> @@ -4456,15 +4456,11 @@ static void run_diff_cmd(const char *pgm,
>>  	const char *xfrm_msg = NULL;
>>  	int complete_rewrite = (p->status == DIFF_STATUS_MODIFIED) && p->score;
>>  	int must_show_header = 0;
>> +	struct userdiff_driver *drv = userdiff_find_by_path(o->repo->index, attr_path);
>
> Do we run this look-up unconditionally, even when .allow_external
> bit is not set?  Why?

Ah, this is perfectly fine.  It used to be that this codepath can
tell that there is no need to check the diff driver when it is told
never to use any external diff driver.  Now, even when it is computing
the diff internally, it needs to check the diff driver to find out
the favoured algorithm for the path.

Strictly speaking, if we are told NOT to use external diff driver,
and if we are told NOT to pay attention to algorithm given by the
diff driver, then we know we can skip the overhead of attribute
look-up.  I.e. we could do this to avoid attribute look-up:

	struct userdiff_driver *drv = NULL;

	if (o->flags.allow_external || !o->ignore_driver_algorithm)
		drv = userdiff_find_by_path(...);

	if (drv && o->flags.allow_external && drv->external)
		pgm = drv->external;
	...
	if (pgm)
		... do the external diff thing ...
	if (one && two) {
		if (drv && !o->ignore_driver_algorithm && drv->algorithm)
			set_diff_algo(...)

I was not sure if it would be worth it before writing the above
down, but the resulting flow does not look _too_ bad.

>> @@ -4583,6 +4584,10 @@ static void run_diffstat(struct diff_filepair *p, struct diff_options *o,
>>  	const char *name;
>>  	const char *other;
>>  
>> +	struct userdiff_driver *drv = userdiff_find_by_path(o->repo->index, p->one->path);
>> +	if (drv && drv->algorithm)
>> +		set_diff_algorithm(o, drv->algorithm);
>
> Interesting.  Does external diff play a role, like in run_diff_cmd()
> we saw earlier?

As whoever wrote "diffstat" did not think of counting output from
external diff driver, of course in this codepath external diff would
not appear.  So what we see is very much expected.

Just move the blank line we see before these new lines one line
down, so that the variable decls are grouped together, with a blank
line before the first executable statement.  I.e.

	const char *name;
	const char *other;
+       struct userdiff_driver *drv;
+
+	drv = userdiff_find_by_path(...);
+	if (drv && drv->algorithm)
+		set_diff_algorithm(o, drv->algorithm);

Shouldn't this function refrain from setting algorithm from the
driver when the algorithm was given elsewhere?  E.g.

	$ git show --histogram --stat
	
or something?  IOW, shouldn't it also pay attention to
o->ignore_driver_algorithm bit, just like run_diff_cmd() did?
John Cai Feb. 16, 2023, 8:37 p.m. UTC | #3
On 23/02/14 07:20PM, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
> >> diff --git a/diff.c b/diff.c
> >> index 92a0eab942e..24da439e56f 100644
> >> --- a/diff.c
> >> +++ b/diff.c
> >> @@ -4456,15 +4456,11 @@ static void run_diff_cmd(const char *pgm,
> >>  	const char *xfrm_msg = NULL;
> >>  	int complete_rewrite = (p->status == DIFF_STATUS_MODIFIED) && p->score;
> >>  	int must_show_header = 0;
> >> +	struct userdiff_driver *drv = userdiff_find_by_path(o->repo->index, attr_path);
> >
> > Do we run this look-up unconditionally, even when .allow_external
> > bit is not set?  Why?
> 
> Ah, this is perfectly fine.  It used to be that this codepath can
> tell that there is no need to check the diff driver when it is told
> never to use any external diff driver.  Now, even when it is computing
> the diff internally, it needs to check the diff driver to find out
> the favoured algorithm for the path.
> 
> Strictly speaking, if we are told NOT to use external diff driver,
> and if we are told NOT to pay attention to algorithm given by the
> diff driver, then we know we can skip the overhead of attribute
> look-up.  I.e. we could do this to avoid attribute look-up:
> 
> 	struct userdiff_driver *drv = NULL;
> 
> 	if (o->flags.allow_external || !o->ignore_driver_algorithm)
> 		drv = userdiff_find_by_path(...);
> 
> 	if (drv && o->flags.allow_external && drv->external)
> 		pgm = drv->external;
> 	...
> 	if (pgm)
> 		... do the external diff thing ...
> 	if (one && two) {
> 		if (drv && !o->ignore_driver_algorithm && drv->algorithm)
> 			set_diff_algo(...)
> 
> I was not sure if it would be worth it before writing the above
> down, but the resulting flow does not look _too_ bad.

Yes I think it's worth it to save on execution if we know we are not using
external diff algorithm.

> 
> >> @@ -4583,6 +4584,10 @@ static void run_diffstat(struct diff_filepair *p, struct diff_options *o,
> >>  	const char *name;
> >>  	const char *other;
> >>  
> >> +	struct userdiff_driver *drv = userdiff_find_by_path(o->repo->index, p->one->path);
> >> +	if (drv && drv->algorithm)
> >> +		set_diff_algorithm(o, drv->algorithm);
> >
> > Interesting.  Does external diff play a role, like in run_diff_cmd()
> > we saw earlier?
> 
> As whoever wrote "diffstat" did not think of counting output from
> external diff driver, of course in this codepath external diff would
> not appear.  So what we see is very much expected.
> 
> Just move the blank line we see before these new lines one line
> down, so that the variable decls are grouped together, with a blank
> line before the first executable statement.  I.e.
> 
> 	const char *name;
> 	const char *other;
> +       struct userdiff_driver *drv;
> +
> +	drv = userdiff_find_by_path(...);
> +	if (drv && drv->algorithm)
> +		set_diff_algorithm(o, drv->algorithm);

makes sense, thanks.

> 
> Shouldn't this function refrain from setting algorithm from the
> driver when the algorithm was given elsewhere?  E.g.
> 
> 	$ git show --histogram --stat
> 	
> or something?  IOW, shouldn't it also pay attention to
> o->ignore_driver_algorithm bit, just like run_diff_cmd() did?

Yes we should add the same guard as above.

> 
>
diff mbox series

Patch

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index c19e64ea0ef..7e69f509d0a 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -736,7 +736,6 @@  String::
 	by the configuration variables in the "diff.foo" section of the
 	Git config file.
 
-
 Defining an external diff driver
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -758,6 +757,46 @@  with the above configuration, i.e. `j-c-diff`, with 7
 parameters, just like `GIT_EXTERNAL_DIFF` program is called.
 See linkgit:git[1] for details.
 
+Setting the internal diff algorithm
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The diff algorithm can be set through the `diff.algorithm` config key, but
+sometimes it may be helpful to set the diff algorithm by path. For example, one
+might wish to set a diff algorithm automatically for all `.json` files such that
+the user would not need to pass in a separate command line `--diff-algorithm` flag each
+time.
+
+First, in `.gitattributes`, you would assign the `diff` attribute for paths.
+
+*Git attributes*
+------------------------
+*.json diff=<name>
+------------------------
+
+Then, you would define a "diff.<name>.algorithm" configuration to specify the
+diff algorithm, choosing from `meyers`, `patience`, `minimal`, and `histogram`.
+
+*Git config*
+
+----------------------------------------------------------------
+[diff "<name>"]
+  algorithm = histogram
+----------------------------------------------------------------
+
+This diff algorithm applies to git-diff(1), including the `--stat` output.
+
+NOTE: If the `command` key also exists, then Git will treat this as an external
+diff and attempt to use the value set for `command` as an external program. For
+instance, the following config, combined with the above `.gitattributes` file,
+will result in `command` favored over `algorithm`.
+
+*Git config*
+
+----------------------------------------------------------------
+[diff "<name>"]
+  command = j-c-diff
+  algorithm = histogram
+----------------------------------------------------------------
 
 Defining a custom hunk-header
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/diff.c b/diff.c
index 92a0eab942e..24da439e56f 100644
--- a/diff.c
+++ b/diff.c
@@ -4456,15 +4456,11 @@  static void run_diff_cmd(const char *pgm,
 	const char *xfrm_msg = NULL;
 	int complete_rewrite = (p->status == DIFF_STATUS_MODIFIED) && p->score;
 	int must_show_header = 0;
+	struct userdiff_driver *drv = userdiff_find_by_path(o->repo->index, attr_path);
 
-
-	if (o->flags.allow_external) {
-		struct userdiff_driver *drv;
-
-		drv = userdiff_find_by_path(o->repo->index, attr_path);
+	if (o->flags.allow_external)
 		if (drv && drv->external)
 			pgm = drv->external;
-	}
 
 	if (msg) {
 		/*
@@ -4481,12 +4477,17 @@  static void run_diff_cmd(const char *pgm,
 		run_external_diff(pgm, name, other, one, two, xfrm_msg, o);
 		return;
 	}
-	if (one && two)
+	if (one && two) {
+		if (!o->xdl_opts_command_line)
+			if (drv && drv->algorithm)
+				set_diff_algorithm(o, drv->algorithm);
+
 		builtin_diff(name, other ? other : name,
 			     one, two, xfrm_msg, must_show_header,
 			     o, complete_rewrite);
-	else
+	} else {
 		fprintf(o->file, "* Unmerged path %s\n", name);
+	}
 }
 
 static void diff_fill_oid_info(struct diff_filespec *one, struct index_state *istate)
@@ -4583,6 +4584,10 @@  static void run_diffstat(struct diff_filepair *p, struct diff_options *o,
 	const char *name;
 	const char *other;
 
+	struct userdiff_driver *drv = userdiff_find_by_path(o->repo->index, p->one->path);
+	if (drv && drv->algorithm)
+		set_diff_algorithm(o, drv->algorithm);
+
 	if (DIFF_PAIR_UNMERGED(p)) {
 		/* unmerged */
 		builtin_diffstat(p->one->path, NULL, NULL, NULL,
@@ -5130,6 +5135,8 @@  static int diff_opt_diff_algorithm(const struct option *opt,
 		return error(_("option diff-algorithm accepts \"myers\", "
 			       "\"minimal\", \"patience\" and \"histogram\""));
 
+	options->xdl_opts_command_line = 1;
+
 	return 0;
 }
 
@@ -5157,6 +5164,8 @@  static int diff_opt_diff_algorithm_no_arg(const struct option *opt,
 		BUG("available diff algorithms include \"myers\", "
 			       "\"minimal\", \"patience\" and \"histogram\"");
 
+	options->xdl_opts_command_line = 1;
+
 	return 0;
 }
 
diff --git a/diff.h b/diff.h
index 41eb2c3d428..46b565abfd4 100644
--- a/diff.h
+++ b/diff.h
@@ -333,6 +333,8 @@  struct diff_options {
 	int prefix_length;
 	const char *stat_sep;
 	int xdl_opts;
+	/* If xdl_opts has been set via the command line. */
+	int xdl_opts_command_line;
 
 	/* see Documentation/diff-options.txt */
 	char **anchors;
diff --git a/t/lib-diff-alternative.sh b/t/lib-diff-alternative.sh
index 8d1e408bb58..2dc02bca873 100644
--- a/t/lib-diff-alternative.sh
+++ b/t/lib-diff-alternative.sh
@@ -105,10 +105,46 @@  index $file1..$file2 100644
  }
 EOF
 
+	cat >expect_diffstat <<EOF
+ file1 => file2 | 21 ++++++++++-----------
+ 1 file changed, 10 insertions(+), 11 deletions(-)
+EOF
+
 	STRATEGY=$1
 
+	test_expect_success "$STRATEGY diff from attributes" '
+		echo "file* diff=driver" >.gitattributes &&
+		git config diff.driver.algorithm "$STRATEGY" &&
+		test_must_fail git diff --no-index file1 file2 > output &&
+		cat expect &&
+		cat output &&
+		test_cmp expect output
+	'
+
+	test_expect_success "$STRATEGY diff from attributes has valid diffstat" '
+		echo "file* diff=driver" >.gitattributes &&
+		git config diff.driver.algorithm "$STRATEGY" &&
+		test_must_fail git diff --stat --no-index file1 file2 > output &&
+		test_cmp expect_diffstat output
+	'
+
 	test_expect_success "$STRATEGY diff" '
-		test_must_fail git diff --no-index "--$STRATEGY" file1 file2 > output &&
+		test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
+		test_cmp expect output
+	'
+
+	test_expect_success "$STRATEGY diff command line precedence before attributes" '
+		echo "file* diff=driver" >.gitattributes &&
+		git config diff.driver.algorithm meyers &&
+		test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
+		test_cmp expect output
+	'
+
+	test_expect_success "$STRATEGY diff attributes precedence before config" '
+		git config diff.algorithm default &&
+		echo "file* diff=driver" >.gitattributes &&
+		git config diff.driver.algorithm "$STRATEGY" &&
+		test_must_fail git diff --no-index file1 file2 > output &&
 		test_cmp expect output
 	'
 
diff --git a/userdiff.c b/userdiff.c
index d71b82feb74..ff25cfc4b4c 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -293,7 +293,7 @@  PATTERNS("scheme",
 	 "|([^][)(}{[ \t])+"),
 PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",
 	 "\\\\[a-zA-Z@]+|\\\\.|[a-zA-Z0-9\x80-\xff]+"),
-{ "default", NULL, -1, { NULL, 0 } },
+{ "default", NULL, NULL, -1, { NULL, 0 } },
 };
 #undef PATTERNS
 #undef IPATTERN
@@ -394,6 +394,8 @@  int userdiff_config(const char *k, const char *v)
 		return parse_bool(&drv->textconv_want_cache, k, v);
 	if (!strcmp(type, "wordregex"))
 		return git_config_string(&drv->word_regex, k, v);
+	if (!strcmp(type, "algorithm"))
+		return git_config_string(&drv->algorithm, k, v);
 
 	return 0;
 }
diff --git a/userdiff.h b/userdiff.h
index aee91bc77e6..24419db6973 100644
--- a/userdiff.h
+++ b/userdiff.h
@@ -14,6 +14,7 @@  struct userdiff_funcname {
 struct userdiff_driver {
 	const char *name;
 	const char *external;
+	const char *algorithm;
 	int binary;
 	struct userdiff_funcname funcname;
 	const char *word_regex;