diff mbox series

[03/11] merge-ort: implement detect_regular_renames()

Message ID ba30bc8686ef11115b369d351b4447c75a7bb9b5.1607542887.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series merge-ort: add basic rename detection | expand

Commit Message

Elijah Newren Dec. 9, 2020, 7:41 p.m. UTC
From: Elijah Newren <newren@gmail.com>

Based heavily on merge-recursive's get_diffpairs() function.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

Comments

Derrick Stolee Dec. 11, 2020, 2:54 a.m. UTC | #1
On 12/9/2020 2:41 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
> 
> Based heavily on merge-recursive's get_diffpairs() function.

(You're not kidding, and I should have looked here before making
some comments below.)

> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  merge-ort.c | 32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/merge-ort.c b/merge-ort.c
> index 92b765dd3f0..1ff637e57af 100644
> --- a/merge-ort.c
> +++ b/merge-ort.c
> @@ -634,7 +634,33 @@ static void detect_regular_renames(struct merge_options *opt,
>  				   struct tree *side,
>  				   unsigned side_index)
>  {
> -	die("Not yet implemented.");
> +	struct diff_options diff_opts;
> +	struct rename_info *renames = opt->priv->renames;
> +
> +	repo_diff_setup(opt->repo, &diff_opts);
> +	diff_opts.flags.recursive = 1;
> +	diff_opts.flags.rename_empty = 0;
> +	diff_opts.detect_rename = DIFF_DETECT_RENAME;
> +	diff_opts.rename_limit = opt->rename_limit;

I assume that opt->rename_limit has been initialized properly
against merge.renameLimit/diff.renameLimit in another location...

> +	if (opt->rename_limit <= 0)
> +		diff_opts.rename_limit = 1000;

(I made the following comments before thinking to look at
get_diffpairs() which behaves in an equivalent way with this
"1000" constant limit. I'm not sure if there is a reason why
this limit is different from the _other_ limits I discovered,
but it might still be good to reduce magic literal ints by
grouping this "1000" into a const or macro.)

...and this just assigns the default again. Why is this done
here instead of inside the diff machinery? Also, wouldn't a
diff.renameLimit = 0 imply no renames, not "use default"?

I notice that the docs don't make this explicit:

diff.renameLimit::
	The number of files to consider when performing the copy/rename
	detection; equivalent to the 'git diff' option `-l`. This setting
	has no effect if rename detection is turned off.

but also too_many_rename_candidates() has this strange
default check:

	/*
	 * This basically does a test for the rename matrix not
	 * growing larger than a "rename_limit" square matrix, ie:
	 *
	 *    num_create * num_src > rename_limit * rename_limit
	 */
	if (rename_limit <= 0)
		rename_limit = 32767;

this is... a much larger limit than I would think is reasonable.

Of course, diff_rename_limit_default is set to 400 inside diff.c.
Should that be extracted as a constant so we can repeat it here?

> +	diff_opts.rename_score = opt->rename_score;
> +	diff_opts.show_rename_progress = opt->show_rename_progress;
> +	diff_opts.output_format = DIFF_FORMAT_NO_OUTPUT;
> +	diff_setup_done(&diff_opts);
> +	diff_tree_oid(&merge_base->object.oid, &side->object.oid, "",
> +		      &diff_opts);
> +	diffcore_std(&diff_opts);
> +
> +	if (diff_opts.needed_rename_limit > opt->priv->renames->needed_limit)
> +		opt->priv->renames->needed_limit = diff_opts.needed_rename_limit;
> +
> +	renames->pairs[side_index] = diff_queued_diff;
> +
> +	diff_opts.output_format = DIFF_FORMAT_NO_OUTPUT;
> +	diff_queued_diff.nr = 0;
> +	diff_queued_diff.queue = NULL;
> +	diff_flush(&diff_opts);
>  }
>  
>  /*
> @@ -1379,6 +1405,10 @@ void merge_switch_to_result(struct merge_options *opt,
>  			printf("%s", sb->buf);
>  		}
>  		string_list_clear(&olist, 0);
> +
> +		/* Also include needed rename limit adjustment now */
> +		diff_warn_rename_limit("merge.renamelimit",
> +				       opti->renames->needed_limit, 0);

I suppose this new call is appropriate in this patch, since you assign
the value inside detect_regular_renames(), but it might be good to
describe its presence in the commit message.

Thanks,
-Stolee
Elijah Newren Dec. 11, 2020, 5:38 p.m. UTC | #2
On Thu, Dec 10, 2020 at 6:54 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 12/9/2020 2:41 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> >
> > Based heavily on merge-recursive's get_diffpairs() function.
>
> (You're not kidding, and I should have looked here before making
> some comments below.)

I can provide some extra background on all the crazy magic numbers and
non-sensical treatment of tiny values, though.  And since you were so
curious about these, I have an excuse to dump more info on you than
you probably were bargaining for...  :-)

> > Signed-off-by: Elijah Newren <newren@gmail.com>
> > ---
> >  merge-ort.c | 32 +++++++++++++++++++++++++++++++-
> >  1 file changed, 31 insertions(+), 1 deletion(-)
> >
> > diff --git a/merge-ort.c b/merge-ort.c
> > index 92b765dd3f0..1ff637e57af 100644
> > --- a/merge-ort.c
> > +++ b/merge-ort.c
> > @@ -634,7 +634,33 @@ static void detect_regular_renames(struct merge_options *opt,
> >                                  struct tree *side,
> >                                  unsigned side_index)
> >  {
> > -     die("Not yet implemented.");
> > +     struct diff_options diff_opts;
> > +     struct rename_info *renames = opt->priv->renames;
> > +
> > +     repo_diff_setup(opt->repo, &diff_opts);
> > +     diff_opts.flags.recursive = 1;
> > +     diff_opts.flags.rename_empty = 0;
> > +     diff_opts.detect_rename = DIFF_DETECT_RENAME;
> > +     diff_opts.rename_limit = opt->rename_limit;
>
> I assume that opt->rename_limit has been initialized properly
> against merge.renameLimit/diff.renameLimit in another location...

Yes, see init_merge_options() and merge_recursive_config() in
merge-recursive.c.  People using merge-ort will nevertheless be using
some functions out of merge-recursive.c...for now.

> > +     if (opt->rename_limit <= 0)
> > +             diff_opts.rename_limit = 1000;
>
> (I made the following comments before thinking to look at
> get_diffpairs() which behaves in an equivalent way with this
> "1000" constant limit. I'm not sure if there is a reason why
> this limit is different from the _other_ limits I discovered,
> but it might still be good to reduce magic literal ints by
> grouping this "1000" into a const or macro.)

I'll discuss the value of 1000 later...

> ...and this just assigns the default again. Why is this done
> here instead of inside the diff machinery? Also, wouldn't a
> diff.renameLimit = 0 imply no renames, not "use default"?

Yes, I totally agree that would make more sense, but backward
compatibility sometimes requires violating common sense.  See commit
89973554b5 ("diffcore-rename: make diff-tree -l0 mean -l<large>",
2017-11-29).  For the same reasons discussed in that commit, I'm
hesitant to change what is used here; it's a backward compatibility
concern now.

One reason opt->rename_limit could be 0 is if some caller does the following:

   merge_options opt;
   memset(&opt, 0, sizeof(opt));
   opt.ancestor = ....;
   /* forget to set opt.rename_limit */
   merge_incore_nonrecursive(&opt, ...);

The most likely reason for a negative value is probably that
init_merge_options() in merge-recursive.c set opt->rename_limit to -1.
Having init_merge_options() set the value to the actual default
probably would have made more sense, but the
assign-it-to-negative-one-and-deal-with-it-later goes back to the
introduction of init_merge_options() in 2008.  Actually, if you ignore
init_merge_options() the same thing was being done before back in 2007
as soon as any limit handling was introduced to the code.

Since init_merge_options() is shared for now between merge-recurisve.c
and merge-ort.c, any updates I make here would necessitate similar
code updates to merge-recursive.c.

Also, it's not just internal code callers.  Someone could set
merge.renameLimit or diff.renameLimit in their repository (or their
global .gitconfig) to a non-positive value and get this behavior of
treat-non-positive-as-whatever-the-default-is.

> I notice that the docs don't make this explicit:
>
> diff.renameLimit::
>         The number of files to consider when performing the copy/rename
>         detection; equivalent to the 'git diff' option `-l`. This setting
>         has no effect if rename detection is turned off.

See also https://lore.kernel.org/git/20180426162339.db6b4855fedb5e5244ba7dd1@google.com/
where we talked about documenting the special value of 0 (in that case
for diff -l, though merge.renameLimit should have one too), but we
obviously never got around to it.  Yet.  (I did at least put it on my
projects list, though things sometimes languish there for years.)

> but also too_many_rename_candidates() has this strange
> default check:
>
>         /*
>          * This basically does a test for the rename matrix not
>          * growing larger than a "rename_limit" square matrix, ie:
>          *
>          *    num_create * num_src > rename_limit * rename_limit
>          */
>         if (rename_limit <= 0)
>                 rename_limit = 32767;
>
> this is... a much larger limit than I would think is reasonable.

The value of 32767 came from backward compatibility and in particular
from the exact same commit referenced above -- 89973554b5
("diffcore-rename: make diff-tree -l0 mean -l<large>", 2017-11-29).

Also, perhaps this value is *smaller* than reasonable -- I've used
values like 48941 before on real world repositories.  (And I'm not
picking a random large value to report; *that* exact value came up
enough times that I remember that particular one.)  If 0 (or negative)
is supposed to mean "large", then shouldn't it handle values people
use on real world repositories?  (Not that I care that much, because I
think the usage of 0 to mean "large" is kind of illogical, so I'll
avoid it and discourage others from using it.)

I do know where the 32767 came from, though.  Once upon a time, 32767
was "the biggest supported value possible" and in fact any other
number was silently capped to 32767.  This of course led to a number
of issues.  See commit 9f7e4bfa3b ("diff: remove silent clamp of
renameLimit", 2017-11-13) and perhaps also commits b520abf1c8
("sequencer: warn when internal merge may be suboptimal due to
renameLimit", 2017-11-13) and d6861d0258 ("progress: fix progress
meters when dealing with lots of work", 2017-11-13).

> Of course, diff_rename_limit_default is set to 400 inside diff.c.
> Should that be extracted as a constant so we can repeat it here?

I think it makes sense to have merge have a higher default rename
limit than diffs.  I can see folks just doing a "git log -p" and not
wanting individual commits to take a long time, especially since it's
not at all clear that most the commits are going to be of interest to
the user.  In contrast, when merging, the commits are definitely of
interest to the user, and spending a little more time on a few commits
provides a nice payoff.

Also, merges provide progress meters on rename detection; I don't
think that log -p does.  I think that the presence of progress meters
makes it easier to deal with larger values as well.

It may also be worth noting that both of these numbers were modified
in the same commit in the past and retained distinct values; see
commit 92c57e5c1d ("bump rename limit defaults (again)", 2011-02-19).

After all my rename optimizations, all those cases that used to
require limits in the 20k ~ 50k range can now all complete with a
limit under 1000, and quite rapidly.  (It was really hard to get one
of them under 1000, though.  It stubbornly required a value of 1002
until I figured out another optimization allowing me to avoid
detecting more renames without any change in behavior.)  It's nice
that it's fast, and it's also nice that rename detection just works
instead of having the merge throw a warning that the limit was too
low, doing the merge all wrong, and expecting the user to undo the
merge, set the limit higher, and redo it.

400 definitely isn't high enough.  I'm actually tempted to double the
1000 to buy more room.  Since the last bump was about a decade ago and
noted that processors had gotten faster, since the bump before it
perhaps it is time to bump it again.

All that said, it could possibly make sense to define 1000 as a
special constant near the top of the file and then use it via whatever
macro/constant/variable name we give it.  Such a change would make it
harder to compare this patch to get_diffpairs() in merge-recursive.c,
though...

> > +     diff_opts.rename_score = opt->rename_score;
> > +     diff_opts.show_rename_progress = opt->show_rename_progress;
> > +     diff_opts.output_format = DIFF_FORMAT_NO_OUTPUT;
> > +     diff_setup_done(&diff_opts);
> > +     diff_tree_oid(&merge_base->object.oid, &side->object.oid, "",
> > +                   &diff_opts);
> > +     diffcore_std(&diff_opts);
> > +
> > +     if (diff_opts.needed_rename_limit > opt->priv->renames->needed_limit)
> > +             opt->priv->renames->needed_limit = diff_opts.needed_rename_limit;
> > +
> > +     renames->pairs[side_index] = diff_queued_diff;
> > +
> > +     diff_opts.output_format = DIFF_FORMAT_NO_OUTPUT;
> > +     diff_queued_diff.nr = 0;
> > +     diff_queued_diff.queue = NULL;
> > +     diff_flush(&diff_opts);
> >  }
> >
> >  /*
> > @@ -1379,6 +1405,10 @@ void merge_switch_to_result(struct merge_options *opt,
> >                       printf("%s", sb->buf);
> >               }
> >               string_list_clear(&olist, 0);
> > +
> > +             /* Also include needed rename limit adjustment now */
> > +             diff_warn_rename_limit("merge.renamelimit",
> > +                                    opti->renames->needed_limit, 0);
>
> I suppose this new call is appropriate in this patch, since you assign
> the value inside detect_regular_renames(), but it might be good to
> describe its presence in the commit message.

Sure, I can add a note.
diff mbox series

Patch

diff --git a/merge-ort.c b/merge-ort.c
index 92b765dd3f0..1ff637e57af 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -634,7 +634,33 @@  static void detect_regular_renames(struct merge_options *opt,
 				   struct tree *side,
 				   unsigned side_index)
 {
-	die("Not yet implemented.");
+	struct diff_options diff_opts;
+	struct rename_info *renames = opt->priv->renames;
+
+	repo_diff_setup(opt->repo, &diff_opts);
+	diff_opts.flags.recursive = 1;
+	diff_opts.flags.rename_empty = 0;
+	diff_opts.detect_rename = DIFF_DETECT_RENAME;
+	diff_opts.rename_limit = opt->rename_limit;
+	if (opt->rename_limit <= 0)
+		diff_opts.rename_limit = 1000;
+	diff_opts.rename_score = opt->rename_score;
+	diff_opts.show_rename_progress = opt->show_rename_progress;
+	diff_opts.output_format = DIFF_FORMAT_NO_OUTPUT;
+	diff_setup_done(&diff_opts);
+	diff_tree_oid(&merge_base->object.oid, &side->object.oid, "",
+		      &diff_opts);
+	diffcore_std(&diff_opts);
+
+	if (diff_opts.needed_rename_limit > opt->priv->renames->needed_limit)
+		opt->priv->renames->needed_limit = diff_opts.needed_rename_limit;
+
+	renames->pairs[side_index] = diff_queued_diff;
+
+	diff_opts.output_format = DIFF_FORMAT_NO_OUTPUT;
+	diff_queued_diff.nr = 0;
+	diff_queued_diff.queue = NULL;
+	diff_flush(&diff_opts);
 }
 
 /*
@@ -1379,6 +1405,10 @@  void merge_switch_to_result(struct merge_options *opt,
 			printf("%s", sb->buf);
 		}
 		string_list_clear(&olist, 0);
+
+		/* Also include needed rename limit adjustment now */
+		diff_warn_rename_limit("merge.renamelimit",
+				       opti->renames->needed_limit, 0);
 	}
 
 	merge_finalize(opt, result);