diff mbox series

combine-diff: handle --find-object in multitree code path

Message ID 20200930115240.GA1899467@coredump.intra.peff.net (mailing list archive)
State Accepted
Commit 637072574f05596cff629653dc946f0902129ad6
Headers show
Series combine-diff: handle --find-object in multitree code path | expand

Commit Message

Jeff King Sept. 30, 2020, 11:52 a.m. UTC
When doing combined diffs, we have two possible code paths:

  - a slower one which independently diffs against each parent, applies
    any filters, and then intersects the resulting paths

  - a faster one which walks all trees simultaneously

When the diff options specify that we must do certain filters, like
pickaxe, then we always use the slow path, since the pickaxe code only
knows how to handle filepairs, not the n-parent entries generated for
combined diffs.

But there are two problems with the slow path:

 1. It's slow. Running:

      git rev-list HEAD | git diff-tree --stdin -r -c

    in git.git takes ~3s on my machine. But adding "--find-object" to
    that increases it to ~6s, even though find-object itself should
    incur only a few extra oid comparisons. On linux.git, it's even
    worse: 35s versus 215s.

 2. It doesn't catch all cases where a particular path is interesting.
    Consider a merge with parent blobs X and Y for a particular path,
    and end result Z. That should be interesting according to "-c",
    because the result doesn't match either parent. And it should be
    interesting even with "--find-object=X", because "X" went away in
    the merge.

    But because we perform each pairwise diff independently, this
    confuses the intersection code. The change from X to Z is still
    interesting according to --find-object. But in the other parent we
    went from Y to Z, so the diff appears empty! That causes the
    intersection code to think that parent didn't change the path, and
    thus it's not interesting for "-c".

This patch fixes both by implementing --find-object for the multitree
code. It's a bit unfortunate that we have to duplicate some logic from
diffcore-pickaxe, but this is the best we can do for now. In an ideal
world, all of the diffcore code would stop thinking about filepairs and
start thinking about n-parent sets, and we could use the multitree walk
with all of it.

Until then, there are some leftover warts:

  - other pickaxe operations, like -S or -G, still suffer from both
    problems. These would be hard to adapt because they rely on having
    a diff_filespec() for each path to look at content. And we'd need to
    define what an n-way "change" means in each case (probably easy for
    "-S", which can compare counts, but not so clear for -G, which is
    about grepping diffs).

  - other options besides --find-object may cause us to use the slow
    pairwise path, in which case we'll go back to producing a different
    (wrong) answer for the X/Y/Z case above.

We may be able to hack around these, but I think the ultimate solution
will be a larger rewrite of the diffcore code. For now, this patch
improves one specific case but leaves the rest.

Signed-off-by: Jeff King <peff@peff.net>
---
I'm a little nervous that the second "wart" may actually be making
things worse, because now we sometimes produce a wrong answer and
sometime a right one, and it can be difficult to know which options
cause which (e.g., rename detection puts us onto the slow path). Is it
worse to sometimes be right and sometimes wrong, or to always be
consistently and predictably wrong? I suppose one could even argue that
the current semantics aren't "wrong", but just what we happen to
produce. But IMHO they are so un-useful as to be considered wrong.

 combine-diff.c          | 43 ++++++++++++++++++++++++++++++--
 t/t4064-diff-oidfind.sh | 55 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 96 insertions(+), 2 deletions(-)

Comments

Chris Torek Sept. 30, 2020, 8:06 p.m. UTC | #1
On Wed, Sep 30, 2020 at 4:54 AM Jeff King <peff@peff.net> wrote:
> I'm a little nervous that the second "wart" may actually be making
> things worse, because now we sometimes produce a wrong answer and
> sometime a right one, and it can be difficult to know which options
> cause which (e.g., rename detection puts us onto the slow path). Is it
> worse to sometimes be right and sometimes wrong, or to always be
> consistently and predictably wrong? I suppose one could even argue that
> the current semantics aren't "wrong", but just what we happen to
> produce. But IMHO they are so un-useful as to be considered wrong.

"Predictably wrong" *is* actually useful while "unpredictably wrong"
is, um, "less useful".  Perhaps just documenting exactly which options
use which path?  Basically turn some of this into documentation.

Chris
Junio C Hamano Sept. 30, 2020, 10:07 p.m. UTC | #2
Jeff King <peff@peff.net> writes:

>  2. It doesn't catch all cases where a particular path is interesting.
>     Consider a merge with parent blobs X and Y for a particular path,
>     and end result Z. That should be interesting according to "-c",
>     because the result doesn't match either parent. And it should be
>     interesting even with "--find-object=X", because "X" went away in
>     the merge.
>
>     But because we perform each pairwise diff independently, this
>     confuses the intersection code. The change from X to Z is still
>     interesting according to --find-object. But in the other parent we
>     went from Y to Z, so the diff appears empty! That causes the
>     intersection code to think that parent didn't change the path, and
>     thus it's not interesting for "-c".

Hmmmm....

> +test_expect_success 'do not detect merge that does not touch blob' '
> +	git checkout -B merge interesting &&
> +	git merge -m "untouched blob" base &&
> +	git diff-tree --format=%s --find-object=$blob -c --name-status HEAD >actual &&

You learn new things every day ;-)

I've always thought that for --find-object to do a good job, you'd
need "--full-history" and perhaps "-m".  Especially, I didn't expect
"-c" or "--cc" to make a difference.
Jeff King Sept. 30, 2020, 10:46 p.m. UTC | #3
On Wed, Sep 30, 2020 at 01:06:02PM -0700, Chris Torek wrote:

> On Wed, Sep 30, 2020 at 4:54 AM Jeff King <peff@peff.net> wrote:
> > I'm a little nervous that the second "wart" may actually be making
> > things worse, because now we sometimes produce a wrong answer and
> > sometime a right one, and it can be difficult to know which options
> > cause which (e.g., rename detection puts us onto the slow path). Is it
> > worse to sometimes be right and sometimes wrong, or to always be
> > consistently and predictably wrong? I suppose one could even argue that
> > the current semantics aren't "wrong", but just what we happen to
> > produce. But IMHO they are so un-useful as to be considered wrong.
> 
> "Predictably wrong" *is* actually useful while "unpredictably wrong"
> is, um, "less useful".  Perhaps just documenting exactly which options
> use which path?  Basically turn some of this into documentation.

Perhaps. The thing is that I do have a use case for which I need that
answer, and I need it to be reliable and fast. So right now
--find-object is not useful at all, and this makes it useful for at
least my narrow case.

My big concern is just making the overall system more confusing. I agree
that documenting _might_ help, but we're getting pretty far into
internals here. I'm not sure I'd want to expose the user to a list of
"these options trigger this code path, which behaves like this". It's
likely to generate more confusion than it solves, and would likely
bitrot anyway.

Maybe something like this would help:

diff --git a/Documentation/diff-generate-patch.txt b/Documentation/diff-generate-patch.txt
index b10ff4caa6..f61af6af18 100644
--- a/Documentation/diff-generate-patch.txt
+++ b/Documentation/diff-generate-patch.txt
@@ -200,3 +200,7 @@ parents).  When shown by `git diff-files -c`, it compares the
 two unresolved merge parents with the working tree file
 (i.e. file1 is stage 2 aka "our version", file2 is stage 3 aka
 "their version").
+
+Note that the results of using `-c` or `--cc` with diff-filtering
+options such as `-S` or `--find-object` are currently poorly defined,
+and are subject to change.

I'd perhaps even put that in a BUGS section, but that can't be done from
an include file like this.

-Peff
Jeff King Sept. 30, 2020, 10:54 p.m. UTC | #4
On Wed, Sep 30, 2020 at 03:07:15PM -0700, Junio C Hamano wrote:

> > +test_expect_success 'do not detect merge that does not touch blob' '
> > +	git checkout -B merge interesting &&
> > +	git merge -m "untouched blob" base &&
> > +	git diff-tree --format=%s --find-object=$blob -c --name-status HEAD >actual &&
> 
> You learn new things every day ;-)
> 
> I've always thought that for --find-object to do a good job, you'd
> need "--full-history" and perhaps "-m".  Especially, I didn't expect
> "-c" or "--cc" to make a difference.

You don't need --full-history, since there's no history simplification
going on here. And "-m" isn't very helpful with --find-object. It's
going to find every merge that crosses a boundary where the object was
introduced, since it will be introduced in the diff against the _other_
parent. E.g., in:

  A -- B
   \    \
    C -- M

If B introduces object X, then:

  git log --find-object=X -m M

is going to see the diff of C to M as introducing X. But M didn't do
anything interesting there; it just picked it up from the branch with B.

My concrete use case here, btw, is reporting to users which commit
introduced a blob (and at which path). It mostly works if you ignore
merges, but misses out on any evil merges which introduce the object.
Adding "-c" fixes that (but before this patch is slow and misses the
case where the merge removes the object).

-Peff
diff mbox series

Patch

diff --git a/combine-diff.c b/combine-diff.c
index 002e0e5438..23bca28746 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -1451,6 +1451,42 @@  static struct combine_diff_path *find_paths_multitree(
 	return paths_head.next;
 }
 
+static int match_objfind(struct combine_diff_path *path,
+			 int num_parent,
+			 const struct oidset *set)
+{
+	int i;
+	if (oidset_contains(set, &path->oid))
+		return 1;
+	for (i = 0; i < num_parent; i++) {
+		if (oidset_contains(set, &path->parent[i].oid))
+			return 1;
+	}
+	return 0;
+}
+
+static struct combine_diff_path *combined_objfind(struct diff_options *opt,
+						  struct combine_diff_path *paths,
+						  int num_parent)
+{
+	struct combine_diff_path *ret = NULL, **tail = &ret;
+	struct combine_diff_path *p = paths;
+
+	while (p) {
+		struct combine_diff_path *next = p->next;
+
+		if (match_objfind(p, num_parent, opt->objfind)) {
+			p->next = NULL;
+			*tail = p;
+			tail = &p->next;
+		} else {
+			free(p);
+		}
+		p = next;
+	}
+
+	return ret;
+}
 
 void diff_tree_combined(const struct object_id *oid,
 			const struct oid_array *parents,
@@ -1506,10 +1542,10 @@  void diff_tree_combined(const struct object_id *oid,
 			opt->flags.follow_renames	||
 			opt->break_opt != -1	||
 			opt->detect_rename	||
-			(opt->pickaxe_opts & DIFF_PICKAXE_KINDS_MASK)	||
+			(opt->pickaxe_opts &
+			 (DIFF_PICKAXE_KINDS_MASK & ~DIFF_PICKAXE_KIND_OBJFIND)) ||
 			opt->filter;
 
-
 	if (need_generic_pathscan) {
 		/*
 		 * NOTE generic case also handles --stat, as it computes
@@ -1523,6 +1559,9 @@  void diff_tree_combined(const struct object_id *oid,
 		int stat_opt;
 		paths = find_paths_multitree(oid, parents, &diffopts);
 
+		if (opt->pickaxe_opts & DIFF_PICKAXE_KIND_OBJFIND)
+			paths = combined_objfind(opt, paths, num_parent);
+
 		/*
 		 * show stat against the first parent even
 		 * when doing combined diff.
diff --git a/t/t4064-diff-oidfind.sh b/t/t4064-diff-oidfind.sh
index 3bdf317af8..6d8c8986fc 100755
--- a/t/t4064-diff-oidfind.sh
+++ b/t/t4064-diff-oidfind.sh
@@ -65,4 +65,59 @@  test_expect_success 'find a submodule' '
 	test_cmp expect actual
 '
 
+test_expect_success 'set up merge tests' '
+	test_commit base &&
+
+	git checkout -b boring base^ &&
+	echo boring >file &&
+	git add file &&
+	git commit -m boring &&
+
+	git checkout -b interesting base^ &&
+	echo interesting >file &&
+	git add file &&
+	git commit -m interesting &&
+
+	blob=$(git rev-parse interesting:file)
+'
+
+test_expect_success 'detect merge which introduces blob' '
+	git checkout -B merge base &&
+	git merge --no-commit boring &&
+	echo interesting >file &&
+	git commit -am "introduce blob" &&
+	git diff-tree --format=%s --find-object=$blob -c --name-status HEAD >actual &&
+	cat >expect <<-\EOF &&
+	introduce blob
+
+	AM	file
+	EOF
+	test_cmp expect actual
+'
+
+test_expect_success 'detect merge which removes blob' '
+	git checkout -B merge interesting &&
+	git merge --no-commit base &&
+	echo boring >file &&
+	git commit -am "remove blob" &&
+	git diff-tree --format=%s --find-object=$blob -c --name-status HEAD >actual &&
+	cat >expect <<-\EOF &&
+	remove blob
+
+	MA	file
+	EOF
+	test_cmp expect actual
+'
+
+test_expect_success 'do not detect merge that does not touch blob' '
+	git checkout -B merge interesting &&
+	git merge -m "untouched blob" base &&
+	git diff-tree --format=%s --find-object=$blob -c --name-status HEAD >actual &&
+	cat >expect <<-\EOF &&
+	untouched blob
+
+	EOF
+	test_cmp expect actual
+'
+
 test_done