diff mbox series

Histogram/patience diff matching lines with different counts

Message ID CANiSa6jtwizbR4K-DqdKjVeZqAkbswnPXCBZZrrfNy2CKBEQVg@mail.gmail.com (mailing list archive)
State New
Headers show
Series Histogram/patience diff matching lines with different counts | expand

Commit Message

Martin von Zweigbergk Jan. 9, 2025, 9:31 p.m. UTC
Hi,

Let's say you have this a file with this content:
```
a
b
c
d
e
f
```

Then you change it to this:
```
a
b2
c
d2
c
e2
f
```

Note that most lines changed, but `c` remains unchanged but duplicated.

Now `git diff --diff-algorithm=histogram` will show this diff:
```
```

I'm surprised the first "c" line is considered unchanged. I thought
histogram diff was supposed to first match up unique lines between the
two sides and then gradually try higher and higher counts if there
were no unique lines. In this case, only "a" and "f" have count 1
(i.e. are unique) on both sides, so they would be matched up first.
After that, "c" is unique on the left side but has a different count
(namely 2) on the right side, so I would have thought that it should
not be considered matching. Does anyone know if it's implemented this
way on purpose? Actually, I think I remember reading that Git falls
back to Myers in some cases, so maybe that's what's going on here?

As some of you know, I work on the Jujutsu/jj VCS
(https://github.com/jj-vcs/jj). We also use histogram diff (and only
histogram diff) and actually allowed matching up lines with different
counts a while ago, but I thought it seemed too arbitrary to line up
the first matches if there were different counts, so we changed that.
Then we got a report from a user that Git behaves differently. See
https://github.com/jj-vcs/jj/issues/761#issuecomment-2581219294 for
more details.

Thanks
diff mbox series

Patch

diff --git a/file b/file
index 0fdf397..7cfb042 100644
--- a/file
+++ b/file
@@ -1,6 +1,7 @@ 
 a
-b
+b2
 c
-d
-e
+d2
+c
+e2
 f