Message ID | pull.1637.git.1705006074626.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | diffcore-delta: avoid ignoring final 'line' of file | expand |
On Thu, Jan 11, 2024 at 08:47:54PM +0000, Elijah Newren via GitGitGadget wrote: > --- > diffcore-delta.c | 4 ++++ > t/t4001-diff-rename.sh | 19 +++++++++++++++++++ > 2 files changed, 23 insertions(+) Nice find and fix. This patch LGTM. Thanks, Taylor
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > diff --git a/diffcore-delta.c b/diffcore-delta.c > index c30b56e983b..7136c3dd203 100644 > --- a/diffcore-delta.c > +++ b/diffcore-delta.c > @@ -159,6 +159,10 @@ static struct spanhash_top *hash_chars(struct repository *r, > n = 0; > accum1 = accum2 = 0; > } > + if (n > 0) { > + hashval = (accum1 + accum2 * 0x61) % HASHBASE; > + hash = add_spanhash(hash, hashval, n); > + } OK, so we were ignoring the final short bit that is not terminated with LF due to the "continue". Nicely found. > diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh > index 85be1367de6..29299acbce7 100755 > --- a/t/t4001-diff-rename.sh > +++ b/t/t4001-diff-rename.sh > @@ -286,4 +286,23 @@ test_expect_success 'basename similarity vs best similarity' ' > test_cmp expected actual > ' > > +test_expect_success 'last line matters too' ' > + test_write_lines a 0 1 2 3 4 5 6 7 8 9 >nonewline && > + printf "git ignores final up to 63 characters if not newline terminated" >>nonewline && > + git add nonewline && > + git commit -m "original version of file with no final newline" && I found it misleading that the file whose name is nonewline has bunch of LF including on its last line ;-). > + # Change ONLY the first character of the whole file > + test_write_lines b 0 1 2 3 4 5 6 7 8 9 >nonewline && Same here, but it is too much to bother rewriting it ... { test_write_lines ... printf ... } >incomplete ... like so ("incomplete" stands for "file ending with an incomplete line"), so I'll let it pass. > + printf "git ignores final up to 63 characters if not newline terminated" >>nonewline && > + git add nonewline && > + git mv nonewline still-no-newline && > + git commit -a -m "rename nonewline -> still-no-newline" && > + git diff-tree -r -M01 --name-status HEAD^ HEAD >actual && > + cat >expected <<-\EOF && > + R097 nonewline still-no-newline I am not very happy with the hardcoded 97. You are already using the non-standard 10% threshold. If the delta detection that forgets about the last line is so broken as your proposed log message noted, shouldn't you be able to construct a sample pair of preimage and postimage for which the broken version gives so low similarity to be judged not worth treating as a rename, while the fixed version gives reasonable similarity to be made into a rename, by the default threshold? That way, the test only needs to see if we got a rename (with any similarity) or a delete and an add.
On Thu, Jan 11, 2024 at 3:00 PM Junio C Hamano <gitster@pobox.com> wrote: > > "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > diff --git a/diffcore-delta.c b/diffcore-delta.c > > index c30b56e983b..7136c3dd203 100644 > > --- a/diffcore-delta.c > > +++ b/diffcore-delta.c > > @@ -159,6 +159,10 @@ static struct spanhash_top *hash_chars(struct repository *r, > > n = 0; > > accum1 = accum2 = 0; > > } > > + if (n > 0) { > > + hashval = (accum1 + accum2 * 0x61) % HASHBASE; > > + hash = add_spanhash(hash, hashval, n); > > + } > > OK, so we were ignoring the final short bit that is not terminated > with LF due to the "continue". Nicely found. > > > diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh > > index 85be1367de6..29299acbce7 100755 > > --- a/t/t4001-diff-rename.sh > > +++ b/t/t4001-diff-rename.sh > > @@ -286,4 +286,23 @@ test_expect_success 'basename similarity vs best similarity' ' > > test_cmp expected actual > > ' > > > > +test_expect_success 'last line matters too' ' > > + test_write_lines a 0 1 2 3 4 5 6 7 8 9 >nonewline && > > + printf "git ignores final up to 63 characters if not newline terminated" >>nonewline && > > + git add nonewline && > > + git commit -m "original version of file with no final newline" && > > I found it misleading that the file whose name is nonewline has > bunch of LF including on its last line ;-). Yeah, sorry. It's been a while, but I think I started with a file with only one line that had no LF, but then realized for inexact rename detection to kick in I needed some other lines, at least one of which didn't match...but I forgot to update the filename after adding the other lines... > > + # Change ONLY the first character of the whole file > > + test_write_lines b 0 1 2 3 4 5 6 7 8 9 >nonewline && > > Same here, but it is too much to bother rewriting it ... > > { > test_write_lines ... > printf ... > } >incomplete > > ... like so ("incomplete" stands for "file ending with an incomplete line"), > so I'll let it pass. > > > + printf "git ignores final up to 63 characters if not newline terminated" >>nonewline && > > > > + git add nonewline && > > + git mv nonewline still-no-newline && > > + git commit -a -m "rename nonewline -> still-no-newline" && > > + git diff-tree -r -M01 --name-status HEAD^ HEAD >actual && > > + cat >expected <<-\EOF && > > + R097 nonewline still-no-newline > > I am not very happy with the hardcoded 97. You are already using > the non-standard 10% threshold. If the delta detection that > forgets about the last line is so broken as your proposed log > message noted, shouldn't you be able to construct a sample pair of > preimage and postimage for which the broken version gives so low > similarity to be judged not worth treating as a rename, while the > fixed version gives reasonable similarity to be made into a rename, > by the default threshold? That way, the test only needs to see if > we got a rename (with any similarity) or a delete and an add. Oops, the threshold is entirely unnecessary here; not sure why I didn't remember to take it out (originally used the threshold while testing without the fix to just how low of a similarity git thought these nearly identical files had). Since you don't like the threshold, and since we don't seem to have a summary format that reports on the rename without the percentage, I guess I need to munge the output with sed: sed -e "s/^R[0-9]* /R /" actual >actual.munged && and then compare the expected output to that. I'll send in a patch doing so and fix up the filenames and drop the rename threshold while at it.
Elijah Newren <newren@gmail.com> writes: >> I am not very happy with the hardcoded 97. You are already using >> the non-standard 10% threshold. If the delta detection that >> forgets about the last line is so broken as your proposed log >> message noted, shouldn't you be able to construct a sample pair of >> preimage and postimage for which the broken version gives so low >> similarity to be judged not worth treating as a rename, while the >> fixed version gives reasonable similarity to be made into a rename, >> by the default threshold? That way, the test only needs to see if >> we got a rename (with any similarity) or a delete and an add. > > Oops, the threshold is entirely unnecessary here; not sure why I > didn't remember to take it out (originally used the threshold while > testing without the fix to just how low of a similarity git thought > these nearly identical files had). > > Since you don't like the threshold, and since we don't seem to have a > summary format that reports on the rename without the percentage, I > guess I need to munge the output with sed: > > sed -e "s/^R[0-9]* /R /" actual >actual.munged && Heh, I was hoping that we should be able to use "diff --name-only". $ git mv Makefile Breakfile $ git diff --name-only -M HEAD Breakfile $ git reset --hard $ git rm Makefile $ >Breakfile && git add Breakfile $ git diff --name-only -M HEAD Breakfile Makefile $ git reset --hard
On Fri, Jan 12, 2024 at 10:22 PM Junio C Hamano <gitster@pobox.com> wrote: > > Elijah Newren <newren@gmail.com> writes: > > >> I am not very happy with the hardcoded 97. You are already using > >> the non-standard 10% threshold. If the delta detection that > >> forgets about the last line is so broken as your proposed log > >> message noted, shouldn't you be able to construct a sample pair of > >> preimage and postimage for which the broken version gives so low > >> similarity to be judged not worth treating as a rename, while the > >> fixed version gives reasonable similarity to be made into a rename, > >> by the default threshold? That way, the test only needs to see if > >> we got a rename (with any similarity) or a delete and an add. > > > > Oops, the threshold is entirely unnecessary here; not sure why I > > didn't remember to take it out (originally used the threshold while > > testing without the fix to just how low of a similarity git thought > > these nearly identical files had). > > > > Since you don't like the threshold, and since we don't seem to have a > > summary format that reports on the rename without the percentage, I > > guess I need to munge the output with sed: > > > > sed -e "s/^R[0-9]* /R /" actual >actual.munged && > > Heh, I was hoping that we should be able to use "diff --name-only". > > $ git mv Makefile Breakfile > $ git diff --name-only -M HEAD > Breakfile > $ git reset --hard > $ git rm Makefile > $ >Breakfile && git add Breakfile > $ git diff --name-only -M HEAD > Breakfile > Makefile > $ git reset --hard I guess we could, since the only thing in this repository is a file which is being renamed, and we can then deduce based on the setup that the change we expected must have happened. However, I didn't like the deduce bit; since --name-only only lists one of the two filenames and doesn't provide any hint that the change involved is a rename, it felt to me like using --name-only would make the test not really be a rename test.
Elijah Newren <newren@gmail.com> writes: >> Heh, I was hoping that we should be able to use "diff --name-only". >> >> $ git mv Makefile Breakfile >> $ git diff --name-only -M HEAD >> Breakfile >> $ git reset --hard >> $ git rm Makefile >> $ >Breakfile && git add Breakfile >> $ git diff --name-only -M HEAD >> Breakfile >> Makefile >> $ git reset --hard > > I guess we could, since the only thing in this repository is a file > which is being renamed, and we can then deduce based on the setup that > the change we expected must have happened. > > However, I didn't like the deduce bit; since --name-only only lists > one of the two filenames and doesn't provide any hint that the change > involved is a rename, it felt to me like using --name-only would make > the test not really be a rename test. Hmph, I am not quite seeing what you are saying. If the "mv" from Makefile to Breakfile in the above example is between preimage and postimage that are similar enough, then we will see "one" paths, i.e. the file in the "after" side of the diff. But if the "mv" from Makefile to Breakfile involves too large a content change (like, say, going from 3800+ lines to an empty file, in the second example above), then because such a change from Makefile to Breakfile is too dissimilar, we do not judge it as "renamed" and show "two" paths. I do not quite see where we need to "deduce".
On Thu, Jan 18, 2024 at 7:06 PM Junio C Hamano <gitster@pobox.com> wrote: > > Elijah Newren <newren@gmail.com> writes: > > >> Heh, I was hoping that we should be able to use "diff --name-only". > >> > >> $ git mv Makefile Breakfile > >> $ git diff --name-only -M HEAD > >> Breakfile > >> $ git reset --hard > >> $ git rm Makefile > >> $ >Breakfile && git add Breakfile > >> $ git diff --name-only -M HEAD > >> Breakfile > >> Makefile > >> $ git reset --hard > > > > I guess we could, since the only thing in this repository is a file > > which is being renamed, and we can then deduce based on the setup that > > the change we expected must have happened. > > > > However, I didn't like the deduce bit; since --name-only only lists > > one of the two filenames and doesn't provide any hint that the change > > involved is a rename, it felt to me like using --name-only would make > > the test not really be a rename test. > > Hmph, I am not quite seeing what you are saying. If the "mv" from > Makefile to Breakfile in the above example is between preimage and > postimage that are similar enough, then we will see "one" paths, > i.e. the file in the "after" side of the diff. But if the "mv" from > Makefile to Breakfile involves too large a content change (like, > say, going from 3800+ lines to an empty file, in the second example > above), then because such a change from Makefile to Breakfile is too > dissimilar, we do not judge it as "renamed" and show "two" paths. I > do not quite see where we need to "deduce". You just wrote a very well worded paragraph going through the reasoning involved to prove that a rename is involved. You reasoned, or deduced, the necessary conclusion quite well. Sure, it might be a simple deduction given the knowledge of the test setup, but it's still a deduction. But perhaps I can put it another way: You can't just look at the output of `diff --name-only` and say a rename was involved -- unless you know the test setup and the previous operations. In fact, how about a possibly contrived alternate scenario: What if `git mv $1 $2` had a bug where, after doing the expected work, it did the equivalent of running `git checkout HEAD -- $1` at the end of its operation. Then we'd see: $ <tweak Makefile slightly> $ git mv Makefile Breakfile $ git diff --name-only -M HEAD Breakfile i.e. we get the same output as without the git-mv bug (note that Makefile will not be listed since it is unmodified), but with the bug in git-mv there definitely is no rename involved (since there's no delete to pair with the addition of Breakfile). As such, we'd still pass the test despite there being no rename. Sure, the example is somewhat contrived, but I'm just saying that the --name-only output doesn't actually test or prove that a rename occurred. And since this test, and all other tests in this particular testfile, are specifically about renames, the fact that we aren't specifically testing for something being renamed feels odd to me. If you still like `diff --name-only` better anyway, that's fine and I'll switch it. I'm just stating why it seems suboptimal to me.
Elijah Newren <newren@gmail.com> writes: > But perhaps I can put it another way: You can't just look at the > output of `diff --name-only` and say a rename was involved -- unless > you know the test setup and the previous operations. That is true. But that is exactly what a test is about. You have this and that file, and you do this operation, now what should happen? Does the observation match the expectation? That is what our tests are done. And your argument should not have to rely on a bug in "git mv". After all, you should be able to do the same with "mv A B && git add B && git add -u" (or "git rm -f A") and you won't be affected by such a bug. > If you still like `diff --name-only` better anyway, that's fine and > I'll switch it. I'm just stating why it seems suboptimal to me. I'd prefer to omit "sed" involved, but I'd even more prefer not waste more time on the test. As long as we can make a robust test (which an extra process running sed would certainly give us), I'd be fine. Thanks.
diff --git a/diffcore-delta.c b/diffcore-delta.c index c30b56e983b..7136c3dd203 100644 --- a/diffcore-delta.c +++ b/diffcore-delta.c @@ -159,6 +159,10 @@ static struct spanhash_top *hash_chars(struct repository *r, n = 0; accum1 = accum2 = 0; } + if (n > 0) { + hashval = (accum1 + accum2 * 0x61) % HASHBASE; + hash = add_spanhash(hash, hashval, n); + } QSORT(hash->data, (size_t)1ul << hash->alloc_log2, spanhash_cmp); return hash; } diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh index 85be1367de6..29299acbce7 100755 --- a/t/t4001-diff-rename.sh +++ b/t/t4001-diff-rename.sh @@ -286,4 +286,23 @@ test_expect_success 'basename similarity vs best similarity' ' test_cmp expected actual ' +test_expect_success 'last line matters too' ' + test_write_lines a 0 1 2 3 4 5 6 7 8 9 >nonewline && + printf "git ignores final up to 63 characters if not newline terminated" >>nonewline && + git add nonewline && + git commit -m "original version of file with no final newline" && + + # Change ONLY the first character of the whole file + test_write_lines b 0 1 2 3 4 5 6 7 8 9 >nonewline && + printf "git ignores final up to 63 characters if not newline terminated" >>nonewline && + git add nonewline && + git mv nonewline still-no-newline && + git commit -a -m "rename nonewline -> still-no-newline" && + git diff-tree -r -M01 --name-status HEAD^ HEAD >actual && + cat >expected <<-\EOF && + R097 nonewline still-no-newline + EOF + test_cmp expected actual +' + test_done