diff mbox series

[1/2] diff: Fix modified lines stats with --stat and --numstat

Message ID 20200918113256.8699-2-tguyot@gmail.com (mailing list archive)
State Superseded
Headers show
Series [1/2] diff: Fix modified lines stats with --stat and --numstat | expand

Commit Message

Thomas Guyot Sept. 18, 2020, 11:32 a.m. UTC
In builtin_diffstat(), when both files are coming from "stdin" (which
could be better described as the file's content being written directly
into the file object), oideq() compares two null hashes and ignores the
actual differences for the statistics.

This patch checks if is_stdin flag is set on both sides and compare
contents directly.

Signed-off-by: Thomas Guyot-Sionnest <tguyot@gmail.com>
---
 diff.c                | 5 ++++-
 t/t3206-range-diff.sh | 8 ++++----
 2 files changed, 8 insertions(+), 5 deletions(-)

Comments

Taylor Blau Sept. 18, 2020, 2:46 p.m. UTC | #1
On Fri, Sep 18, 2020 at 07:32:55AM -0400, Thomas Guyot-Sionnest wrote:
> In builtin_diffstat(), when both files are coming from "stdin" (which
> could be better described as the file's content being written directly
> into the file object), oideq() compares two null hashes and ignores the
> actual differences for the statistics.
>
> This patch checks if is_stdin flag is set on both sides and compare
> contents directly.
>
> Signed-off-by: Thomas Guyot-Sionnest <tguyot@gmail.com>
> ---
>  diff.c                | 5 ++++-
>  t/t3206-range-diff.sh | 8 ++++----
>  2 files changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/diff.c b/diff.c
> index a5114fa864..2995527896 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -3681,7 +3681,10 @@ static void builtin_diffstat(const char *name_a, const char *name_b,
>  		return;
>  	}
>
> -	same_contents = oideq(&one->oid, &two->oid);
> +	if (one->is_stdin && two->is_stdin)
> +		same_contents = !strcmp(one->data, two->data);

Hmm. A couple of thoughts here:

  - strcmp seems like a slow-down here, since we'll have to go through
    at worst the smaller of one->data and two->data to compare each of
    them.

  - strcmp is likely not the right way to do that, since we could be
    diffing binary content, in which case we'd want to continue over
    NULs and instead stop at a fixed length (the minimum of the length
    of one->data and two->data, specifically). I'd have expected memcmp
    here instead.

  - Why do we have to do this at all all the way up in
    'builtin_diffstat'? I would expect these to contain the right
    OIDs by the time they are given back to us from the call to
    'diff_fill_oid_info' in 'run_diffstat'.

So, my last point is the most important of the three. I'd expect
something more along the lines of:

  1. diff_fill_oid_info resolve the link to the pipe, and
  2. index_path handles the resolved fd.

...but it looks like that is already what it's doing? I'm confused why
this doesn't work as-is.

> +	else
> +		same_contents = oideq(&one->oid, &two->oid);
>
>  	if (diff_filespec_is_binary(o->repo, one) ||
>  	    diff_filespec_is_binary(o->repo, two)) {
> diff --git a/t/t3206-range-diff.sh b/t/t3206-range-diff.sh
> index e024cff65c..4715e75b68 100755
> --- a/t/t3206-range-diff.sh
> +++ b/t/t3206-range-diff.sh
> @@ -258,11 +258,11 @@ test_expect_success 'changed commit with --stat diff option' '
>  	     a => b | 0
>  	     1 file changed, 0 insertions(+), 0 deletions(-)
>  	3:  $(test_oid t3) ! 3:  $(test_oid c3) s/11/B/
> -	     a => b | 0
> -	     1 file changed, 0 insertions(+), 0 deletions(-)
> +	     a => b | 2 +-
> +	     1 file changed, 1 insertion(+), 1 deletion(-)
>  	4:  $(test_oid t4) ! 4:  $(test_oid c4) s/12/B/
> -	     a => b | 0
> -	     1 file changed, 0 insertions(+), 0 deletions(-)
> +	     a => b | 2 +-
> +	     1 file changed, 1 insertion(+), 1 deletion(-)

The tests definitely demonstrate that the old behavior was wrong,
though...

Thanks,
Taylor
Thomas Guyot Sept. 18, 2020, 3:10 p.m. UTC | #2
On Fri, 18 Sep 2020 at 10:46, Taylor Blau <me@ttaylorr.com> wrote:
>
> On Fri, Sep 18, 2020 at 07:32:55AM -0400, Thomas Guyot-Sionnest wrote:
> > -     same_contents = oideq(&one->oid, &two->oid);
> > +     if (one->is_stdin && two->is_stdin)
> > +             same_contents = !strcmp(one->data, two->data);
>
> Hmm. A couple of thoughts here:
>
>   - strcmp seems like a slow-down here, since we'll have to go through
>     at worst the smaller of one->data and two->data to compare each of
>     them.
>
>   - strcmp is likely not the right way to do that, since we could be
>     diffing binary content, in which case we'd want to continue over
>     NULs and instead stop at a fixed length (the minimum of the length
>     of one->data and two->data, specifically). I'd have expected memcmp
>     here instead.
>

You're absolutely right - this is a bug I managed to figure out last
night - first real incursion into git C code - and I definitely didn't
think this through. TBH so far I coded mostly with tools dealing in
plaintext and C strings.

>   - Why do we have to do this at all all the way up in
>     'builtin_diffstat'? I would expect these to contain the right
>     OIDs by the time they are given back to us from the call to
>     'diff_fill_oid_info' in 'run_diffstat'.
>
> So, my last point is the most important of the three. I'd expect
> something more along the lines of:
>
>   1. diff_fill_oid_info resolve the link to the pipe, and
>   2. index_path handles the resolved fd.
>
> ...but it looks like that is already what it's doing? I'm confused why
> this doesn't work as-is.

So the idea is to checksum the data and write a valid oid. I'll see if
I can figure that out. Thanks for the hint though else I would likely
have gone with a buffer and memcmp. Your solution seems cleaner, and
there is a few other uses of oideq's that look dubious at best with
the case of null oids / buffered data so it's definitely a better
approach.

> > +     else
> > +             same_contents = oideq(&one->oid, &two->oid);
> >
> >       if (diff_filespec_is_binary(o->repo, one) ||
> >           diff_filespec_is_binary(o->repo, two)) {
> > diff --git a/t/t3206-range-diff.sh b/t/t3206-range-diff.sh
> > index e024cff65c..4715e75b68 100755
> > --- a/t/t3206-range-diff.sh
> > +++ b/t/t3206-range-diff.sh
> > @@ -258,11 +258,11 @@ test_expect_success 'changed commit with --stat diff option' '
> >            a => b | 0
> >            1 file changed, 0 insertions(+), 0 deletions(-)
> >       3:  $(test_oid t3) ! 3:  $(test_oid c3) s/11/B/
> > -          a => b | 0
> > -          1 file changed, 0 insertions(+), 0 deletions(-)
> > +          a => b | 2 +-
> > +          1 file changed, 1 insertion(+), 1 deletion(-)
> >       4:  $(test_oid t4) ! 4:  $(test_oid c4) s/12/B/
> > -          a => b | 0
> > -          1 file changed, 0 insertions(+), 0 deletions(-)
> > +          a => b | 2 +-
> > +          1 file changed, 1 insertion(+), 1 deletion(-)
>
> The tests definitely demonstrate that the old behavior was wrong,
> though...
>

For the records I verified the actual diffs (I think they're even
hardcoded in the earlier tests) and the remaining 0-add/del's are also
valid.

Regards,

Thomas
Jeff King Sept. 18, 2020, 5:27 p.m. UTC | #3
On Fri, Sep 18, 2020 at 07:32:55AM -0400, Thomas Guyot-Sionnest wrote:

> In builtin_diffstat(), when both files are coming from "stdin" (which
> could be better described as the file's content being written directly
> into the file object), oideq() compares two null hashes and ignores the
> actual differences for the statistics.
> 
> This patch checks if is_stdin flag is set on both sides and compare
> contents directly.

I'm somewhat puzzled how we could have two filespecs that came from
stdin, since we'd generally read to EOF. But looking at the test, it
seems this is a weird range-diff hack to set is_stdin.

Looking at your patch:

> diff --git a/diff.c b/diff.c
> index a5114fa864..2995527896 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -3681,7 +3681,10 @@ static void builtin_diffstat(const char *name_a, const char *name_b,
>  		return;
>  	}
>  
> -	same_contents = oideq(&one->oid, &two->oid);
> +	if (one->is_stdin && two->is_stdin)
> +		same_contents = !strcmp(one->data, two->data);
> +	else
> +		same_contents = oideq(&one->oid, &two->oid);

...should this actually be checking the oid_valid flag in each filespec?
That would presumably cover the is_stdin case, too. I also wonder
whether range-diff ought to be using that flag instead of is_stdin.

-Peff
Jeff King Sept. 18, 2020, 5:37 p.m. UTC | #4
On Fri, Sep 18, 2020 at 11:10:45AM -0400, Thomas Guyot-Sionnest wrote:

> > So, my last point is the most important of the three. I'd expect
> > something more along the lines of:
> >
> >   1. diff_fill_oid_info resolve the link to the pipe, and
> >   2. index_path handles the resolved fd.
> >
> > ...but it looks like that is already what it's doing? I'm confused why
> > this doesn't work as-is.
> 
> So the idea is to checksum the data and write a valid oid. I'll see if
> I can figure that out. Thanks for the hint though else I would likely
> have gone with a buffer and memcmp. Your solution seems cleaner, and
> there is a few other uses of oideq's that look dubious at best with
> the case of null oids / buffered data so it's definitely a better
> approach.

You're generally better off not to compute the oid just to compare two
buffers:

  - a byte-by-byte comparison can quit early as soon as it sees a
    difference, whereas computing two hashes has to cover each byte

  - even in the worst case that the byte comparison has to go all the
    way to the end, it's way faster than computing a sha1

So generally in the diff code we compare oids if we got them for free
(from the index, or from a tree), but otherwise it's OK to have
filespecs without the oid_valid flag set, and compare them byte-wise
when necessary. And there something like:

  if (one->size == two->size &&
      !memcmp(one->data, two->data, one->size))

is what you'd want.

Note that filespecs may not have their data or size loaded yet, though.
Looking at that part of builtin_diffstat(), I'm pretty sure that is
possible here (see how later code calls diff_populate_filespec() to make
sure it has data). OTOH, I guess if they're from stdin we'd always have
the data (since we'd have no oid to load from), so it might be OK under
that conditional.

-Peff
Thomas Guyot Sept. 18, 2020, 5:52 p.m. UTC | #5
On Fri, 18 Sep 2020 at 13:27, Jeff King <peff@peff.net> wrote:
> On Fri, Sep 18, 2020 at 07:32:55AM -0400, Thomas Guyot-Sionnest wrote:
> > This patch checks if is_stdin flag is set on both sides and compare
> > contents directly.
>
> I'm somewhat puzzled how we could have two filespecs that came from
> stdin, since we'd generally read to EOF. But looking at the test, it
> seems this is a weird range-diff hack to set is_stdin.

"is_stdin" is actually set manually by a function that copies stdin to
diff_filespec->data. We can get an arbitrary number of pipes from
command line arguments - only difference with stdin is that we have to
open them before read.

The flag seems to have been leveraged by diff-range - the first patch
fixes that tool alone, and 2nd adds support for multiple pipes in
--no-index. Both are independent but you would not be able to --stat
two pipes without the first patch.

> > diff --git a/diff.c b/diff.c
> > index a5114fa864..2995527896 100644
> > --- a/diff.c
> > +++ b/diff.c
> > @@ -3681,7 +3681,10 @@ static void builtin_diffstat(const char *name_a, const char *name_b,
> >               return;
> >       }
> >
> > -     same_contents = oideq(&one->oid, &two->oid);
> > +     if (one->is_stdin && two->is_stdin)
> > +             same_contents = !strcmp(one->data, two->data);
> > +     else
> > +             same_contents = oideq(&one->oid, &two->oid);
>
> ...should this actually be checking the oid_valid flag in each filespec?
> That would presumably cover the is_stdin case, too. I also wonder
> whether range-diff ought to be using that flag instead of is_stdin.

I considered that, but IIRC when run under a debugger oid_valid was
set to 0 - it seemed to be used for something different that i'm not
familiar with, maybe it's an indication the object is in git datastore
(whereas with --no-index outside files will only be hashed for
comparison).

I think is_stdin is a misnomer, but if we want to refactor that i'd
rather do it after.

Regards,

Thomas
Thomas Guyot Sept. 18, 2020, 6 p.m. UTC | #6
On Fri, 18 Sep 2020 at 13:37, Jeff King <peff@peff.net> wrote:
> On Fri, Sep 18, 2020 at 11:10:45AM -0400, Thomas Guyot-Sionnest wrote:
>
>   if (one->size == two->size &&
>       !memcmp(one->data, two->data, one->size))
>
> is what you'd want.
>

I think the other approach has its merits too - AFAIK if you run this
from a git repo and one of the files is tracked by it (or even both if
you compare two files within the repo) the oid will be readily
available and usable if the file hasn't been modified. If there is no
big objection I could stick with the hybrid approach, using memcmp of
course. - it's also the easiest fix.

--
Thomas
Junio C Hamano Sept. 18, 2020, 6:06 p.m. UTC | #7
Thomas Guyot-Sionnest <tguyot@gmail.com> writes:

>> > -     same_contents = oideq(&one->oid, &two->oid);
>> > +     if (one->is_stdin && two->is_stdin)
>> > +             same_contents = !strcmp(one->data, two->data);
>> > +     else
>> > +             same_contents = oideq(&one->oid, &two->oid);
>>
>> ...should this actually be checking the oid_valid flag in each filespec?
>> That would presumably cover the is_stdin case, too. I also wonder
>> whether range-diff ought to be using that flag instead of is_stdin.
>
> I considered that, but IIRC when run under a debugger oid_valid was
> set to 0 - it seemed to be used for something different that i'm not
> familiar with, maybe it's an indication the object is in git datastore
> (whereas with --no-index outside files will only be hashed for
> comparison).

If it says !oid_valid, I think you are getting what you do want.
The contents from the outside world, be it what was read from the
standard input or a pipe, a regular file that is not up-to-date with
the index, may not have a usable oid computed for it, and oid_valid
being false signals you that you need byte-for-byte comparison.  As
suggested by Peff in another message, you can take that signal and
compare the size and then the contents with memcmp() to see if they
are the same.
Thomas Guyot Sept. 20, 2020, 4:53 a.m. UTC | #8
Hi... Added Jeff as he got involved later and comments below are
relevant to his questions.

On 2020-09-18 11:10, Thomas Guyot-Sionnest wrote:
> On Fri, 18 Sep 2020 at 10:46, Taylor Blau <me@ttaylorr.com> wrote:
>>
>>   - Why do we have to do this at all all the way up in
>>     'builtin_diffstat'? I would expect these to contain the right
>>     OIDs by the time they are given back to us from the call to
>>     'diff_fill_oid_info' in 'run_diffstat'.
>>
>> So, my last point is the most important of the three. I'd expect
>> something more along the lines of:
>>
>>   1. diff_fill_oid_info resolve the link to the pipe, and
>>   2. index_path handles the resolved fd.
>>
>> ...but it looks like that is already what it's doing? I'm confused why
>> this doesn't work as-is.
> 
> So the idea is to checksum the data and write a valid oid. I'll see if
> I can figure that out. Thanks for the hint though else I would likely
> have gone with a buffer and memcmp. Your solution seems cleaner, and
> there is a few other uses of oideq's that look dubious at best with
> the case of null oids / buffered data so it's definitely a better
> approach.
> 

After looking further at the code I understand your point, although
pipes can only ever be read once, so even if we do that we would have to
buffer on first read. It appears the files are first read by
diff_populate_filespec() - builtin_diffstat isn't even called if the
files match (even for two pipes).

Jeff, on your suggestion to compare size, the size is set even if data
is null. Files in-tree appears to be mmapped on demand for reads.

diff_fill_oid_info explicitly resets oids for is_stdin and return, and
if the file's been read already and it's a pipe, we would *have* to have
buffered the data already so I don't really see what else we can do
besides memcmp() (technically we should be able to tell if the files
have been modified at this point but apparently that information isn't
transmitted to builtin_diffstat - it's assumed and I won't make complex
change for that odd case of diffing two pipes. That's what I have now:

    /* What is_stdin really means is that the file's content is only
     * in the filespec's buffer and its oid is zero. We can't compare
     * oid's if both are null and we can just diff the buffers */
    if (one->is_stdin && two->is_stdin)
        same_contents = (one->size == two->size ?
            !memcmp(one->data, two->data, one->size) : 0);
    else
        same_contents = oideq(&one->oid, &two->oid);


Even when we implement the --literally switch, considering we can't
guarantee a single read per file, for now I'd keep using the is_stdin
flag as an indication of in-memory data, and we'll have to read in all
pipes we diff (like earlier patch). It could be a concern if we
--literally diff a whole subtree of large pipes. In that case the only
fix I can see is to reorder the operations to generate the stats on each
file diff (or at least keep the diffs around for the stats pass).


Regards,

Thomas
Johannes Schindelin Sept. 23, 2020, 3:05 p.m. UTC | #9
Hi Peff,

On Fri, 18 Sep 2020, Jeff King wrote:

> I also wonder whether range-diff ought to be using that flag
> [`oid_valid`] instead of is_stdin.

From `diffcore.h`:

        unsigned oid_valid : 1;  /* if true, use oid and trust mode;
                                  * if false, use the name and read from
                                  * the filesystem.
                                  */

That description leads me to believe that `oid_valid` cannot be used here:
we do _not_ want to read any data from the file system in `range-diff.c`'s
`get_filespec()` function; Instead, we want to use the data provided via
the function parameter `p`.

Ciao,
Dscho
Johannes Schindelin Sept. 23, 2020, 7:16 p.m. UTC | #10
Hi,

On Fri, 18 Sep 2020, Junio C Hamano wrote:

> Thomas Guyot-Sionnest <tguyot@gmail.com> writes:
>
> >> > -     same_contents = oideq(&one->oid, &two->oid);
> >> > +     if (one->is_stdin && two->is_stdin)
> >> > +             same_contents = !strcmp(one->data, two->data);
> >> > +     else
> >> > +             same_contents = oideq(&one->oid, &two->oid);
> >>
> >> ...should this actually be checking the oid_valid flag in each filespec?
> >> That would presumably cover the is_stdin case, too. I also wonder
> >> whether range-diff ought to be using that flag instead of is_stdin.
> >
> > I considered that, but IIRC when run under a debugger oid_valid was
> > set to 0 - it seemed to be used for something different that i'm not
> > familiar with, maybe it's an indication the object is in git datastore
> > (whereas with --no-index outside files will only be hashed for
> > comparison).
>
> If it says !oid_valid, I think you are getting what you do want.

I suspect the same.

> The contents from the outside world, be it what was read from the
> standard input or a pipe, a regular file that is not up-to-date with
> the index, may not have a usable oid computed for it, and oid_valid
> being false signals you that you need byte-for-byte comparison.  As
> suggested by Peff in another message, you can take that signal and
> compare the size and then the contents with memcmp() to see if they
> are the same.

To complete the information: `struct diff_filespec`'s first attribute is
`oid`, the object ID of the data. If it is left uninitialized (as is the
case in `range-diff`'s case), `oid_valid` has to be 0 to prevent it from
being used.

I believe that that is exactly the reason why we want this:

-	same_contents = oideq(&one->oid, &two->oid);
+	same_contents = one->oid_valid && two->oid_valid ?
		oideq(&one->oid, &two->oid) : !strcmp(one->data, two->data);

Ciao,
Dscho
Junio C Hamano Sept. 23, 2020, 7:23 p.m. UTC | #11
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> I believe that that is exactly the reason why we want this:
>
> -	same_contents = oideq(&one->oid, &two->oid);
> +	same_contents = one->oid_valid && two->oid_valid ?
> 		oideq(&one->oid, &two->oid) : !strcmp(one->data, two->data);

Not quite.  The other side should either be

	one->size == two->size && !memcmp(...)

or just left to false, as the downstream code must be prepared for
same_contents being false even when one and two turns out to be
not-byte-for-byte-same but equivalent anyway.
Johannes Schindelin Sept. 23, 2020, 8:44 p.m. UTC | #12
Hi Junio,

On Wed, 23 Sep 2020, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> > I believe that that is exactly the reason why we want this:
> >
> > -	same_contents = oideq(&one->oid, &two->oid);
> > +	same_contents = one->oid_valid && two->oid_valid ?
> > 		oideq(&one->oid, &two->oid) : !strcmp(one->data, two->data);
>
> Not quite.  The other side should either be
>
> 	one->size == two->size && !memcmp(...)

Right!

Thank you for correcting my mistake,
Dscho

>
> or just left to false, as the downstream code must be prepared for
> same_contents being false even when one and two turns out to be
> not-byte-for-byte-same but equivalent anyway.
Thomas Guyot Sept. 24, 2020, 4:49 a.m. UTC | #13
Hi,

On 2020-09-23 16:44, Johannes Schindelin wrote:
> Hi Junio,
> 
> On Wed, 23 Sep 2020, Junio C Hamano wrote:
> 
>> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>>
>>> I believe that that is exactly the reason why we want this:
>>>
>>> -	same_contents = oideq(&one->oid, &two->oid);
>>> +	same_contents = one->oid_valid && two->oid_valid ?
>>> 		oideq(&one->oid, &two->oid) : !strcmp(one->data, two->data);
>>
>> Not quite.  The other side should either be
>>
>> 	one->size == two->size && !memcmp(...)

Thanks for all the feedback, that was enlightening. Although I have been
silent the past few days I watched this thread with interest.


So as Junio pointed out this is merely an optimization - the range-diff
test that I corrected also showed two 0-line diffs and I realized
there's a block further down that should explicitly removes them, under

    else if (!same_contents) {

We can even remove same_contents entirely and everything work just fine
after adjusting the range-diff test - the logic is correct and
underlying functions already DTRT.


My next patch simplifies the test down to:

    same_contents = one->oid_valid && two->oid_valid &&
        oideq(&one->oid, &two->oid);

My understanding is that oid_valid will never be true for a modified
(even mode change) or out of tree file so it's a valid assumption.

I'll also rename that variable to "same_oid" - the current name is
misleading both ways (true doesn't means there will be diffs, false
doesn't mean contents differs).

Regards,

Thomas
Junio C Hamano Sept. 24, 2020, 6:40 a.m. UTC | #14
Thomas Guyot <tguyot@gmail.com> writes:

> My next patch simplifies the test down to:
>
>     same_contents = one->oid_valid && two->oid_valid &&
>         oideq(&one->oid, &two->oid);
>
> My understanding is that oid_valid will never be true for a modified
> (even mode change) or out of tree file so it's a valid assumption.
>
> I'll also rename that variable to "same_oid" - the current name is
> misleading both ways (true doesn't means there will be diffs, false
> doesn't mean contents differs).

It is not "both ways", I think.  The idea is that when this variable
is true, we know with certainty that these two are the same, but
even when the variable is false, they still can be the same.  So
true does mean there will not be diff.  False indeed is fuzzy.

And as long as one side gives a 100% correct answer cheaply, we can
use it as an optimization (and 'true' being that side in this case).

I have a mild suspicion that the name same_anything conveys a wrong
impression, no matter what word you use for <anything>.  It does not
capture that we are saying the "true" side has no false positive.

And that is why I alluded to "may_differ" earlier (with opposite
polarity).  The flow would become:

    may_differ = !one->oid_valid || !two->oid_valid || !oideq(...);

    if (binary) {
        if (!may_differ) {
            added = deleted = 0;
            ...
        } else {
            ... count added and deleted ...
        }
    } else if (rewrite) {
	...
    } else if (may_differ) {
	... use xdl ...
    }

and it would become quite straight-forward to follow.  When there is
no chance that they may be different, we short-cut and otherwise we
compute without cheating.  Only when they can be different, we do
the expensive xdl thing.

Thanks.
Thomas Guyot Sept. 24, 2020, 7:13 a.m. UTC | #15
Hi Junio,

On 2020-09-24 02:40, Junio C Hamano wrote:
> Thomas Guyot <tguyot@gmail.com> writes:
> 
> It is not "both ways", I think.  The idea is that when this variable
> is true, we know with certainty that these two are the same, but
> even when the variable is false, they still can be the same.  So
> true does mean there will not be diff.  False indeed is fuzzy.

I meant to say the old behavior "lied" in both directions.

> And as long as one side gives a 100% correct answer cheaply, we can
> use it as an optimization (and 'true' being that side in this case).
> 
> I have a mild suspicion that the name same_anything conveys a wrong
> impression, no matter what word you use for <anything>.  It does not
> capture that we are saying the "true" side has no false positive.
> 
> And that is why I alluded to "may_differ" earlier (with opposite
> polarity).  The flow would become:
> 
>     may_differ = !one->oid_valid || !two->oid_valid || !oideq(...);
> 
>     if (binary) {
>         if (!may_differ) {
>             added = deleted = 0;
>             ...
>         } else {
>             ... count added and deleted ...
>         }
>     } else if (rewrite) {
> 	...
>     } else if (may_differ) {
> 	... use xdl ...
>     }
> 
> and it would become quite straight-forward to follow.  When there is
> no chance that they may be different, we short-cut and otherwise we
> compute without cheating.  Only when they can be different, we do
> the expensive xdl thing.

I toyed a bit on the binary side... I never sent my 2nd reply as I still
needed to dig up; testing with diff_filespec_is_binary() { return 1; } I
would get (as expected) the same false-positive modified binary files I
used to get in range-diff test.

What I didn't get is applying the same logic
(free_diffstat_file(file);diffstat->nr--;) didn't have any effect. I'll
have to find what differs here to make binary files how up regardless.

Regards,

Thomas
Junio C Hamano Sept. 24, 2020, 5:19 p.m. UTC | #16
Thomas Guyot <tguyot@gmail.com> writes:

> Hi Junio,
>
> On 2020-09-24 02:40, Junio C Hamano wrote:
>> Thomas Guyot <tguyot@gmail.com> writes:
>> 
>> It is not "both ways", I think.  The idea is that when this variable
>> is true, we know with certainty that these two are the same, but
>> even when the variable is false, they still can be the same.  So
>> true does mean there will not be diff.  False indeed is fuzzy.
>
> I meant to say the old behavior "lied" in both directions.

It depends on the perspective ;-)

The old one didn't expect/realize fill_oid_info() can leave the oid
field to "unknown".  It was OK because is_stdin happens to be the
only such case [*1*] and we never saw both one->oid and two->oid
being the null_oid at the same time, so it wasn't an issue that
their validity weren't checked there.  As long as one side was
valid, when the comparison said they were equal, they indeed were
equal.  So in that sense, the true side did not lie.

If you add new case where the oid of both sides can legitimately be
null_oid, that will of course break the code.  I think that is the
reason why we are having this discussion to prepare for such a
future (that happens in 2/2???).


[Footnote]

*1* The missing side of addition and deletion will also get null_oid,
but we don't compare that with stdin in such a case.
Junio C Hamano Sept. 24, 2020, 5:38 p.m. UTC | #17
Junio C Hamano <gitster@pobox.com> writes:

> If you add new case where the oid of both sides can legitimately be
> null_oid, that will of course break the code.

Ahh, I forgot that we already had such an iffy caller that broke the
code.  I should have re-read the test part of the patch.

Thanks.
diff mbox series

Patch

diff --git a/diff.c b/diff.c
index a5114fa864..2995527896 100644
--- a/diff.c
+++ b/diff.c
@@ -3681,7 +3681,10 @@  static void builtin_diffstat(const char *name_a, const char *name_b,
 		return;
 	}
 
-	same_contents = oideq(&one->oid, &two->oid);
+	if (one->is_stdin && two->is_stdin)
+		same_contents = !strcmp(one->data, two->data);
+	else
+		same_contents = oideq(&one->oid, &two->oid);
 
 	if (diff_filespec_is_binary(o->repo, one) ||
 	    diff_filespec_is_binary(o->repo, two)) {
diff --git a/t/t3206-range-diff.sh b/t/t3206-range-diff.sh
index e024cff65c..4715e75b68 100755
--- a/t/t3206-range-diff.sh
+++ b/t/t3206-range-diff.sh
@@ -258,11 +258,11 @@  test_expect_success 'changed commit with --stat diff option' '
 	     a => b | 0
 	     1 file changed, 0 insertions(+), 0 deletions(-)
 	3:  $(test_oid t3) ! 3:  $(test_oid c3) s/11/B/
-	     a => b | 0
-	     1 file changed, 0 insertions(+), 0 deletions(-)
+	     a => b | 2 +-
+	     1 file changed, 1 insertion(+), 1 deletion(-)
 	4:  $(test_oid t4) ! 4:  $(test_oid c4) s/12/B/
-	     a => b | 0
-	     1 file changed, 0 insertions(+), 0 deletions(-)
+	     a => b | 2 +-
+	     1 file changed, 1 insertion(+), 1 deletion(-)
 	EOF
 	test_cmp expect actual
 '