mbox series

[v6,0/6] blame: add the ability to ignore commits

Message ID 20190410162409.117264-1-brho@google.com (mailing list archive)
Headers show
Series blame: add the ability to ignore commits | expand

Message

Barret Rhoden April 10, 2019, 4:24 p.m. UTC
This patch set adds the ability to ignore a set of commits and their
changes when blaming.  This can be used to ignore a commit deemed 'not
interesting,' such as reformatting.

The last patch in the series changes the heuristic by which ignored
lines are attributed to specific lines in the parent commit.  This
increases the likelihood of blaming the 'right' commit, where 'right' is
subjective.  Perhaps we want another algorithm.  I'm using a relatively
simple one that uses the basics of Michael's fingerprinting code, but he
has another algorithm.

v5 -> v6
v5: https://public-inbox.org/git/20190403160207.149174-1-brho@google.com/
- The "guess" heuristic can now look anywhere in the parent file for a
  matching line, instead of just looking in the parent chunk.  The
  chunks passed to blame_chunk() are smaller than you'd expect: they are
  just adjacent '-' and '+' sections.  Any diff 'context' is a chunk
  boundary.
- Fixed the parent_len calculation.  I had been basing it off of
  e->num_lines, and treating the blame entry as if it was the target
  chunk, but the individual blame entries are subsets of the chunk.  I
  just pass the parent chunk info all the way through now.
- Use Michael's newest fingerprinting code, which is a large speedup.
- Made a config option to zero the hash for an ignored line when the
  heuristic could not find a line in the parent to blame.  Previously,
  this was always 'on'.
- Moved the for loop variable declarations out of the for ().
- Rebased on master.

v4 -> v5
v4: https://public-inbox.org/git/20190226170648.211847-1-brho@google.com/
- Changed the handling of blame_entries from ignored commits so that you
  can use any algorithm you want to map lines from the diff chunk to
  different parts of the parent commit.
- fill_origin_blob() optionally can track the offsets of the start of
  every line, similar to what we do in the scoreboard for the final
  file.  This can be used by the matching algorithm.  It has no effect
  if you are not ignoring commits.
- RFC of a fuzzy/fingerprinting heuristic, based on Michael Platings RFC
  at https://public-inbox.org/git/20190324235020.49706-2-michael@platin.gs/
- Made the tests that detect unblamable entries more resilient to
  different heuristics.
- Fixed a few bugs:
	- tests were not grepping the line number from --line-porcelain
	  correctly.
	- In the old version, when I passed the "upper" part of the
	  blame entry to the target and marked unblamable, the suspect
	  was incorrectly marked as the parent.  The s_lno was also in
	  the parent's address space.

v3 -> v4
v3: https://public-inbox.org/git/20190212222722.240676-1-brho@google.com/
- Cleaned up the tests, especially removing usage of sed -i.
- Squashed the 'tests' commit into the other blame commits.  Let me know
  if you'd like further squashing.

v2 -> v3
v2: https://public-inbox.org/git/20190117202919.157326-1-brho@google.com/
- SHA-1 -> "object name", and fixed other comments
- Changed error string for oidset_parse_file()
- Adjusted existing fsck tests to handle those string changes
- Return hash of all zeros for lines we know we cannot identify
- Allow repeated options for blame.ignoreRevsFile and
  --ignore-revs-file.  An empty file name resets the list.  Config
  options are parsed before the command line options.
- Rebased to master
- Added regression tests

v1 -> v2
v1: https://public-inbox.org/git/20190107213013.231514-1-brho@google.com/
- extracted the skiplist from fsck to avoid duplicating code
- overhauled the interface and options
- split out markIgnoredFiles
- handled merges


Barret Rhoden (6):
  Move init_skiplist() outside of fsck
  blame: use a helper function in blame_chunk()
  blame: add the ability to ignore commits and their changes
  blame: add config options to handle output for ignored lines
  blame: optionally track line fingerprints during fill_blame_origin()
  blame: use a fingerprint heuristic to match ignored lines

 Documentation/blame-options.txt |  18 ++
 Documentation/config/blame.txt  |  16 ++
 Documentation/git-blame.txt     |   1 +
 blame.c                         | 446 ++++++++++++++++++++++++++++----
 blame.h                         |   6 +
 builtin/blame.c                 |  56 ++++
 fsck.c                          |  37 +--
 oidset.c                        |  35 +++
 oidset.h                        |   8 +
 t/t5504-fetch-receive-strict.sh |  14 +-
 t/t8013-blame-ignore-revs.sh    | 202 +++++++++++++++
 11 files changed, 740 insertions(+), 99 deletions(-)
 create mode 100755 t/t8013-blame-ignore-revs.sh

Comments

Michael Platings April 14, 2019, 9:10 p.m. UTC | #1
Hi Barret,

This works pretty well for the typical reformatting use case now. I've
run it over every commit of every .c file in the git project root,
both forwards and backwards with every combination of -w/-M/-C and
can't get it to crash so I think it's good in that respect.

However, it can still attribute lines to the wrong parent line. See
https://pypi.org/project/autopep8/#usage for an example reformatting
that it gets a bit confused on. The patch I submitted handles this
case correctly because it uses information about the more similar
lines to decide how more ambiguous lines should be matched.

You also gave an example of:

        commit-a 11) void new_func_1(void *x, void *y);
        commit-b 12) void new_func_2(void *x, void *y);

Being reformatted to:

        commit-a 11) void new_func_1(void *x,
        commit-b 12)                 void *y);
        commit-b 13) void new_func_2(void *x,
        commit-b 14)                 void *y);

The patch I submitted handles this case correctly, assigning line 12
to commit-a because it scales the parent line numbers according to the
relative diff chunk sizes instead of assuming a 1-1 mapping.

So I do ask that you incorporate more of my patch, including the test
code. It is more complex but I hope this demonstrates that there are
reasons for that. Happy to provide more examples or explanation if it
would help. On the other hand if you have examples where it falls
short then I'd be interested to know.

The other major use case that I'm interested in is renaming. In this
case, the git-hyper-blame approach of mapping line numbers 1-1 works
perfectly. Here's an example. Before:

        commit-a 11) Position MyClass::location(Offset O) {
        commit-b 12)    return P + O;
        commit-c 13) }

After:

        commit-a 11) Position MyClass::location(Offset offset) {
        commit-a 12)    return position + offset;
        commit-c 13) }

With the fuzzy matching, line 12 gets incorrectly matched to parent
line 11 because the similarity of "position" and "offset" outweighs
the similarity of "return". I'm considering adding even more
complexity to my patch such that parts of a line that have already
been matched can't be matched again by other lines.

But the other possibility is that we let the user choose the
heuristic. For a commit where they know that line numbers haven't
changed they could choose 1-1 matching, while for a reformatting
commit they could use fuzzy matching. I welcome your thoughts.

-Michael
Barret Rhoden April 15, 2019, 1:23 p.m. UTC | #2
Hi Michael -

On 4/14/19 5:10 PM, Michael Platings wrote:
> Hi Barret,
> 
> This works pretty well for the typical reformatting use case now. I've
> run it over every commit of every .c file in the git project root,
> both forwards and backwards with every combination of -w/-M/-C and
> can't get it to crash so I think it's good in that respect.
> 
> However, it can still attribute lines to the wrong parent line. See
> https://pypi.org/project/autopep8/#usage for an example reformatting
> that it gets a bit confused on. The patch I submitted handles this
> case correctly because it uses information about the more similar
> lines to decide how more ambiguous lines should be matched.

Yeah - I ran your tests against it and noticed a few cases weren't handled.

> You also gave an example of:
> 
>          commit-a 11) void new_func_1(void *x, void *y);
>          commit-b 12) void new_func_2(void *x, void *y);
> 
> Being reformatted to:
> 
>          commit-a 11) void new_func_1(void *x,
>          commit-b 12)                 void *y);
>          commit-b 13) void new_func_2(void *x,
>          commit-b 14)                 void *y);
> 
> The patch I submitted handles this case correctly, assigning line 12
> to commit-a because it scales the parent line numbers according to the
> relative diff chunk sizes instead of assuming a 1-1 mapping.
> 
> So I do ask that you incorporate more of my patch, including the test
> code. It is more complex but I hope this demonstrates that there are
> reasons for that. Happy to provide more examples or explanation if it
> would help. On the other hand if you have examples where it falls
> short then I'd be interested to know.

My main concerns:
- Can your version reach outside of a diff chunk?  such as in my "header 
moved" case.  That was a simplified version of something that pops up in 
a major file reformatting of mine, where a "return 0;" was matched as 
context and broke a diff chunk up into two blame_chunk() calls.  I tend 
to think of this as the "split diff chunk."

- Complexity and possibly performance.  The recursive stuff made me 
wonder about it a bit.  It's no reason not to use it, just need to check 
it more closely.

Is the latest version of your stuff still the one you posted last week 
or so?  If we had a patch applied onto this one with something like an 
ifdef or a dirt-simple toggle, we can play with both of them in the same 
codebase.

Similarly, do you think the "two pass" approach I have (check the chunk, 
then check the parent file) would work with your recursive partitioning 
style?  That might make yours able to handle the "split diff chunk" case.

> The other major use case that I'm interested in is renaming. In this
> case, the git-hyper-blame approach of mapping line numbers 1-1 works
> perfectly. Here's an example. Before:
> 
>          commit-a 11) Position MyClass::location(Offset O) {
>          commit-b 12)    return P + O;
>          commit-c 13) }
> 
> After:
> 
>          commit-a 11) Position MyClass::location(Offset offset) {
>          commit-a 12)    return position + offset;
>          commit-c 13) }
> 
> With the fuzzy matching, line 12 gets incorrectly matched to parent
> line 11 because the similarity of "position" and "offset" outweighs
> the similarity of "return". I'm considering adding even more
> complexity to my patch such that parts of a line that have already
> been matched can't be matched again by other lines.
> 
> But the other possibility is that we let the user choose the
> heuristic. For a commit where they know that line numbers haven't
> changed they could choose 1-1 matching, while for a reformatting
> commit they could use fuzzy matching. I welcome your thoughts.

No algorithm will work for all cases.  The one you just gave had the 
simple heuristic working better than a complex one.  We could make it 
more complex, but then another example may be worse.  I can live with 
some inaccuracy in exchange for simplicity.

I ran into something similar with the THRESHOLD #defines.  You want it 
to be able to match certain things, but not other things.  How similar 
does something have to be?  Should it depend on how far away the 
matching line is from the source line?  I went with a "close enough is 
good enough" approach, since we're marking with a '*' or something 
anyways, so the user should know to not trust it 100%.

Thanks,

Barret
Michael Platings April 15, 2019, 9:54 p.m. UTC | #3
> My main concerns:
> - Can your version reach outside of a diff chunk?

Currently no. It's optimised for reformatting and renaming, both of
which preserve ordering. I could look into allowing disordered matches
where the similarity is high, while still being biased towards ordered
matches. If you can post more examples that would be helpful.

> - Complexity and possibly performance.  The recursive stuff made me
> wonder about it a bit.  It's no reason not to use it, just need to check
> it more closely.

Complexity I can't deny, I can only mitigate it with
documentation/comments. I optimised the code pretty heavily and tested
on some contrived worst-case scenarios and the performance was still
good so I'm not worried about that.

> Is the latest version of your stuff still the one you posted last week
> or so?

Yes. But reaching outside the chunk might lead to a significantly
different API in the next version...

> Similarly, do you think the "two pass" approach I have (check the chunk,
> then check the parent file) would work with your recursive partitioning
> style?  That might make yours able to handle the "split diff chunk" case.

Yes, should do. I'll see what I can come up with this week.

> No algorithm will work for all cases.  The one you just gave had the
> simple heuristic working better than a complex one.  We could make it
> more complex, but then another example may be worse.  I can live with
> some inaccuracy in exchange for simplicity.

Exactly, no algorithm will work for all cases. So what I'm suggesting
is that it might be best to let the user choose which heuristic is
appropriate for a given commit. If they know that the simple heuristic
works best then perhaps we should let them choose that rather than
only offering a one-size-fits-all option. But if we do want to go for
one-size-fits-all then I'm very keen to make sure it at least solves
the specific cases that we know about.