[RFC,0/1] Fuzzy blame

Message ID	20190324235020.49706-1-michael@platin.gs (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> From: michael@platin.gs To: git@vger.kernel.org Cc: Jeff King <peff@peff.net>, Stefan Beller <stefanbeller@gmail.com>, Jeff Smith <whydoubt@gmail.com>, Junio C Hamano <gitster@pobox.com>, =?utf-8?q?Ren=C3=A9_Scharfe?= <l.s.r@web.de>, Michael Platings <michael@platin.gs> Subject: [RFC PATCH 0/1] Fuzzy blame Date: Sun, 24 Mar 2019 23:50:19 +0000 Message-Id: <20190324235020.49706-1-michael@platin.gs> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Fuzzy blame \| expand [RFC,0/1] Fuzzy blame [RFC,1/1] Fuzzy blame

Michael Platings March 24, 2019, 11:50 p.m. UTC

From: Michael Platings <michael@platin.gs>

Hi Git devs,

Some of you may be familiar with the git-hyper-blame tool [1]. It's "useful if
you have a commit that makes sweeping changes that are unlikely to be what you
are looking for in a blame, such as mass reformatting or renaming."

git-hyper-blame is useful but (a) it's not convenient to install; (b) it's
missing functionality available in regular git blame; (c) it's method of
matching lines between chunks is too simplistic for many use cases; and
(d) it's not Git so it doesn't integrate well with tools that expect Git
e.g. vim plugins. Therefore I'm hoping to add similar and hopefully superior
functionality to Git itself. I have a very rough patch so I'd like to get your
thoughts on the general approach, particularly in terms of its user-visible
behaviour.

My initial idea was to lift the design directly from git-hyper-blame. However
the approach of picking single revisions to somehow ignore doesn't sit well
with the -w, -M & -C options, which have a similar intent but apply to all
revisions.

I'd like to get your thoughts on whether we could allow applying the -M or -w
options to specific revisions. For example, imagine it was agreed that all
the #includes in a project should be reordered. In that case, it would be useful
to be able to specify that the -M option should be used for blames on that
revision specifically, so that in future when someone wants to know why
a #include was added they don't have to run git blame twice to find out.

Options that are specific to a particular revision could be stored in a
".gitrevisions" file or similar.

If the principle of allowing blame options to be applied per-revision is
agreeable then I'd like to add a -F/--fuzzy option, to sit alongside -w, -M & -C.

I've implemented a prototype "fuzzy" option, patch attached.
The option operates at the level of diff chunks. For each line in the "after"
half of the chunk it uses a heuristic to choose which line in the "before" half
of the chunk matches best. The heuristic I'm using at the moment is of matching
"bigrams" as described in [2]. The initial pass typically gives reasonable
results, but can jumble up the lines. As in the reformatting/renaming use case
the content should stay in the same order, it's worth going to extra effort to
avoid jumbling lines. Therefore, after the initial pass, the line that can be
matched with the most confidence is used to partition the chunk into halves
before and after it. The process is then repeated recursively on the halves
above and below the partition line.
I feel like a similar algorithm has probably already been invented in a better
form - if anyone knows of such a thing then please let me know!

I look forward to hearing your thoughts.
Thanks,
-Michael


[1] https://commondatastorage.googleapis.com/chrome-infra-docs/flat/depot_tools/docs/html/git-hyper-blame.html
[2] https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient

Michael Platings (1):
  Add git blame --fuzzy option.

 blame.c                | 352 +++++++++++++++++++++++++++++++++++++++++++++++--
 blame.h                |   1 +
 builtin/blame.c        |   3 +
 t/t8020-blame-fuzzy.sh | 264 +++++++++++++++++++++++++++++++++++++
 4 files changed, 609 insertions(+), 11 deletions(-)
 create mode 100755 t/t8020-blame-fuzzy.sh

Junio C Hamano March 25, 2019, 2:39 a.m. UTC | #1

michael@platin.gs writes:

> From: Michael Platings <michael@platin.gs>
>
> Hi Git devs,
>
> Some of you may be familiar with the git-hyper-blame tool [1]. It's "useful if
> you have a commit that makes sweeping changes that are unlikely to be what you
> are looking for in a blame, such as mass reformatting or renaming."

I recall a month or so ago brho@google (CC'ed) sent a "let's allow
blame to ignore some uninteresting commit" topic, which was
unfortunately not well reviewed (what I mean is *not* that it was
reviewed thoroughly and found to be bad---not many reviewers found
time or inclination to review it well).  The topic is queued as
br/blame-ignore and its tip is at 43a290e3 ("SQUASH???", 2019-02-13)
as of this writing.

Perhaps you two can join forces?

P.S. I expect to be offline for most of the week (packing, moving
and unpacking.  Even though the places packing and unpacking happens
are within 1 kilometer radius, that does not make it less hassle
X-<).  See you guys next month.

Michael Platings March 25, 2019, 9:32 a.m. UTC | #2

(resending in plain text mode, sorry for the noise)

Thanks Junio, that's super helpful!

A month or two ago I contacted the author of git-hyper-blame, Matt
Giuca, asking whether anyone had looked into adding the feature to git
blame. I didn't receive a response but maybe that prompted Barret
Rhoden's patch? Or maybe just a weird coincidence!

@Barret I see your patches are a nice translation of git-hyper-blame.
However could you give me your thoughts on the approach in my patch? A
comment in the git-hyper-blame source [1] says:
# This doesn't work that well if there are a lot of line changes within the
# hunk (demonstrated by GitHyperBlameLineMotionTest.testIntraHunkLineMotion).
# A fuzzy heuristic that takes the text of the new line and tries to find a
# deleted line within the hunk that mostly matches the new line could help.

My patch aims to implement this "fuzzy heuristic" so I'd love to get
your take on it.

Many thanks,
-Michael

On Mon, 25 Mar 2019 at 02:39, Junio C Hamano <gitster@pobox.com> wrote:
>
> michael@platin.gs writes:
>
> > From: Michael Platings <michael@platin.gs>
> >
> > Hi Git devs,
> >
> > Some of you may be familiar with the git-hyper-blame tool [1]. It's "useful if
> > you have a commit that makes sweeping changes that are unlikely to be what you
> > are looking for in a blame, such as mass reformatting or renaming."
>
> I recall a month or so ago brho@google (CC'ed) sent a "let's allow
> blame to ignore some uninteresting commit" topic, which was
> unfortunately not well reviewed (what I mean is *not* that it was
> reviewed thoroughly and found to be bad---not many reviewers found
> time or inclination to review it well).  The topic is queued as
> br/blame-ignore and its tip is at 43a290e3 ("SQUASH???", 2019-02-13)
> as of this writing.
>
> Perhaps you two can join forces?
>
> P.S. I expect to be offline for most of the week (packing, moving
> and unpacking.  Even though the places packing and unpacking happens
> are within 1 kilometer radius, that does not make it less hassle
> X-<).  See you guys next month.

Barret Rhoden March 25, 2019, 4:04 p.m. UTC | #3

Hi -

On 3/25/19 5:32 AM, Michael Platings wrote:
> (resending in plain text mode, sorry for the noise)
> 
> Thanks Junio, that's super helpful!
> 
> A month or two ago I contacted the author of git-hyper-blame, Matt
> Giuca, asking whether anyone had looked into adding the feature to git
> blame. I didn't receive a response but maybe that prompted Barret
> Rhoden's patch? Or maybe just a weird coincidence!

Weird coincidence.  It's a big company.  =)

I work on a project that needs a major reformatting, and one thing 
delaying me was the lack of an ability to ignore commits during blame. 
hyper-blame does that, but I never liked the fact that it wasn't 
directly in git.

> 
> @Barret I see your patches are a nice translation of git-hyper-blame.

Not sure if you've seen my latest version, updated to v4 (2019-02-26) here:

https://public-inbox.org/git/20190226170648.211847-1-brho@google.com/

The one Junio has (br/blame-ignore) hasn't been updated - not sure if 
that's automated, or if it just fell through the cracks.

> However could you give me your thoughts on the approach in my patch? A
> comment in the git-hyper-blame source [1] says:
> # This doesn't work that well if there are a lot of line changes within the
> # hunk (demonstrated by GitHyperBlameLineMotionTest.testIntraHunkLineMotion).
> # A fuzzy heuristic that takes the text of the new line and tries to find a
> # deleted line within the hunk that mostly matches the new line could help.
> 
> My patch aims to implement this "fuzzy heuristic" so I'd love to get
> your take on it.

This is an interesting idea, and it sounds like it might be 
complimentary to the blame-ignore work.  Both have the flavor of "user 
knows this commit is special and wants special processing."

In my patch, I didn't try to find the likely original version of a line 
in a diff hunk.  What I did amounted to finding blame based on the 
parent's image of the file.  Example in this message:

https://public-inbox.org/git/20190226170648.211847-4-brho@google.com/

Of note, line 12 was blamed on commit b, when it really came from commit a.

For any lines added beyond the size of the parent's image (e.g. the 
ignored commit added more lines than it removed), those lines were 
removed from blame contention - marked with all 0s.

In essence, my method did the following:

	for all suspect/child lines 'i' in a diff hunk
		if i <= parent's hunk size
			assign to parent line i (+ offset)
		else
			mark umblamable

Due to the two cases being contiguous, each hunk would be broken up into 
at most two blame entries.  (I actually short-circuit that for loop and 
just split at i == parent_size immediately).

Having a smart/fuzzy matcher sounds nicer.  My patch passes blame to the 
parent.  Yours finds the right *part* of the parent to blame, which 
means we have a better chance of finding the right original commit to blame.

I think you're doing this:

	for all suspect/child lines 'i' in a diff hunk
		if i matches parent line (say, x)
			assign to parent line x
		else
			assign to child line i

 From glancing at your code, it looks like you break up every blame 
entry into N entries, one for each line, which you need since each 
parent line X might not be adjacent to the matching lines in the child.

One question I have is under which circumstances do you find that you 
cannot match a suspect/child line to a parent?  One obvious case is a 
commit that only adds lines - the parent's line set is the empty set.  I 
think you catch that earlier in your code (parent_chunk_length == 0), 
though I guess there are other cases where the parent has lines, but 
they are below a certain match threshold?

For those cases where you can't find a match, I could imagine marking 
them as unblamable.  The purpose of 'unblamable' in my patchset was to 
signal to not bother looking up further commit info for a final blame 
entry.  It was largely so that the user (me!) wouldn't see a commit 
blamed when I explicitly told git to not tell me about that commit. 
Both approaches sound fine though.

Another question was whether or not you wanted this to apply per commit 
or for an entire blame session.  It looks like your current code applies 
it overall, and not for a specific commit.  I'd much prefer it to be 
per-commit, though maybe the -F is just for showing it working in the RFC.

The first thing that comes to mind for me is to plug your fuzzy logic 
into my patch set.  Basically, in my commit near "These go to the 
parent", we do your line-by-line matching.  We might not need my 'delta' 
check anymore, which was basically the 'if' part of my for loop (in text 
above).  But maybe we need more info for your function.

That way, we'd get the per-commit control of when we ignore a commit 
from my patch, and we'd get the fuzzy-blaming brains of your patch, such 
that we try to find the right *part* of the parent to blame.

Anyway, let me know what your thoughts are.  We could try to change that 
part of my code so that it just calls some function that tells it where 
in the parent to blame, and otherwise marks unblamable.  Then your patch 
can replace the logic?

Barret

Michael Platings March 25, 2019, 11:21 p.m. UTC | #4

Hi Barret,

> I work on a project that needs a major reformatting, and one thing
> delaying me was the lack of an ability to ignore commits during blame.

I think we understand each other well then - I'm working on a plan to
change the variable naming rule in LLVM, and naturally other
developers aren't keen on making git blame less useful.

> One question I have is under which circumstances do you find that you
> cannot match a suspect/child line to a parent?  One obvious case is a
> commit that only adds lines - the parent's line set is the empty set.  I
> think you catch that earlier in your code (parent_chunk_length == 0),
> though I guess there are other cases where the parent has lines, but
> they are below a certain match threshold?

Yes, exactly. The threshold is currently 0 i.e. a single matching
bigram is all that's required for two lines to be considered matching,
but in future the threshold could be configurable in the same manner
as -M & -C options.
In the t8020-blame-fuzzy.sh test script in my patch, where it's
expected that a line will be attributed to the "ignored" commit you'll
see "Final". So far this is just "}" lines.

> Another question was whether or not you wanted this to apply per commit
> or for an entire blame session.  It looks like your current code applies
> it overall, and not for a specific commit.

This is a really interesting question for this feature. Initially I
just wanted to be able to say "Hey, Git, ignore this revision please."
But then Git says "OK, but how exactly? I can ignore whitespace and I
can detect moves & copies so do you want me to do those?" And then I'm
thinking, actually yes -M10 would be great because I know that this
revision also reordered a bunch of #includes and I still want people
to be able to see where they came from. However other sets of options
might work better for other changes.

On looking at the problem this way it seems that fuzzy matching
belongs in the same class as -w, -M & -C. As these options apply for
an entire blame session, it would be consistent to allow applying the
fuzzy matching likewise. As a bonus, having the ability to apply the
-F option for the entire blame session seems quite nice for other use
cases.

> I'd much prefer it to be per-commit

Yes, we definitely need a way to say "fuzzy match this commit only"
otherwise you lose the ability to detect small but significant changes
in other commits.
I haven't explored this fully, but I'm thinking that the revision
options file might look something like this:

# Just use the defaults, whatever they may be
6e8063eee1d30bc80c7802e94ed0caa8949c6323
# This commit changed loads of tabs to spaces
35ee755a8c43bcb3c2786522d423f006c23d32df -w
# This commit reordered lots of short lines
c5b679e14b761a7bfd6ae93cfffbf66e3c4e25a5 -M5
# This commit copied some stuff and changed CamelCase to snake_case
58b9cd43da695ee339b7679cf0c9f31e1f8ef67f -w -C15 -F

For the command-line, potentially we could make -w/-M/-C/-F specified
after --ignore-rev apply to only that revision e.g.:
git blame myfile --ignore-rev 35ee755a8c43bcb3c2786522d423f006c23d32df -M -F

But as I say, I haven't explored this fully.

> For those cases where you can't find a match, I could imagine marking
> them as unblamable.  The purpose of 'unblamable' in my patchset was to
> signal to not bother looking up further commit info for a final blame
> entry.  It was largely so that the user (me!) wouldn't see a commit
> blamed when I explicitly told git to not tell me about that commit.

I can see how the unblameable concept makes sense for the heuristic
you have right now. However, once you have a heuristic that can tell
you with a high degree of certainty that a line really did come from a
commit that you're merely not interested in, then I suggest that it's
better to just point at that commit.

> Both approaches sound fine though.

:)

> The first thing that comes to mind for me is to plug your fuzzy logic
> into my patch set.

Please do! It should be easy to pluck fuzzy_find_matching_lines() and
its dependencies out. Just to set your expectations, I have not yet
optimised it and it is highly wasteful right now both in terms of time
and memory.

> But maybe we need more info for your function.

The extra info needed is the parent & child file content, plus the
indices in the strings of where new lines start. This information is
already calculated in the course of doing the diff but it looks like a
fair amount of plumbing work will be needed to make the information
available to the chunk callback. That was my reason for initially
plonking the fuzzy matching in a separate pass.

Thanks,
-Michael

Jeff King March 25, 2019, 11:35 p.m. UTC | #5

On Mon, Mar 25, 2019 at 11:21:19PM +0000, Michael Platings wrote:

> > I work on a project that needs a major reformatting, and one thing
> > delaying me was the lack of an ability to ignore commits during blame.
> 
> I think we understand each other well then - I'm working on a plan to
> change the variable naming rule in LLVM, and naturally other
> developers aren't keen on making git blame less useful.

This is sort of a tangent to the thread, but have you looked into tools
that provide an interactive "re-blame from the parent" operation? I use
tig for this.  Quite often my blame turns up on some boring line
(whitespace fixing, minor tweaking of a function interface, etc), and
then I want to keep digging on the "same" line, as counted by line count
(but it's OK if it's off by one or two lines, since I'm looking at a
blame of the whole file).

Obviously this isn't as automated as saying "ignore commit X, it's just
variable renaming". But it also eliminates the need to a priori figure
out all such X that affect the lines you care about. You get an answer,
your human mind says "nope, that's not interesting", and you press a
button to dig further.

I think there's room for both solutions to co-exist, but just suggesting
you to try out the one that's already been implemented if you haven't. ;)

-Peff

Jacob Keller March 26, 2019, 3:07 a.m. UTC | #6

On Mon, Mar 25, 2019 at 4:37 PM Jeff King <peff@peff.net> wrote:
>
> On Mon, Mar 25, 2019 at 11:21:19PM +0000, Michael Platings wrote:
>
> > > I work on a project that needs a major reformatting, and one thing
> > > delaying me was the lack of an ability to ignore commits during blame.
> >
> > I think we understand each other well then - I'm working on a plan to
> > change the variable naming rule in LLVM, and naturally other
> > developers aren't keen on making git blame less useful.
>
> This is sort of a tangent to the thread, but have you looked into tools
> that provide an interactive "re-blame from the parent" operation? I use
> tig for this.  Quite often my blame turns up on some boring line
> (whitespace fixing, minor tweaking of a function interface, etc), and
> then I want to keep digging on the "same" line, as counted by line count
> (but it's OK if it's off by one or two lines, since I'm looking at a
> blame of the whole file).
>

+1 for the usefulness of this approach. It really helps figure things
out in a way that doesn't require me to track all "uninteresting"
commits, and also works when I *am* trying to find that uninteresting
commit too.

> Obviously this isn't as automated as saying "ignore commit X, it's just
> variable renaming". But it also eliminates the need to a priori figure
> out all such X that affect the lines you care about. You get an answer,
> your human mind says "nope, that's not interesting", and you press a
> button to dig further.
>
> I think there's room for both solutions to co-exist, but just suggesting
> you to try out the one that's already been implemented if you haven't. ;)
>
> -Peff

That's also my sentiment.

Thanks,
Jake

Michael Platings March 26, 2019, 8:26 p.m. UTC | #7

> Obviously this isn't as automated as saying "ignore commit X, it's just
> variable renaming". But it also eliminates the need to a priori figure
> out all such X that affect the lines you care about. You get an answer,
> your human mind says "nope, that's not interesting", and you press a
> button to dig further.

Hi Peff, for the use case you describe of someone stumbling across a
renaming commit, your approach is clearly better. However the use case
Barret & I are facing is of deliberately choosing to make a large
refactoring/renaming commit, and not wanting everyone else working on
the project to have to press that extra button every time they run git
blame.

I think it's really important that we make this dead easy for everyone
to use. The ultimate in ease of use would be for git blame to
automatically pick up ignore settings without the user having to even
know that it's happening. But that breaks the principle of least
astonishment. The next simplest thing I can think of is to add a
configuration option blame.ignoreRevs which would have the same
effect, except the user has to opt in.
Barret has implemented blame.ignoreRevsFile, but I think the world
will be a more consistent and happier place if we dictate the location
that the revisions are loaded from, in the same way as .gitignore.
Deciding what that location should be is one of those bikeshed
arguments which is perhaps why Barret dodged it :)

-Michael

Duy Nguyen March 27, 2019, 6:36 a.m. UTC | #8

On Wed, Mar 27, 2019 at 3:27 AM Michael Platings <michael@platin.gs> wrote:
> I think it's really important that we make this dead easy for everyone
> to use. The ultimate in ease of use would be for git blame to
> automatically pick up ignore settings without the user having to even
> know that it's happening. But that breaks the principle of least
> astonishment. The next simplest thing I can think of is to add a
> configuration option blame.ignoreRevs which would have the same
> effect, except the user has to opt in.
> Barret has implemented blame.ignoreRevsFile, but I think the world
> will be a more consistent and happier place if we dictate the location
> that the revisions are loaded from, in the same way as .gitignore.
> Deciding what that location should be is one of those bikeshed
> arguments which is perhaps why Barret dodged it :)

And bikeshedding. Another good place to keep these revs is git-notes,
which probably could result in faster lookups too and can be made
visible in git-log. But that's in addition to --ignoreRevsFile, not
replacing it.

Michael Platings March 27, 2019, 8:26 a.m. UTC | #9

> Another good place to keep these revs is git-notes,
> which probably could result in faster lookups too and can be made
> visible in git-log.

Oh wow, I really like this. A major concern I had about the revisions
file was that you don't know what a revision ID will be until it's
upstream. If you can specify *in the commit message itself* what
options should apply to git blame for that revision then that problem
is solved. And if you change your mind later, or want to ignore a
pre-existing revision then git-notes solves that problem.

So I'm thinking you just have a commit message like this:
"
Make all function names snake_case
git-blame-ignore: fuzzy
"
And users who have blame.ignoreRevs set will have the -F/--fuzzy
option applied to that commit.

> But that's in addition to --ignoreRevsFile, not replacing it.

I disagree. ignoreRevsFile has the major problem that the file will
need updating every time you rebase a commit to be ignored, and you'll
need to remember to edit it for cherry picks. Let's not have that
option as I think it will add unhelpful complexity.

-Michael

Duy Nguyen March 27, 2019, 9:02 a.m. UTC | #10

On Wed, Mar 27, 2019 at 3:26 PM Michael Platings <michael@platin.gs> wrote:
>
> > Another good place to keep these revs is git-notes,
> > which probably could result in faster lookups too and can be made
> > visible in git-log.
>
> Oh wow, I really like this. A major concern I had about the revisions
> file was that you don't know what a revision ID will be until it's
> upstream. If you can specify *in the commit message itself* what
> options should apply to git blame for that revision then that problem
> is solved. And if you change your mind later, or want to ignore a
> pre-existing revision then git-notes solves that problem.
>
> So I'm thinking you just have a commit message like this:
> "
> Make all function names snake_case
> git-blame-ignore: fuzzy
> "
> And users who have blame.ignoreRevs set will have the -F/--fuzzy
> option applied to that commit.

Yeah some trailer in the commit itself is also good if you know in
advance it should be treated differently. I think we have
git-interpret-trailers to help extract these info.

> > But that's in addition to --ignoreRevsFile, not replacing it.
>
> I disagree. ignoreRevsFile has the major problem that the file will
> need updating every time you rebase a commit to be ignored, and you'll
> need to remember to edit it for cherry picks. Let's not have that
> option as I think it will add unhelpful complexity.

OK I was just trying to say I did not object any current suggestions
(because I didn't know much in the first place). I'll just leave this
for other people to discuss :)

Barret Rhoden April 3, 2019, 3:25 p.m. UTC | #11

Hi -

On 3/25/19 7:21 PM, Michael Platings wrote:
>> The first thing that comes to mind for me is to plug your fuzzy logic
>> into my patch set.
> Please do! It should be easy to pluck fuzzy_find_matching_lines() and
> its dependencies out. Just to set your expectations, I have not yet
> optimised it and it is highly wasteful right now both in terms of time
> and memory.

I edited my patch set to allow changing the heuristic.  I also made a 
commit that uses your fingerprinting code to match target lines to 
parent lines.  I'll send it all out in another email and CC you.  The 
last commit is still a work in progress.

Regarding stuff like the name of the ignore file, in my first version, I 
went with whatever git hyper-blame does.  That was shot down, rightly 
so, I think.  With a git-config setting, you can name the file whatever 
you want, or add multiple files.  With my current patchset, you can 
disable the file too with --ignore-revs-file="".

As far as using notes or per-commit info, that might be nice, though 
it's not a huge burden to have a separate commit - you can wait til 
after things get merged (so we have the final object name (hash)) and 
it's not hugely burdensome.  But I get that's just my opinion.  =)

Barret

Michael Platings April 3, 2019, 9:49 p.m. UTC | #12

Thanks Barret.
I've cooled off on the git-notes idea since learning that notes
branches have to be pulled explicitly. And different people may have
different ideas about which types of commits they want to ignore, so
not predefining the name of the ignore file(s) does seem like the best
option, even if it's not perfect. And maybe having --fuzzy as a
separate option would be nice, maybe not - either way it can wait. In
short, I've come full circle and wish to do what I can to assist your
proposal (+ fuzzy matching).
I've rewritten the fuzzy matching function so it's now much faster and
more modest in its memory use. I hope to share the patch tomorrow.
Following that, I'll do what I can to assist with reviewing your
patches.
Cheers,
-Michael

[RFC,0/1] Fuzzy blame

Message

Comments