Message ID | 20190324235020.49706-1-michael@platin.gs (mailing list archive) |
---|---|
Headers | show |
Series | Fuzzy blame | expand |
michael@platin.gs writes: > From: Michael Platings <michael@platin.gs> > > Hi Git devs, > > Some of you may be familiar with the git-hyper-blame tool [1]. It's "useful if > you have a commit that makes sweeping changes that are unlikely to be what you > are looking for in a blame, such as mass reformatting or renaming." I recall a month or so ago brho@google (CC'ed) sent a "let's allow blame to ignore some uninteresting commit" topic, which was unfortunately not well reviewed (what I mean is *not* that it was reviewed thoroughly and found to be bad---not many reviewers found time or inclination to review it well). The topic is queued as br/blame-ignore and its tip is at 43a290e3 ("SQUASH???", 2019-02-13) as of this writing. Perhaps you two can join forces? P.S. I expect to be offline for most of the week (packing, moving and unpacking. Even though the places packing and unpacking happens are within 1 kilometer radius, that does not make it less hassle X-<). See you guys next month.
(resending in plain text mode, sorry for the noise) Thanks Junio, that's super helpful! A month or two ago I contacted the author of git-hyper-blame, Matt Giuca, asking whether anyone had looked into adding the feature to git blame. I didn't receive a response but maybe that prompted Barret Rhoden's patch? Or maybe just a weird coincidence! @Barret I see your patches are a nice translation of git-hyper-blame. However could you give me your thoughts on the approach in my patch? A comment in the git-hyper-blame source [1] says: # This doesn't work that well if there are a lot of line changes within the # hunk (demonstrated by GitHyperBlameLineMotionTest.testIntraHunkLineMotion). # A fuzzy heuristic that takes the text of the new line and tries to find a # deleted line within the hunk that mostly matches the new line could help. My patch aims to implement this "fuzzy heuristic" so I'd love to get your take on it. Many thanks, -Michael On Mon, 25 Mar 2019 at 02:39, Junio C Hamano <gitster@pobox.com> wrote: > > michael@platin.gs writes: > > > From: Michael Platings <michael@platin.gs> > > > > Hi Git devs, > > > > Some of you may be familiar with the git-hyper-blame tool [1]. It's "useful if > > you have a commit that makes sweeping changes that are unlikely to be what you > > are looking for in a blame, such as mass reformatting or renaming." > > I recall a month or so ago brho@google (CC'ed) sent a "let's allow > blame to ignore some uninteresting commit" topic, which was > unfortunately not well reviewed (what I mean is *not* that it was > reviewed thoroughly and found to be bad---not many reviewers found > time or inclination to review it well). The topic is queued as > br/blame-ignore and its tip is at 43a290e3 ("SQUASH???", 2019-02-13) > as of this writing. > > Perhaps you two can join forces? > > P.S. I expect to be offline for most of the week (packing, moving > and unpacking. Even though the places packing and unpacking happens > are within 1 kilometer radius, that does not make it less hassle > X-<). See you guys next month.
Hi - On 3/25/19 5:32 AM, Michael Platings wrote: > (resending in plain text mode, sorry for the noise) > > Thanks Junio, that's super helpful! > > A month or two ago I contacted the author of git-hyper-blame, Matt > Giuca, asking whether anyone had looked into adding the feature to git > blame. I didn't receive a response but maybe that prompted Barret > Rhoden's patch? Or maybe just a weird coincidence! Weird coincidence. It's a big company. =) I work on a project that needs a major reformatting, and one thing delaying me was the lack of an ability to ignore commits during blame. hyper-blame does that, but I never liked the fact that it wasn't directly in git. > > @Barret I see your patches are a nice translation of git-hyper-blame. Not sure if you've seen my latest version, updated to v4 (2019-02-26) here: https://public-inbox.org/git/20190226170648.211847-1-brho@google.com/ The one Junio has (br/blame-ignore) hasn't been updated - not sure if that's automated, or if it just fell through the cracks. > However could you give me your thoughts on the approach in my patch? A > comment in the git-hyper-blame source [1] says: > # This doesn't work that well if there are a lot of line changes within the > # hunk (demonstrated by GitHyperBlameLineMotionTest.testIntraHunkLineMotion). > # A fuzzy heuristic that takes the text of the new line and tries to find a > # deleted line within the hunk that mostly matches the new line could help. > > My patch aims to implement this "fuzzy heuristic" so I'd love to get > your take on it. This is an interesting idea, and it sounds like it might be complimentary to the blame-ignore work. Both have the flavor of "user knows this commit is special and wants special processing." In my patch, I didn't try to find the likely original version of a line in a diff hunk. What I did amounted to finding blame based on the parent's image of the file. Example in this message: https://public-inbox.org/git/20190226170648.211847-4-brho@google.com/ Of note, line 12 was blamed on commit b, when it really came from commit a. For any lines added beyond the size of the parent's image (e.g. the ignored commit added more lines than it removed), those lines were removed from blame contention - marked with all 0s. In essence, my method did the following: for all suspect/child lines 'i' in a diff hunk if i <= parent's hunk size assign to parent line i (+ offset) else mark umblamable Due to the two cases being contiguous, each hunk would be broken up into at most two blame entries. (I actually short-circuit that for loop and just split at i == parent_size immediately). Having a smart/fuzzy matcher sounds nicer. My patch passes blame to the parent. Yours finds the right *part* of the parent to blame, which means we have a better chance of finding the right original commit to blame. I think you're doing this: for all suspect/child lines 'i' in a diff hunk if i matches parent line (say, x) assign to parent line x else assign to child line i From glancing at your code, it looks like you break up every blame entry into N entries, one for each line, which you need since each parent line X might not be adjacent to the matching lines in the child. One question I have is under which circumstances do you find that you cannot match a suspect/child line to a parent? One obvious case is a commit that only adds lines - the parent's line set is the empty set. I think you catch that earlier in your code (parent_chunk_length == 0), though I guess there are other cases where the parent has lines, but they are below a certain match threshold? For those cases where you can't find a match, I could imagine marking them as unblamable. The purpose of 'unblamable' in my patchset was to signal to not bother looking up further commit info for a final blame entry. It was largely so that the user (me!) wouldn't see a commit blamed when I explicitly told git to not tell me about that commit. Both approaches sound fine though. Another question was whether or not you wanted this to apply per commit or for an entire blame session. It looks like your current code applies it overall, and not for a specific commit. I'd much prefer it to be per-commit, though maybe the -F is just for showing it working in the RFC. The first thing that comes to mind for me is to plug your fuzzy logic into my patch set. Basically, in my commit near "These go to the parent", we do your line-by-line matching. We might not need my 'delta' check anymore, which was basically the 'if' part of my for loop (in text above). But maybe we need more info for your function. That way, we'd get the per-commit control of when we ignore a commit from my patch, and we'd get the fuzzy-blaming brains of your patch, such that we try to find the right *part* of the parent to blame. Anyway, let me know what your thoughts are. We could try to change that part of my code so that it just calls some function that tells it where in the parent to blame, and otherwise marks unblamable. Then your patch can replace the logic? Barret
Hi Barret, > I work on a project that needs a major reformatting, and one thing > delaying me was the lack of an ability to ignore commits during blame. I think we understand each other well then - I'm working on a plan to change the variable naming rule in LLVM, and naturally other developers aren't keen on making git blame less useful. > One question I have is under which circumstances do you find that you > cannot match a suspect/child line to a parent? One obvious case is a > commit that only adds lines - the parent's line set is the empty set. I > think you catch that earlier in your code (parent_chunk_length == 0), > though I guess there are other cases where the parent has lines, but > they are below a certain match threshold? Yes, exactly. The threshold is currently 0 i.e. a single matching bigram is all that's required for two lines to be considered matching, but in future the threshold could be configurable in the same manner as -M & -C options. In the t8020-blame-fuzzy.sh test script in my patch, where it's expected that a line will be attributed to the "ignored" commit you'll see "Final". So far this is just "}" lines. > Another question was whether or not you wanted this to apply per commit > or for an entire blame session. It looks like your current code applies > it overall, and not for a specific commit. This is a really interesting question for this feature. Initially I just wanted to be able to say "Hey, Git, ignore this revision please." But then Git says "OK, but how exactly? I can ignore whitespace and I can detect moves & copies so do you want me to do those?" And then I'm thinking, actually yes -M10 would be great because I know that this revision also reordered a bunch of #includes and I still want people to be able to see where they came from. However other sets of options might work better for other changes. On looking at the problem this way it seems that fuzzy matching belongs in the same class as -w, -M & -C. As these options apply for an entire blame session, it would be consistent to allow applying the fuzzy matching likewise. As a bonus, having the ability to apply the -F option for the entire blame session seems quite nice for other use cases. > I'd much prefer it to be per-commit Yes, we definitely need a way to say "fuzzy match this commit only" otherwise you lose the ability to detect small but significant changes in other commits. I haven't explored this fully, but I'm thinking that the revision options file might look something like this: # Just use the defaults, whatever they may be 6e8063eee1d30bc80c7802e94ed0caa8949c6323 # This commit changed loads of tabs to spaces 35ee755a8c43bcb3c2786522d423f006c23d32df -w # This commit reordered lots of short lines c5b679e14b761a7bfd6ae93cfffbf66e3c4e25a5 -M5 # This commit copied some stuff and changed CamelCase to snake_case 58b9cd43da695ee339b7679cf0c9f31e1f8ef67f -w -C15 -F For the command-line, potentially we could make -w/-M/-C/-F specified after --ignore-rev apply to only that revision e.g.: git blame myfile --ignore-rev 35ee755a8c43bcb3c2786522d423f006c23d32df -M -F But as I say, I haven't explored this fully. > For those cases where you can't find a match, I could imagine marking > them as unblamable. The purpose of 'unblamable' in my patchset was to > signal to not bother looking up further commit info for a final blame > entry. It was largely so that the user (me!) wouldn't see a commit > blamed when I explicitly told git to not tell me about that commit. I can see how the unblameable concept makes sense for the heuristic you have right now. However, once you have a heuristic that can tell you with a high degree of certainty that a line really did come from a commit that you're merely not interested in, then I suggest that it's better to just point at that commit. > Both approaches sound fine though. :) > The first thing that comes to mind for me is to plug your fuzzy logic > into my patch set. Please do! It should be easy to pluck fuzzy_find_matching_lines() and its dependencies out. Just to set your expectations, I have not yet optimised it and it is highly wasteful right now both in terms of time and memory. > But maybe we need more info for your function. The extra info needed is the parent & child file content, plus the indices in the strings of where new lines start. This information is already calculated in the course of doing the diff but it looks like a fair amount of plumbing work will be needed to make the information available to the chunk callback. That was my reason for initially plonking the fuzzy matching in a separate pass. Thanks, -Michael
On Mon, Mar 25, 2019 at 11:21:19PM +0000, Michael Platings wrote: > > I work on a project that needs a major reformatting, and one thing > > delaying me was the lack of an ability to ignore commits during blame. > > I think we understand each other well then - I'm working on a plan to > change the variable naming rule in LLVM, and naturally other > developers aren't keen on making git blame less useful. This is sort of a tangent to the thread, but have you looked into tools that provide an interactive "re-blame from the parent" operation? I use tig for this. Quite often my blame turns up on some boring line (whitespace fixing, minor tweaking of a function interface, etc), and then I want to keep digging on the "same" line, as counted by line count (but it's OK if it's off by one or two lines, since I'm looking at a blame of the whole file). Obviously this isn't as automated as saying "ignore commit X, it's just variable renaming". But it also eliminates the need to a priori figure out all such X that affect the lines you care about. You get an answer, your human mind says "nope, that's not interesting", and you press a button to dig further. I think there's room for both solutions to co-exist, but just suggesting you to try out the one that's already been implemented if you haven't. ;) -Peff
On Mon, Mar 25, 2019 at 4:37 PM Jeff King <peff@peff.net> wrote: > > On Mon, Mar 25, 2019 at 11:21:19PM +0000, Michael Platings wrote: > > > > I work on a project that needs a major reformatting, and one thing > > > delaying me was the lack of an ability to ignore commits during blame. > > > > I think we understand each other well then - I'm working on a plan to > > change the variable naming rule in LLVM, and naturally other > > developers aren't keen on making git blame less useful. > > This is sort of a tangent to the thread, but have you looked into tools > that provide an interactive "re-blame from the parent" operation? I use > tig for this. Quite often my blame turns up on some boring line > (whitespace fixing, minor tweaking of a function interface, etc), and > then I want to keep digging on the "same" line, as counted by line count > (but it's OK if it's off by one or two lines, since I'm looking at a > blame of the whole file). > +1 for the usefulness of this approach. It really helps figure things out in a way that doesn't require me to track all "uninteresting" commits, and also works when I *am* trying to find that uninteresting commit too. > Obviously this isn't as automated as saying "ignore commit X, it's just > variable renaming". But it also eliminates the need to a priori figure > out all such X that affect the lines you care about. You get an answer, > your human mind says "nope, that's not interesting", and you press a > button to dig further. > > I think there's room for both solutions to co-exist, but just suggesting > you to try out the one that's already been implemented if you haven't. ;) > > -Peff That's also my sentiment. Thanks, Jake
> Obviously this isn't as automated as saying "ignore commit X, it's just > variable renaming". But it also eliminates the need to a priori figure > out all such X that affect the lines you care about. You get an answer, > your human mind says "nope, that's not interesting", and you press a > button to dig further. Hi Peff, for the use case you describe of someone stumbling across a renaming commit, your approach is clearly better. However the use case Barret & I are facing is of deliberately choosing to make a large refactoring/renaming commit, and not wanting everyone else working on the project to have to press that extra button every time they run git blame. I think it's really important that we make this dead easy for everyone to use. The ultimate in ease of use would be for git blame to automatically pick up ignore settings without the user having to even know that it's happening. But that breaks the principle of least astonishment. The next simplest thing I can think of is to add a configuration option blame.ignoreRevs which would have the same effect, except the user has to opt in. Barret has implemented blame.ignoreRevsFile, but I think the world will be a more consistent and happier place if we dictate the location that the revisions are loaded from, in the same way as .gitignore. Deciding what that location should be is one of those bikeshed arguments which is perhaps why Barret dodged it :) -Michael
On Wed, Mar 27, 2019 at 3:27 AM Michael Platings <michael@platin.gs> wrote: > I think it's really important that we make this dead easy for everyone > to use. The ultimate in ease of use would be for git blame to > automatically pick up ignore settings without the user having to even > know that it's happening. But that breaks the principle of least > astonishment. The next simplest thing I can think of is to add a > configuration option blame.ignoreRevs which would have the same > effect, except the user has to opt in. > Barret has implemented blame.ignoreRevsFile, but I think the world > will be a more consistent and happier place if we dictate the location > that the revisions are loaded from, in the same way as .gitignore. > Deciding what that location should be is one of those bikeshed > arguments which is perhaps why Barret dodged it :) And bikeshedding. Another good place to keep these revs is git-notes, which probably could result in faster lookups too and can be made visible in git-log. But that's in addition to --ignoreRevsFile, not replacing it.
> Another good place to keep these revs is git-notes, > which probably could result in faster lookups too and can be made > visible in git-log. Oh wow, I really like this. A major concern I had about the revisions file was that you don't know what a revision ID will be until it's upstream. If you can specify *in the commit message itself* what options should apply to git blame for that revision then that problem is solved. And if you change your mind later, or want to ignore a pre-existing revision then git-notes solves that problem. So I'm thinking you just have a commit message like this: " Make all function names snake_case git-blame-ignore: fuzzy " And users who have blame.ignoreRevs set will have the -F/--fuzzy option applied to that commit. > But that's in addition to --ignoreRevsFile, not replacing it. I disagree. ignoreRevsFile has the major problem that the file will need updating every time you rebase a commit to be ignored, and you'll need to remember to edit it for cherry picks. Let's not have that option as I think it will add unhelpful complexity. -Michael
On Wed, Mar 27, 2019 at 3:26 PM Michael Platings <michael@platin.gs> wrote: > > > Another good place to keep these revs is git-notes, > > which probably could result in faster lookups too and can be made > > visible in git-log. > > Oh wow, I really like this. A major concern I had about the revisions > file was that you don't know what a revision ID will be until it's > upstream. If you can specify *in the commit message itself* what > options should apply to git blame for that revision then that problem > is solved. And if you change your mind later, or want to ignore a > pre-existing revision then git-notes solves that problem. > > So I'm thinking you just have a commit message like this: > " > Make all function names snake_case > git-blame-ignore: fuzzy > " > And users who have blame.ignoreRevs set will have the -F/--fuzzy > option applied to that commit. Yeah some trailer in the commit itself is also good if you know in advance it should be treated differently. I think we have git-interpret-trailers to help extract these info. > > But that's in addition to --ignoreRevsFile, not replacing it. > > I disagree. ignoreRevsFile has the major problem that the file will > need updating every time you rebase a commit to be ignored, and you'll > need to remember to edit it for cherry picks. Let's not have that > option as I think it will add unhelpful complexity. OK I was just trying to say I did not object any current suggestions (because I didn't know much in the first place). I'll just leave this for other people to discuss :)
Hi - On 3/25/19 7:21 PM, Michael Platings wrote: >> The first thing that comes to mind for me is to plug your fuzzy logic >> into my patch set. > Please do! It should be easy to pluck fuzzy_find_matching_lines() and > its dependencies out. Just to set your expectations, I have not yet > optimised it and it is highly wasteful right now both in terms of time > and memory. I edited my patch set to allow changing the heuristic. I also made a commit that uses your fingerprinting code to match target lines to parent lines. I'll send it all out in another email and CC you. The last commit is still a work in progress. Regarding stuff like the name of the ignore file, in my first version, I went with whatever git hyper-blame does. That was shot down, rightly so, I think. With a git-config setting, you can name the file whatever you want, or add multiple files. With my current patchset, you can disable the file too with --ignore-revs-file="". As far as using notes or per-commit info, that might be nice, though it's not a huge burden to have a separate commit - you can wait til after things get merged (so we have the final object name (hash)) and it's not hugely burdensome. But I get that's just my opinion. =) Barret
Thanks Barret. I've cooled off on the git-notes idea since learning that notes branches have to be pulled explicitly. And different people may have different ideas about which types of commits they want to ignore, so not predefining the name of the ignore file(s) does seem like the best option, even if it's not perfect. And maybe having --fuzzy as a separate option would be nice, maybe not - either way it can wait. In short, I've come full circle and wish to do what I can to assist your proposal (+ fuzzy matching). I've rewritten the fuzzy matching function so it's now much faster and more modest in its memory use. I hope to share the patch tomorrow. Following that, I'll do what I can to assist with reviewing your patches. Cheers, -Michael
From: Michael Platings <michael@platin.gs> Hi Git devs, Some of you may be familiar with the git-hyper-blame tool [1]. It's "useful if you have a commit that makes sweeping changes that are unlikely to be what you are looking for in a blame, such as mass reformatting or renaming." git-hyper-blame is useful but (a) it's not convenient to install; (b) it's missing functionality available in regular git blame; (c) it's method of matching lines between chunks is too simplistic for many use cases; and (d) it's not Git so it doesn't integrate well with tools that expect Git e.g. vim plugins. Therefore I'm hoping to add similar and hopefully superior functionality to Git itself. I have a very rough patch so I'd like to get your thoughts on the general approach, particularly in terms of its user-visible behaviour. My initial idea was to lift the design directly from git-hyper-blame. However the approach of picking single revisions to somehow ignore doesn't sit well with the -w, -M & -C options, which have a similar intent but apply to all revisions. I'd like to get your thoughts on whether we could allow applying the -M or -w options to specific revisions. For example, imagine it was agreed that all the #includes in a project should be reordered. In that case, it would be useful to be able to specify that the -M option should be used for blames on that revision specifically, so that in future when someone wants to know why a #include was added they don't have to run git blame twice to find out. Options that are specific to a particular revision could be stored in a ".gitrevisions" file or similar. If the principle of allowing blame options to be applied per-revision is agreeable then I'd like to add a -F/--fuzzy option, to sit alongside -w, -M & -C. I've implemented a prototype "fuzzy" option, patch attached. The option operates at the level of diff chunks. For each line in the "after" half of the chunk it uses a heuristic to choose which line in the "before" half of the chunk matches best. The heuristic I'm using at the moment is of matching "bigrams" as described in [2]. The initial pass typically gives reasonable results, but can jumble up the lines. As in the reformatting/renaming use case the content should stay in the same order, it's worth going to extra effort to avoid jumbling lines. Therefore, after the initial pass, the line that can be matched with the most confidence is used to partition the chunk into halves before and after it. The process is then repeated recursively on the halves above and below the partition line. I feel like a similar algorithm has probably already been invented in a better form - if anyone knows of such a thing then please let me know! I look forward to hearing your thoughts. Thanks, -Michael [1] https://commondatastorage.googleapis.com/chrome-infra-docs/flat/depot_tools/docs/html/git-hyper-blame.html [2] https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient Michael Platings (1): Add git blame --fuzzy option. blame.c | 352 +++++++++++++++++++++++++++++++++++++++++++++++-- blame.h | 1 + builtin/blame.c | 3 + t/t8020-blame-fuzzy.sh | 264 +++++++++++++++++++++++++++++++++++++ 4 files changed, 609 insertions(+), 11 deletions(-) create mode 100755 t/t8020-blame-fuzzy.sh