Message ID | 20241213042312.2890841-1-jltobler@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | batch blob diff generation | expand |
On Thu, Dec 12, 2024 at 10:23:09PM -0600, Justin Tobler wrote: > To enable support for batch diffs of multiple blob pairs, this > series introduces a new diff plumbing command git-diff-blob(1). Similar > to git-diff-tree(1), it provides a "--stdin" option that reads a pair of > blobs on each line of input and generates the diffs. This is intended to > be used for scripting purposes where more fine-grained control for diff > generation is desired. Below is an example for each usage: > > $ git diff-blob HEAD~5000:README.md HEAD:README.md > > $ git diff-blob --stdin <<EOF > 88f126184c52bfe4859ec189d018872902e02a84 665ce5f5a83647619fba9157fa9b0141ae8b228b > HEAD~5000:README.md HEAD:README.md > EOF In the first example, I think just using "git diff" would work (though it is not a plumbing command). But the stdin example is what's interesting here anyway, since it can handle arbitrary inputs. So let's focus on that. Feeding just blob ids has a big drawback: we don't have any context! So you get bogus filenames in the patch, no mode data, and so on. Feeding the paths along with their commits, as you do on the second line, gives you those things from the lookup context. But it also has some problems. One, it's needlessly expensive; we have to traverse HEAD~5000, and then dig into its tree to find the blobs (which presumably you already did, since how else would you end up with those oids). And two, there are parsing ambiguities, since arbitrary revision names can contain spaces. E.g., are we looking for the file "README.md HEAD:README.md" in HEAD~5000? So ideally we'd have an input format that encapsulates that extra context data and provides some mechanism for quoting. And it turns out we do: the --raw diff format. If the program takes that format, then you can manually feed it two arbitrary blob oids if you have them (and put whatever you like for the mode/path context), like: git diff-blob --stdin <<\EOF :100644 100644 88f126184c52bfe4859ec189d018872902e02a84 665ce5f5a83647619fba9157fa9b0141ae8b228b M README.md EOF Or you can get the real context yourself (though it seems to me that this is a gap in what "cat-file --batch" should be able to do in a single process): git ls-tree HEAD~5000 README.md >out read mode_a blob oid_a path <out git ls-tree HEAD README.md >out read mode_b blob oid_b path <out printf ":$mode_a $mode_b $oid_a $oid_b M\tREADME.md" | git diff-blob --stdin But it also means you can use --raw output directly. So: git diff-tree --raw -r HEAD~5000 HEAD -- README.md | git diff-blob --stdin Now that command by itself doesn't look all that useful; you could have just asked for patches from diff-tree. But by splitting the two, you can filter the set of paths in between (for example, to omit some entries, or to batch a large diff into more manageable chunks for pagination, etc). The patch might look something like this: https://lore.kernel.org/git/20161201204042.6yslbyrg7l6ghhww@sigill.intra.peff.net/ :) That is what has been powering the diffs at github.com since 2016 or so. And continues to do so, as far as I know. I don't have access to their internal repository anymore, but I've continued to rebase the topic forward in my personal repo. You can fetch it from: https://github.com/peff/git jk/diff-pairs in case that is helpful. -Peff
Jeff King <peff@peff.net> writes: > Feeding just blob ids has a big drawback: we don't have any context! So > you get bogus filenames in the patch, no mode data, and so on. And the lack of filenames and the tree object name at the root level means you do not get anything out of the attribute subsystem, which in turn may affect a few more things. Unfortunately the format used in the output from "diff --raw" does not capture this. Does this want to be a building block for the server side diff? Would it be a bit too low level for each "request" to comparing only two blob objects? Can we place a lot more assumptions on the relationship among blob pairs that appear in the --stdin request (e.g., they all come from the same tree and share the same attr stack to look up attributes applicable for them)?
On Fri, Dec 13, 2024 at 07:17:59PM +0900, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > Feeding just blob ids has a big drawback: we don't have any context! So > > you get bogus filenames in the patch, no mode data, and so on. > > And the lack of filenames and the tree object name at the root level > means you do not get anything out of the attribute subsystem, which > in turn may affect a few more things. > > Unfortunately the format used in the output from "diff --raw" does > not capture this. Don't we just use the working tree .gitattributes by default, and ignore what's in the endpoints? In a bare repo we wouldn't have that, but I think the recently added attr.tree and --attr-source options would work there. You can't use attributes from multiple trees in a single request, but I doubt that would be a big drawback. -Peff
On 24/12/13 03:12AM, Jeff King wrote: > In the first example, I think just using "git diff" would work (though > it is not a plumbing command). But the stdin example is what's > interesting here anyway, since it can handle arbitrary inputs. So let's > focus on that. > > Feeding just blob ids has a big drawback: we don't have any context! So > you get bogus filenames in the patch, no mode data, and so on. > > Feeding the paths along with their commits, as you do on the second > line, gives you those things from the lookup context. But it also has > some problems. One, it's needlessly expensive; we have to traverse > HEAD~5000, and then dig into its tree to find the blobs (which > presumably you already did, since how else would you end up with those > oids). And two, there are parsing ambiguities, since arbitrary revision > names can contain spaces. E.g., are we looking for the file "README.md > HEAD:README.md" in HEAD~5000? > > So ideally we'd have an input format that encapsulates that extra > context data and provides some mechanism for quoting. And it turns out > we do: the --raw diff format. I had not considered using the raw diff format as the input source. As you pointed out, using blob IDs alone loses some of the useful context. By using path-scoped revisions, we can still get this context, but at an added cost of having to traverse the tree to get the underlying information. As you also mentioned, this is potentially wasteful if, for example, the blobs diffs you are trying to generate are a subset of git-diff-tree(1) output and thus the context is already known ahead of time. Which is exactly what we are hoping to accomplish. > If the program takes that format, then you can manually feed it two > arbitrary blob oids if you have them (and put whatever you like for the > mode/path context), like: > > git diff-blob --stdin <<\EOF > :100644 100644 88f126184c52bfe4859ec189d018872902e02a84 665ce5f5a83647619fba9157fa9b0141ae8b228b M README.md > EOF > > Or you can get the real context yourself (though it seems to me that > this is a gap in what "cat-file --batch" should be able to do in a > single process): > > git ls-tree HEAD~5000 README.md >out > read mode_a blob oid_a path <out > git ls-tree HEAD README.md >out > read mode_b blob oid_b path <out > printf ":$mode_a $mode_b $oid_a $oid_b M\tREADME.md" | > git diff-blob --stdin > > But it also means you can use --raw output directly. So: > > git diff-tree --raw -r HEAD~5000 HEAD -- README.md | > git diff-blob --stdin > > Now that command by itself doesn't look all that useful; you could have > just asked for patches from diff-tree. But by splitting the two, you can > filter the set of paths in between (for example, to omit some entries, > or to batch a large diff into more manageable chunks for pagination, > etc). Yup, this is exactly what I'm hoping to achieve! As a single commit may contain an unbounded number changes, being able to control diff generation at the blob level is quite useful. Using the raw diff format as input looks like a rather elegant solution and I think it makes sense to use it here for the "--stdin" option over just reading two blobs. > The patch might look something like this: > > https://lore.kernel.org/git/20161201204042.6yslbyrg7l6ghhww@sigill.intra.peff.net/ > > :) That is what has been powering the diffs at github.com since 2016 or > so. And continues to do so, as far as I know. I don't have access to > their internal repository anymore, but I've continued to rebase the > topic forward in my personal repo. You can fetch it from: > > https://github.com/peff/git jk/diff-pairs > > in case that is helpful. Thanks Peff! From looking at the mentioned thread and branch, it looks like I'm basically trying to accomplish the same thing here. Just a bit late to the conversation. :) While the use-case is rather narrow, I think it would be nice to see this functionality provided upstream. I see this as a means to faciliate more fine-grained control of the blob diffs we actually want to compute at a given time and it seems like it would be reasonable to expose as part of the diff plumbing. I would certainly be interested in adapting this series to instead use raw input from git-diff-tree(1) or trying to revive the previous series if that is preferred. If there is interest in continuing, some lingering questions I have: Being that the primary purpose of git-diff-blob(1) here is to handle generating blob diffs as specified by stdin, is there any reason to have a normal mode that accepts a blob pair as arguments? Or would it be best to limit the input mechanism to stdin entirely? If the user wanted to compute a single blob diff they could just use git-diff(1) already so providing this as a part of git-diff-blob(1) is a bit redundant. Having it as an option for the user does seem a bit more friendly though. -Justin
Jeff King <peff@peff.net> writes: > So ideally we'd have an input format that encapsulates that extra > context data and provides some mechanism for quoting. And it turns out > we do: the --raw diff format. Funny. The raw diff format indeed was designed as an interchange format from various "compare two sets of things" front-ends (like diff-files, diff-cache, and diff-tree) that emits the raw format, to be read by "diff-helper" (initially called "diff-tree-helper") that takes the raw format and - matches removed and added paths with similar contents to detect renames and copies - computes the output in various formats including "patch". So I guess we came a full circle, finally ;-). Looking in the archive for messages exchanged between junkio@ and torvalds@ mentioning diff before 2005-05-30 finds some interesting gems. https://lore.kernel.org/git/7v1x8zsamn.fsf_-_@assigned-by-dhcp.cox.net/
Jeff King <peff@peff.net> writes: >> Unfortunately the format used in the output from "diff --raw" does >> not capture this. > > Don't we just use the working tree .gitattributes by default, and ignore > what's in the endpoints? In a bare repo we wouldn't have that, but I > think the recently added attr.tree and --attr-source options would work > there. Yeah, you're right. I forgot about attr.tree (does not seem to be documented, by the way) and --attr-source & GIT_ATTR_SOURCE. I imagine that this feature is primarily meant to be used on the server side, and in bare repositories, only "diff-tree" makes sense among the diff-* family of commands, which (as server environments lack "the index" nor "the working tree") would already be using these mechanisms, so there is no new issues introduced here. > You can't use attributes from multiple trees in a single request, but I > doubt that would be a big drawback. I think it is also true with the normal diff-tree and friends; I do not think it looks up attributes from each tree independently when you run "git diff-tree -r A B" to compare the blob in tree A that is CRLF with the blob at the same path in tree B. So we should be OK. Thanks.
Junio C Hamano <gitster@pobox.com> writes: > Yeah, you're right. I forgot about attr.tree (does not seem to be > documented, by the way) We do have an entry in Documentation/config/attr.txt that describes the three; I simply assumed it is not documented as I didn't see it mentioned in Documentation/git.txt where --attr-source & GIT_ATTR_SOURCE are described. We may want to add something like this, perhaps? ----- >8 ----- Subject: doc: give attr.tree a bit more visibility In "git help config" output, attr.tree mentions both --attr-source and GIT_ATTR_SOURCE, but the description of --attr-source and GIT_ATTR_SOURCE that appear in "git help git", attr.tree is missing. Add it so that these three are described together in both places. Signed-off-by: Junio C Hamano <gitster@pobox.com> --- Documentation/git.txt | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git c/Documentation/git.txt w/Documentation/git.txt index d15a869762..13f6785408 100644 --- c/Documentation/git.txt +++ w/Documentation/git.txt @@ -228,7 +228,10 @@ If you just want to run git as if it was started in `<path>` then use --attr-source=<tree-ish>:: Read gitattributes from <tree-ish> instead of the worktree. See linkgit:gitattributes[5]. This is equivalent to setting the - `GIT_ATTR_SOURCE` environment variable. + `GIT_ATTR_SOURCE` environment variable. The `attr.tree` + configuration variable is used as a fallback when this option + or the environment variable are not in use. + GIT COMMANDS ------------
Junio C Hamano <gitster@pobox.com> writes: > Jeff King <peff@peff.net> writes: > >> So ideally we'd have an input format that encapsulates that extra >> context data and provides some mechanism for quoting. And it turns out >> we do: the --raw diff format. > > Funny. The raw diff format indeed was designed as an interchange > format from various "compare two sets of things" front-ends (like > diff-files, diff-cache, and diff-tree) that emits the raw format, to > be read by "diff-helper" (initially called "diff-tree-helper") that > takes the raw format and > > - matches removed and added paths with similar contents to detect > renames and copies > > - computes the output in various formats including "patch". > > So I guess we came a full circle, finally ;-). Looking in the archive > for messages exchanged between junkio@ and torvalds@ mentioning diff > before 2005-05-30 finds some interesting gems. > > https://lore.kernel.org/git/7v1x8zsamn.fsf_-_@assigned-by-dhcp.cox.net/ So, if we were to do what Justin tried to do honoring the overall design of our diff machinery, I think what we can do is as follows: * Use the "diff --raw" output format as the input, but with a bit of twist. (1) a narrow special case that takes only a single diff_filepair of <old> and <new> blobs, and immediately run diff_queue() on that single diff_filepair, which is Justin's use case. For this mode of operation, "flush after reach record of input" may be sufficient. (2) as a general "interchange format" to feed "comparison between two sets of <object, path>" into our diff machinery, we are better off if we can treat the input stream as multiple records that describes comparison between two sets. Imagine "git log --oneline --first-parent -2 --raw HEAD", where one set of "diff --raw" records show the changed blobs with their paths between HEAD~1 and HEAD, and another set does so for HEAD~2 and HEAD~1. We need to be able to tell where the first set ends and the second set starts, so that rename detection and other things, if requested, can be done within each set. My recommendation is to use a single blank line as a separator, e.g. :100644 100644 ce31f93061 9829984b0a M Documentation/git-refs.txt :100644 100644 8b3882cff1 4a74f7c7bd M refs.c :100755 100755 1bfff3a7af f59bc4860f M t/t1460-refs-migrate.sh :100644 100644 c11213f520 8953d1c6d3 M refs/files-backend.c :100644 100644 b2e3ba877d bec5962deb M refs/reftable-backend.c so an application that wants to compare only one diff_filepair at a time would issue something like :100644 100644 ce31f93061 9829984b0a M Documentation/git-refs.txt :100644 100644 8b3882cff1 4a74f7c7bd M refs.c :100755 100755 1bfff3a7af f59bc4860f M t/t1460-refs-migrate.sh so the parsing machinery does not have to worry about case (1) above. * Parse and append the input into diff_queue(), until you see an blank line. - If at EOF you are done, but if you have something accumulated in diff_queue(), show them (like below) first. In any case, at EOF, you are done. * Run diffcore_std() followed by diff_flush() to have the contents of the queue nicely formatted and emptied. Go back to parsing more input lines.
On Sat, Dec 14, 2024 at 06:17:01PM -0800, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > > > Yeah, you're right. I forgot about attr.tree (does not seem to be > > documented, by the way) > > We do have an entry in Documentation/config/attr.txt that describes > the three; I simply assumed it is not documented as I didn't see it > mentioned in Documentation/git.txt where --attr-source & > GIT_ATTR_SOURCE are described. > > We may want to add something like this, perhaps? > > ----- >8 ----- > Subject: doc: give attr.tree a bit more visibility Yeah, that looks good to me. I recall that the performance of attr.tree is not great for _some_ commands (like pack-objects). So it's perhaps reasonable to use for single commands like "git diff" but not to set in your on-disk config. It's possible we'd want to warn people about that before advertising it more widely? I dunno. -Peff
On Fri, Dec 13, 2024 at 10:41:25AM -0600, Justin Tobler wrote: > > The patch might look something like this: > > > > https://lore.kernel.org/git/20161201204042.6yslbyrg7l6ghhww@sigill.intra.peff.net/ > > > > :) That is what has been powering the diffs at github.com since 2016 or > > so. And continues to do so, as far as I know. I don't have access to > > their internal repository anymore, but I've continued to rebase the > > topic forward in my personal repo. You can fetch it from: > > > > https://github.com/peff/git jk/diff-pairs > > > > in case that is helpful. > > Thanks Peff! From looking at the mentioned thread and branch, it looks > like I'm basically trying to accomplish the same thing here. Just a bit > late to the conversation. :) > > While the use-case is rather narrow, I think it would be nice to see > this functionality provided upstream. I see this as a means to faciliate > more fine-grained control of the blob diffs we actually want to compute > at a given time and it seems like it would be reasonable to expose as > part of the diff plumbing. I would certainly be interested in adapting > this series to instead use raw input from git-diff-tree(1) or trying to > revive the previous series if that is preferred. Yeah, if you want to take it in that direction, either by adapting the idea, or by starting with diff-pairs and polishing it up, I'm happy either way. GitHub folks may be happy if you keep the name "diff-pairs" and match the interface. ;) > If there is interest in continuing, some lingering questions I have: > > Being that the primary purpose of git-diff-blob(1) here is to handle > generating blob diffs as specified by stdin, is there any reason to have > a normal mode that accepts a blob pair as arguments? Or would it be best > to limit the input mechanism to stdin entirely? If the user wanted to > compute a single blob diff they could just use git-diff(1) already so > providing this as a part of git-diff-blob(1) is a bit redundant. Having > it as an option for the user does seem a bit more friendly though. I don't have a strong opinion. I agree it _could_ be more friendly, but it also raises questions about how that mode/path context is filled in. And also about other modes. E.g., "git diff HEAD:foo bar" will diff between a blob and a working tree. Should a new plumbing command support that, too? If those aren't things you immediately care about, I'd probably punt on it for now. I think it could be added later without losing compatibility (command-line arguments as appropriate for a single pair, and since the stdin format starts with ":mode" there's room to extend it). -Peff
On Sun, Dec 15, 2024 at 03:24:11PM -0800, Junio C Hamano wrote: > > Funny. The raw diff format indeed was designed as an interchange > > format from various "compare two sets of things" front-ends (like > > diff-files, diff-cache, and diff-tree) that emits the raw format, to > > be read by "diff-helper" (initially called "diff-tree-helper") that > > takes the raw format and > > > > - matches removed and added paths with similar contents to detect > > renames and copies > > > > - computes the output in various formats including "patch". > > > > So I guess we came a full circle, finally ;-). Looking in the archive > > for messages exchanged between junkio@ and torvalds@ mentioning diff > > before 2005-05-30 finds some interesting gems. > > > > https://lore.kernel.org/git/7v1x8zsamn.fsf_-_@assigned-by-dhcp.cox.net/ :) That same thread was linked when I posted the original diff-pairs many years ago. > So, if we were to do what Justin tried to do honoring the overall > design of our diff machinery, I think what we can do is as follows: > > * Use the "diff --raw" output format as the input, but with a bit > of twist. > > (1) a narrow special case that takes only a single diff_filepair > of <old> and <new> blobs, and immediately run diff_queue() on > that single diff_filepair, which is Justin's use case. For > this mode of operation, "flush after reach record of input" > may be sufficient. My understanding was that he does not actually care about this case (just feeding two blobs), but is actually processing --raw output in the first place. Or did you just mean that we'd still be feeding raw output, but just getting the flush behavior? > (2) as a general "interchange format" to feed "comparison between > two sets of <object, path>" into our diff machinery, we are > better off if we can treat the input stream as multiple > records that describes comparison between two sets. Imagine > "git log --oneline --first-parent -2 --raw HEAD", where one > set of "diff --raw" records show the changed blobs with their > paths between HEAD~1 and HEAD, and another set does so for > HEAD~2 and HEAD~1. We need to be able to tell where the > first set ends and the second set starts, so that rename > detection and other things, if requested, can be done within > each set. Seems reasonable. For the use of diff-pairs at GitHub, I always just did full-tree things like rename detection in the initial diff-tree invocation. Since my goal was splitting/filtering, doing it after would yield wrong answers (since diff-pairs never sees the complete set). But it's possible for somebody to want to filter the intermediate results, then do full-tree stuff on the result (or even just delay the cost of rename detection). And certainly it's possible to want to feed a whole bunch of unrelated diff segments without having to spawn a process for each. So it's not something I wanted, but I agree it's good to plan for. > My recommendation is to use a single blank line as a separator, > e.g. > > :100644 100644 ce31f93061 9829984b0a M Documentation/git-refs.txt > :100644 100644 8b3882cff1 4a74f7c7bd M refs.c > :100755 100755 1bfff3a7af f59bc4860f M t/t1460-refs-migrate.sh > > :100644 100644 c11213f520 8953d1c6d3 M refs/files-backend.c > :100644 100644 b2e3ba877d bec5962deb M refs/reftable-backend.c > > so an application that wants to compare only one diff_filepair > at a time would issue something like > > :100644 100644 ce31f93061 9829984b0a M Documentation/git-refs.txt > > :100644 100644 8b3882cff1 4a74f7c7bd M refs.c > > :100755 100755 1bfff3a7af f59bc4860f M t/t1460-refs-migrate.sh > > so the parsing machinery does not have to worry about case (1) above. Yeah, that seems good. And it is backwards-compatible with the existing diff-pairs format (which just barfs on a blank line). That's not a big concern for the project, but it is nice that it makes life a bit simpler for folks who would eventually be tasked with switching from it to this new tool. ;) > * Parse and append the input into diff_queue(), until you see an > blank line. > > - If at EOF you are done, but if you have something accumulated > in diff_queue(), show them (like below) first. In any case, at > EOF, you are done. Yep, makes sense. > * Run diffcore_std() followed by diff_flush() to have the contents > of the queue nicely formatted and emptied. Go back to parsing > more input lines. That makes sense. I don't think my diff-pairs runs diffcore_std() at all. The plumbing defaults mean it would always be a noop unless you explicitly passed in rename, etc, options, and I never wanted to do that. We might have to check the interaction of diffcore code on a set of queued diffs that already have values for renames, etc. I.e., that: git diff-tree --raw -M | git diff-pairs -M does not barf, since the input step in diff-pairs is going to set status to 'R', etc, in the pairs. -Peff
Jeff King <peff@peff.net> writes: > I recall that the performance of attr.tree is not great for _some_ > commands (like pack-objects). So it's perhaps reasonable to use for > single commands like "git diff" but not to set in your on-disk config. > It's possible we'd want to warn people about that before advertising it > more widely? I dunno. Or we disable the unusably-inefficient feature before doing so. Would attr.tree be much less efficient than GIT_ATTR_SOURCE?
On Mon, Dec 16, 2024 at 08:29:41AM -0800, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > I recall that the performance of attr.tree is not great for _some_ > > commands (like pack-objects). So it's perhaps reasonable to use for > > single commands like "git diff" but not to set in your on-disk config. > > It's possible we'd want to warn people about that before advertising it > > more widely? I dunno. > > Or we disable the unusably-inefficient feature before doing so. > Would attr.tree be much less efficient than GIT_ATTR_SOURCE? Whether it's unusably inefficient depends on what you throw at it. IIRC, the performance difference for pack-objects on git.git was fairly negligible. The problem in linux.git is that besides being big, it has a deep(er) directory structure. So collecting all of the attributes for a file like drivers/gpu/drm/foo/bar.h needs to open all of those intermediate trees. So I'd be inclined to leave it in place, in case somebody is actually happily using it. GIT_ATTR_SOURCE suffers all of the same problems; it's just that you'd presumably only use it with a few select commands (as far as I know, pack-objects is the worst case because it's looking up one attribute on every single blob in all of history). -Peff
Jeff King <peff@peff.net> writes: >> > I recall that the performance of attr.tree is not great for _some_ >> > commands (like pack-objects). So it's perhaps reasonable to use for >> > single commands like "git diff" but not to set in your on-disk config. >> > It's possible we'd want to warn people about that before advertising it >> > more widely? I dunno. >> >> Or we disable the unusably-inefficient feature before doing so. >> Would attr.tree be much less efficient than GIT_ATTR_SOURCE? > > Whether it's unusably inefficient depends on what you throw at it. IIRC, > the performance difference for pack-objects on git.git was fairly > negligible. The problem in linux.git is that besides being big, it has a > deep(er) directory structure. So collecting all of the attributes for a > file like drivers/gpu/drm/foo/bar.h needs to open all of those > intermediate trees. > > So I'd be inclined to leave it in place, in case somebody is actually > happily using it. > > GIT_ATTR_SOURCE suffers all of the same problems; it's just that you'd > presumably only use it with a few select commands (as far as I know, > pack-objects is the worst case because it's looking up one attribute on > every single blob in all of history). Ah, OK. So your "caution" was about the underlying mechanism to allow attributes corrected from the specified tree, and not specifically about using "attr.tree" to specify that tree? That was what got me confused. If that is the case, I do not think the documentation patch that started this exchange that adds attr.tree to where GIT_ATTR_SOURCE and --attr-source are already mentioned makes anything worse. Thanks.