Message ID | 20181111062312.16342-11-newren@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fast export and import fixes and features | expand |
On Sat, Nov 10, 2018 at 10:23:12PM -0800, Elijah Newren wrote: > fast-export output is traditionally used as an input to a fast-import > program, but it is also useful to help gather statistics about the > history of a repository (particularly when --no-data is also passed). > For example, two of the types of information we may want to collect > could include: > 1) general information about renames that have occurred > 2) what the biggest objects in a repository are and what names > they appear under. > > The first bit of information can be gathered by just passing -M to > fast-export. The second piece of information can partially be gotten > from running > git cat-file --batch-check --batch-all-objects > However, that only shows what the biggest objects in the repository are > and their sizes, not what names those objects appear as or what commits > they were introduced in. We can get that information from fast-export, > but when we only see > R oldname newname > instead of > R oldname newname > M 100644 $SHA1 newname > then it makes the job more difficult. Add an option which allows us to > force the latter output even when commits have exact renames of files. fast-export seems like a funny tool to look up paths. What about "git log --find-object=$SHA1" ? -Peff
On Sat, Nov 10, 2018 at 11:23 PM Jeff King <peff@peff.net> wrote: > > On Sat, Nov 10, 2018 at 10:23:12PM -0800, Elijah Newren wrote: > > > fast-export output is traditionally used as an input to a fast-import > > program, but it is also useful to help gather statistics about the > > history of a repository (particularly when --no-data is also passed). > > For example, two of the types of information we may want to collect > > could include: > > 1) general information about renames that have occurred > > 2) what the biggest objects in a repository are and what names > > they appear under. > > > > The first bit of information can be gathered by just passing -M to > > fast-export. The second piece of information can partially be gotten > > from running > > git cat-file --batch-check --batch-all-objects > > However, that only shows what the biggest objects in the repository are > > and their sizes, not what names those objects appear as or what commits > > they were introduced in. We can get that information from fast-export, > > but when we only see > > R oldname newname > > instead of > > R oldname newname > > M 100644 $SHA1 newname > > then it makes the job more difficult. Add an option which allows us to > > force the latter output even when commits have exact renames of files. > > fast-export seems like a funny tool to look up paths. What about "git > log --find-object=$SHA1" ? Eek, and give me O(N*M) behavior, where N is the number of commits in the repository and M is the number of renames that occur in its history? Also, that's the inverse of the lookup I need anyway (I have the commit and filename, but am missing the SHA). One of the problems with filter-branch that people often run into is they know what they want at a high-level (e.g. extract the history of this directory for a new repository, or rewrite the history of this repo to appear at a subdirectory so it can be merged into a bigger repo and people passing filenames to log will still get the history of those files, or I want to remove some of the big stuff in my history), but often times that's not quite enough. They need help finding big objects, or may be unaware that the subset of files they want used to be known by alternative names. I want a simple --analyze mode that can report on all files that have been renamed (so users don't just say "all I care about is these N files, give me a rewritten history just including those" -- we can point out to them whether those N files used to be known by other names), as well as reporting on all big files and if they've been deleted, and aggregations of the "big files" information across directories and file extensions.
On Sun, Nov 11, 2018 at 12:42:58AM -0800, Elijah Newren wrote: > > > fast-export output is traditionally used as an input to a fast-import > > > program, but it is also useful to help gather statistics about the > > > history of a repository (particularly when --no-data is also passed). > > > For example, two of the types of information we may want to collect > > > could include: > > > 1) general information about renames that have occurred > > > 2) what the biggest objects in a repository are and what names > > > they appear under. > > > > > > The first bit of information can be gathered by just passing -M to > > > fast-export. The second piece of information can partially be gotten > > > from running > > > git cat-file --batch-check --batch-all-objects > > > However, that only shows what the biggest objects in the repository are > > > and their sizes, not what names those objects appear as or what commits > > > they were introduced in. We can get that information from fast-export, > > > but when we only see > > > R oldname newname > > > instead of > > > R oldname newname > > > M 100644 $SHA1 newname > > > then it makes the job more difficult. Add an option which allows us to > > > force the latter output even when commits have exact renames of files. > > > > fast-export seems like a funny tool to look up paths. What about "git > > log --find-object=$SHA1" ? > > Eek, and give me O(N*M) behavior, where N is the number of commits in > the repository and M is the number of renames that occur in its > history? Also, that's the inverse of the lookup I need anyway (I have > the commit and filename, but am missing the SHA). Maybe I don't understand what you're trying to accomplish. I was thinking specifically of your "cat-file can tell you the large objects, but you don't know their names/commits" from above. I would do: git log --raw $( git cat-file --batch-check='%(objectsize:disk) %(objectname)' --batch-all-objects | sort -rn | head -3 | awk '{print "--find-object=" $2 }' ) I'm not sure how renames enter into it at all. > One of the problems with filter-branch that people often run into is > they know what they want at a high-level (e.g. extract the history of > this directory for a new repository, or rewrite the history of this > repo to appear at a subdirectory so it can be merged into a bigger > repo and people passing filenames to log will still get the history of > those files, or I want to remove some of the big stuff in my history), > but often times that's not quite enough. They need help finding big > objects, or may be unaware that the subset of files they want used to > be known by alternative names. > > I want a simple --analyze mode that can report on all files that have > been renamed (so users don't just say "all I care about is these N > files, give me a rewritten history just including those" -- we can > point out to them whether those N files used to be known by other > names), as well as reporting on all big files and if they've been > deleted, and aggregations of the "big files" information across > directories and file extensions. So this seems like a separate problem than what the commit message talks about. There I think you'd want to assemble the list with something like "git log --follow --name-only paths-of-interest" except that --follow sucks too much to handle more than one path at a time. But if you wanted to do it manually, then: git log --diff-filter=R --name-only would be enough to let you track it down, wouldn't it? -Peff
On Mon, Nov 12, 2018 at 4:58 AM Jeff King <peff@peff.net> wrote: > On Sun, Nov 11, 2018 at 12:42:58AM -0800, Elijah Newren wrote: > > Maybe I don't understand what you're trying to accomplish. I was > thinking specifically of your "cat-file can tell you the large objects, > but you don't know their names/commits" from above. Fair enough. And just to be clear, the first 9 patches were fixes and features around trying to rewrite history; patch 10 is orthogonal and was used for a separate run to just gather data. It is entirely possible I could gather that data other ways. > I would do: > > git log --raw $( > git cat-file --batch-check='%(objectsize:disk) %(objectname)' --batch-all-objects | > sort -rn | head -3 | > awk '{print "--find-object=" $2 }' > ) > > I'm not sure how renames enter into it at all. How did I miss objectsize:disk?? Especially since it is right next to objectsize in the manpage to boot? That's awesome, thanks for that pointer. I do have a separate cat-file --batch-check --batch-all-objects process already, since I can't get sizes out of either log or fast-export. However, I wouldn't use your 'head -3' since I'm not looking for the N biggest, but reporting on _all_ objects (in reverse size order) and letting the user look over the report and deciding where to stop reading. So, this is a big and expensive log command. Granted, we will need a big and expensive log command, but let's keep in mind that we have this one. > > One of the problems with filter-branch that people often run into is > > they know what they want at a high-level (e.g. extract the history of > > this directory for a new repository, or rewrite the history of this > > repo to appear at a subdirectory so it can be merged into a bigger > > repo and people passing filenames to log will still get the history of > > those files, or I want to remove some of the big stuff in my history), > > but often times that's not quite enough. They need help finding big > > objects, or may be unaware that the subset of files they want used to > > be known by alternative names. > > > > I want a simple --analyze mode that can report on all files that have > > been renamed (so users don't just say "all I care about is these N > > files, give me a rewritten history just including those" -- we can > > point out to them whether those N files used to be known by other > > names), as well as reporting on all big files and if they've been > > deleted, and aggregations of the "big files" information across > > directories and file extensions. > > So this seems like a separate problem than what the commit message talks > about. > > There I think you'd want to assemble the list with something like "git > log --follow --name-only paths-of-interest" except that --follow sucks > too much to handle more than one path at a time. > > But if you wanted to do it manually, then: > > git log --diff-filter=R --name-only > > would be enough to let you track it down, wouldn't it? Without a -M you'd only catch 100% renames, right? Those aren't the only ones I'd want to catch, so I'd need to add -M. You are right that we could get basic renames this way, but it doesn't cover everything I need. Let's use this as a starting point, though, and build up to what I need... I also want to know when files were deleted. I've generally found that people are more okay with purging parts of history [corresponding to large ojbects] that were deleted longer ago than more recent stuff, for a variety of reasons. So we could either run yet another log, or modify the command to: git log -M --diff-filter=RD --name-status However, I don't just want to know when files were deleted, I'd like to know when directories are deleted. I only knew how to derive that from knowing what files existed within those directories, so that would take me to: git log -M --diff-filter=RAD --name-status [Edit: I just saw your other email and for the first time learned about the -t rev-list option which might simplify this a little, although "need to worry about deleted files being reinstated" below might require the 'A' anyway.] At this point, let's remember that we had another full git-log invocation for mapping object sizes to filenames. We might as well coalesce the two log commands into one, by extending this latest one to: git log -M --diff-filter=RAMD --no-abbrev --raw Also, I wanted commit date rather than author date, so we need to extend the headers a bit. Also, for reasons I won't bother detailing, I think I want to traverse commits in reverse topological order. So our command is: git log --pretty=fuller --topo-order --reverse -M --diff-filter=RAMD --no-abbrev --raw But that still leaves us with four problems, three of which we can solve with further extensions to this command: 1) There are some weird edge cases with deletions and renames. Lots of them in fact. At a simple level, branching and merging and multiple refs means that "is-this-deleted" isn't a binary flag for a given filename (but rather a binary flag per-ref). Also, it makes "the set of names associated with a single 'file' as perceived by the user" possibly rather ill-defined as well. This can get really hairy, but I'd at least like to handle the very basic cases of (a) "user re-instates filename that used to be deleted" (i.e. the file isn't deleted anymore) and (b) "user re-instates a filename that used to exist but was renamed to something else" (in such cases, we can't just treat the two filenames as being different names of the same content). Handling the (b) usecase sanely requires some topology information, so we need parents as well. So our command extends to: git log --parents --pretty=fuller --topo-order --reverse -M --diff-filter=RAMD --no-abbrev --raw 2) log is not plumbing, so parsing the stuff before the file modifications is not a good idea. This could be fixed by using --format: git log --format='%H%n%P%n%cd' --date=short --topo-order --reverse -M --diff-filter-RAMD --no-abbrev --raw 3) log won't show changes for merge commits by default; we'd need to add -c: git log --format='%H%n%P%n%cd' --date=short --topo-order --reverse -M --diff-filter-RAMD --no-abbrev --raw -c 4) log is not plumbing, revisited: although at this point I've specified the log output explicitly enough that it ought to be safe to parse, there are a few things that make me slightly worried. I can depend on fast-export to be stable; it only gives 'M' and 'D' unless you explicitly ask for more types (e.g. -M to detect renames will add 'R'). With log, I'm no so sure; do I need to worry about new types appearing in the future? Also, should I just drop --diff-filter=RAMD since it covers just about everything anyway? Also, while --raw is stable, is the combination of -c and --raw stable? Is --date=short stable (most likely, but still seems more likely to change than fast-export would be)? Is there something else I need to be worried about? Granted, each of those is only a small worry with log, but they add up and give me pause about whether I should be parsing it output in another tool. So we've come up with an alternate way to get the data I need, though with some worries. I could potentially switch to using this and drop patch 10/10. Maybe there's even a good reason to prefer using log. But at the time I was thinking in terms of "I already have a tool that parses fast-export output and I know it's stable...and it has access to all the information I need so why not just get the information from it?" So I did that, and then realized towards the end that although it had all the needed info, it stripped one piece from me. Namely, when it had a 100% rename, I'd only get R oldname newname and wouldn't know the sha1sum of newname (for mapping object sizes to all their names). If I cached the information about all file shas for all trees I could pull it from that cache (which could be expensive memory-wise for large repos), or I could use the original-oid directive and keep another long running "git cat-file --batch-check='%(objectname)' process and just pass it "$ORIGINAL_OID:$NEWNAME" lines as I come across them. However, fast-export had the information and did special work to try to avoid showing it when it thought it woudln't be needed, so why not just add a flag to tell it to just give me the filemodify? At this point, if folks don't like this patch, I'm more likely to use the supplementary cat-file process than switching to log, unless someone can ameliorate my concerns with it and suggest a good reason why it's actually better. Anyway, I hope it makes a little more sense why I created this patch. Does it, or have I just made things even more confusing? ...and if you've read this far, I'm impressed. Thanks for reading.
On Mon, Nov 12, 2018 at 10:08:10AM -0800, Elijah Newren wrote: > > I would do: > > > > git log --raw $( > > git cat-file --batch-check='%(objectsize:disk) %(objectname)' --batch-all-objects | > > sort -rn | head -3 | > > awk '{print "--find-object=" $2 }' > > ) > > > > I'm not sure how renames enter into it at all. > > How did I miss objectsize:disk?? Especially since it is right next to > objectsize in the manpage to boot? That's awesome, thanks for that > pointer. > > I do have a separate cat-file --batch-check --batch-all-objects > process already, since I can't get sizes out of either log or > fast-export. However, I wouldn't use your 'head -3' since I'm not > looking for the N biggest, but reporting on _all_ objects (in reverse > size order) and letting the user look over the report and deciding > where to stop reading. So, this is a big and expensive log command. > Granted, we will need a big and expensive log command, but let's keep > in mind that we have this one. It is an expensive log command, but it's the same expense as running fast-export, no? And I think maybe that is the disconnect. I am looking at this problem as "how do you answer question X in a repository". And I think you are looking at as "I am receiving a fast-export stream, and I need to answer question X on the fly". And that would explain why you want to get extra annotations into the fast-export stream. Is that right? > > There I think you'd want to assemble the list with something like "git > > log --follow --name-only paths-of-interest" except that --follow sucks > > too much to handle more than one path at a time. > > > > But if you wanted to do it manually, then: > > > > git log --diff-filter=R --name-only > > > > would be enough to let you track it down, wouldn't it? > > Without a -M you'd only catch 100% renames, right? Those aren't the > only ones I'd want to catch, so I'd need to add -M. You are right > that we could get basic renames this way, but it doesn't cover > everything I need. Let's use this as a starting point, though, and > build up to what I need... No, renames are on by default these days, and that includes inexact renames. That said, if you're scripting you probably ought to be doing: git rev-list HEAD | git diff-tree --stdin and there yes, you'd have to enable "-M" yourself (you touched on scripting and formatting below; diff-tree can accept the format options you'd want). > I also want to know when files were deleted. I've generally found > that people are more okay with purging parts of history [corresponding > to large ojbects] that were deleted longer ago than more recent stuff, > for a variety of reasons. So we could either run yet another log, or > modify the command to: > > git log -M --diff-filter=RD --name-status > > However, I don't just want to know when files were deleted, I'd like > to know when directories are deleted. I only knew how to derive that > from knowing what files existed within those directories, so that > would take me to: > > git log -M --diff-filter=RAD --name-status > > [Edit: I just saw your other email and for the first time learned > about the -t rev-list option which might simplify this a little, > although "need to worry about deleted files being reinstated" below > might require the 'A' anyway.] Yeah, I think "-t" would help your tree deletion problem. > At this point, let's remember that we had another full git-log > invocation for mapping object sizes to filenames. We might as well > coalesce the two log commands into one, by extending this latest one > to: > > git log -M --diff-filter=RAMD --no-abbrev --raw What is there besides RAMD? :) > I could potentially switch to using this and drop patch 10/10. So I'm still not _entirely_ clear on what you're trying to do with 10/10. I think maybe the "disconnect" part I wrote above explains it. If that's correct, then I think framing it in terms of the operations that you'd be able to perform _without running a separate traverse_ would make it more obvious. > Anyway, I hope it makes a little more sense why I created this patch. > Does it, or have I just made things even more confusing? Some of both, I think. > ...and if you've read this far, I'm impressed. Thanks for reading. I'll admit I skimmed near the end. ;) -Peff
On Tue, Nov 13, 2018 at 6:45 AM Jeff King <peff@peff.net> wrote: > It is an expensive log command, but it's the same expense as running > fast-export, no? And I think maybe that is the disconnect. I would expect an expensive log command to generally be the same expense as running fast-export, yes. But I would expect two expensive log commands to be twice the expense of a single fast-export (and you suggested two log commands: both the --find-object= one and the --diff-filter one). > I am looking at this problem as "how do you answer question X in a > repository". And I think you are looking at as "I am receiving a > fast-export stream, and I need to answer question X on the fly". > > And that would explain why you want to get extra annotations into the > fast-export stream. Is that right? I'm not trying to get information on the fly during a rewrite or anything like that. This is an optional pre-rewrite step (from a separate invocation of the tool) where I have multiple questions I want to answer. I'd like to answer them all relatively quickly, if possible, and I think all of them should be answerable with a single history traversal (plus a cat-file --batch-all-objects call to get object sizes, since I don't know of another way to get those). I'd be fine with switching from fast-export to log or something else if it met the needs better. As far as I can tell, you're trying to split each question apart and do a history traversal for each, and I don't see why that's better. Simpler, perhaps, but it seems worse for performance. Am I missing something? > > > There I think you'd want to assemble the list with something like "git > > > log --follow --name-only paths-of-interest" except that --follow sucks > > > too much to handle more than one path at a time. > > > > > > But if you wanted to do it manually, then: > > > > > > git log --diff-filter=R --name-only > > > > > > would be enough to let you track it down, wouldn't it? > > > > Without a -M you'd only catch 100% renames, right? Those aren't the > > only ones I'd want to catch, so I'd need to add -M. You are right > > that we could get basic renames this way, but it doesn't cover > > everything I need. Let's use this as a starting point, though, and > > build up to what I need... > > No, renames are on by default these days, and that includes inexact > renames. That said, if you're scripting you probably ought to be doing: > > git rev-list HEAD | git diff-tree --stdin > > and there yes, you'd have to enable "-M" yourself (you touched on > scripting and formatting below; diff-tree can accept the format options > you'd want). Ah, I didn't know renames were on by default; I somehow missed that. Also, the rev-list to diff-tree pipe is nice, but I also need parent and commit timestamp information. .... > Yeah, I think "-t" would help your tree deletion problem. Absolutely, thanks for the hint. Much appreciated. :-) > > At this point, let's remember that we had another full git-log > > invocation for mapping object sizes to filenames. We might as well > > coalesce the two log commands into one, by extending this latest one > > to: > > > > git log -M --diff-filter=RAMD --no-abbrev --raw > > What is there besides RAMD? :) Well, as you pointed out above, log detects renames by default, whereas it didn't used to. So, if someone had written some similar-ish history walking/parsing tool years ago that didn't depend need renames and was based on log output, there's a good chance their tool might start failing when rename detection was turned on by default, because instead of getting both a 'D' and an 'M' change, they'd get an unexpected 'R'. For my case, do I have to worry about similar future changes? Will copy detection ('C') or break detection ('B') become the default in the future? Do I have to worry about typechanges ('T")? Will new change types be added? I mean, the fast-export output could maybe change too, but it seems much less likely than with log. > > I could potentially switch to using this and drop patch 10/10. > > So I'm still not _entirely_ clear on what you're trying to do with > 10/10. I think maybe the "disconnect" part I wrote above explains it. If > that's correct, then I think framing it in terms of the operations that > you'd be able to perform _without running a separate traverse_ would > make it more obvious. Let me try to put it as briefly as I can. With as few traversals as possible, I want to: * Get all blob sizes * Map blob shas to filename(s) they appeared under in the history * Find when files and directories were deleted (and whether they were later reinstated, since that means they aren't actually gone) * Find sets of filenames referring to the same logical 'file'. (e.g. foo->bar in commit A and bar->baz in commit B mean that {foo,bar,baz} refer to the same 'file' so that a user has an easy report to look at to find out that if they just want to "keep baz and its history" then they need foo & bar & baz. I need to know about things like another foo or bar being introduced after the rename though, since that breaks the connection between filenames) * Do a few aggregations on the above data as well (e.g. all copies of postgres.exe add up to 20M -- why were those checked in anyway?, *.webm files in aggregate are .5G, your long-deleted src/video-server/ directory from that aborted experimental project years ago takes up 2G of your history, etc.) Right now, my best solution for this combination of questions is 'cat-file --batch-all-objects' plus fast-export, if I get patch 10/10 in place. I'm totally open to better solutions, including ones that don't use fast-export.
On Tue, Nov 13, 2018 at 09:10:36AM -0800, Elijah Newren wrote: > > I am looking at this problem as "how do you answer question X in a > > repository". And I think you are looking at as "I am receiving a > > fast-export stream, and I need to answer question X on the fly". > > > > And that would explain why you want to get extra annotations into the > > fast-export stream. Is that right? > > I'm not trying to get information on the fly during a rewrite or > anything like that. This is an optional pre-rewrite step (from a > separate invocation of the tool) where I have multiple questions I > want to answer. I'd like to answer them all relatively quickly, if > possible, and I think all of them should be answerable with a single > history traversal (plus a cat-file --batch-all-objects call to get > object sizes, since I don't know of another way to get those). I'd be > fine with switching from fast-export to log or something else if it > met the needs better. Ah, OK. Yes, if we're just trying to query, then I think you should be able to do what you want with the existing traversal and diff tools. And if not, we should think about a new feature there, and not try to shoe-horn it into fast-export. > As far as I can tell, you're trying to split each question apart and > do a history traversal for each, and I don't see why that's better. > Simpler, perhaps, but it seems worse for performance. Am I missing > something? I was only trying to address each possible query individually. I agree that if you are querying both things, you should be able to do it in a single traversal (and that is strictly better). It may require a little more parsing of the output (e.g., `--find-object` is easy to implement yourself looking at --raw output). > Ah, I didn't know renames were on by default; I somehow missed that. > Also, the rev-list to diff-tree pipe is nice, but I also need parent > and commit timestamp information. diff-tree will format the commit info as well (before git-log was a C builtin, it was just a rev-list/diff-tree pipeline in a shell script). So you can do: git rev-list ... | git diff-tree --stdin --format='%h %ct %p' --raw -r -M and get dump very similar to what fast-export would give you. > > > git log -M --diff-filter=RAMD --no-abbrev --raw > > > > What is there besides RAMD? :) > > Well, as you pointed out above, log detects renames by default, > whereas it didn't used to. > So, if someone had written some similar-ish history walking/parsing > tool years ago that didn't depend need renames and was based on log > output, there's a good chance their tool might start failing when > rename detection was turned on by default, because instead of getting > both a 'D' and an 'M' change, they'd get an unexpected 'R'. Mostly I just meant: your diff-filter includes basically everything, so why bother filtering? You're going to have to parse the result anyway, and you can throw away uninteresting bits there. > For my case, do I have to worry about similar future changes? Will > copy detection ('C') or break detection ('B') become the default in > the future? Do I have to worry about typechanges ('T")? Will new > change types be added? I mean, the fast-export output could maybe > change too, but it seems much less likely than with log. If you use diff-tree, then it won't ever enable copy or break detection without you explicitly asking for it. > Let me try to put it as briefly as I can. With as few traversals as > possible, I want to: > * Get all blob sizes > * Map blob shas to filename(s) they appeared under in the history > * Find when files and directories were deleted (and whether they > were later reinstated, since that means they aren't actually gone) > * Find sets of filenames referring to the same logical 'file'. (e.g. > foo->bar in commit A and bar->baz in commit B mean that {foo,bar,baz} > refer to the same 'file' so that a user has an easy report to look at > to find out that if they just want to "keep baz and its history" then > they need foo & bar & baz. I need to know about things like another > foo or bar being introduced after the rename though, since that breaks > the connection between filenames) > * Do a few aggregations on the above data as well (e.g. all copies > of postgres.exe add up to 20M -- why were those checked in anyway?, > *.webm files in aggregate are .5G, your long-deleted src/video-server/ > directory from that aborted experimental project years ago takes up 2G > of your history, etc.) > > Right now, my best solution for this combination of questions is > 'cat-file --batch-all-objects' plus fast-export, if I get patch 10/10 > in place. I'm totally open to better solutions, including ones that > don't use fast-export. OK, I think I understand your problem better now. I don't think there's anything fast-export can show that log/diff-tree could not, aside from actual blob contents. But I don't think you want them (and if you did, you can use "cat-file --batch" to selectively request them). I think there's a general problem with any serialized output (log or fast-export) that things like rename tracking depend on the topology. If I rename "foo" to "bar" on one branch, and "bar" to "baz" on another branch, without reconstructing the parent graph you don't realize that those two things were on parallel branches, and not a sequence. But with the parent ids, you can delve as deep as you like in your analysis script. -Peff
diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt index 4e40f0b99a..946a5aee1f 100644 --- a/Documentation/git-fast-export.txt +++ b/Documentation/git-fast-export.txt @@ -128,6 +128,17 @@ marks the same across runs. for intermediary filters (e.g. for rewriting commit messages which refer to older commits, or for stripping blobs by id). +--always-show-modify-after-rename:: + When a rename is detected, fast-export normally issues both a + 'R' (rename) and a 'M' (modify) directive. However, if the + contents of the old and new filename match exactly, it will + only issue the rename directive. Use this flag to have it + always issue the modify directive after the rename, which may + be useful for tools which are using the fast-export stream as + a mechanism for gathering statistics about a repository. Note + that this option only has effect when rename detection is + active (see the -M option). + --refspec:: Apply the specified refspec to each ref exported. Multiple of them can be specified. diff --git a/builtin/fast-export.c b/builtin/fast-export.c index cc01dcc90c..db606d1fd0 100644 --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -38,6 +38,7 @@ static int use_done_feature; static int no_data; static int full_tree; static int reference_excluded_commits; +static int always_show_modify_after_rename; static int show_original_ids; static struct string_list extra_refs = STRING_LIST_INIT_NODUP; static struct string_list tag_refs = STRING_LIST_INIT_NODUP; @@ -407,7 +408,8 @@ static void show_filemodify(struct diff_queue_struct *q, putchar('\n'); if (oideq(&ospec->oid, &spec->oid) && - ospec->mode == spec->mode) + ospec->mode == spec->mode && + !always_show_modify_after_rename) break; } /* fallthrough */ @@ -1099,6 +1101,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix) &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by sha1sum")), OPT_BOOL(0, "show-original-ids", &show_original_ids, N_("Show original sha1sums of blobs/commits")), + OPT_BOOL(0, "always-show-modify-after-rename", + &always_show_modify_after_rename, + N_("Always provide 'M' directive after 'R'")), OPT_END() }; diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh index 5ad6669910..d0c30672ac 100755 --- a/t/t9350-fast-export.sh +++ b/t/t9350-fast-export.sh @@ -638,4 +638,40 @@ test_expect_success 'merge commit gets exported with --import-marks' ' ) ' +test_expect_success 'rename detection and --always-show-modify-after-rename' ' + test_create_repo renames && + ( + cd renames && + test_seq 0 9 >single_digit && + test_seq 10 98 >double_digit && + git add . && + git commit -m initial && + + echo 99 >>double_digit && + git mv single_digit single-digit && + git mv double_digit double-digit && + git add double-digit && + git commit -m renames && + + # First, check normal fast-export -M output + git fast-export -M --no-data master >out && + + grep double-digit out >out2 && + test_line_count = 2 out2 && + + grep single-digit out >out2 && + test_line_count = 1 out2 && + + # Now, test with --always-show-modify-after-rename; should + # have an extra "M" directive for "single-digit". + git fast-export -M --no-data --always-show-modify-after-rename master >out && + + grep double-digit out >out2 && + test_line_count = 2 out2 && + + grep single-digit out >out2 && + test_line_count = 2 out2 + ) +' + test_done
fast-export output is traditionally used as an input to a fast-import program, but it is also useful to help gather statistics about the history of a repository (particularly when --no-data is also passed). For example, two of the types of information we may want to collect could include: 1) general information about renames that have occurred 2) what the biggest objects in a repository are and what names they appear under. The first bit of information can be gathered by just passing -M to fast-export. The second piece of information can partially be gotten from running git cat-file --batch-check --batch-all-objects However, that only shows what the biggest objects in the repository are and their sizes, not what names those objects appear as or what commits they were introduced in. We can get that information from fast-export, but when we only see R oldname newname instead of R oldname newname M 100644 $SHA1 newname then it makes the job more difficult. Add an option which allows us to force the latter output even when commits have exact renames of files. Signed-off-by: Elijah Newren <newren@gmail.com> --- Documentation/git-fast-export.txt | 11 ++++++++++ builtin/fast-export.c | 7 +++++- t/t9350-fast-export.sh | 36 +++++++++++++++++++++++++++++++ 3 files changed, 53 insertions(+), 1 deletion(-)