[10/10] fast-export: add --always-show-modify-after-rename

Message ID	20181111062312.16342-11-newren@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> From: Elijah Newren <newren@gmail.com> To: git@vger.kernel.org Cc: larsxschneider@gmail.com, sandals@crustytoothpaste.net, peff@peff.net, me@ttaylorr.com, jrnieder@gmail.com, Elijah Newren <newren@gmail.com> Subject: [PATCH 10/10] fast-export: add --always-show-modify-after-rename Date: Sat, 10 Nov 2018 22:23:12 -0800 Message-Id: <20181111062312.16342-11-newren@gmail.com> In-Reply-To: <20181111062312.16342-1-newren@gmail.com> References: <CABPp-BEefqYADr8SVvh6uFWkp96PDv7qfKK1c9O1WUnPy3wqrw@mail.gmail.com> <20181111062312.16342-1-newren@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk
Series	fast export and import fixes and features \| expand [00/10] fast export and import fixes and features [01/10] git-fast-import.txt: fix documentation for --quiet option [02/10] git-fast-export.txt: clarify misleading documentation about rev-list args [03/10] fast-export: use value from correct enum [04/10] fast-export: avoid dying when filtering by paths and old tags exist [05/10] fast-export: move commit rewriting logic into a function for reuse [06/10] fast-export: when using paths, avoid corrupt stream with non-existent mark [07/10] fast-export: ensure we export requested refs [08/10] fast-export: add --reference-excluded-parents option [09/10] fast-export: add a --show-original-ids option to show original names [10/10] fast-export: add --always-show-modify-after-rename

Elijah Newren Nov. 11, 2018, 6:23 a.m. UTC

fast-export output is traditionally used as an input to a fast-import
program, but it is also useful to help gather statistics about the
history of a repository (particularly when --no-data is also passed).
For example, two of the types of information we may want to collect
could include:
  1) general information about renames that have occurred
  2) what the biggest objects in a repository are and what names
     they appear under.

The first bit of information can be gathered by just passing -M to
fast-export.  The second piece of information can partially be gotten
from running
    git cat-file --batch-check --batch-all-objects
However, that only shows what the biggest objects in the repository are
and their sizes, not what names those objects appear as or what commits
they were introduced in.  We can get that information from fast-export,
but when we only see
    R oldname newname
instead of
    R oldname newname
    M 100644 $SHA1 newname
then it makes the job more difficult.  Add an option which allows us to
force the latter output even when commits have exact renames of files.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-export.txt | 11 ++++++++++
 builtin/fast-export.c             |  7 +++++-
 t/t9350-fast-export.sh            | 36 +++++++++++++++++++++++++++++++
 3 files changed, 53 insertions(+), 1 deletion(-)

Jeff King Nov. 11, 2018, 7:23 a.m. UTC | #1

On Sat, Nov 10, 2018 at 10:23:12PM -0800, Elijah Newren wrote:

> fast-export output is traditionally used as an input to a fast-import
> program, but it is also useful to help gather statistics about the
> history of a repository (particularly when --no-data is also passed).
> For example, two of the types of information we may want to collect
> could include:
>   1) general information about renames that have occurred
>   2) what the biggest objects in a repository are and what names
>      they appear under.
> 
> The first bit of information can be gathered by just passing -M to
> fast-export.  The second piece of information can partially be gotten
> from running
>     git cat-file --batch-check --batch-all-objects
> However, that only shows what the biggest objects in the repository are
> and their sizes, not what names those objects appear as or what commits
> they were introduced in.  We can get that information from fast-export,
> but when we only see
>     R oldname newname
> instead of
>     R oldname newname
>     M 100644 $SHA1 newname
> then it makes the job more difficult.  Add an option which allows us to
> force the latter output even when commits have exact renames of files.

fast-export seems like a funny tool to look up paths. What about "git
log --find-object=$SHA1" ?

-Peff

Elijah Newren Nov. 11, 2018, 8:42 a.m. UTC | #2

On Sat, Nov 10, 2018 at 11:23 PM Jeff King <peff@peff.net> wrote:
>
> On Sat, Nov 10, 2018 at 10:23:12PM -0800, Elijah Newren wrote:
>
> > fast-export output is traditionally used as an input to a fast-import
> > program, but it is also useful to help gather statistics about the
> > history of a repository (particularly when --no-data is also passed).
> > For example, two of the types of information we may want to collect
> > could include:
> >   1) general information about renames that have occurred
> >   2) what the biggest objects in a repository are and what names
> >      they appear under.
> >
> > The first bit of information can be gathered by just passing -M to
> > fast-export.  The second piece of information can partially be gotten
> > from running
> >     git cat-file --batch-check --batch-all-objects
> > However, that only shows what the biggest objects in the repository are
> > and their sizes, not what names those objects appear as or what commits
> > they were introduced in.  We can get that information from fast-export,
> > but when we only see
> >     R oldname newname
> > instead of
> >     R oldname newname
> >     M 100644 $SHA1 newname
> > then it makes the job more difficult.  Add an option which allows us to
> > force the latter output even when commits have exact renames of files.
>
> fast-export seems like a funny tool to look up paths. What about "git
> log --find-object=$SHA1" ?

Eek, and give me O(N*M) behavior, where N is the number of commits in
the repository and M is the number of renames that occur in its
history?  Also, that's the inverse of the lookup I need anyway (I have
the commit and filename, but am missing the SHA).

One of the problems with filter-branch that people often run into is
they know what they want at a high-level (e.g. extract the history of
this directory for a new repository, or rewrite the history of this
repo to appear at a subdirectory so it can be merged into a bigger
repo and people passing filenames to log will still get the history of
those files, or I want to remove some of the big stuff in my history),
but often times that's not quite enough.  They need help finding big
objects, or may be unaware that the subset of files they want used to
be known by alternative names.

I want a simple --analyze mode that can report on all files that have
been renamed (so users don't just say "all I care about is these N
files, give me a rewritten history just including those" -- we can
point out to them whether those N files used to be known by other
names), as well as reporting on all big files and if they've been
deleted, and aggregations of the "big files" information across
directories and file extensions.

Jeff King Nov. 12, 2018, 12:58 p.m. UTC | #3

On Sun, Nov 11, 2018 at 12:42:58AM -0800, Elijah Newren wrote:

> > > fast-export output is traditionally used as an input to a fast-import
> > > program, but it is also useful to help gather statistics about the
> > > history of a repository (particularly when --no-data is also passed).
> > > For example, two of the types of information we may want to collect
> > > could include:
> > >   1) general information about renames that have occurred
> > >   2) what the biggest objects in a repository are and what names
> > >      they appear under.
> > >
> > > The first bit of information can be gathered by just passing -M to
> > > fast-export.  The second piece of information can partially be gotten
> > > from running
> > >     git cat-file --batch-check --batch-all-objects
> > > However, that only shows what the biggest objects in the repository are
> > > and their sizes, not what names those objects appear as or what commits
> > > they were introduced in.  We can get that information from fast-export,
> > > but when we only see
> > >     R oldname newname
> > > instead of
> > >     R oldname newname
> > >     M 100644 $SHA1 newname
> > > then it makes the job more difficult.  Add an option which allows us to
> > > force the latter output even when commits have exact renames of files.
> >
> > fast-export seems like a funny tool to look up paths. What about "git
> > log --find-object=$SHA1" ?
> 
> Eek, and give me O(N*M) behavior, where N is the number of commits in
> the repository and M is the number of renames that occur in its
> history?  Also, that's the inverse of the lookup I need anyway (I have
> the commit and filename, but am missing the SHA).

Maybe I don't understand what you're trying to accomplish. I was
thinking specifically of your "cat-file can tell you the large objects,
but you don't know their names/commits" from above.

I would do:

   git log --raw $(
     git cat-file --batch-check='%(objectsize:disk) %(objectname)' --batch-all-objects |
     sort -rn | head -3 |
     awk '{print "--find-object=" $2 }'
   )

I'm not sure how renames enter into it at all.

> One of the problems with filter-branch that people often run into is
> they know what they want at a high-level (e.g. extract the history of
> this directory for a new repository, or rewrite the history of this
> repo to appear at a subdirectory so it can be merged into a bigger
> repo and people passing filenames to log will still get the history of
> those files, or I want to remove some of the big stuff in my history),
> but often times that's not quite enough.  They need help finding big
> objects, or may be unaware that the subset of files they want used to
> be known by alternative names.
> 
> I want a simple --analyze mode that can report on all files that have
> been renamed (so users don't just say "all I care about is these N
> files, give me a rewritten history just including those" -- we can
> point out to them whether those N files used to be known by other
> names), as well as reporting on all big files and if they've been
> deleted, and aggregations of the "big files" information across
> directories and file extensions.

So this seems like a separate problem than what the commit message talks
about.

There I think you'd want to assemble the list with something like "git
log --follow --name-only paths-of-interest" except that --follow sucks
too much to handle more than one path at a time.

But if you wanted to do it manually, then:

  git log --diff-filter=R --name-only

would be enough to let you track it down, wouldn't it?

-Peff

Elijah Newren Nov. 12, 2018, 6:08 p.m. UTC | #4

On Mon, Nov 12, 2018 at 4:58 AM Jeff King <peff@peff.net> wrote:
> On Sun, Nov 11, 2018 at 12:42:58AM -0800, Elijah Newren wrote:
>
> Maybe I don't understand what you're trying to accomplish. I was
> thinking specifically of your "cat-file can tell you the large objects,
> but you don't know their names/commits" from above.

Fair enough.  And just to be clear, the first 9 patches were fixes and
features around trying to rewrite history; patch 10 is orthogonal and
was used for a separate run to just gather data.  It is entirely
possible I could gather that data other ways.

> I would do:
>
>    git log --raw $(
>      git cat-file --batch-check='%(objectsize:disk) %(objectname)' --batch-all-objects |
>      sort -rn | head -3 |
>      awk '{print "--find-object=" $2 }'
>    )
>
> I'm not sure how renames enter into it at all.

How did I miss objectsize:disk??  Especially since it is right next to
objectsize in the manpage to boot?  That's awesome, thanks for that
pointer.

I do have a separate cat-file --batch-check --batch-all-objects
process already, since I can't get sizes out of either log or
fast-export.  However, I wouldn't use your 'head -3' since I'm not
looking for the N biggest, but reporting on _all_ objects (in reverse
size order) and letting the user look over the report and deciding
where to stop reading.  So, this is a big and expensive log command.
Granted, we will need a big and expensive log command, but let's keep
in mind that we have this one.

> > One of the problems with filter-branch that people often run into is
> > they know what they want at a high-level (e.g. extract the history of
> > this directory for a new repository, or rewrite the history of this
> > repo to appear at a subdirectory so it can be merged into a bigger
> > repo and people passing filenames to log will still get the history of
> > those files, or I want to remove some of the big stuff in my history),
> > but often times that's not quite enough.  They need help finding big
> > objects, or may be unaware that the subset of files they want used to
> > be known by alternative names.
> >
> > I want a simple --analyze mode that can report on all files that have
> > been renamed (so users don't just say "all I care about is these N
> > files, give me a rewritten history just including those" -- we can
> > point out to them whether those N files used to be known by other
> > names), as well as reporting on all big files and if they've been
> > deleted, and aggregations of the "big files" information across
> > directories and file extensions.
>
> So this seems like a separate problem than what the commit message talks
> about.
>
> There I think you'd want to assemble the list with something like "git
> log --follow --name-only paths-of-interest" except that --follow sucks
> too much to handle more than one path at a time.
>
> But if you wanted to do it manually, then:
>
>   git log --diff-filter=R --name-only
>
> would be enough to let you track it down, wouldn't it?

Without a -M you'd only catch 100% renames, right?  Those aren't the
only ones I'd want to catch, so I'd need to add -M.  You are right
that we could get basic renames this way, but it doesn't cover
everything I need.  Let's use this as a starting point, though, and
build up to what I need...

I also want to know when files were deleted.  I've generally found
that people are more okay with purging parts of history [corresponding
to large ojbects] that were deleted longer ago than more recent stuff,
for a variety of reasons.  So we could either run yet another log, or
modify the command to:

  git log -M --diff-filter=RD --name-status

However, I don't just want to know when files were deleted, I'd like
to know when directories are deleted.  I only knew how to derive that
from knowing what files existed within those directories, so that
would take me to:

  git log -M --diff-filter=RAD --name-status

[Edit: I just saw your other email and for the first time learned
about the -t rev-list option which might simplify this a little,
although "need to worry about deleted files being reinstated" below
might require the 'A' anyway.]

At this point, let's remember that we had another full git-log
invocation for mapping object sizes to filenames.  We might as well
coalesce the two log commands into one, by extending this latest one
to:

  git log -M --diff-filter=RAMD --no-abbrev --raw

Also, I wanted commit date rather than author date, so we need to
extend the headers a bit.  Also, for reasons I won't bother detailing,
I think I want to traverse commits in reverse topological order.  So
our command is:

  git log --pretty=fuller --topo-order --reverse -M --diff-filter=RAMD
--no-abbrev --raw

But that still leaves us with four problems, three of which we can
solve with further extensions to this command:

1) There are some weird edge cases with deletions and renames.  Lots
of them in fact.  At a simple level, branching and merging and
multiple refs means that "is-this-deleted" isn't a binary flag for a
given filename (but rather a binary flag per-ref).  Also, it makes
"the set of names associated with a single 'file' as perceived by the
user" possibly rather ill-defined as well.  This can get really hairy,
but I'd at least like to handle the very basic cases of (a) "user
re-instates filename that used to be deleted" (i.e. the file isn't
deleted anymore) and (b) "user re-instates a filename that used to
exist but was renamed to something else" (in such cases, we can't just
treat the two filenames as being different names of the same content).
Handling the (b) usecase sanely requires some topology information, so
we need parents as well.  So our command extends to:

   git log --parents --pretty=fuller --topo-order --reverse -M
--diff-filter=RAMD --no-abbrev --raw

2) log is not plumbing, so parsing the stuff before the file
modifications is not a good idea. This could be fixed by using
--format:

  git log --format='%H%n%P%n%cd' --date=short --topo-order --reverse
-M --diff-filter-RAMD --no-abbrev --raw

3) log won't show changes for merge commits by default; we'd need to add -c:

  git log --format='%H%n%P%n%cd' --date=short --topo-order --reverse
-M --diff-filter-RAMD --no-abbrev --raw -c

4) log is not plumbing, revisited: although at this point I've
specified the log output explicitly enough that it ought to be safe to
parse, there are a few things that make me slightly worried.  I can
depend on fast-export to be stable; it only gives 'M' and 'D' unless
you explicitly ask for more types (e.g. -M to detect renames will add
'R').  With log, I'm no so sure; do I need to worry about new types
appearing in the future?  Also, should I just drop --diff-filter=RAMD
since it covers just about everything anyway?  Also, while --raw is
stable, is the combination of -c and --raw stable?  Is --date=short
stable (most likely, but still seems more likely to change than
fast-export would be)?  Is there something else I need to be worried
about?  Granted, each of those is only a small worry with log, but
they add up and give me pause about whether I should be parsing it
output in another tool.

So we've come up with an alternate way to get the data I need, though
with some worries.

I could potentially switch to using this and drop patch 10/10.  Maybe
there's even a good reason to prefer using log.  But at the time I was
thinking in terms of "I already have a tool that parses fast-export
output and I know it's stable...and it has access to all the
information I need so why not just get the information from it?"  So I
did that, and then realized towards the end that although it had all
the needed info, it stripped one piece from me.  Namely, when it had a
100% rename, I'd only get
   R oldname newname
and wouldn't know the sha1sum of newname (for mapping object sizes to
all their names).  If I cached the information about all file shas for
all trees I could pull it from that cache (which could be expensive
memory-wise for large repos), or I could use the original-oid
directive and keep another long running "git cat-file
--batch-check='%(objectname)' process and just pass it
"$ORIGINAL_OID:$NEWNAME" lines as I come across them.  However,
fast-export had the information and did special work to try to avoid
showing it when it thought it woudln't be needed, so why not just add
a flag to tell it to just give me the filemodify?

At this point, if folks don't like this patch, I'm more likely to use
the supplementary cat-file process than switching to log, unless
someone can ameliorate my concerns with it and suggest a good reason
why it's actually better.

Anyway, I hope it makes a little more sense why I created this patch.
Does it, or have I just made things even more confusing?

...and if you've read this far, I'm impressed.  Thanks for reading.

Jeff King Nov. 13, 2018, 2:45 p.m. UTC | #5

On Mon, Nov 12, 2018 at 10:08:10AM -0800, Elijah Newren wrote:

> > I would do:
> >
> >    git log --raw $(
> >      git cat-file --batch-check='%(objectsize:disk) %(objectname)' --batch-all-objects |
> >      sort -rn | head -3 |
> >      awk '{print "--find-object=" $2 }'
> >    )
> >
> > I'm not sure how renames enter into it at all.
> 
> How did I miss objectsize:disk??  Especially since it is right next to
> objectsize in the manpage to boot?  That's awesome, thanks for that
> pointer.
> 
> I do have a separate cat-file --batch-check --batch-all-objects
> process already, since I can't get sizes out of either log or
> fast-export.  However, I wouldn't use your 'head -3' since I'm not
> looking for the N biggest, but reporting on _all_ objects (in reverse
> size order) and letting the user look over the report and deciding
> where to stop reading.  So, this is a big and expensive log command.
> Granted, we will need a big and expensive log command, but let's keep
> in mind that we have this one.

It is an expensive log command, but it's the same expense as running
fast-export, no? And I think maybe that is the disconnect.

I am looking at this problem as "how do you answer question X in a
repository". And I think you are looking at as "I am receiving a
fast-export stream, and I need to answer question X on the fly".

And that would explain why you want to get extra annotations into the
fast-export stream. Is that right?

> > There I think you'd want to assemble the list with something like "git
> > log --follow --name-only paths-of-interest" except that --follow sucks
> > too much to handle more than one path at a time.
> >
> > But if you wanted to do it manually, then:
> >
> >   git log --diff-filter=R --name-only
> >
> > would be enough to let you track it down, wouldn't it?
> 
> Without a -M you'd only catch 100% renames, right?  Those aren't the
> only ones I'd want to catch, so I'd need to add -M.  You are right
> that we could get basic renames this way, but it doesn't cover
> everything I need.  Let's use this as a starting point, though, and
> build up to what I need...

No, renames are on by default these days, and that includes inexact
renames. That said, if you're scripting you probably ought to be doing:

  git rev-list HEAD | git diff-tree --stdin

and there yes, you'd have to enable "-M" yourself (you touched on
scripting and formatting below; diff-tree can accept the format options
you'd want).

> I also want to know when files were deleted.  I've generally found
> that people are more okay with purging parts of history [corresponding
> to large ojbects] that were deleted longer ago than more recent stuff,
> for a variety of reasons.  So we could either run yet another log, or
> modify the command to:
> 
>   git log -M --diff-filter=RD --name-status
> 
> However, I don't just want to know when files were deleted, I'd like
> to know when directories are deleted.  I only knew how to derive that
> from knowing what files existed within those directories, so that
> would take me to:
> 
>   git log -M --diff-filter=RAD --name-status
> 
> [Edit: I just saw your other email and for the first time learned
> about the -t rev-list option which might simplify this a little,
> although "need to worry about deleted files being reinstated" below
> might require the 'A' anyway.]

Yeah, I think "-t" would help your tree deletion problem.

> At this point, let's remember that we had another full git-log
> invocation for mapping object sizes to filenames.  We might as well
> coalesce the two log commands into one, by extending this latest one
> to:
> 
>   git log -M --diff-filter=RAMD --no-abbrev --raw

What is there besides RAMD? :)

> I could potentially switch to using this and drop patch 10/10.

So I'm still not _entirely_ clear on what you're trying to do with
10/10. I think maybe the "disconnect" part I wrote above explains it. If
that's correct, then I think framing it in terms of the operations that
you'd be able to perform _without running a separate traverse_ would
make it more obvious.

> Anyway, I hope it makes a little more sense why I created this patch.
> Does it, or have I just made things even more confusing?

Some of both, I think.

> ...and if you've read this far, I'm impressed.  Thanks for reading.

I'll admit I skimmed near the end. ;)

-Peff

Elijah Newren Nov. 13, 2018, 5:10 p.m. UTC | #6

On Tue, Nov 13, 2018 at 6:45 AM Jeff King <peff@peff.net> wrote:
> It is an expensive log command, but it's the same expense as running
> fast-export, no? And I think maybe that is the disconnect.

I would expect an expensive log command to generally be the same
expense as running fast-export, yes.  But I would expect two expensive
log commands to be twice the expense of a single fast-export (and you
suggested two log commands: both the --find-object= one and the
--diff-filter one).

> I am looking at this problem as "how do you answer question X in a
> repository". And I think you are looking at as "I am receiving a
> fast-export stream, and I need to answer question X on the fly".
>
> And that would explain why you want to get extra annotations into the
> fast-export stream. Is that right?

I'm not trying to get information on the fly during a rewrite or
anything like that.  This is an optional pre-rewrite step (from a
separate invocation of the tool) where I have multiple questions I
want to answer.  I'd like to answer them all relatively quickly, if
possible, and I think all of them should be answerable with a single
history traversal (plus a cat-file --batch-all-objects call to get
object sizes, since I don't know of another way to get those).  I'd be
fine with switching from fast-export to log or something else if it
met the needs better.

As far as I can tell, you're trying to split each question apart and
do a history traversal for each, and I don't see why that's better.
Simpler, perhaps, but it seems worse for performance.  Am I missing
something?

> > > There I think you'd want to assemble the list with something like "git
> > > log --follow --name-only paths-of-interest" except that --follow sucks
> > > too much to handle more than one path at a time.
> > >
> > > But if you wanted to do it manually, then:
> > >
> > >   git log --diff-filter=R --name-only
> > >
> > > would be enough to let you track it down, wouldn't it?
> >
> > Without a -M you'd only catch 100% renames, right?  Those aren't the
> > only ones I'd want to catch, so I'd need to add -M.  You are right
> > that we could get basic renames this way, but it doesn't cover
> > everything I need.  Let's use this as a starting point, though, and
> > build up to what I need...
>
> No, renames are on by default these days, and that includes inexact
> renames. That said, if you're scripting you probably ought to be doing:
>
>   git rev-list HEAD | git diff-tree --stdin
>
> and there yes, you'd have to enable "-M" yourself (you touched on
> scripting and formatting below; diff-tree can accept the format options
> you'd want).

Ah, I didn't know renames were on by default; I somehow missed that.
Also, the rev-list to diff-tree pipe is nice, but I also need parent
and commit timestamp information.

....
> Yeah, I think "-t" would help your tree deletion problem.

Absolutely, thanks for the hint.  Much appreciated.  :-)

> > At this point, let's remember that we had another full git-log
> > invocation for mapping object sizes to filenames.  We might as well
> > coalesce the two log commands into one, by extending this latest one
> > to:
> >
> >   git log -M --diff-filter=RAMD --no-abbrev --raw
>
> What is there besides RAMD? :)

Well, as you pointed out above, log detects renames by default,
whereas it didn't used to.
So, if someone had written some similar-ish history walking/parsing
tool years ago that didn't depend need renames and was based on log
output, there's a good chance their tool might start failing when
rename detection was turned on by default, because instead of getting
both a 'D' and an 'M' change, they'd get an unexpected 'R'.

For my case, do I have to worry about similar future changes?  Will
copy detection ('C') or break detection ('B') become the default in
the future?  Do I have to worry about typechanges ('T")?  Will new
change types be added?  I mean, the fast-export output could maybe
change too, but it seems much less likely than with log.

> > I could potentially switch to using this and drop patch 10/10.
>
> So I'm still not _entirely_ clear on what you're trying to do with
> 10/10. I think maybe the "disconnect" part I wrote above explains it. If
> that's correct, then I think framing it in terms of the operations that
> you'd be able to perform _without running a separate traverse_ would
> make it more obvious.

Let me try to put it as briefly as I can.  With as few traversals as
possible, I want to:
  * Get all blob sizes
  * Map blob shas to filename(s) they appeared under in the history
  * Find when files and directories were deleted (and whether they
were later reinstated, since that means they aren't actually gone)
  * Find sets of filenames referring to the same logical 'file'. (e.g.
foo->bar in commit A and bar->baz in commit B mean that {foo,bar,baz}
refer to the same 'file' so that a user has an easy report to look at
to find out that if they just want to "keep baz and its history" then
they need foo & bar & baz.  I need to know about things like another
foo or bar being introduced after the rename though, since that breaks
the connection between filenames)
  * Do a few aggregations on the above data as well (e.g. all copies
of postgres.exe add up to 20M -- why were those checked in anyway?,
*.webm files in aggregate are .5G, your long-deleted src/video-server/
directory from that aborted experimental project years ago takes up 2G
of your history, etc.)

Right now, my best solution for this combination of questions is
'cat-file --batch-all-objects' plus fast-export, if I get patch 10/10
in place.  I'm totally open to better solutions, including ones that
don't use fast-export.

Jeff King Nov. 14, 2018, 7:14 a.m. UTC | #7

On Tue, Nov 13, 2018 at 09:10:36AM -0800, Elijah Newren wrote:

> > I am looking at this problem as "how do you answer question X in a
> > repository". And I think you are looking at as "I am receiving a
> > fast-export stream, and I need to answer question X on the fly".
> >
> > And that would explain why you want to get extra annotations into the
> > fast-export stream. Is that right?
> 
> I'm not trying to get information on the fly during a rewrite or
> anything like that.  This is an optional pre-rewrite step (from a
> separate invocation of the tool) where I have multiple questions I
> want to answer.  I'd like to answer them all relatively quickly, if
> possible, and I think all of them should be answerable with a single
> history traversal (plus a cat-file --batch-all-objects call to get
> object sizes, since I don't know of another way to get those).  I'd be
> fine with switching from fast-export to log or something else if it
> met the needs better.

Ah, OK. Yes, if we're just trying to query, then I think you should be
able to do what you want with the existing traversal and diff tools. And
if not, we should think about a new feature there, and not try to
shoe-horn it into fast-export.

> As far as I can tell, you're trying to split each question apart and
> do a history traversal for each, and I don't see why that's better.
> Simpler, perhaps, but it seems worse for performance.  Am I missing
> something?

I was only trying to address each possible query individually. I agree
that if you are querying both things, you should be able to do it in a
single traversal (and that is strictly better). It may require a little
more parsing of the output (e.g., `--find-object` is easy to implement
yourself looking at --raw output).

> Ah, I didn't know renames were on by default; I somehow missed that.
> Also, the rev-list to diff-tree pipe is nice, but I also need parent
> and commit timestamp information.

diff-tree will format the commit info as well (before git-log was a C
builtin, it was just a rev-list/diff-tree pipeline in a shell script).
So you can do:

  git rev-list ... |
  git diff-tree --stdin --format='%h %ct %p' --raw -r -M

and get dump very similar to what fast-export would give you.

> > >   git log -M --diff-filter=RAMD --no-abbrev --raw
> >
> > What is there besides RAMD? :)
> 
> Well, as you pointed out above, log detects renames by default,
> whereas it didn't used to.
> So, if someone had written some similar-ish history walking/parsing
> tool years ago that didn't depend need renames and was based on log
> output, there's a good chance their tool might start failing when
> rename detection was turned on by default, because instead of getting
> both a 'D' and an 'M' change, they'd get an unexpected 'R'.

Mostly I just meant: your diff-filter includes basically everything, so
why bother filtering? You're going to have to parse the result anyway,
and you can throw away uninteresting bits there.

> For my case, do I have to worry about similar future changes?  Will
> copy detection ('C') or break detection ('B') become the default in
> the future?  Do I have to worry about typechanges ('T")?  Will new
> change types be added?  I mean, the fast-export output could maybe
> change too, but it seems much less likely than with log.

If you use diff-tree, then it won't ever enable copy or break detection
without you explicitly asking for it.

> Let me try to put it as briefly as I can.  With as few traversals as
> possible, I want to:
>   * Get all blob sizes
>   * Map blob shas to filename(s) they appeared under in the history
>   * Find when files and directories were deleted (and whether they
> were later reinstated, since that means they aren't actually gone)
>   * Find sets of filenames referring to the same logical 'file'. (e.g.
> foo->bar in commit A and bar->baz in commit B mean that {foo,bar,baz}
> refer to the same 'file' so that a user has an easy report to look at
> to find out that if they just want to "keep baz and its history" then
> they need foo & bar & baz.  I need to know about things like another
> foo or bar being introduced after the rename though, since that breaks
> the connection between filenames)
>   * Do a few aggregations on the above data as well (e.g. all copies
> of postgres.exe add up to 20M -- why were those checked in anyway?,
> *.webm files in aggregate are .5G, your long-deleted src/video-server/
> directory from that aborted experimental project years ago takes up 2G
> of your history, etc.)
> 
> Right now, my best solution for this combination of questions is
> 'cat-file --batch-all-objects' plus fast-export, if I get patch 10/10
> in place.  I'm totally open to better solutions, including ones that
> don't use fast-export.

OK, I think I understand your problem better now. I don't think there's
anything fast-export can show that log/diff-tree could not, aside from
actual blob contents. But I don't think you want them (and if you did,
you can use "cat-file --batch" to selectively request them).

I think there's a general problem with any serialized output (log or
fast-export) that things like rename tracking depend on the topology. If
I rename "foo" to "bar" on one branch, and "bar" to "baz" on another
branch, without reconstructing the parent graph you don't realize that
those two things were on parallel branches, and not a sequence.  But
with the parent ids, you can delve as deep as you like in your analysis
script.

-Peff

[10/10] fast-export: add --always-show-modify-after-rename

Commit Message

Comments

Patch