mbox series

[0/3] Filter alternate references

Message ID cover.1537466087.git.me@ttaylorr.com (mailing list archive)
Headers show
Series Filter alternate references | expand

Message

Taylor Blau Sept. 20, 2018, 6:04 p.m. UTC
Hi,

This is a series to customize Git's behavior when listing references
from an alternate repository. It is motivated by the following example:

Consider an upstream repository, a fork of it, and a local copy of that
fork. Ideally, running "git pull upstream" from the local copy followed
by a "git push fork" should be a lightweight operation, ideally because
the fork already "knows" about the new objects introduced upstream.

Today, we do this by means of the special ".have" references advertised
by 'git receive-pack'. This special part of the advertisement is
designed to tell the pusher about tips that it might want to know about,
to avoid sending them again.

This optimization is a good one and works well, particularly when the
upstream repository has a relatively normal number of references. When
the upstream has a pathologically _large_ number of references, the
advertisement alone can be so time consuming, that it's faster to send
redundant objects to the fork.

To make the reference advertisement manageable even with a large number
of references, let's allow the fork to select which ones it thinks might
be "interesting", and only advertise those. This makes the advertisement
much smaller, and lets us take advantage of the ".have" references, even
when the upstream contains more references than we're advertising.

This series implements the above functionality by means of
"core.alternateRefsCommand", and "core.alternateRefsPrefixes", either a
command to run in place of "git for-each-ref", or arguments to be
appended to "git for-each-ref".

The order of precedence when listing references from an alternate is as
follows:

  1. If the fork configures "core.alternateRefsCommand", run that.

  2. If the fork configures "core.alternateRefsPrefixes", run 'git
     for-each-ref', limiting results to references that have any of the
     given values as a prefix.

  3. Otherwise, run 'git for-each-ref' in the alternate.

In a previous version of this series, I taught the configuration
property to the alternate, as in "these are the references that _I_
think _you_ will find interesting," rather than the other way around. I
ultimately decided on what is attached here so that the fork does not
have to trust the upstream to run arbitrary shell commands.

Thanks,
Taylor

Taylor Blau (3):
  transport.c: extract 'fill_alternate_refs_command'
  transport.c: introduce core.alternateRefsCommand
  transport.c: introduce core.alternateRefsPrefixes

 Documentation/config.txt | 12 +++++++++
 t/t5410-receive-pack.sh  | 58 ++++++++++++++++++++++++++++++++++++++++
 transport.c              | 34 ++++++++++++++++++-----
 3 files changed, 98 insertions(+), 6 deletions(-)
 create mode 100755 t/t5410-receive-pack.sh

--
2.19.0

Comments

Stefan Beller Sept. 20, 2018, 6:35 p.m. UTC | #1
On Thu, Sep 20, 2018 at 11:04 AM Taylor Blau <ttaylorr@github.com> wrote:
>
> Hi,
>
> This is a series to customize Git's behavior when listing references
> from an alternate repository. It is motivated by the following example:
>
> Consider an upstream repository, a fork of it, and a local copy of that
> fork. Ideally, running "git pull upstream" from the local copy followed
> by a "git push fork" should be a lightweight operation, ideally because
> the fork already "knows" about the new objects introduced upstream.
>
> Today, we do this by means of the special ".have" references advertised
> by 'git receive-pack'. This special part of the advertisement is
> designed to tell the pusher about tips that it might want to know about,
> to avoid sending them again.
>
> This optimization is a good one and works well, particularly when the
> upstream repository has a relatively normal number of references. When
> the upstream has a pathologically _large_ number of references, the
> advertisement alone can be so time consuming, that it's faster to send
> redundant objects to the fork.

(tangent:)
The current fetch protocol consists of 2 parts:
negotiation + sending the packfile, and the negotiation only tries
to trim down the size of the packfile to send, without taking its own
cost (in terms of time and band width) into account, just to produce
a perfect pack to send to the client.

When talking about designing protocol v2 for push (which has not
landed yet[1]), we had some in-office discussions whether we
want to have a proper negotiation on push, as it would help
pushing to remotes that have non-ff pushes, but not necessarily
regular pushes, as they should be fine with just the refs advertisement.

[1] https://github.com/bmwill/git/commit/57a4e6e5d18a2d4d806fc8dec644b89affd50853
bmwill@ no longer works on it though.


>
> To make the reference advertisement manageable even with a large number
> of references, let's allow the fork to select which ones it thinks might
> be "interesting", and only advertise those. This makes the advertisement
> much smaller, and lets us take advantage of the ".have" references, even
> when the upstream contains more references than we're advertising.
>
> This series implements the above functionality by means of
> "core.alternateRefsCommand", and "core.alternateRefsPrefixes", either a
> command to run in place of "git for-each-ref", or arguments to be
> appended to "git for-each-ref".
>
> The order of precedence when listing references from an alternate is as
> follows:
>
>   1. If the fork configures "core.alternateRefsCommand", run that.
>
>   2. If the fork configures "core.alternateRefsPrefixes", run 'git
>      for-each-ref', limiting results to references that have any of the
>      given values as a prefix.
>
>   3. Otherwise, run 'git for-each-ref' in the alternate.
>
> In a previous version of this series, I taught the configuration
> property to the alternate, as in "these are the references that _I_
> think _you_ will find interesting," rather than the other way around. I
> ultimately decided on what is attached here so that the fork does not
> have to trust the upstream to run arbitrary shell commands.

Would it make sense to estimate the value of each .have before
advertising them and then advertise only the <n> most valuable
.haves ?
(e.g. if a .have is only one small commit ahead of origin/master,
it may not bring a lot of value as the potential savings are small,
but if that .have contains history between master..TIP that has lots
of big blobs or objects in general, this may be valuable to know)

Stefan
Taylor Blau Sept. 20, 2018, 6:56 p.m. UTC | #2
Hi Stefan,

On Thu, Sep 20, 2018 at 11:35:23AM -0700, Stefan Beller wrote:
> > To make the reference advertisement manageable even with a large number
> > of references, let's allow the fork to select which ones it thinks might
> > be "interesting", and only advertise those. This makes the advertisement
> > much smaller, and lets us take advantage of the ".have" references, even
> > when the upstream contains more references than we're advertising.
> >
> > This series implements the above functionality by means of
> > "core.alternateRefsCommand", and "core.alternateRefsPrefixes", either a
> > command to run in place of "git for-each-ref", or arguments to be
> > appended to "git for-each-ref".
> >
> > The order of precedence when listing references from an alternate is as
> > follows:
> >
> >   1. If the fork configures "core.alternateRefsCommand", run that.
> >
> >   2. If the fork configures "core.alternateRefsPrefixes", run 'git
> >      for-each-ref', limiting results to references that have any of the
> >      given values as a prefix.
> >
> >   3. Otherwise, run 'git for-each-ref' in the alternate.
> >
> > In a previous version of this series, I taught the configuration
> > property to the alternate, as in "these are the references that _I_
> > think _you_ will find interesting," rather than the other way around. I
> > ultimately decided on what is attached here so that the fork does not
> > have to trust the upstream to run arbitrary shell commands.
>
> Would it make sense to estimate the value of each .have before
> advertising them and then advertise only the <n> most valuable
> .haves ?
> (e.g. if a .have is only one small commit ahead of origin/master,
> it may not bring a lot of value as the potential savings are small,
> but if that .have contains history between master..TIP that has lots
> of big blobs or objects in general, this may be valuable to know)

I think that this sort of filtering should be theoretically possible
by configuring "core.alternateRefsCommand", perhaps to execute a script
like:

  cd "$1" &&
  git for-each-ref --format="%(objectname) %(refname)" |
  while read objectname refname; do
    total_size="$(git rev-list --objects master...$objectname \
      | awk '{ print $1 }' \
      | git cat-file --batch-check='%(objectsize)' \
      | awk '{ sum+=$1 } END { print $sum }')"

    if [ "$total_size" -gt "$minimum_size" ]; then
      echo "$objectname $refname"
    fi
  done

But that's quite inefficient to compute, since you're walking the same
parts of the graph over and over again.

Perhaps we could teach Git to do something better? I suppose that just
"core.alternateRefPrefixes" could do this by default (or with another
knob) to further optimize the simpler case. But I think that we'd be
equally OK without it, since push over V2 obviates the need for this
sort of optimization (as you noted in the unquoted part of this
response).

My inclination is to avoid teaching this to Git, and let callers
script it into their "core.alternateRefsCommand" if they really desire
it.

Does that seem OK?


Thanks,
Taylor
Jeff King Sept. 20, 2018, 7:21 p.m. UTC | #3
On Thu, Sep 20, 2018 at 02:04:05PM -0400, Taylor Blau wrote:

> This is a series to customize Git's behavior when listing references
> from an alternate repository. It is motivated by the following example:
> 
> Consider an upstream repository, a fork of it, and a local copy of that
> fork. Ideally, running "git pull upstream" from the local copy followed
> by a "git push fork" should be a lightweight operation, ideally because
> the fork already "knows" about the new objects introduced upstream.
> 
> Today, we do this by means of the special ".have" references advertised
> by 'git receive-pack'. This special part of the advertisement is
> designed to tell the pusher about tips that it might want to know about,
> to avoid sending them again.

I think it's important to note that this is just one place where this
optimization is useful. A few others are:

  1. On fetching, the client similarly advertises the extra tips (not in
     a ref advertisement, but as part of the negotiation).

  2. We don't do it now, but we ought to use those for checking the
     connectivity of incoming objects. Otherwise we end up walking over
     history that we already know we have. Since this is purely local,
     it's not usually as big a deal, but it can matter a lot in large
     repositories, because it makes what should be O(nr_changes)
     fetches into O(size_of_repo). E.g., imagine making a fork of
     linux.git backed by the same shared-object alternate. The initial
     "fetch" should be a noop as we realize that we have everything
     already, but we spend 45s of CPU walking the whole graph.

     I have patches for this, but haven't sent them, since without the
     optimization you've done here, we'd never be able to turn it on at
     GitHub.

  3. Other scripts may want us to expose this. The patches I have for
     (2) actually implement "rev-list --alternate-refs" (since we
     implement the connectivity check there). I don't have other
     particular uses in mind, but it lets you ask questions like "which
     objects are reachable here versus in the alternate".

Your patches would affect all of those sites, I and I think that's a
good thing. It's giving a consistent view of "what can I assume is
reachable from the alternate?", which is OK to be a subset of the whole
(and already is, really, since we don't peek into the alternate's
reflogs).

> In a previous version of this series, I taught the configuration
> property to the alternate, as in "these are the references that _I_
> think _you_ will find interesting," rather than the other way around. I
> ultimately decided on what is attached here so that the fork does not
> have to trust the upstream to run arbitrary shell commands.

Right, we had a lot of discussion here (which I'm repeating not for you
but for the benefit of the list). It might seem conceptually simpler to
for the alternate itself to say "what are my important refs?". And that
nicely generalizes if you have multiple alternates. But in our use case,
"important" here is in the eye of the beholder. If a bunch of repos are
sharing object storage, and repo Y is derived from repo X, then refs
related to X are going to be most important when you're doing an
operation in Y. But in some repo Q derived from R, that wouldn't be the
case.

So I think you could make an argument either way there. But simplifying
the security boundary around core.alternateRefsCommand pushes it in
favor of having all of this decided by the repo doing the looking,
rather than the one it's looking at.

-Peff
Jeff King Sept. 20, 2018, 7:27 p.m. UTC | #4
On Thu, Sep 20, 2018 at 11:35:23AM -0700, Stefan Beller wrote:

> > This optimization is a good one and works well, particularly when the
> > upstream repository has a relatively normal number of references. When
> > the upstream has a pathologically _large_ number of references, the
> > advertisement alone can be so time consuming, that it's faster to send
> > redundant objects to the fork.
> 
> (tangent:)
> The current fetch protocol consists of 2 parts:
> negotiation + sending the packfile, and the negotiation only tries
> to trim down the size of the packfile to send, without taking its own
> cost (in terms of time and band width) into account, just to produce
> a perfect pack to send to the client.
> 
> When talking about designing protocol v2 for push (which has not
> landed yet[1]), we had some in-office discussions whether we
> want to have a proper negotiation on push, as it would help
> pushing to remotes that have non-ff pushes, but not necessarily
> regular pushes, as they should be fine with just the refs advertisement.

I don't think that materially changes anything. We already do this same
trick on fetch (but just with the client advertising the extra haves,
since it's the receiver). So if push started doing a real negotiation,
we'd still want to feed those haves in the same way.

> Would it make sense to estimate the value of each .have before
> advertising them and then advertise only the <n> most valuable
> .haves ?
> (e.g. if a .have is only one small commit ahead of origin/master,
> it may not bring a lot of value as the potential savings are small,
> but if that .have contains history between master..TIP that has lots
> of big blobs or objects in general, this may be valuable to know)

That sounds neat, but I think is mostly orthogonal here. We're primarily
interested in just narrowing down the initial set of possibilities, so
you could cull it further.

And I see Taylor just responded with the idea that you could do this in
your hook. Which is neat, but definitely not something we are planning
on doing with it immediately. ;)

-Peff