[2/3] transport.c: introduce core.alternateRefsCommand

Message ID	4c4900722cab253b3ce33cb28910c4602ce44536.1537466087.git.me@ttaylorr.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> From: Taylor Blau <ttaylorr@github.com> Date: Thu, 20 Sep 2018 14:04:11 -0400 To: git@vger.kernel.org Cc: peff@peff.net Subject: [PATCH 2/3] transport.c: introduce core.alternateRefsCommand Message-ID: <4c4900722cab253b3ce33cb28910c4602ce44536.1537466087.git.me@ttaylorr.com> References: <cover.1537466087.git.me@ttaylorr.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <cover.1537466087.git.me@ttaylorr.com> User-Agent: Mutt/1.9.5 (2018-04-13) Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Filter alternate references \| expand [0/3] Filter alternate references [1/3] transport.c: extract 'fill_alternate_refs_command' [2/3] transport.c: introduce core.alternateRefsCommand [3/3] transport.c: introduce core.alternateRefsPrefixes

Taylor Blau Sept. 20, 2018, 6:04 p.m. UTC

When in a repository containing one or more alternates, Git would
sometimes like to list references from its alternates. For example, 'git
receive-pack' list the objects pointed to by alternate references as
special ".have" references.

Listing ".have" references is designed to make pushing changes from
upstream to a fork a lightweight operation, by advertising to the pusher
that the fork already has the objects (via its alternate). Thus, the
client can avoid sending them.

However, when the alternate has a pathologically large number of
references, the initial advertisement is too expensive. In fact, it can
dominate any such optimization where the pusher avoids sending certain
objects.

Introduce "core.alternateRefsCommand" in order to provide a facility to
limit or filter alternate references. This can be used, for example, to
filter out "uninteresting" references from the initial advertisement in
the above scenario.

Let the repository that has alternates configure this command to avoid
trusting the alternate to provide us a safe command to run in the shell.
To behave differently on each alternate (e.g., only list tags from
alternate A, only heads from B) provide the path of the alternate as the
first argument.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config.txt |  6 +++++
 t/t5410-receive-pack.sh  | 47 ++++++++++++++++++++++++++++++++++++++++
 transport.c              | 19 ++++++++++++----
 3 files changed, 68 insertions(+), 4 deletions(-)
 create mode 100755 t/t5410-receive-pack.sh

Jeff King Sept. 20, 2018, 7:37 p.m. UTC | #1

On Thu, Sep 20, 2018 at 02:04:11PM -0400, Taylor Blau wrote:

> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index 112041f407..b908bc5825 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -616,6 +616,12 @@ core.preferSymlinkRefs::
>  	This is sometimes needed to work with old scripts that
>  	expect HEAD to be a symbolic link.
>  
> +core.alternateRefsCommand::
> +	When listing references from an alternate (e.g., in the case of ".have"), use
> +	the shell to execute the specified command instead of
> +	linkgit:git-for-each-ref[1]. The first argument is the path of the alternate.
> +	Output must be of the form: `%(objectname) SPC %(refname)`.

We discussed off-list the notion that this could just be the objectname,
since the ".have" mechanism doesn't care about the actual refnames.

There's a little prior discussion from the list:

  https://public-inbox.org/git/xmqqefzraqbu.fsf@gitster.mtv.corp.google.com/

My "rev-list --alternate-refs" patches _do_ use the refnames, since you
could do something like "--source" that cares about them. But there's
some awkwardness there, because the names are in a different namespace
than the rest of the refs. If we were to just say "nope, you do not get
to see the names of the alternates" then that awkwardness goes away. But
it also loses some information that could _possibly_ be of use to a
caller.

Back in that earlier discussion I did not have a strong opinion, but
here we are cementing that decision into a user-visible interface. So it
probably makes sense to revisit and decide once and for all.

> +test_description='git receive-pack test'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup' '
> +	test_commit one &&
> +	git update-ref refs/heads/a HEAD &&
> +	test_commit two &&
> +	git update-ref refs/heads/b HEAD &&
> +	test_commit three &&
> +	git update-ref refs/heads/c HEAD &&
> +	git clone --bare . fork &&
> +	git clone fork pusher &&
> +	(
> +		cd fork &&
> +		git config receive.advertisealternates true &&
> +		git update-ref -d refs/heads/a &&
> +		git update-ref -d refs/heads/b &&
> +		git update-ref -d refs/heads/c &&
> +		git update-ref -d refs/heads/master &&
> +		git update-ref -d refs/tags/one &&
> +		git update-ref -d refs/tags/two &&
> +		git update-ref -d refs/tags/three &&

Probably not worth nit-picking process count, but this could done with a
single "update-ref --stdin".

> +		printf "../../.git/objects" >objects/info/alternates

Also a nitpick, but I think "echo" would be more usual here (we handle
the lack of a trailing newline just fine, but any use of printf makes me
wonder if something tricky is going on with line endings).

> +test_expect_success 'with core.alternateRefsCommand' '
> +	test_config -C fork core.alternateRefsCommand \
> +		"git --git-dir=\"\$1\" for-each-ref \
> +		--format=\"%(objectname) %(refname)\" \
> +		refs/heads/a refs/heads/c;:" &&

This is cute and all, but might it be more readable to use
write_script() to stick it into its own script?

> +	cat >expect <<-EOF &&
> +	$(git rev-parse a) .have
> +	$(git rev-parse c) .have
> +	EOF
> +	printf "0000" | git receive-pack fork | extract_haves >actual &&

There's been a push lately to avoid having git on the left-hand side of
a fork, since we might otherwise miss its exit code (including things
like asan/valgrind errors). So maybe:

   ... receive-pack fork >actual &&
   extract_haves <actual >actual.haves &&
   test_cmp expect actual.haves

or similar?

> diff --git a/transport.c b/transport.c
> index 24ae3f375d..e7d2cdf00b 100644
> --- a/transport.c
> +++ b/transport.c
> @@ -1328,10 +1328,21 @@ char *transport_anonymize_url(const char *url)
>  static void fill_alternate_refs_command(struct child_process *cmd,
>  					const char *repo_path)
>  {
> -	cmd->git_cmd = 1;
> -	argv_array_pushf(&cmd->args, "--git-dir=%s", repo_path);
> -	argv_array_push(&cmd->args, "for-each-ref");
> -	argv_array_push(&cmd->args, "--format=%(objectname) %(refname)");
> +	const char *value;
> +
> +	if (!git_config_get_value("core.alternateRefsCommand", &value)) {
> +		cmd->use_shell = 1;
> +
> +		argv_array_push(&cmd->args, value);
> +		argv_array_push(&cmd->args, repo_path);

Setting use_shell allows the shell trickery in your test, and matches
the modern way we run config-based commands. Good.

> +	} else {
> +		cmd->git_cmd = 1;
> +
> +		argv_array_pushf(&cmd->args, "--git-dir=%s", repo_path);
> +		argv_array_push(&cmd->args, "for-each-ref");
> +		argv_array_push(&cmd->args, "--format=%(objectname) %(refname)");
> +	}
> +
>  	cmd->env = local_repo_env;
>  	cmd->out = -1;

And we still clear local_repo_env for the custom command, which is good
to avoid confusion like $GIT_DIR being set when the custom command does
"cd $1 && git ...". Good.

-Peff

Taylor Blau Sept. 20, 2018, 8 p.m. UTC | #2

On Thu, Sep 20, 2018 at 03:37:51PM -0400, Jeff King wrote:
> On Thu, Sep 20, 2018 at 02:04:11PM -0400, Taylor Blau wrote:
>
> > diff --git a/Documentation/config.txt b/Documentation/config.txt
> > index 112041f407..b908bc5825 100644
> > --- a/Documentation/config.txt
> > +++ b/Documentation/config.txt
> > @@ -616,6 +616,12 @@ core.preferSymlinkRefs::
> >  	This is sometimes needed to work with old scripts that
> >  	expect HEAD to be a symbolic link.
> >
> > +core.alternateRefsCommand::
> > +	When listing references from an alternate (e.g., in the case of ".have"), use
> > +	the shell to execute the specified command instead of
> > +	linkgit:git-for-each-ref[1]. The first argument is the path of the alternate.
> > +	Output must be of the form: `%(objectname) SPC %(refname)`.
>
> We discussed off-list the notion that this could just be the objectname,
> since the ".have" mechanism doesn't care about the actual refnames.
>
> There's a little prior discussion from the list:
>
>   https://public-inbox.org/git/xmqqefzraqbu.fsf@gitster.mtv.corp.google.com/
>
> My "rev-list --alternate-refs" patches _do_ use the refnames, since you
> could do something like "--source" that cares about them. But there's
> some awkwardness there, because the names are in a different namespace
> than the rest of the refs. If we were to just say "nope, you do not get
> to see the names of the alternates" then that awkwardness goes away. But
> it also loses some information that could _possibly_ be of use to a
> caller.
>
> Back in that earlier discussion I did not have a strong opinion, but
> here we are cementing that decision into a user-visible interface. So it
> probably makes sense to revisit and decide once and for all.

Interesting, and thanks for the link to the prior discussion. I think
that I agree mostly with your rationale in [1], which boils down (for
me) to:

  - Other callers (like 'rev-list --alternate-refs') might care about
    them. Even if we don't have those patches in Git today, it's worth
    keeping their use case(s) in mind.

  - I didn't measure either, but I can't imagine that we're paying a
    huge price for this. So, it might be easy enough to keep saying,
    "please write output as '%(objectname) SP %(refname)'", even if we
    end up throwing out the refname, anyway.

> > +test_description='git receive-pack test'
> > +
> > +. ./test-lib.sh
> > +
> > +test_expect_success 'setup' '
> > +	test_commit one &&
> > +	git update-ref refs/heads/a HEAD &&
> > +	test_commit two &&
> > +	git update-ref refs/heads/b HEAD &&
> > +	test_commit three &&
> > +	git update-ref refs/heads/c HEAD &&
> > +	git clone --bare . fork &&
> > +	git clone fork pusher &&
> > +	(
> > +		cd fork &&
> > +		git config receive.advertisealternates true &&
> > +		git update-ref -d refs/heads/a &&
> > +		git update-ref -d refs/heads/b &&
> > +		git update-ref -d refs/heads/c &&
> > +		git update-ref -d refs/heads/master &&
> > +		git update-ref -d refs/tags/one &&
> > +		git update-ref -d refs/tags/two &&
> > +		git update-ref -d refs/tags/three &&
>
> Probably not worth nit-picking process count, but this could done with a
> single "update-ref --stdin".

Sure, I don't think that 7 `update-ref`'s vs 2 (`cat` + `git update-ref
--stdin`) will make or break the series, but I can happily shorten it as
you suggest ;-).

> > +		printf "../../.git/objects" >objects/info/alternates
>
> Also a nitpick, but I think "echo" would be more usual here (we handle
> the lack of a trailing newline just fine, but any use of printf makes me
> wonder if something tricky is going on with line endings).

'echo' indeed seems to be the way to go. This 'printf' preference is a
Git LFS-ism ;-).

> > +test_expect_success 'with core.alternateRefsCommand' '
> > +	test_config -C fork core.alternateRefsCommand \
> > +		"git --git-dir=\"\$1\" for-each-ref \
> > +		--format=\"%(objectname) %(refname)\" \
> > +		refs/heads/a refs/heads/c;:" &&
>
> This is cute and all, but might it be more readable to use
> write_script() to stick it into its own script?

Good idea, I'll do that.

> > +	cat >expect <<-EOF &&
> > +	$(git rev-parse a) .have
> > +	$(git rev-parse c) .have
> > +	EOF
> > +	printf "0000" | git receive-pack fork | extract_haves >actual &&
>
> There's been a push lately to avoid having git on the left-hand side of
> a fork, since we might otherwise miss its exit code (including things
> like asan/valgrind errors). So maybe:
>
>    ... receive-pack fork >actual &&
>    extract_haves <actual >actual.haves &&
>    test_cmp expect actual.haves
>
> or similar?

Sure, I agree that it's a good idea to not miss the exit code (since we
don't have pipefail on), etc. I adopted your suggestion into my local
copy.

> > diff --git a/transport.c b/transport.c
> > index 24ae3f375d..e7d2cdf00b 100644
> > --- a/transport.c
> > +++ b/transport.c
> > @@ -1328,10 +1328,21 @@ char *transport_anonymize_url(const char *url)
> >  static void fill_alternate_refs_command(struct child_process *cmd,
> >  					const char *repo_path)
> >  {
> > -	cmd->git_cmd = 1;
> > -	argv_array_pushf(&cmd->args, "--git-dir=%s", repo_path);
> > -	argv_array_push(&cmd->args, "for-each-ref");
> > -	argv_array_push(&cmd->args, "--format=%(objectname) %(refname)");
> > +	const char *value;
> > +
> > +	if (!git_config_get_value("core.alternateRefsCommand", &value)) {
> > +		cmd->use_shell = 1;
> > +
> > +		argv_array_push(&cmd->args, value);
> > +		argv_array_push(&cmd->args, repo_path);
>
> Setting use_shell allows the shell trickery in your test, and matches
> the modern way we run config-based commands. Good.
>
> > +	} else {
> > +		cmd->git_cmd = 1;
> > +
> > +		argv_array_pushf(&cmd->args, "--git-dir=%s", repo_path);
> > +		argv_array_push(&cmd->args, "for-each-ref");
> > +		argv_array_push(&cmd->args, "--format=%(objectname) %(refname)");
> > +	}
> > +
> >  	cmd->env = local_repo_env;
> >  	cmd->out = -1;
>
> And we still clear local_repo_env for the custom command, which is good
> to avoid confusion like $GIT_DIR being set when the custom command does
> "cd $1 && git ...". Good.

Thanks,
Taylor

[1]: https://public-inbox.org/git/20170125195425.q4fpvc4ten5mfjgl@sigill.intra.peff.net/

Jeff King Sept. 20, 2018, 8:06 p.m. UTC | #3

On Thu, Sep 20, 2018 at 04:00:34PM -0400, Taylor Blau wrote:

> > My "rev-list --alternate-refs" patches _do_ use the refnames, since you
> > could do something like "--source" that cares about them. But there's
> > some awkwardness there, because the names are in a different namespace
> > than the rest of the refs. If we were to just say "nope, you do not get
> > to see the names of the alternates" then that awkwardness goes away. But
> > it also loses some information that could _possibly_ be of use to a
> > caller.
> >
> > Back in that earlier discussion I did not have a strong opinion, but
> > here we are cementing that decision into a user-visible interface. So it
> > probably makes sense to revisit and decide once and for all.
> 
> Interesting, and thanks for the link to the prior discussion. I think
> that I agree mostly with your rationale in [1], which boils down (for
> me) to:
> 
>   - Other callers (like 'rev-list --alternate-refs') might care about
>     them. Even if we don't have those patches in Git today, it's worth
>     keeping their use case(s) in mind.
> 
>   - I didn't measure either, but I can't imagine that we're paying a
>     huge price for this. So, it might be easy enough to keep saying,
>     "please write output as '%(objectname) SP %(refname)'", even if we
>     end up throwing out the refname, anyway.

TBH, the main advantage to me is that it makes the user-visible
interface way simpler. We just say "give us a list of object ids, one
per line". I guess the current spec is not too bad, especially given
that we can just provide a for-each-ref format that generates it.

> > Probably not worth nit-picking process count, but this could done with a
> > single "update-ref --stdin".
> 
> Sure, I don't think that 7 `update-ref`'s vs 2 (`cat` + `git update-ref
> --stdin`) will make or break the series, but I can happily shorten it as
> you suggest ;-).

Yeah, in retrospect I should have not have even mentioned it.
test_commit() already adds a bunch of extra processes you may or may not
care about (e.g., by making tags, or using "git add" when "commit -a"
might do).

> > > +	cat >expect <<-EOF &&
> > > +	$(git rev-parse a) .have
> > > +	$(git rev-parse c) .have
> > > +	EOF
> > > +	printf "0000" | git receive-pack fork | extract_haves >actual &&
> >
> > There's been a push lately to avoid having git on the left-hand side of
> > a fork, since we might otherwise miss its exit code (including things

Heh, I meant to say "left-hand side of a pipe", but you obviously
figured out what I meant. :)

-Peff

Junio C Hamano Sept. 21, 2018, 4:39 p.m. UTC | #4

Taylor Blau <ttaylorr@github.com> writes:

> +extract_haves () {
> +	depacketize - | grep -o '^.* \.have'

Not portable, isn't it?

cf. http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html

Taylor Blau Sept. 21, 2018, 5:48 p.m. UTC | #5

On Fri, Sep 21, 2018 at 09:39:14AM -0700, Junio C Hamano wrote:
> Taylor Blau <ttaylorr@github.com> writes:
>
> > +extract_haves () {
> > +	depacketize - | grep -o '^.* \.have'
>
> Not portable, isn't it?
>
> cf. http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html

Good catch. Definitely not portable, per the link that you shared above.

Since 'depacketize()' will give us a "\0", we can pull it and anything
after it out with 'sed', instead. Any lines that don't contain a "\0"
only contain an OID and the literal, ".have", and are fine as-is.

Something like this:

  extract_haves () {
    depacketize - | grep '^.* \.have' | sed -e 's/\\0.*$//g'
  }

Harder to read--at least for me--but infinitely more portable.

I'll wait until a little later today, and then send you v2. Thanks for
reviewing :-).

Thanks,
Taylor

Taylor Blau Sept. 21, 2018, 5:57 p.m. UTC | #6

On Fri, Sep 21, 2018 at 01:48:25PM -0400, Taylor Blau wrote:
> On Fri, Sep 21, 2018 at 09:39:14AM -0700, Junio C Hamano wrote:
> > Taylor Blau <ttaylorr@github.com> writes:
> >
> > > +extract_haves () {
> > > +	depacketize - | grep -o '^.* \.have'
> >
> > Not portable, isn't it?
> >
> > cf. http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html
>
> Good catch. Definitely not portable, per the link that you shared above.
>
> Since 'depacketize()' will give us a "\0", we can pull it and anything
> after it out with 'sed', instead. Any lines that don't contain a "\0"
> only contain an OID and the literal, ".have", and are fine as-is.
>
> Something like this:
>
>   extract_haves () {
>     depacketize - | grep '^.* \.have' | sed -e 's/\\0.*$//g'
>   }
>
> Harder to read--at least for me--but infinitely more portable.

In fact, I think that we can go even further: since we don't need to
catch the beginning '^.*' (without -o), we can instead:

  extract_haves () {
    depacketize - | grep '\.have' | sed -e 's/\\0.*$//g'
  }

Thanks,
Taylor

Junio C Hamano Sept. 21, 2018, 7:59 p.m. UTC | #7

Taylor Blau <me@ttaylorr.com> writes:

> In fact, I think that we can go even further: since we don't need to
> catch the beginning '^.*' (without -o), we can instead:
>
>   extract_haves () {
>     depacketize - | grep '\.have' | sed -e 's/\\0.*$//g'
>   }

Do not pipe grep into sed, unless you have an overly elaborate set
of patterns to filter with, e.g. something along the lines of...

	sed -ne '/\.have/s/...//p'

Taylor Blau Sept. 26, 2018, 12:56 a.m. UTC | #8

On Fri, Sep 21, 2018 at 12:59:16PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > In fact, I think that we can go even further: since we don't need to
> > catch the beginning '^.*' (without -o), we can instead:
> >
> >   extract_haves () {
> >     depacketize - | grep '\.have' | sed -e 's/\\0.*$//g'
> >   }
>
> Do not pipe grep into sed, unless you have an overly elaborate set
> of patterns to filter with, e.g. something along the lines of...
>
> 	sed -ne '/\.have/s/...//p'

Thanks, I'm not sure why I thought that this was a good idea to send
(even after discussing it to myself twice publicly on the list
beforehand).

Anyway, in my local copy, I adopted Peff's suggestion below in the
thread, which is:

  extract_haves () {
    depacketize - | perl -lne '/^(\S+) \.have/ and print $1'
  }

I think that that should be OK, but I sent it here to double check
before sending you real patches.

Thanks,
Taylor

[2/3] transport.c: introduce core.alternateRefsCommand

Commit Message

Comments

Patch