[v2,2/2] cat-file: add --batch-command mode

Message ID	1b63164ad4d9ec6b5fa6f733b6095b2779298b36.1644251611.git.gitgitgadget@gmail.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <1b63164ad4d9ec6b5fa6f733b6095b2779298b36.1644251611.git.gitgitgadget@gmail.com> In-Reply-To: <pull.1212.v2.git.git.1644251611.gitgitgadget@gmail.com> References: <pull.1212.git.git.1643915286.gitgitgadget@gmail.com> <pull.1212.v2.git.git.1644251611.gitgitgadget@gmail.com> Date: Mon, 07 Feb 2022 16:33:31 +0000 Subject: [PATCH v2 2/2] cat-file: add --batch-command mode MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fcc: Sent To: git@vger.kernel.org Cc: me@ttaylorr.com, phillip.wood123@gmail.com, avarab@gmail.com, e@80x24.org, bagasdotme@gmail.com, gitster@pobox.com, Eric Sunshine <sunshine@sunshineco.com>, John Cai <johncai86@gmail.com>, John Cai <johncai86@gmail.com> Precedence: bulk From: John Cai <johncai86@gmail.com>
Series	Add cat-file --batch-command flag \| expand [v2,0/2] Add cat-file --batch-command flag [v2,1/2] cat-file: rename cmdmode to transform_mode [v2,2/2] cat-file: add --batch-command mode

John Cai Feb. 7, 2022, 4:33 p.m. UTC

From: John Cai <johncai86@gmail.com>

Add a new flag --batch-command that accepts commands and arguments
from stdin, similar to git-update-ref --stdin.

At GitLab, we use a pair of long running cat-file processes when
accessing object content. One for iterating over object metadata with
--batch-check, and the other to grab object contents with --batch.

However, if we had --batch-command, we wouldn't need to keep both
processes around, and instead just have one --batch-command process
where we can flip between getting object info, and getting object
contents. Since we have a pair of cat-file processes per repository,
this means we can get rid of roughly half of long lived git cat-file
processes. Given there are many repositories being accessed at any given
time, this can lead to huge savings since on a given server.

git cat-file --batch-command

will enter an interactive command mode whereby the user can enter in
commands and their arguments that get queued in memory:

<command1> [arg1] [arg2] NL
<command2> [arg1] [arg2] NL

When --buffer mode is used, commands will be queued in memory until a
flush command is issued that execute them:

flush NL

The reason for a flush command is that when a consumer process (A)
talks to a git cat-file process (B) and interactively writes to and
reads from it in --buffer mode, (A) needs to be able to control when
the buffer is flushed to stdout.

Currently, from (A)'s perspective, the only way is to either

1. kill (B)'s process
2. send an invalid object to stdin.

1. is not ideal from a performance perspective as it will require
spawning a new cat-file process each time, and 2. is hacky and not a
good long term solution.

With this mechanism of queueing up commands and letting (A) issue a
flush command, process (A) can control when the buffer is flushed and
can guarantee it will receive all of the output when in --buffer mode.

This patch adds the basic structure for adding command which can be
extended in the future to add more commands. It also adds the following
two commands (on top of the flush command):

contents <object> NL
info <object> NL

The contents command takes an <object> argument and prints out the object
contents.

The info command takes a <object> argument and prints out the object
metadata.

These can be used in the following way with --buffer:

contents <sha1> NL
object <sha1> NL
object <sha1> NL
contents <sha1> NL
flush
contents <sha1> NL
flush

When used without --buffer:

contents <sha1> NL
object <sha1> NL
object <sha1> NL
contents <sha1> NL
contents <sha1> NL

Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: John Cai <johncai86@gmail.com>
---
 Documentation/git-cat-file.txt |  19 ++++
 builtin/cat-file.c             | 124 +++++++++++++++++++++
 t/t1006-cat-file.sh            | 197 ++++++++++++++++++++++++++++++++-
 3 files changed, 339 insertions(+), 1 deletion(-)

Jonathan Tan Feb. 7, 2022, 11:34 p.m. UTC | #1

"John Cai via GitGitGadget" <gitgitgadget@gmail.com> writes:
> However, if we had --batch-command, we wouldn't need to keep both
> processes around, and instead just have one --batch-command process
> where we can flip between getting object info, and getting object
> contents. Since we have a pair of cat-file processes per repository,
> this means we can get rid of roughly half of long lived git cat-file
> processes. Given there are many repositories being accessed at any given
> time, this can lead to huge savings since on a given server.

One other benefit is that with explicit flushes, in a partial clone,
this makes it possible to batch prefetch objects.

> diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
> index bef76f4dd06..618dbd15338 100644
> --- a/Documentation/git-cat-file.txt
> +++ b/Documentation/git-cat-file.txt
> @@ -96,6 +96,25 @@ OPTIONS
>  	need to specify the path, separated by whitespace.  See the
>  	section `BATCH OUTPUT` below for details.
>  
> +--batch-command::
> +	Enter a command mode that reads commands and arguments from stdin.
> +	May not be combined with any other options or arguments except
> +	`--textconv` or `--filters`, in which case the input lines also need to
> +	specify the path, separated by whitespace.  See the section
> +	`BATCH OUTPUT` below for details.
> +
> +contents <object>::
> +	Print object contents for object reference <object>
> +
> +info <object>::
> +	Print object info for object reference <object>
> +
> +flush::
> +	Execute all preceding commands that were issued since the beginning or
> +	since the last flush command was issued. Only used with --buffer. When
> +	--buffer is not used, commands are flushed each time without issuing
> +	`flush`.

The way this is formatted leads me to think that "contents", etc. are
CLI arguments, not things written to stdin. Some of the commit message
probably needs to go here.

I just looked at the commit message and documentation for now.

If you have time and are interested, we at Google are thinking of a more
comprehensive "batch" process [1].

[1] https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/

Junio C Hamano Feb. 8, 2022, 12:49 a.m. UTC | #2

"John Cai via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: John Cai <johncai86@gmail.com>
>
> Add a new flag --batch-command that accepts commands and arguments
> from stdin, similar to git-update-ref --stdin.
>
> At GitLab, we use a pair of long running cat-file processes when
> accessing object content. One for iterating over object metadata with
> --batch-check, and the other to grab object contents with --batch.
>
> However, if we had --batch-command, we wouldn't need to keep both
> processes around, and instead just have one --batch-command process
> where we can flip between getting object info, and getting object
> contents. Since we have a pair of cat-file processes per repository,
> this means we can get rid of roughly half of long lived git cat-file
> processes. Given there are many repositories being accessed at any given
> time, this can lead to huge savings since on a given server.
>
> git cat-file --batch-command
>
> will enter an interactive command mode whereby the user can enter in
> commands and their arguments that get queued in memory:
>
> <command1> [arg1] [arg2] NL
> <command2> [arg1] [arg2] NL

If you mean you take one command with its args per line, use LF not
NL.

    $ git grep '\<NL\>' Documentation
    $ git grep '\<LF\>' Documentation

We may want to fix the sole offender in Documentation/config.txt
file (#leftoverbits).

> With this mechanism of queueing up commands and letting (A) issue a
> flush command, process (A) can control when the buffer is flushed and
> can guarantee it will receive all of the output when in --buffer mode.

Are we giving them guarantee when output will *not* come?  If (B) is
allowed to flush when it thinks it has too much in-core, it would
mean that (A) cannot keep issuing commands forever without reading
the response from (B), because (B) will eventually be blocked when
it tries to flush to a pipe that (A) is not reading.  I think there
should be some discussion on that, too.  IOW, --batch-command does
not allow (B) to flush until it gets "flush", or something like that.

> diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
> index bef76f4dd06..618dbd15338 100644
> --- a/Documentation/git-cat-file.txt
> +++ b/Documentation/git-cat-file.txt
> @@ -96,6 +96,25 @@ OPTIONS
>  	need to specify the path, separated by whitespace.  See the
>  	section `BATCH OUTPUT` below for details.
>  
> +--batch-command::
> +	Enter a command mode that reads commands and arguments from stdin.
> +	May not be combined with any other options or arguments except
> +	`--textconv` or `--filters`, in which case the input lines also need to
> +	specify the path, separated by whitespace.  See the section
> +	`BATCH OUTPUT` below for details.
> +
> +contents <object>::
> +	Print object contents for object reference <object>

Presumably this corresponds to what you get out of "--batch"?

> +info <object>::
> +	Print object info for object reference <object>

and this one "--batch-check"?

I expect that future readers will ask this same question because it
is not clear how "object contents" and "object info" are exactly
printed.  These two paragraphs may want to anticipate it and reduce
the need for readers to ask such a question.  

> +flush::
> +	Execute all preceding commands that were issued since the beginning or
> +	since the last flush command was issued. Only used with --buffer. When
> +	--buffer is not used, commands are flushed each time without issuing
> +	`flush`.

Here is a good place to also say "When --buffer is used, no output
will come until this is issued" or something.

> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 5f015e71096..6bfab74b58a 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -26,6 +26,7 @@ struct batch_options {
>  	int unordered;
>  	int transform_mode; /* may be 'w' or 'c' for --filters or --textconv */
>  	const char *format;
> +	int command;

I am not sure if "command" is a good name.  Does it answer this
question clearly? "'command' as opposed to what?"

> @@ -508,6 +509,118 @@ static int batch_unordered_packed(const struct object_id *oid,
>  				      data);
>  }
>  
> +typedef void (*parse_cmd_fn_t)(struct batch_options *, const char *,
> +			       struct strbuf *, struct expand_data *);
> +
> +struct queued_cmd {
> +	parse_cmd_fn_t fn;
> +	const char *line;
> +};
> +
> +static void parse_cmd_contents(struct batch_options *opt,
> +			     const char *line,
> +			     struct strbuf *output,
> +			     struct expand_data *data)
> +{
> +	opt->print_contents = 1;
> +	batch_one_object(line, output, opt, data);
> +}
> +
> +static void parse_cmd_info(struct batch_options *opt,
> +			   const char *line,
> +			   struct strbuf *output,
> +			   struct expand_data *data)
> +{
> +	opt->print_contents = 0;
> +	batch_one_object(line, output, opt, data);
> +}

OK, these are as simple as expected.

> +static void flush_batch_calls(struct batch_options *opt,
> +		struct strbuf *output,
> +		struct expand_data *data,
> +		struct queued_cmd *cmds,
> +		int queued)
> +{
> +	int i;
> +	for(i = 0; i < queued; i++){

Missing SP around parentheses.

Excess brace pair wround a single-statement block.

> +		cmds[i].fn(opt, cmds[i].line, output, data);
> +	}
> +	fflush(stdout);
> +}
> +
> +static const struct parse_cmd {
> +	const char *prefix;
> +	parse_cmd_fn_t fn;
> +	unsigned takes_args;
> +} commands[] = {
> +	{ "contents", parse_cmd_contents, 1},
> +	{ "info", parse_cmd_info, 1},
> +};
> +
> +static void batch_objects_command(struct batch_options *opt,
> +				    struct strbuf *output,
> +				    struct expand_data *data)
> +{
> +	struct strbuf input = STRBUF_INIT;
> +	struct queued_cmd *cmds = NULL;
> +	size_t alloc = 0, nr = 0;
> +	int queued = 0;
> +
> +	while (!strbuf_getline(&input, stdin)) {
> +		int i;
> +		const struct parse_cmd *cmd = NULL;
> +		const char *p, *cmd_end;
> +		struct queued_cmd call = {0};
> +
> +		if (!input.len)
> +			die(_("empty command in input"));
> +		if (isspace(*input.buf))
> +			die(_("whitespace before command: '%s'"), input.buf);
> +
> +		if (skip_prefix(input.buf, "flush", &cmd_end)) {
> +			if (!opt->buffer_output)
> +				die(_("flush is only for --buffer mode"));
> +			if (*cmd_end)
> +				die(_("flush takes no arguments"));
> +			if (!queued)
> +				die(_("nothing to flush"));

I am not sure if this is a good idea at all.  What do we gain from
punishing an automated stupid loop that issues flush every once in a
while after issuing a handful real commands and issues another flush
after running out of the real commands for a good measure?

> +			flush_batch_calls(opt, output, data, cmds, queued);
> +			queued = 0;
> +			continue;
> +		}
> +
> +		for (i = 0; i < ARRAY_SIZE(commands); i++) {
> +			if (!skip_prefix(input.buf, commands[i].prefix, &cmd_end))
> +				continue;
> +
> +			cmd = &commands[i];
> +			if (cmd->takes_args)
> +				p = cmd_end + 1;

If somebody adds an entry with .takes_args==false, p will stay
uninitialized and used in the call ->fn() below, or passed to
xstrdup(p).  It probabloy should be initialized to NULL, and
xstrdup(p) below replaced with xstrdup_or_null(p).

> +			break;
> +		}
> +
> +		if (!cmd)
> +			die(_("unknown command: '%s'"), input.buf);
> +
> +		if (!opt->buffer_output) {
> +			cmd->fn(opt, p, output, data);
> +			continue;
> +		}
> +

> +		queued++;
> +		if (queued > nr) {
> +			ALLOC_GROW(cmds, nr+1, alloc);
> +			nr++;
> +		}
> +
> +		call.fn = cmd->fn;
> +		call.line = xstrdup(p);
> +		cmds[queued-1] = call;

Can nr and queued ever go out of sync?

If cmds is the usual <array, nr, alloc> tuple we let ALLOC_GROW() to
manage, alloc keeps track of how physically large the array is, and
nr indicates how many slots are filled.  Holding onto the block of
memory we used when discarding the accumulated items and reusing
that block without having to reallocate until we use the slots that
we have allocated is done by using only <nr, alloc> pair.

But the above code seems that it does not understand that, and
instead thinks it has to use "nr" for the "we have made the array
this big, so we do not have to realloc up to that point" pointer,
hence its use of a separate "queued".  IOW, the array growing code
above seems confused.

It is more customery (hence easier to follow by readers who work on
our code base) to lose queued and say

		ALLOC_GROW(cmds, nr + 1, alloc);
		cmds[nr++] = call;
		call.fn = cmd->fn;
		call.line = xstrdup_or_null(p);

instead of the above 9 lines.

> @@ -515,6 +628,7 @@ static int batch_objects(struct batch_options *opt)
>  	struct expand_data data;
>  	int save_warning;
>  	int retval = 0;
> +	const int command = opt->command;

This tells me that it is quite a misnomer.  This single bit is used
to differentiate between "other batch modes" and "--batch-command"
mode, which already smells like a misdesign, because we have, from
an end-user's point of view, three modes:

    --batch
    --batch-check
    --batch-command

so it would be far cleaner to have a single batch_mode enum that can
represent these three "batch modes", no?

>  	if (!opt->format)
>  		opt->format = "%(objectname) %(objecttype) %(objectsize)";
> @@ -590,6 +704,10 @@ static int batch_objects(struct batch_options *opt)
>  	save_warning = warn_on_object_refname_ambiguity;
>  	warn_on_object_refname_ambiguity = 0;
>  
> +	if (command) {
> +		batch_objects_command(opt, &output, &data);
> +		goto cleanup;
> +	}
>  	while (strbuf_getline(&input, stdin) != EOF) {
>  		if (data.split_on_whitespace) {
>  			/*
> @@ -608,6 +726,7 @@ static int batch_objects(struct batch_options *opt)
>  		batch_one_object(input.buf, &output, opt, &data);
>  	}
>  
> + cleanup:
>  	strbuf_release(&input);
>  	strbuf_release(&output);
>  	warn_on_object_refname_ambiguity = save_warning;
> @@ -636,6 +755,7 @@ static int batch_option_callback(const struct option *opt,
>  
>  	bo->enabled = 1;
>  	bo->print_contents = !strcmp(opt->long_name, "batch");
> +	bo->command = !strcmp(opt->long_name, "batch-command");

And this part needs fixing.  The original used to say

	we supoprt "batch" and something else (it turns out that
	"batch-check" is that something else, but the above code is
	so sloppy that it does not even tell readers that), and
	the .print_contents member is how you can tell them apart.

This patch is making it worse by introducing another member that can
be independently set or unset, pretending that the four combinations
all can make sense, but that is not the case, right?

So, perhaps a good first step would be to lose .print_contents
member and add .batch_command member that is used to more explicitly
name one of the three possibilities?

Phillip Wood Feb. 8, 2022, 11 a.m. UTC | #3

Hi Jonathan and John

On 07/02/2022 23:34, Jonathan Tan wrote:
> "John Cai via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> However, if we had --batch-command, we wouldn't need to keep both
>> processes around, and instead just have one --batch-command process
>> where we can flip between getting object info, and getting object
>> contents. Since we have a pair of cat-file processes per repository,
>> this means we can get rid of roughly half of long lived git cat-file
>> processes. Given there are many repositories being accessed at any given
>> time, this can lead to huge savings since on a given server.
> 
> One other benefit is that with explicit flushes, in a partial clone,
> this makes it possible to batch prefetch objects.

Jonathan is there any overlap between what this series is trying to do 
and your proposal for a batch command[1]? For example would extending 
this series to get blob sizes be useful to you?

Best Wishes

Phillip

[1] 
https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/

>> diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
>> index bef76f4dd06..618dbd15338 100644
>> --- a/Documentation/git-cat-file.txt
>> +++ b/Documentation/git-cat-file.txt
>> @@ -96,6 +96,25 @@ OPTIONS
>>   	need to specify the path, separated by whitespace.  See the
>>   	section `BATCH OUTPUT` below for details.
>>   
>> +--batch-command::
>> +	Enter a command mode that reads commands and arguments from stdin.
>> +	May not be combined with any other options or arguments except
>> +	`--textconv` or `--filters`, in which case the input lines also need to
>> +	specify the path, separated by whitespace.  See the section
>> +	`BATCH OUTPUT` below for details.
>> +
>> +contents <object>::
>> +	Print object contents for object reference <object>
>> +
>> +info <object>::
>> +	Print object info for object reference <object>
>> +
>> +flush::
>> +	Execute all preceding commands that were issued since the beginning or
>> +	since the last flush command was issued. Only used with --buffer. When
>> +	--buffer is not used, commands are flushed each time without issuing
>> +	`flush`.
> 
> The way this is formatted leads me to think that "contents", etc. are
> CLI arguments, not things written to stdin. Some of the commit message
> probably needs to go here.
> 
> I just looked at the commit message and documentation for now.
> 
> If you have time and are interested, we at Google are thinking of a more
> comprehensive "batch" process [1].
> 
> [1] https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/

Phillip Wood Feb. 8, 2022, 11:06 a.m. UTC | #4

Hi John

On 07/02/2022 16:33, John Cai via GitGitGadget wrote:
> From: John Cai <johncai86@gmail.com>
> 
> Add a new flag --batch-command that accepts commands and arguments
> from stdin, similar to git-update-ref --stdin.
> 
> At GitLab, we use a pair of long running cat-file processes when
> accessing object content. One for iterating over object metadata with
> --batch-check, and the other to grab object contents with --batch.
> 
> However, if we had --batch-command, we wouldn't need to keep both
> processes around, and instead just have one --batch-command process
> where we can flip between getting object info, and getting object
> contents. Since we have a pair of cat-file processes per repository,
> this means we can get rid of roughly half of long lived git cat-file
> processes. Given there are many repositories being accessed at any given
> time, this can lead to huge savings since on a given server.
> 
> git cat-file --batch-command
> 
> will enter an interactive command mode whereby the user can enter in
> commands and their arguments that get queued in memory:
> 
> <command1> [arg1] [arg2] NL
> <command2> [arg1] [arg2] NL
> 
> When --buffer mode is used, commands will be queued in memory until a
> flush command is issued that execute them:
> 
> flush NL
> 
> The reason for a flush command is that when a consumer process (A)
> talks to a git cat-file process (B) and interactively writes to and
> reads from it in --buffer mode, (A) needs to be able to control when
> the buffer is flushed to stdout.
> 
> Currently, from (A)'s perspective, the only way is to either
> 
> 1. kill (B)'s process
> 2. send an invalid object to stdin.
> 
> 1. is not ideal from a performance perspective as it will require
> spawning a new cat-file process each time, and 2. is hacky and not a
> good long term solution.
> 
> With this mechanism of queueing up commands and letting (A) issue a
> flush command, process (A) can control when the buffer is flushed and
> can guarantee it will receive all of the output when in --buffer mode.
> 
> This patch adds the basic structure for adding command which can be
> extended in the future to add more commands. It also adds the following
> two commands (on top of the flush command):
> 
> contents <object> NL
> info <object> NL
> 
> The contents command takes an <object> argument and prints out the object
> contents.
> 
> The info command takes a <object> argument and prints out the object
> metadata.
> 
> These can be used in the following way with --buffer:
> 
> contents <sha1> NL
> object <sha1> NL

There is no object command

 >[...]
> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
> index 145eee11df9..c57a35ea20a 100755
> --- a/t/t1006-cat-file.sh
> +++ b/t/t1006-cat-file.sh
> @@ -177,6 +177,20 @@ $content"
>   	test_cmp expect actual
>       '
>   
> +    test -z "$content" ||
> +    test_expect_success "--batch-command output of $type content is correct" '
> +	maybe_remove_timestamp "$batch_output" $no_ts >expect &&
> +	maybe_remove_timestamp "$(test_write_lines "contents $sha1" \
> +	| git cat-file --batch-command)" $no_ts >actual &&
> +	test_cmp expect actual
> +    '
> +
> +    test_expect_success "--batch-command output of $type info is correct" '
> +	echo "$sha1 $type $size" >expect &&
> +	test_write_lines "info $sha1" | git cat-file --batch-command >actual &&
> +	test_cmp expect actual
> +    '
> +
>       test_expect_success "custom --batch-check format" '
>   	echo "$type $sha1" >expect &&
>   	echo $sha1 | git cat-file --batch-check="%(objecttype) %(objectname)" >actual &&
> @@ -213,6 +227,64 @@ $content"
>       '
>   }
>   
> +run_buffer_test_flush () {
> +	type=$1
> +	sha1=$2
> +	size=$3
> +
> +	mkfifo input &&
> +	test_when_finished 'rm input; exec 8<&-' &&
> +	mkfifo output &&
> +	exec 9<>output &&
> +	test_when_finished 'rm output; exec 9<&-'
> +	(
> +		git cat-file --buffer --batch-command <input 2>err &
> +		echo $! &&
> +		wait $!
> +	) >&9 &
> +	sh_pid=$! &&
> +	read cat_file_pid <&9 &&
> +	test_when_finished "kill $cat_file_pid
> +			    kill $sh_pid; wait $sh_pid; :" &&
> +	echo "$sha1 $type $size" >expect &&
> +	test_write_lines "info $sha1" flush "info $sha1" >input
> +	# TODO - consume all available input, not just one
> +	# line (see above).
> +	# check output is flushed on exit

This test seems to have lost some code above this comment so the comment 
is no longer correct - we do not test if the output is flushed on exit 
and looking at the implementation I don't think it is.

Best Wishes

Phillip

> +	read actual <&9 &&
> +	echo "$actual" >actual &&
> +	test_cmp expect actual &&
> +	test_must_be_empty err
> +}
> +
> +run_buffer_test_no_flush () {
> +	type=$1
> +	sha1=$2
> +	size=$3
> +
> +	touch output &&
> +	test_when_finished 'rm output' &&
> +	mkfifo input &&
> +	test_when_finished 'rm input' &&
> +	mkfifo pid &&
> +	exec 9<>pid &&
> +	test_when_finished 'rm pid; exec 9<&-'
> +	(
> +		git cat-file --buffer --batch-command <input >output &
> +		echo $! &&
> +		wait $!
> +		echo $?
> +	) >&9 &
> +	sh_pid=$! &&
> +	read cat_file_pid <&9 &&
> +	test_when_finished "kill $cat_file_pid
> +			    kill $sh_pid; wait $sh_pid; :" &&
> +	test_write_lines "info $sha1" "info $sha1" &&
> +	kill $cat_file_pid &&
> +	read status <&9 &&
> +	test_must_be_empty output
> +}
> +
>   hello_content="Hello World"
>   hello_size=$(strlen "$hello_content")
>   hello_sha1=$(echo_without_newline "$hello_content" | git hash-object --stdin)
> @@ -224,6 +296,14 @@ test_expect_success "setup" '
>   
>   run_tests 'blob' $hello_sha1 $hello_size "$hello_content" "$hello_content"
>   
> +test_expect_success PIPE '--batch-command --buffer with flush for blob info' '
> +       run_buffer_test_flush blob $hello_sha1 $hello_size
> +'
> +
> +test_expect_success PIPE '--batch-command --buffer without flush for blob info' '
> +       run_buffer_test_no_flush blob $hello_sha1 $hello_size false
> +'
> +
>   test_expect_success '--batch-check without %(rest) considers whole line' '
>   	echo "$hello_sha1 blob $hello_size" >expect &&
>   	git update-index --add --cacheinfo 100644 $hello_sha1 "white space" &&
> @@ -238,6 +318,14 @@ tree_pretty_content="100644 blob $hello_sha1	hello"
>   
>   run_tests 'tree' $tree_sha1 $tree_size "" "$tree_pretty_content"
>   
> +test_expect_success PIPE '--batch-command --buffer with flush for tree info' '
> +       run_buffer_test_flush tree $tree_sha1 $tree_size
> +'
> +
> +test_expect_success PIPE '--batch-command --buffer without flush for tree info' '
> +       run_buffer_test_no_flush tree $tree_sha1 $tree_size false
> +'
> +
>   commit_message="Initial commit"
>   commit_sha1=$(echo_without_newline "$commit_message" | git commit-tree $tree_sha1)
>   commit_size=$(($(test_oid hexsz) + 137))
> @@ -249,6 +337,14 @@ $commit_message"
>   
>   run_tests 'commit' $commit_sha1 $commit_size "$commit_content" "$commit_content" 1
>   
> +test_expect_success PIPE '--batch-command --buffer with flush for commit info' '
> +       run_buffer_test_flush commit $commit_sha1 $commit_size
> +'
> +
> +test_expect_success PIPE '--batch-command --buffer without flush for commit info' '
> +       run_buffer_test_no_flush commit $commit_sha1 $commit_size false
> +'
> +
>   tag_header_without_timestamp="object $hello_sha1
>   type blob
>   tag hellotag
> @@ -263,11 +359,19 @@ tag_size=$(strlen "$tag_content")
>   
>   run_tests 'tag' $tag_sha1 $tag_size "$tag_content" "$tag_content" 1
>   
> +test_expect_success PIPE '--batch-command --buffer with flush for tag info' '
> +       run_buffer_test_flush tag $tag_sha1 $tag_size
> +'
> +
> +test_expect_success PIPE '--batch-command --buffer without flush for tag info' '
> +       run_buffer_test_no_flush tag $tag_sha1 $tag_size false
> +'
> +
>   test_expect_success \
>       "Reach a blob from a tag pointing to it" \
>       "test '$hello_content' = \"\$(git cat-file blob $tag_sha1)\""
>   
> -for batch in batch batch-check
> +for batch in batch batch-check batch-command
>   do
>       for opt in t s e p
>       do
> @@ -373,6 +477,62 @@ test_expect_success "--batch-check with multiple sha1s gives correct format" '
>       "$(echo_without_newline "$batch_check_input" | git cat-file --batch-check)"
>   '
>   
> +batch_command_info_input="info $hello_sha1
> +info $tree_sha1
> +info $commit_sha1
> +info $tag_sha1
> +info deadbeef
> +info
> +flush
> +"
> +
> +test_expect_success "--batch-command with multiple info calls gives correct format" '
> +	test "$batch_check_output" = "$(echo_without_newline \
> +	"$batch_command_info_input" | git cat-file --batch-command --buffer)"
> +'
> +
> +batch_command_contents_input="contents $hello_sha1
> +contents $commit_sha1
> +contents $tag_sha1
> +contents deadbeef
> +contents
> +flush
> +"
> +
> +test_expect_success "--batch-command with multiple contents calls gives correct format" '
> +	test "$(maybe_remove_timestamp "$batch_output" 1)" = \
> +	"$(maybe_remove_timestamp "$(echo_without_newline "$batch_command_contents_input" | git cat-file --batch-command)" 1)"
> +'
> +
> +batch_command_mixed_input="info $hello_sha1
> +contents $hello_sha1
> +info $commit_sha1
> +contents $commit_sha1
> +info $tag_sha1
> +contents $tag_sha1
> +contents deadbeef
> +info
> +flush
> +"
> +
> +batch_command_mixed_output="$hello_sha1 blob $hello_size
> +$hello_sha1 blob $hello_size
> +$hello_content
> +$commit_sha1 commit $commit_size
> +$commit_sha1 commit $commit_size
> +$commit_content
> +$tag_sha1 tag $tag_size
> +$tag_sha1 tag $tag_size
> +$tag_content
> +deadbeef missing
> + missing"
> +
> +test_expect_success "--batch-command with mixed calls gives correct format" '
> +	test "$(maybe_remove_timestamp "$batch_command_mixed_output" 1)" = \
> +	"$(maybe_remove_timestamp "$(echo_without_newline \
> +	"$batch_command_mixed_input" | git cat-file --batch-command --buffer)" 1)"
> +'
> +
>   test_expect_success 'setup blobs which are likely to delta' '
>   	test-tool genrandom foo 10240 >foo &&
>   	{ cat foo && echo plus; } >foo-plus &&
> @@ -963,5 +1123,40 @@ test_expect_success 'cat-file --batch-all-objects --batch-check ignores replace'
>   	echo "$orig commit $orig_size" >expect &&
>   	test_cmp expect actual
>   '
> +test_expect_success 'batch-command empty command' '
> +	echo "" >cmd &&
> +	test_expect_code 128 git cat-file --batch-command <cmd 2>err &&
> +	grep -E "^fatal:.*empty command in input.*" err
> +'
> +
> +test_expect_success 'batch-command whitespace before command' '
> +	echo " info deadbeef" >cmd &&
> +	test_expect_code 128 git cat-file --batch-command <cmd 2>err &&
> +	grep -E "^fatal:.*whitespace before command.*" err
> +'
> +
> +test_expect_success 'batch-command unknown command' '
> +	echo unknown_command >cmd &&
> +	test_expect_code 128 git cat-file --batch-command <cmd 2>err &&
> +	grep -E "^fatal:.*unknown command.*" err
> +'
> +
> +test_expect_success 'batch-command flush with arguments' '
> +	echo "flush arg" >cmd &&
> +	test_expect_code 128 git cat-file --batch-command --buffer <cmd 2>err &&
> +	grep -E "^fatal:.*flush takes no arguments.*" err
> +'
> +
> +test_expect_success 'batch-command flush without --buffer' '
> +	echo "flush arg" >cmd &&
> +	test_expect_code 128 git cat-file --batch-command <cmd 2>err &&
> +	grep -E "^fatal:.*flush is only for --buffer mode.*" err
> +'
> +
> +test_expect_success 'batch-command flush empty queue' '
> +	echo flush >cmd &&
> +	test_expect_code 128 git cat-file --batch-command --buffer <cmd 2>err &&
> +	grep -E "^fatal:.*nothing to flush.*" err
> +'
>   
>   test_done

Jonathan Tan Feb. 8, 2022, 5:56 p.m. UTC | #5

Phillip Wood <phillip.wood123@gmail.com> writes:
> Jonathan is there any overlap between what this series is trying to do 
> and your proposal for a batch command[1]? For example would extending 
> this series to get blob sizes be useful to you?
> 
> Best Wishes
> 
> Phillip
> 
> [1] 
> https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/

There is overlap, yes. I'm not sure of the best way to resolve it,
though. John mentions a substantial reduction ("roughly half") of Git
processes [1], and if they foresee needing to access things other than
object info and contents, it might be better to start with something
more extensible, like my proposal for a specific batch command. (If not,
they will encounter another increase in the number of processes.) If
they think that they can make do with this patch for the time being, I
think that's fine too: once this is merged (which will be earlier than
any extensible batch command, for sure), they (and anyone else who needs
batched object info and contents without the overhead of initializing
all the data structures in Git) can make use of this improvement.

As for getting blob sizes, I think that --batch-check can already give
it to us. If that is the case, the series is fine as-is (at least in
that aspect).

[1] https://lore.kernel.org/git/1b63164ad4d9ec6b5fa6f733b6095b2779298b36.1644251611.git.gitgitgadget@gmail.com/

Junio C Hamano Feb. 8, 2022, 6:09 p.m. UTC | #6

Jonathan Tan <jonathantanmy@google.com> writes:

> There is overlap, yes. I'm not sure of the best way to resolve it,
> though. John mentions a substantial reduction ("roughly half") of Git
> processes [1], and if they foresee needing to access things other than
> object info and contents, it might be better to start with something
> more extensible, like my proposal for a specific batch command.

I agree that it would be ideal to have just one way generic and
extensible enough.  I do not know if there are much difference in
that area between the two approaches, though.  The RFC I saw did
look more complex and rigidly specified with framing and such, but
that is only the syntax part---in the way in which interaction
between two processes happen, I didn't quite see fundamental
differences.  I'd expect it wouldn't be too much trouble to add new
commands to code written using either approach (although I haven't
seen yours yet ;-).

Thanks.

Jonathan Tan Feb. 9, 2022, 12:11 a.m. UTC | #7

Junio C Hamano <gitster@pobox.com> writes:
> I agree that it would be ideal to have just one way generic and
> extensible enough.  I do not know if there are much difference in
> that area between the two approaches, though.  The RFC I saw did
> look more complex and rigidly specified with framing and such, but
> that is only the syntax part---in the way in which interaction
> between two processes happen, I didn't quite see fundamental
> differences.  I'd expect it wouldn't be too much trouble to add new
> commands to code written using either approach (although I haven't
> seen yours yet ;-).
> 
> Thanks.

It is similar to this approach, except:
 - the approach I sent out uses pkt-line, which might be difficult to
   retrofit to cat-file if we need it
 - in the future, we want the Git side to be able to initiate requests
 - (possibly minor) it may be confusing if we add functionality to
   cat-file that is not about reading objects

Other than that, yes, they are similar.

[v2,2/2] cat-file: add --batch-command mode

Commit Message

Comments

Patch