diff mbox series

[2/2] catfile.c: add --batch-command mode

Message ID ebd2a1356017905245744babf32c5b78a0af1639.1643915286.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Add cat-file --batch-command flag | expand

Commit Message

John Cai Feb. 3, 2022, 7:08 p.m. UTC
From: John Cai <johncai86@gmail.com>

Add new flag --batch-command that will accept commands and arguments
from stdin, similar to git-update-ref --stdin.

At GitLab, we use a pair of long running cat-file processes when
accessing object content. One for iterating over object metadata with
--batch-check, and the other to grab object contents with --batch.

However, if we had --batch-command, we wouldnt need to keep both
processes around, and instead just have one process where we can flip
between getting object info, and getting object contents. This means we
can get rid of roughly half of long lived git cat-file processes. This
can lead to huge savings since on a given server there could be hundreds
of git cat-file processes running.

git cat-file --batch-command

would enter an interactive command mode whereby the user can enter in
commands and their arguments:

<command> [arg1] [arg2] NL

This patch adds the basic structure for add command which can be
extended in the future to add more commands.

This patch also adds the following commands:

contents <object> NL
info <object> NL

The contents command takes an <object> argument and prints out the object
contents.

The info command takes a <object> argument and prints out the object
metadata.

In addition, we need a set of commands that enable a "read session".

When a separate process (A) is connected to a git cat-file process (B)
and is interactively writing to and reading from it in --buffer mode,
(A) needs to be able to know when the buffer is flushed to stdout.
Currently, from (A)'s perspective, the only way is to either 1. exit
(B)'s process or 2. send an invalid object to stdin. 1. is not ideal
from a performance perspective as it will require spawning a new cat-file
process each time, and 2. is hacky and not a good long term solution.

With the following commands, process (A) can begin a "session" and
send a list of object names over stdin. When "get contents" or "get info"
is issued, this list of object names will be fed into batch_one_object()
to retrieve either info or contents. Finally an fflush() will be called
to end the session.

begin NL
get contents NL
get info NL

These can be used in the following way:

begin
<sha1>
<sha1>
<sha1>
<sha1>
get info

begin
<sha1>
<sha1>
<sha1>
<sha1>
get contents

With this mechanism, process (A) can be guaranteed to receive all of the
output even in --buffer mode.

Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: John Cai <johncai86@gmail.com>
---
 Documentation/git-cat-file.txt |  27 +++++
 builtin/cat-file.c             | 184 ++++++++++++++++++++++++++++++---
 t/t1006-cat-file.sh            |  46 ++++++++-
 3 files changed, 242 insertions(+), 15 deletions(-)

Comments

Junio C Hamano Feb. 3, 2022, 7:57 p.m. UTC | #1
"John Cai via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Subject: Re: [PATCH 2/2] catfile.c: add --batch-command mode

"cat-file: add --batch-command mode" perhaps.  The patch touching
the file "catfile.c" (which does not even exist) is an irrelevant
implementation detail to spend 2 extra bytes in "git shortlog"
output.

> From: John Cai <johncai86@gmail.com>
>
> Add new flag --batch-command that will accept commands and arguments
> from stdin, similar to git-update-ref --stdin.
>
> At GitLab, we use a pair of long running cat-file processes when
> accessing object content. One for iterating over object metadata with
> --batch-check, and the other to grab object contents with --batch.
>
> However, if we had --batch-command, we wouldnt need to keep both
> processes around, and instead just have one process where we can flip
> between getting object info, and getting object contents. This means we
> can get rid of roughly half of long lived git cat-file processes. This
> can lead to huge savings since on a given server there could be hundreds
> of git cat-file processes running.

Hmph, why hundreds, not two you listed?

Do you mean "we have two per repository, but by combining, we can do
with just one per repository, halving the number of processes"?

> git cat-file --batch-command
>
> would enter an interactive command mode whereby the user can enter in
> commands and their arguments:
>
> <command> [arg1] [arg2] NL
>
> This patch adds the basic structure for add command which can be
> extended in the future to add more commands.
>
> This patch also adds the following commands:
>
> contents <object> NL
> info <object> NL
>
> The contents command takes an <object> argument and prints out the object
> contents.
>
> The info command takes a <object> argument and prints out the object
> metadata.
>
> In addition, we need a set of commands that enable a "read session".
>
> When a separate process (A) is connected to a git cat-file process (B)

This description misleads readers into thinking as if we have just
created a daemon process that is running, and an unrelated process
can connect to it, which obviously poses a question about securing
the connection.  It is my understanding that what this creates is
just a consumer process (A) starts the cat-file process (B) locally
on its behalf under process (A)'s privilege, and they talk over pipe
without allowing any third-party to participate in the exchange, so
we should avoid misleading users by saying "is connected to" here.

> and is interactively writing to and reading from it in --buffer mode,
> (A) needs to be able to know when the buffer is flushed to stdout.

If A and B are talking over a pair pipes, in order to avoid
deadlocking, both ends need to be able to control whose turn it is
to speak (and it is turn for the other side to listen).  A needs to
be able to _control_ (not "know") when the buffer it uses to write
to B gets flushed, in order to reliably say "I am done for now, it
is your turn to speak" and be assured that it reaches B.  The story
is the same for the other side.  When a request by A needs to be
responded with multiple lines of output, B needs to be able to say
"And that concludes my response, and I am ready to accept a new
request from you" and make sure it reaches A.  "know when..." is
probably a wrong phrase here.

> Currently, from (A)'s perspective, the only way is to either 1. exit
> (B)'s process or 2. send an invalid object to stdin. 1. is not ideal
> from a performance perspective as it will require spawning a new cat-file
> process each time, and 2. is hacky and not a good long term solution.

Writing enumeration as bulletted or enumerated list would make it
much easier to read, I would think.

    From (A)'s perspective, the only way is to either 

    1. exit (B)'s process or
    2. send an invalid object to stdin.

    1. is not ideal from a performance perspective, as it will
    require spawning a new cat-file process each time, and 2. is
    hacky and not a good long term solution.

I am not sure what you exactly mean by "exit" in the above.  Do you
mean "kill" instead?

> With the following commands, process (A) can begin a "session" and
> send a list of object names over stdin. When "get contents" or "get info"
> is issued, this list of object names will be fed into batch_one_object()
> to retrieve either info or contents. Finally an fflush() will be called
> to end the session.
>
> begin NL
> get contents NL
> get info NL
>
> These can be used in the following way:
>
> begin
> <sha1>
> <sha1>
> <sha1>
> <sha1>
> get info
>
> begin
> <sha1>
> <sha1>
> <sha1>
> <sha1>
> get contents
>
> With this mechanism, process (A) can be guaranteed to receive all of the
> output even in --buffer mode.

OK, so do these "get blah" serve both as command and an implicit
"flush"?

With an implicit "flush", do we really need "begin"?

Also, from the point of view of extensibility, not saying what kind
of operation is started when given "begin" is probably not a good
idea.  "get info" and "get contents" may happen to be the only
commands that are supported right now, and the parameters to them
may happen to be just list of object names and nothing else, but
what happens when a new "git frotz" command is added and its
operation is extended with something other than object names and
pathnames?  The way to parse these parameter lines for the "get"
would be different for different commands, and if "cat-file" knows
upfront what is to be done to these parameters, it can even start
prefetching and precomputing to reduce latency observed by the
client before the final "get info" command is given.

So, from that point of view,

	begin <cmd>
	<parameter>
	<parameter>
	...
	end

may be a better design, no?
John Cai Feb. 4, 2022, 4:11 a.m. UTC | #2
Hi Junio, thanks for the review!

On 3 Feb 2022, at 14:57, Junio C Hamano wrote:

> "John Cai via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> Subject: Re: [PATCH 2/2] catfile.c: add --batch-command mode
>
> "cat-file: add --batch-command mode" perhaps.  The patch touching
> the file "catfile.c" (which does not even exist) is an irrelevant
> implementation detail to spend 2 extra bytes in "git shortlog"
> output.
>
>> From: John Cai <johncai86@gmail.com>
>>
>> Add new flag --batch-command that will accept commands and arguments
>> from stdin, similar to git-update-ref --stdin.
>>
>> At GitLab, we use a pair of long running cat-file processes when
>> accessing object content. One for iterating over object metadata with
>> --batch-check, and the other to grab object contents with --batch.
>>
>> However, if we had --batch-command, we wouldnt need to keep both
>> processes around, and instead just have one process where we can flip
>> between getting object info, and getting object contents. This means we
>> can get rid of roughly half of long lived git cat-file processes. This
>> can lead to huge savings since on a given server there could be hundreds
>> of git cat-file processes running.
>
> Hmph, why hundreds, not two you listed?
>
> Do you mean "we have two per repository, but by combining, we can do
> with just one per repository, halving the number of processes"?

Yes, exactly. I'll reword this in the next version to be more clear.

>
>> git cat-file --batch-command
>>
>> would enter an interactive command mode whereby the user can enter in
>> commands and their arguments:
>>
>> <command> [arg1] [arg2] NL
>>
>> This patch adds the basic structure for add command which can be
>> extended in the future to add more commands.
>>
>> This patch also adds the following commands:
>>
>> contents <object> NL
>> info <object> NL
>>
>> The contents command takes an <object> argument and prints out the object
>> contents.
>>
>> The info command takes a <object> argument and prints out the object
>> metadata.
>>
>> In addition, we need a set of commands that enable a "read session".
>>
>> When a separate process (A) is connected to a git cat-file process (B)
>
> This description misleads readers into thinking as if we have just
> created a daemon process that is running, and an unrelated process
> can connect to it, which obviously poses a question about securing
> the connection.  It is my understanding that what this creates is
> just a consumer process (A) starts the cat-file process (B) locally
> on its behalf under process (A)'s privilege, and they talk over pipe
> without allowing any third-party to participate in the exchange, so
> we should avoid misleading users by saying "is connected to" here.

Yes this understanding is correct. Will fix wording in next version

>
>> and is interactively writing to and reading from it in --buffer mode,
>> (A) needs to be able to know when the buffer is flushed to stdout.
>
> If A and B are talking over a pair pipes, in order to avoid
> deadlocking, both ends need to be able to control whose turn it is
> to speak (and it is turn for the other side to listen).  A needs to
> be able to _control_ (not "know") when the buffer it uses to write
> to B gets flushed, in order to reliably say "I am done for now, it
> is your turn to speak" and be assured that it reaches B.  The story
> is the same for the other side.  When a request by A needs to be
> responded with multiple lines of output, B needs to be able to say
> "And that concludes my response, and I am ready to accept a new
> request from you" and make sure it reaches A.  "know when..." is
> probably a wrong phrase here.

Correct, "know" is not exactly right. _control_ would be the more accurate description.
>
>> Currently, from (A)'s perspective, the only way is to either 1. exit
>> (B)'s process or 2. send an invalid object to stdin. 1. is not ideal
>> from a performance perspective as it will require spawning a new cat-file
>> process each time, and 2. is hacky and not a good long term solution.
>
> Writing enumeration as bulletted or enumerated list would make it
> much easier to read, I would think.
>
>     From (A)'s perspective, the only way is to either
>
>     1. exit (B)'s process or
>     2. send an invalid object to stdin.
>
>     1. is not ideal from a performance perspective, as it will
>     require spawning a new cat-file process each time, and 2. is
>     hacky and not a good long term solution.
>
> I am not sure what you exactly mean by "exit" in the above.  Do you
> mean "kill" instead?

Yes

>
>> With the following commands, process (A) can begin a "session" and
>> send a list of object names over stdin. When "get contents" or "get info"
>> is issued, this list of object names will be fed into batch_one_object()
>> to retrieve either info or contents. Finally an fflush() will be called
>> to end the session.
>>
>> begin NL
>> get contents NL
>> get info NL
>>
>> These can be used in the following way:
>>
>> begin
>> <sha1>
>> <sha1>
>> <sha1>
>> <sha1>
>> get info
>>
>> begin
>> <sha1>
>> <sha1>
>> <sha1>
>> <sha1>
>> get contents
>>
>> With this mechanism, process (A) can be guaranteed to receive all of the
>> output even in --buffer mode.
>
> OK, so do these "get blah" serve both as command and an implicit
> "flush"?

yes, that's the idea.

>
> With an implicit "flush", do we really need "begin"?
>
> Also, from the point of view of extensibility, not saying what kind
> of operation is started when given "begin" is probably not a good
> idea.  "get info" and "get contents" may happen to be the only
> commands that are supported right now, and the parameters to them
> may happen to be just list of object names and nothing else, but
> what happens when a new "git frotz" command is added and its
> operation is extended with something other than object names and
> pathnames?  The way to parse these parameter lines for the "get"
> would be different for different commands, and if "cat-file" knows
> upfront what is to be done to these parameters, it can even start
> prefetching and precomputing to reduce latency observed by the
> client before the final "get info" command is given.
>
> So, from that point of view,
>
>     begin <cmd>
>     <parameter>
>     <parameter>
>     ...
>     end
>
> may be a better design, no?

Good point. Now I'm wondering if we can simplify where commands get queued up
and a "get" will execute them along with an implicit flush.

<cmd> <parameter>
<cmd> <parameter>
<cmd> <parameter>
get
Eric Sunshine Feb. 4, 2022, 6:45 a.m. UTC | #3
()()On Thu, Feb 3, 2022 at 7:20 PM John Cai via GitGitGadget
<gitgitgadget@gmail.com> wrote:
> Add new flag --batch-command that will accept commands and arguments
> from stdin, similar to git-update-ref --stdin.
>
> contents <object> NL
> info <object> NL
>
> With the following commands, process (A) can begin a "session" and
> send a list of object names over stdin. When "get contents" or "get info"
> is issued, this list of object names will be fed into batch_one_object()
> to retrieve either info or contents. Finally an fflush() will be called
> to end the session.
>
> begin
> <sha1>
> get info
>
> begin
> <sha1>
> get contents

I had the same reaction to this new command set as Junio expressed
upstream[1], and was prepared to suggest an alternative but Junio said
everything I wanted to say, so I won't say anything more about it
here.

That aside, the implementation presented here is overly loose about
malformed input and accepts all sorts of bogus invocations which it
should be reporting as errors. For instance, a lone:

    get info

without a preceding `begin` should be an error but such bogus input is
not diagnosed. `get contents` is likewise affected. Another example:

    begin
    <oid>
    <EOF>

which lacks a closing `get info` or `get contents` is silently
ignored. Similarly, malformed:

    begin
    begin
    ...

should be reported as an error but is not.

There is also a bug in which the accumulated list of OID's is never
cleared. Thus:

    begin
    <oid>
    get info
    get info

emits info about <oid> twice, once for each `get info` invocation. Similarly:

    begin
    <oid1>
    get info
    <oid2>
    get info

emits information about <oid1> twice and <oid2> once. Invoking `begin`
between the `get info` commands doesn't help because `begin` is a
no-op. Thus:

    begin
    <oid1>
    get info
    begin
    <oid2>
    get info

likewise emits information about <oid1> twice and <oid2> once.

It also incorrectly accepts non-"session" commands in the middle of a
session. For instance:

    begin
    <oid1>
    info <oid2>
    <oid3>

immediately emits information about <oid2> and then bombs out claiming
that <oid3> is an "unknown command" because the `info` command --
which should not have been allowed within a session -- prematurely
ended the "session".

The `info` and `contents` commands neglect to do any sort of
validation of their arguments, thus any and all bogus invocations are
accepted. Thus:

    info
    info <arg1> <arg2>
    info <non-oid>

are all accepted as valid invocations, misleadingly producing "<foo>
missing" messages, rather than erroring out as they should. The "<oid>
missing" message should be reserved for the case when the lone
argument to `info` or `contents` is something which looks like a
legitimate OID.

The above is a long-winded way of saying that it is important not only
to check the obvious "expect success" cases when implementing a new
feature, but it's also important to add checks for the "expect
failure" cases, such as all the above malformed inputs.

It's subjective, but it feels like this implementation is trying to be
too clever by handling all these cases via a single strbuf_getline()
loop in batch_objects_command(). It would be easier to reason about,
and would have avoided some of the above problems, for instance, if
handling of `begin` was split out to its own function which looped
over strbuf_getline() itself, thus could easily have detected
premature EOF (lacking `get info` or `get contents`) and likewise
would not have allowed `info` or `contents` commands to be executed
within a "session".

Similarly (again subjective), the generic command dispatcher seems
over-engineered and perhaps contributed to the oversight of `info` and
`contents` failing to perform validation of their arguments. A simpler
hand-rolled command-response loop is more common on this project and
often easier to reason about. Perhaps something like this
(pseudo-code):

    while (strbuf_getline(&input, stdin)) {
        char *end_cmd = strchrnul(&input.buf, ' ');
        const char *argstart = *end_cmd ? end_cmd + 1 : end_cmd;
        *end_cmd = '\0';
        if (strcmp(input.buf, "info"))
            show_info(argstart);
        else if (strcmp(input.buf, "contents"))
            show_contents(argstart);
        else if (strcmp(input.buf, "begin"))
            begin_session(argstart);
        else
            die(...);
    }

and each of the command-handler functions would perform its own
argument validation (including the case when no argument should be
present).

[1]: https://lore.kernel.org/git/xmqqo83nsoxs.fsf@gitster.g/

> +static void parse_cmd_begin(struct batch_options *opt,
> +                          const char *line,
> +                          struct strbuf *output,
> +                          struct expand_data *data,
> +                          struct string_list revs)
> +{
> +       /* nothing needs to be done here */
> +}

Either this function should be clearing the list of collected OID's or...

> +static void parse_cmd_get(struct batch_options *opt,
> +                          const char *line,
> +                          struct strbuf *output,
> +                          struct expand_data *data,
> +                          struct string_list revs)
> +{
> +       struct string_list_item *item;
> +       for_each_string_list_item(item, &revs) {
> +               batch_one_object(item->string, output, opt, data);
> +       }
> +       if (opt->buffer_output)
> +               fflush(stdout);

... this function should do so after it's done processing the OID list.

> +}
> +static void parse_cmd_get_info(struct batch_options *opt,
> +                          const char *line,
> +                          struct strbuf *output,
> +                          struct expand_data *data,
> +                          struct string_list revs)

Missing blank line before the function definition.

> +static void parse_cmd_get_objects(struct batch_options *opt,
> +                          const char *line,
> +                          struct strbuf *output,
> +                          struct expand_data *data,
> +                          struct string_list revs)
> +{
> +       opt->print_contents = 1;
> +       parse_cmd_get(opt, line, output, data, revs);
> +       if (opt->buffer_output)
> +               fflush(stdout);
> +}

Why does this function duplicate the flushing logic of parse_cmd_get()
immediately after calling parse_cmd_get()?

> +static void batch_objects_command(struct batch_options *opt,
> +                                   struct strbuf *output,
> +                                   struct expand_data *data)
> +{
> +       /* Read each line dispatch its command */

Missing "and".

> +       while (!strbuf_getline(&input, stdin)) {
> +               if (state == BATCH_STATE_COMMAND) {
> +                       if (*input.buf == '\n')
> +                               die("empty command in input");

This never triggers because strbuf_getline() removes the line
terminator before returning.

> +                       else if (isspace(*input.buf))
> +                               die("whitespace before command: %s", input.buf);
> +               }
> +
> +               for (i = 0; i < ARRAY_SIZE(commands); i++) {
> +                       if (!skip_prefix(input.buf, prefix, &cmd_end))
> +                               continue;
> +                       /*
> +                        * If the command has arguments, verify that it's
> +                        * followed by a space. Otherwise, it shall be followed
> +                        * by a line terminator.
> +                        */
> +                       c = commands[i].takes_args ? ' ' : '\n';
> +                       if (*cmd_end && *cmd_end != c)
> +                               die("arguments invalid for command: %s", commands[i].prefix);

Ditto regarding strbuf_getline() removing line-terminator.

> +                       cmd = &commands[i];
> +                       if (cmd->takes_args)
> +                               p = cmd_end + 1;
> +                       break;
> +               }
> +
> +               if (input.buf[input.len - 1] == '\n')
> +                       input.buf[--input.len] = '\0';

Ditto again. This is especially scary since it's accessing element
`input.len-1` without even checking if `input.len` is greater than
zero.

Also, it's better to use published strbuf API rather than mucking with
the internals:

    strbuf_setlen(&input, input.len - 1);

> +               if (state == BATCH_STATE_INPUT && !cmd){
> +                       string_list_append(&revs, input.buf);
> +                       continue;
> +               }
> +
> +               if (!cmd)
> +                       die("unknown command: %s", input.buf);
> +
> +               state = cmd->next_state;
> +               cmd->fn(opt, p, output, data, revs);
> +       }
> +       strbuf_release(&input);
> +       string_list_clear(&revs, 0);
> +}
Ævar Arnfjörð Bjarmason Feb. 4, 2022, 12:11 p.m. UTC | #4
On Thu, Feb 03 2022, John Cai via GitGitGadget wrote:

> From: John Cai <johncai86@gmail.com>

[Trying not to pile on and mentioning some things others haven't, but
maybe there'll be duplications]

>  	requested batch operation on all objects in the repository and
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 858bca208ff..29d5cd6857b 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -26,6 +26,7 @@ struct batch_options {
>  	int unordered;
>  	int mode; /* may be 'w' or 'c' for --filters or --textconv */
>  	const char *format;
> +	int command;
>  };

Not sure why it's not added to "int mode", but in any case
post-ab/cat-file that might be clearer...

> +	/* Read each line dispatch its command */
> +	while (!strbuf_getline(&input, stdin)) {

I think comments that are obvious from the code are probably best
dropped. We're just doing a fairly obvious consume stdin pattern here.

> +		int i;
> +		const struct parse_cmd *cmd = NULL;
> +		const char *p, *cmd_end;
> +
> +		if (state == BATCH_STATE_COMMAND) {
> +			if (*input.buf == '\n')
> +				die("empty command in input");
> +			else if (isspace(*input.buf))
> +				die("whitespace before command: %s", input.buf);

These should use _() to mark strings for translation, and let's quote %s
like '%s'

> +		}
> +
> +		for (i = 0; i < ARRAY_SIZE(commands); i++) {
> +			const char *prefix = commands[i].prefix;
> +			char c;

nit: add a \n between variable decls and code.

> +			if (!skip_prefix(input.buf, prefix, &cmd_end))
> +				continue;
> +			/*
> +			 * If the command has arguments, verify that it's
> +			 * followed by a space. Otherwise, it shall be followed
> +			 * by a line terminator.
> +			 */

I'd say ditto on the "the code says that" comments...

> +			c = commands[i].takes_args ? ' ' : '\n';
> +			if (*cmd_end && *cmd_end != c)
> +				die("arguments invalid for command: %s", commands[i].prefix);
> +
> +			cmd = &commands[i];
> +			if (cmd->takes_args)
> +				p = cmd_end + 1;
> +			break;
> +		}
> +
> +		if (input.buf[input.len - 1] == '\n')
> +			input.buf[--input.len] = '\0';

Don't we mean strbuf_trim_trailing_newline() here, or do we not want
Windows newlines to be accepted?

But more generally doesn't one of the strbuf_getline_*() functions do
the right thing here already?

> +
> +		if (state == BATCH_STATE_INPUT && !cmd){
> +			string_list_append(&revs, input.buf);

Nit: You can save yourself some malloc() churn here with:

    string_list_append_nodup(..., strbuf_detach(&input, NULL));

I.e. we're looping over the input, here we're done, so we might as well
steal the already alloc'd string....

> +			continue;
> +		}
> +
> +		if (!cmd)
> +			die("unknown command: %s", input.buf);
> +
> +		state = cmd->next_state;
> +		cmd->fn(opt, p, output, data, revs);
> +	}
> +	strbuf_release(&input);
> +	string_list_clear(&revs, 0);

...and these will do the right thing, as strbuf will notice the string
is stolen (it'll be the slopbuf again), and due to the combination of
*_DUP and *_nodup() we'll properly free it here too.

> +}
> +
>  static int batch_objects(struct batch_options *opt)
>  {
>  	struct strbuf input = STRBUF_INIT;
> @@ -519,6 +665,7 @@ static int batch_objects(struct batch_options *opt)
>  	struct expand_data data;
>  	int save_warning;
>  	int retval = 0;
> +	const int command = opt->command;
>  
>  	if (!opt->format)
>  		opt->format = "%(objectname) %(objecttype) %(objectsize)";
> @@ -594,22 +741,25 @@ static int batch_objects(struct batch_options *opt)
>  	save_warning = warn_on_object_refname_ambiguity;
>  	warn_on_object_refname_ambiguity = 0;
>  
> -	while (strbuf_getline(&input, stdin) != EOF) {
> -		if (data.split_on_whitespace) {
> -			/*
> -			 * Split at first whitespace, tying off the beginning
> -			 * of the string and saving the remainder (or NULL) in
> -			 * data.rest.
> -			 */
> -			char *p = strpbrk(input.buf, " \t");
> -			if (p) {
> -				while (*p && strchr(" \t", *p))
> -					*p++ = '\0';
> +	if (command)
> +		batch_objects_command(opt, &output, &data);
> +	else {

Style: {} braces for all arms if one requires it.

> +		while (strbuf_getline(&input, stdin) != EOF) {
> +			if (data.split_on_whitespace) {

diff nit: maybe we can find some way to not require re-indenting the existing code. E.g.:
	
	if (command) {
		batch_objects_command(...);
	        goto cleanup;
	}

...

> +				/*
> +				 * Split at first whitespace, tying off the beginning
> +				 * of the string and saving the remainder (or NULL) in
> +				 * data.rest.
> +				 */
> +				char *p = strpbrk(input.buf, " \t");
> +				if (p) {
> +					while (*p && strchr(" \t", *p))
> +						*p++ = '\0';
> +				}
> +				data.rest = p;
>  			}
> -			data.rest = p;
> +			batch_one_object(input.buf, &output, opt, &data);
>  		}
> -
> -		batch_one_object(input.buf, &output, opt, &data);
>  	}
>  

...and then just add a "cleanup:" label here.

>  	strbuf_release(&input);
> @@ -646,6 +796,7 @@ static int batch_option_callback(const struct option *opt,
>  
>  	bo->enabled = 1;
>  	bo->print_contents = !strcmp(opt->long_name, "batch");
> +	bo->command = !strcmp(opt->long_name, "batch-command");
>  	bo->format = arg;
>  
>  	return 0;
> @@ -682,6 +833,11 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix)
>  			N_("show info about objects fed from the standard input"),
>  			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
>  			batch_option_callback),
> +		OPT_CALLBACK_F(0, "batch-command", &batch, N_(""),

You're either missing a string here in "", or we don't need N_() to mark
it for translation.

> +			 N_("enters batch mode that accepts commands"),
> +			 PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
> +			 batch_option_callback),
> +
>  		OPT_BOOL(0, "follow-symlinks", &batch.follow_symlinks,
>  			 N_("follow in-tree symlinks (used with --batch or --batch-check)")),
>  		OPT_BOOL(0, "batch-all-objects", &batch.all_objects,
> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
> index 39382fa1958..7360d049113 100755
> --- a/t/t1006-cat-file.sh
> +++ b/t/t1006-cat-file.sh
> @@ -85,6 +85,34 @@ $content"
>  	test_cmp expect actual
>      '
>  
> +    test -z "$content" ||
> +    test_expect_success "--batch-command output of $type content is correct" '
> +	maybe_remove_timestamp "$batch_output" $no_ts >expect &&
> +	maybe_remove_timestamp "$(echo contents $sha1 | git cat-file --batch-command)" $no_ts >actual &&
> +	test_cmp expect actual
> +    '
> +
> +    test -z "$content" ||
> +    test_expect_success "--batch-command session for $type content is correct" '
> +	maybe_remove_timestamp "$batch_output" $no_ts >expect &&
> +	maybe_remove_timestamp \
> +		"$(test_write_lines "begin" "$sha1" "get contents" | git cat-file --batch-command)" \
> +		$no_ts >actual &&
> +	test_cmp expect actual
> +    '
> +
> +    test_expect_success "--batch-command output of $type info is correct" '
> +	echo "$sha1 $type $size" >expect &&
> +	echo "info $sha1" | git cat-file --batch-command >actual &&
> +	test_cmp expect actual
> +    '
> +
> +    test_expect_success "--batch-command session for $type info is correct" '
> +	echo "$sha1 $type $size" >expect &&
> +	test_write_lines "begin" "$sha1" "get info" | git cat-file --batch-command >actual &&
> +	test_cmp expect actual
> +    '
> +
>      test_expect_success "custom --batch-check format" '
>  	echo "$type $sha1" >expect &&
>  	echo $sha1 | git cat-file --batch-check="%(objecttype) %(objectname)" >actual &&
> @@ -141,6 +169,7 @@ test_expect_success '--batch-check without %(rest) considers whole line' '
>  '
>  
>  tree_sha1=$(git write-tree)
> +

stray newline addition.

>  tree_size=$(($(test_oid rawsz) + 13))
>  tree_pretty_content="100644 blob $hello_sha1	hello"
>  
> @@ -175,7 +204,7 @@ test_expect_success \
>      "Reach a blob from a tag pointing to it" \
>      "test '$hello_content' = \"\$(git cat-file blob $tag_sha1)\""
>  
> -for batch in batch batch-check
> +for batch in batch batch-check batch-command
>  do
>      for opt in t s e p
>      do
> @@ -281,6 +310,15 @@ test_expect_success "--batch-check with multiple sha1s gives correct format" '
>      "$(echo_without_newline "$batch_check_input" | git cat-file --batch-check)"
>  '
>  
> +test_expect_success "--batch-command with multiple sha1s gives correct format" '
> +    echo "$batch_check_output" >expect &&
> +    echo begin >input &&
> +    echo_without_newline "$batch_check_input" >>input &&
> +    echo "get info" >>input &&
> +    git cat-file --batch-command <input >actual &&
> +    test_cmp expect actual
> +'

indentation with spaces, \t correctly used for the rest.
Phillip Wood Feb. 4, 2022, 4:46 p.m. UTC | #5
Hi John

On 04/02/2022 04:11, John Cai wrote:
> Hi Junio, thanks for the review!
>  [...]
>> So, from that point of view,
>>
>>      begin <cmd>
>>      <parameter>
>>      <parameter>
>>      ...
>>      end
>>
>> may be a better design, no?
> 
> Good point. Now I'm wondering if we can simplify where commands get queued up
> and a "get" will execute them along with an implicit flush.
> 
> <cmd> <parameter>
> <cmd> <parameter>
> <cmd> <parameter>
> get

I think that would be an improvement (I'd suggest calling "get" "flush" 
instead), the begin ... get <cmd> sequence seems unnecessarily complex 
compared the the RFC of this series. If the user gives --buffer on the 
command line then we can buffer the input until we see a "flush" command 
and if the user does not give --buffer on the command line then we 
should flush the output after each command.

Best Wishes

Phillip
Phillip Wood Feb. 4, 2022, 4:51 p.m. UTC | #6
Hi John/Ævar

On 04/02/2022 12:11, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, Feb 03 2022, John Cai via GitGitGadget wrote:
> 
>> From: John Cai <johncai86@gmail.com>
>> +		if (state == BATCH_STATE_INPUT && !cmd){
>> +			string_list_append(&revs, input.buf);
> 
> Nit: You can save yourself some malloc() churn here with:
> 
>      string_list_append_nodup(..., strbuf_detach(&input, NULL));
> 
> I.e. we're looping over the input, here we're done, so we might as well
> steal the already alloc'd string....

If we do that then the strbuf will have to reallocate its buffer when it 
reads the next input line on the next iteration of the loop. As the 
strbuf will have likely over allocated the buffer using the 
string_list_append_nodup() + strbuf_detach() will will be less efficient 
overall than using string_list_append().

Best Wishes

Phillip

>> +			continue;
>> +		}
>> +
>> +		if (!cmd)
>> +			die("unknown command: %s", input.buf);
>> +
>> +		state = cmd->next_state;
>> +		cmd->fn(opt, p, output, data, revs);
>> +	}
>> +	strbuf_release(&input);
>> +	string_list_clear(&revs, 0);
> 
> ...and these will do the right thing, as strbuf will notice the string
> is stolen (it'll be the slopbuf again), and due to the combination of
> *_DUP and *_nodup() we'll properly free it here too.
> 
>> +}
>> +
>>   static int batch_objects(struct batch_options *opt)
>>   {
>>   	struct strbuf input = STRBUF_INIT;
>> @@ -519,6 +665,7 @@ static int batch_objects(struct batch_options *opt)
>>   	struct expand_data data;
>>   	int save_warning;
>>   	int retval = 0;
>> +	const int command = opt->command;
>>   
>>   	if (!opt->format)
>>   		opt->format = "%(objectname) %(objecttype) %(objectsize)";
>> @@ -594,22 +741,25 @@ static int batch_objects(struct batch_options *opt)
>>   	save_warning = warn_on_object_refname_ambiguity;
>>   	warn_on_object_refname_ambiguity = 0;
>>   
>> -	while (strbuf_getline(&input, stdin) != EOF) {
>> -		if (data.split_on_whitespace) {
>> -			/*
>> -			 * Split at first whitespace, tying off the beginning
>> -			 * of the string and saving the remainder (or NULL) in
>> -			 * data.rest.
>> -			 */
>> -			char *p = strpbrk(input.buf, " \t");
>> -			if (p) {
>> -				while (*p && strchr(" \t", *p))
>> -					*p++ = '\0';
>> +	if (command)
>> +		batch_objects_command(opt, &output, &data);
>> +	else {
> 
> Style: {} braces for all arms if one requires it.
> 
>> +		while (strbuf_getline(&input, stdin) != EOF) {
>> +			if (data.split_on_whitespace) {
> 
> diff nit: maybe we can find some way to not require re-indenting the existing code. E.g.:
> 	
> 	if (command) {
> 		batch_objects_command(...);
> 	        goto cleanup;
> 	}
> 
> ...
> 
>> +				/*
>> +				 * Split at first whitespace, tying off the beginning
>> +				 * of the string and saving the remainder (or NULL) in
>> +				 * data.rest.
>> +				 */
>> +				char *p = strpbrk(input.buf, " \t");
>> +				if (p) {
>> +					while (*p && strchr(" \t", *p))
>> +						*p++ = '\0';
>> +				}
>> +				data.rest = p;
>>   			}
>> -			data.rest = p;
>> +			batch_one_object(input.buf, &output, opt, &data);
>>   		}
>> -
>> -		batch_one_object(input.buf, &output, opt, &data);
>>   	}
>>   
> 
> ...and then just add a "cleanup:" label here.
> 
>>   	strbuf_release(&input);
>> @@ -646,6 +796,7 @@ static int batch_option_callback(const struct option *opt,
>>   
>>   	bo->enabled = 1;
>>   	bo->print_contents = !strcmp(opt->long_name, "batch");
>> +	bo->command = !strcmp(opt->long_name, "batch-command");
>>   	bo->format = arg;
>>   
>>   	return 0;
>> @@ -682,6 +833,11 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix)
>>   			N_("show info about objects fed from the standard input"),
>>   			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
>>   			batch_option_callback),
>> +		OPT_CALLBACK_F(0, "batch-command", &batch, N_(""),
> 
> You're either missing a string here in "", or we don't need N_() to mark
> it for translation.
> 
>> +			 N_("enters batch mode that accepts commands"),
>> +			 PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
>> +			 batch_option_callback),
>> +
>>   		OPT_BOOL(0, "follow-symlinks", &batch.follow_symlinks,
>>   			 N_("follow in-tree symlinks (used with --batch or --batch-check)")),
>>   		OPT_BOOL(0, "batch-all-objects", &batch.all_objects,
>> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
>> index 39382fa1958..7360d049113 100755
>> --- a/t/t1006-cat-file.sh
>> +++ b/t/t1006-cat-file.sh
>> @@ -85,6 +85,34 @@ $content"
>>   	test_cmp expect actual
>>       '
>>   
>> +    test -z "$content" ||
>> +    test_expect_success "--batch-command output of $type content is correct" '
>> +	maybe_remove_timestamp "$batch_output" $no_ts >expect &&
>> +	maybe_remove_timestamp "$(echo contents $sha1 | git cat-file --batch-command)" $no_ts >actual &&
>> +	test_cmp expect actual
>> +    '
>> +
>> +    test -z "$content" ||
>> +    test_expect_success "--batch-command session for $type content is correct" '
>> +	maybe_remove_timestamp "$batch_output" $no_ts >expect &&
>> +	maybe_remove_timestamp \
>> +		"$(test_write_lines "begin" "$sha1" "get contents" | git cat-file --batch-command)" \
>> +		$no_ts >actual &&
>> +	test_cmp expect actual
>> +    '
>> +
>> +    test_expect_success "--batch-command output of $type info is correct" '
>> +	echo "$sha1 $type $size" >expect &&
>> +	echo "info $sha1" | git cat-file --batch-command >actual &&
>> +	test_cmp expect actual
>> +    '
>> +
>> +    test_expect_success "--batch-command session for $type info is correct" '
>> +	echo "$sha1 $type $size" >expect &&
>> +	test_write_lines "begin" "$sha1" "get info" | git cat-file --batch-command >actual &&
>> +	test_cmp expect actual
>> +    '
>> +
>>       test_expect_success "custom --batch-check format" '
>>   	echo "$type $sha1" >expect &&
>>   	echo $sha1 | git cat-file --batch-check="%(objecttype) %(objectname)" >actual &&
>> @@ -141,6 +169,7 @@ test_expect_success '--batch-check without %(rest) considers whole line' '
>>   '
>>   
>>   tree_sha1=$(git write-tree)
>> +
> 
> stray newline addition.
> 
>>   tree_size=$(($(test_oid rawsz) + 13))
>>   tree_pretty_content="100644 blob $hello_sha1	hello"
>>   
>> @@ -175,7 +204,7 @@ test_expect_success \
>>       "Reach a blob from a tag pointing to it" \
>>       "test '$hello_content' = \"\$(git cat-file blob $tag_sha1)\""
>>   
>> -for batch in batch batch-check
>> +for batch in batch batch-check batch-command
>>   do
>>       for opt in t s e p
>>       do
>> @@ -281,6 +310,15 @@ test_expect_success "--batch-check with multiple sha1s gives correct format" '
>>       "$(echo_without_newline "$batch_check_input" | git cat-file --batch-check)"
>>   '
>>   
>> +test_expect_success "--batch-command with multiple sha1s gives correct format" '
>> +    echo "$batch_check_output" >expect &&
>> +    echo begin >input &&
>> +    echo_without_newline "$batch_check_input" >>input &&
>> +    echo "get info" >>input &&
>> +    git cat-file --batch-command <input >actual &&
>> +    test_cmp expect actual
>> +'
> 
> indentation with spaces, \t correctly used for the rest.
John Cai Feb. 4, 2022, 9:41 p.m. UTC | #7
Hi Eric,

I appreciate the feedback and the thorough review!

On 4 Feb 2022, at 1:45, Eric Sunshine wrote:

> ()()On Thu, Feb 3, 2022 at 7:20 PM John Cai via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> Add new flag --batch-command that will accept commands and arguments
>> from stdin, similar to git-update-ref --stdin.
>>
>> contents <object> NL
>> info <object> NL
>>
>> With the following commands, process (A) can begin a "session" and
>> send a list of object names over stdin. When "get contents" or "get info"
>> is issued, this list of object names will be fed into batch_one_object()
>> to retrieve either info or contents. Finally an fflush() will be called
>> to end the session.
>>
>> begin
>> <sha1>
>> get info
>>
>> begin
>> <sha1>
>> get contents
>
> I had the same reaction to this new command set as Junio expressed
> upstream[1], and was prepared to suggest an alternative but Junio said
> everything I wanted to say, so I won't say anything more about it
> here.
>
> That aside, the implementation presented here is overly loose about
> malformed input and accepts all sorts of bogus invocations which it
> should be reporting as errors. For instance, a lone:
>
>     get info
>
> without a preceding `begin` should be an error but such bogus input is
> not diagnosed. `get contents` is likewise affected. Another example:
>
>     begin
>     <oid>
>     <EOF>
>
> which lacks a closing `get info` or `get contents` is silently
> ignored. Similarly, malformed:
>
>     begin
>     begin
>     ...
>
> should be reported as an error but is not.
>
> There is also a bug in which the accumulated list of OID's is never
> cleared. Thus:
>
>     begin
>     <oid>
>     get info
>     get info
>
> emits info about <oid> twice, once for each `get info` invocation. Similarly:
>
>     begin
>     <oid1>
>     get info
>     <oid2>
>     get info
>
> emits information about <oid1> twice and <oid2> once. Invoking `begin`
> between the `get info` commands doesn't help because `begin` is a
> no-op. Thus:
>
>     begin
>     <oid1>
>     get info
>     begin
>     <oid2>
>     get info
>
> likewise emits information about <oid1> twice and <oid2> once.
>
> It also incorrectly accepts non-"session" commands in the middle of a
> session. For instance:
>
>     begin
>     <oid1>
>     info <oid2>
>     <oid3>
>
> immediately emits information about <oid2> and then bombs out claiming
> that <oid3> is an "unknown command" because the `info` command --
> which should not have been allowed within a session -- prematurely
> ended the "session".
>
> The `info` and `contents` commands neglect to do any sort of
> validation of their arguments, thus any and all bogus invocations are
> accepted. Thus:
>
>     info
>     info <arg1> <arg2>
>     info <non-oid>
>
> are all accepted as valid invocations, misleadingly producing "<foo>
> missing" messages, rather than erroring out as they should. The "<oid>
> missing" message should be reserved for the case when the lone
> argument to `info` or `contents` is something which looks like a
> legitimate OID.

So actually the argument to info and contents doesn't have to be an OID but an object name that
eventually gets passed to get_oid_with_context() to resolve to an actual oid. This is the
same behavior as git cat-file --batch and --batch-check, neither of which throws an error. My
goal was to maintain this behavior in --batch-command.
>
> The above is a long-winded way of saying that it is important not only
> to check the obvious "expect success" cases when implementing a new
> feature, but it's also important to add checks for the "expect
> failure" cases, such as all the above malformed inputs.

I overall agree and will add test coverage for invalid input.

> It's subjective, but it feels like this implementation is trying to be
> too clever by handling all these cases via a single strbuf_getline()
> loop in batch_objects_command(). It would be easier to reason about,
> and would have avoided some of the above problems, for instance, if
> handling of `begin` was split out to its own function which looped
> over strbuf_getline() itself, thus could easily have detected
> premature EOF (lacking `get info` or `get contents`) and likewise
> would not have allowed `info` or `contents` commands to be executed
> within a "session".
>
> Similarly (again subjective), the generic command dispatcher seems
> over-engineered and perhaps contributed to the oversight of `info` and
> `contents` failing to perform validation of their arguments. A simpler
> hand-rolled command-response loop is more common on this project and
> often easier to reason about. Perhaps something like this
> (pseudo-code):
>
>     while (strbuf_getline(&input, stdin)) {
>         char *end_cmd = strchrnul(&input.buf, ' ');
>         const char *argstart = *end_cmd ? end_cmd + 1 : end_cmd;
>         *end_cmd = '\0';
>         if (strcmp(input.buf, "info"))
>             show_info(argstart);
>         else if (strcmp(input.buf, "contents"))
>             show_contents(argstart);
>         else if (strcmp(input.buf, "begin"))
>             begin_session(argstart);
>         else
>             die(...);
>     }

I think I agree that the code in this patch is a little hard to follow,
but now I'm planning to simplify the design by getting rid of "begin"[2].

I'm hoping this change will make the code easier to reason about.

2. https://lore.kernel.org/git/767d5f5a-8395-78bc-865f-a39acc39e061@gmail.com/

>
> and each of the command-handler functions would perform its own
> argument validation (including the case when no argument should be
> present).
>
> [1]: https://lore.kernel.org/git/xmqqo83nsoxs.fsf@gitster.g/
>
>> +static void parse_cmd_begin(struct batch_options *opt,
>> +                          const char *line,
>> +                          struct strbuf *output,
>> +                          struct expand_data *data,
>> +                          struct string_list revs)
>> +{
>> +       /* nothing needs to be done here */
>> +}
>
> Either this function should be clearing the list of collected OID's or...
>
>> +static void parse_cmd_get(struct batch_options *opt,
>> +                          const char *line,
>> +                          struct strbuf *output,
>> +                          struct expand_data *data,
>> +                          struct string_list revs)
>> +{
>> +       struct string_list_item *item;
>> +       for_each_string_list_item(item, &revs) {
>> +               batch_one_object(item->string, output, opt, data);
>> +       }
>> +       if (opt->buffer_output)
>> +               fflush(stdout);
>
> ... this function should do so after it's done processing the OID list.
>
>> +}
>> +static void parse_cmd_get_info(struct batch_options *opt,
>> +                          const char *line,
>> +                          struct strbuf *output,
>> +                          struct expand_data *data,
>> +                          struct string_list revs)
>
> Missing blank line before the function definition.
>
>> +static void parse_cmd_get_objects(struct batch_options *opt,
>> +                          const char *line,
>> +                          struct strbuf *output,
>> +                          struct expand_data *data,
>> +                          struct string_list revs)
>> +{
>> +       opt->print_contents = 1;
>> +       parse_cmd_get(opt, line, output, data, revs);
>> +       if (opt->buffer_output)
>> +               fflush(stdout);
>> +}
>
> Why does this function duplicate the flushing logic of parse_cmd_get()
> immediately after calling parse_cmd_get()?
>
>> +static void batch_objects_command(struct batch_options *opt,
>> +                                   struct strbuf *output,
>> +                                   struct expand_data *data)
>> +{
>> +       /* Read each line dispatch its command */
>
> Missing "and".
>
>> +       while (!strbuf_getline(&input, stdin)) {
>> +               if (state == BATCH_STATE_COMMAND) {
>> +                       if (*input.buf == '\n')
>> +                               die("empty command in input");
>
> This never triggers because strbuf_getline() removes the line
> terminator before returning.

yes, this was an oversight. thanks for catching it!

>
>> +                       else if (isspace(*input.buf))
>> +                               die("whitespace before command: %s", input.buf);
>> +               }
>> +
>> +               for (i = 0; i < ARRAY_SIZE(commands); i++) {
>> +                       if (!skip_prefix(input.buf, prefix, &cmd_end))
>> +                               continue;
>> +                       /*
>> +                        * If the command has arguments, verify that it's
>> +                        * followed by a space. Otherwise, it shall be followed
>> +                        * by a line terminator.
>> +                        */
>> +                       c = commands[i].takes_args ? ' ' : '\n';
>> +                       if (*cmd_end && *cmd_end != c)
>> +                               die("arguments invalid for command: %s", commands[i].prefix);
>
> Ditto regarding strbuf_getline() removing line-terminator.
>
>> +                       cmd = &commands[i];
>> +                       if (cmd->takes_args)
>> +                               p = cmd_end + 1;
>> +                       break;
>> +               }
>> +
>> +               if (input.buf[input.len - 1] == '\n')
>> +                       input.buf[--input.len] = '\0';
>
> Ditto again. This is especially scary since it's accessing element
> `input.len-1` without even checking if `input.len` is greater than
> zero.

I realize we actually don't need this if block at all because strubf_getline()
strips the line terminator for us already.

>
> Also, it's better to use published strbuf API rather than mucking with
> the internals:
>
>     strbuf_setlen(&input, input.len - 1);
>
>> +               if (state == BATCH_STATE_INPUT && !cmd){
>> +                       string_list_append(&revs, input.buf);
>> +                       continue;
>> +               }
>> +
>> +               if (!cmd)
>> +                       die("unknown command: %s", input.buf);
>> +
>> +               state = cmd->next_state;
>> +               cmd->fn(opt, p, output, data, revs);
>> +       }
>> +       strbuf_release(&input);
>> +       string_list_clear(&revs, 0);
>> +}
Eric Sunshine Feb. 5, 2022, 6:52 a.m. UTC | #8
On Fri, Feb 4, 2022 at 4:42 PM John Cai <johncai86@gmail.com> wrote:
> On 4 Feb 2022, at 1:45, Eric Sunshine wrote:
> > The `info` and `contents` commands neglect to do any sort of
> > validation of their arguments, thus any and all bogus invocations are
> > accepted. Thus:
> >
> >     info
> >     info <arg1> <arg2>
> >     info <non-oid>
> >
> > are all accepted as valid invocations, misleadingly producing "<foo>
> > missing" messages, rather than erroring out as they should. The "<oid>
> > missing" message should be reserved for the case when the lone
> > argument to `info` or `contents` is something which looks like a
> > legitimate OID.
>
> So actually the argument to info and contents doesn't have to be an OID but an object name that
> eventually gets passed to get_oid_with_context() to resolve to an actual oid. This is the
> same behavior as git cat-file --batch and --batch-check, neither of which throws an error. My
> goal was to maintain this behavior in --batch-command.

Okay, that makes sense. I was misled by the commit message talking
about "<sha1>" in its examples, and didn't do my due diligence by
digging more deeply into the existing documentation and code.
diff mbox series

Patch

diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
index 27b27e2b300..d1ba0b12e54 100644
--- a/Documentation/git-cat-file.txt
+++ b/Documentation/git-cat-file.txt
@@ -90,6 +90,33 @@  OPTIONS
 	need to specify the path, separated by whitespace.  See the
 	section `BATCH OUTPUT` below for details.
 
+--batch-command::
+	Enter a command mode that reads commands and arguments from stdin.
+	May not be combined with any other options or arguments except
+	`--textconv` or `--filters`, in which case the input lines also need to
+	specify the path, separated by whitespace.  See the section
+	`BATCH OUTPUT` below for details.
+
+contents <object>::
+	Print object contents for object reference <object>
+
+info <object>::
+	Print object info for object reference <object>
+
+begin::
+	Begins a session to read object names off of stdin. A session can be
+	terminated with `get contents` or `get info`.
+
+get contents::
+	After a read session is begun with the `begin` command, and object
+	names have been fed into stdin, end the session and retrieve contents of
+	all the objects requested.
+
+get info::
+	After a read session is begun with the `begin` command, and object
+	names have been fed into stdin, end the session and retrieve info of
+	all the objects requested.
+
 --batch-all-objects::
 	Instead of reading a list of objects on stdin, perform the
 	requested batch operation on all objects in the repository and
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 858bca208ff..29d5cd6857b 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -26,6 +26,7 @@  struct batch_options {
 	int unordered;
 	int mode; /* may be 'w' or 'c' for --filters or --textconv */
 	const char *format;
+	int command;
 };
 
 static const char *force_path;
@@ -512,6 +513,151 @@  static int batch_unordered_packed(const struct object_id *oid,
 				      data);
 }
 
+static void parse_cmd_object(struct batch_options *opt,
+			     const char *line,
+			     struct strbuf *output,
+			     struct expand_data *data,
+			     struct string_list revs)
+{
+	opt->print_contents = 1;
+	batch_one_object(line, output, opt, data);
+}
+
+static void parse_cmd_info(struct batch_options *opt,
+			   const char *line,
+			   struct strbuf *output,
+			   struct expand_data *data,
+			   struct string_list revs)
+{
+	opt->print_contents = 0;
+	batch_one_object(line, output, opt, data);
+}
+
+static void parse_cmd_begin(struct batch_options *opt,
+			   const char *line,
+			   struct strbuf *output,
+			   struct expand_data *data,
+			   struct string_list revs)
+{
+	/* nothing needs to be done here */
+}
+
+static void parse_cmd_get(struct batch_options *opt,
+			   const char *line,
+			   struct strbuf *output,
+			   struct expand_data *data,
+			   struct string_list revs)
+{
+	struct string_list_item *item;
+	for_each_string_list_item(item, &revs) {
+		batch_one_object(item->string, output, opt, data);
+	}
+	if (opt->buffer_output)
+		fflush(stdout);
+}
+static void parse_cmd_get_info(struct batch_options *opt,
+			   const char *line,
+			   struct strbuf *output,
+			   struct expand_data *data,
+			   struct string_list revs)
+{
+	opt->print_contents = 0;
+	parse_cmd_get(opt, line, output, data, revs);
+}
+
+static void parse_cmd_get_objects(struct batch_options *opt,
+			   const char *line,
+			   struct strbuf *output,
+			   struct expand_data *data,
+			   struct string_list revs)
+{
+	opt->print_contents = 1;
+	parse_cmd_get(opt, line, output, data, revs);
+	if (opt->buffer_output)
+		fflush(stdout);
+}
+
+enum batch_state {
+	BATCH_STATE_COMMAND,
+	BATCH_STATE_INPUT,
+};
+
+typedef void (*parse_cmd_fn_t)(struct batch_options *, const char *,
+			       struct strbuf *, struct expand_data *,
+			       struct string_list revs);
+
+static const struct parse_cmd {
+	const char *prefix;
+	parse_cmd_fn_t fn;
+	unsigned takes_args;
+	enum batch_state next_state;
+} commands[] = {
+	{ "contents", parse_cmd_object, 1, BATCH_STATE_COMMAND},
+	{ "info", parse_cmd_info, 1, BATCH_STATE_COMMAND},
+	{ "begin", parse_cmd_begin, 0, BATCH_STATE_INPUT},
+	{ "get info", parse_cmd_get_info, 0, BATCH_STATE_COMMAND},
+	{ "get contents", parse_cmd_get_objects, 0, BATCH_STATE_COMMAND},
+};
+
+static void batch_objects_command(struct batch_options *opt,
+				    struct strbuf *output,
+				    struct expand_data *data)
+{
+	struct strbuf input = STRBUF_INIT;
+	enum batch_state state = BATCH_STATE_COMMAND;
+	struct string_list revs = STRING_LIST_INIT_DUP;
+
+	/* Read each line dispatch its command */
+	while (!strbuf_getline(&input, stdin)) {
+		int i;
+		const struct parse_cmd *cmd = NULL;
+		const char *p, *cmd_end;
+
+		if (state == BATCH_STATE_COMMAND) {
+			if (*input.buf == '\n')
+				die("empty command in input");
+			else if (isspace(*input.buf))
+				die("whitespace before command: %s", input.buf);
+		}
+
+		for (i = 0; i < ARRAY_SIZE(commands); i++) {
+			const char *prefix = commands[i].prefix;
+			char c;
+			if (!skip_prefix(input.buf, prefix, &cmd_end))
+				continue;
+			/*
+			 * If the command has arguments, verify that it's
+			 * followed by a space. Otherwise, it shall be followed
+			 * by a line terminator.
+			 */
+			c = commands[i].takes_args ? ' ' : '\n';
+			if (*cmd_end && *cmd_end != c)
+				die("arguments invalid for command: %s", commands[i].prefix);
+
+			cmd = &commands[i];
+			if (cmd->takes_args)
+				p = cmd_end + 1;
+			break;
+		}
+
+		if (input.buf[input.len - 1] == '\n')
+			input.buf[--input.len] = '\0';
+
+		if (state == BATCH_STATE_INPUT && !cmd){
+			string_list_append(&revs, input.buf);
+			continue;
+		}
+
+		if (!cmd)
+			die("unknown command: %s", input.buf);
+
+		state = cmd->next_state;
+		cmd->fn(opt, p, output, data, revs);
+	}
+	strbuf_release(&input);
+	string_list_clear(&revs, 0);
+}
+
 static int batch_objects(struct batch_options *opt)
 {
 	struct strbuf input = STRBUF_INIT;
@@ -519,6 +665,7 @@  static int batch_objects(struct batch_options *opt)
 	struct expand_data data;
 	int save_warning;
 	int retval = 0;
+	const int command = opt->command;
 
 	if (!opt->format)
 		opt->format = "%(objectname) %(objecttype) %(objectsize)";
@@ -594,22 +741,25 @@  static int batch_objects(struct batch_options *opt)
 	save_warning = warn_on_object_refname_ambiguity;
 	warn_on_object_refname_ambiguity = 0;
 
-	while (strbuf_getline(&input, stdin) != EOF) {
-		if (data.split_on_whitespace) {
-			/*
-			 * Split at first whitespace, tying off the beginning
-			 * of the string and saving the remainder (or NULL) in
-			 * data.rest.
-			 */
-			char *p = strpbrk(input.buf, " \t");
-			if (p) {
-				while (*p && strchr(" \t", *p))
-					*p++ = '\0';
+	if (command)
+		batch_objects_command(opt, &output, &data);
+	else {
+		while (strbuf_getline(&input, stdin) != EOF) {
+			if (data.split_on_whitespace) {
+				/*
+				 * Split at first whitespace, tying off the beginning
+				 * of the string and saving the remainder (or NULL) in
+				 * data.rest.
+				 */
+				char *p = strpbrk(input.buf, " \t");
+				if (p) {
+					while (*p && strchr(" \t", *p))
+						*p++ = '\0';
+				}
+				data.rest = p;
 			}
-			data.rest = p;
+			batch_one_object(input.buf, &output, opt, &data);
 		}
-
-		batch_one_object(input.buf, &output, opt, &data);
 	}
 
 	strbuf_release(&input);
@@ -646,6 +796,7 @@  static int batch_option_callback(const struct option *opt,
 
 	bo->enabled = 1;
 	bo->print_contents = !strcmp(opt->long_name, "batch");
+	bo->command = !strcmp(opt->long_name, "batch-command");
 	bo->format = arg;
 
 	return 0;
@@ -682,6 +833,11 @@  int cmd_cat_file(int argc, const char **argv, const char *prefix)
 			N_("show info about objects fed from the standard input"),
 			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
 			batch_option_callback),
+		OPT_CALLBACK_F(0, "batch-command", &batch, N_(""),
+			 N_("enters batch mode that accepts commands"),
+			 PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
+			 batch_option_callback),
+
 		OPT_BOOL(0, "follow-symlinks", &batch.follow_symlinks,
 			 N_("follow in-tree symlinks (used with --batch or --batch-check)")),
 		OPT_BOOL(0, "batch-all-objects", &batch.all_objects,
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 39382fa1958..7360d049113 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -85,6 +85,34 @@  $content"
 	test_cmp expect actual
     '
 
+    test -z "$content" ||
+    test_expect_success "--batch-command output of $type content is correct" '
+	maybe_remove_timestamp "$batch_output" $no_ts >expect &&
+	maybe_remove_timestamp "$(echo contents $sha1 | git cat-file --batch-command)" $no_ts >actual &&
+	test_cmp expect actual
+    '
+
+    test -z "$content" ||
+    test_expect_success "--batch-command session for $type content is correct" '
+	maybe_remove_timestamp "$batch_output" $no_ts >expect &&
+	maybe_remove_timestamp \
+		"$(test_write_lines "begin" "$sha1" "get contents" | git cat-file --batch-command)" \
+		$no_ts >actual &&
+	test_cmp expect actual
+    '
+
+    test_expect_success "--batch-command output of $type info is correct" '
+	echo "$sha1 $type $size" >expect &&
+	echo "info $sha1" | git cat-file --batch-command >actual &&
+	test_cmp expect actual
+    '
+
+    test_expect_success "--batch-command session for $type info is correct" '
+	echo "$sha1 $type $size" >expect &&
+	test_write_lines "begin" "$sha1" "get info" | git cat-file --batch-command >actual &&
+	test_cmp expect actual
+    '
+
     test_expect_success "custom --batch-check format" '
 	echo "$type $sha1" >expect &&
 	echo $sha1 | git cat-file --batch-check="%(objecttype) %(objectname)" >actual &&
@@ -141,6 +169,7 @@  test_expect_success '--batch-check without %(rest) considers whole line' '
 '
 
 tree_sha1=$(git write-tree)
+
 tree_size=$(($(test_oid rawsz) + 13))
 tree_pretty_content="100644 blob $hello_sha1	hello"
 
@@ -175,7 +204,7 @@  test_expect_success \
     "Reach a blob from a tag pointing to it" \
     "test '$hello_content' = \"\$(git cat-file blob $tag_sha1)\""
 
-for batch in batch batch-check
+for batch in batch batch-check batch-command
 do
     for opt in t s e p
     do
@@ -281,6 +310,15 @@  test_expect_success "--batch-check with multiple sha1s gives correct format" '
     "$(echo_without_newline "$batch_check_input" | git cat-file --batch-check)"
 '
 
+test_expect_success "--batch-command with multiple sha1s gives correct format" '
+    echo "$batch_check_output" >expect &&
+    echo begin >input &&
+    echo_without_newline "$batch_check_input" >>input &&
+    echo "get info" >>input &&
+    git cat-file --batch-command <input >actual &&
+    test_cmp expect actual
+'
+
 test_expect_success 'setup blobs which are likely to delta' '
 	test-tool genrandom foo 10240 >foo &&
 	{ cat foo && echo plus; } >foo-plus &&
@@ -872,4 +910,10 @@  test_expect_success 'cat-file --batch-all-objects --batch-check ignores replace'
 	test_cmp expect actual
 '
 
+test_expect_success 'batch-command unknown command' '
+	echo unknown_command >cmd &&
+	test_expect_code 128 git cat-file --batch-command < cmd 2>err &&
+	grep -E "^fatal:.*unknown command.*" err
+'
+
 test_done