[v4,3/3] cat-file: add --batch-command mode

Message ID	6c51324a6623b62c385ec707a773c21375596584.1644465706.git.gitgitgadget@gmail.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <6c51324a6623b62c385ec707a773c21375596584.1644465706.git.gitgitgadget@gmail.com> In-Reply-To: <pull.1212.v4.git.git.1644465706.gitgitgadget@gmail.com> References: <pull.1212.v3.git.git.1644353884.gitgitgadget@gmail.com> <pull.1212.v4.git.git.1644465706.gitgitgadget@gmail.com> Date: Thu, 10 Feb 2022 04:01:46 +0000 Subject: [PATCH v4 3/3] cat-file: add --batch-command mode MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fcc: Sent To: git@vger.kernel.org Cc: me@ttaylorr.com, phillip.wood123@gmail.com, avarab@gmail.com, e@80x24.org, bagasdotme@gmail.com, gitster@pobox.com, Eric Sunshine <sunshine@sunshineco.com>, Jonathan Tan <jonathantanmy@google.com>, John Cai <johncai86@gmail.com>, John Cai <johncai86@gmail.com> Precedence: bulk From: John Cai <johncai86@gmail.com>
Series	Add cat-file --batch-command flag \| expand [v4,0/3] Add cat-file --batch-command flag [v4,1/3] cat-file: rename cmdmode to transform_mode [v4,2/3] cat-file: introduce batch_mode enum to replace print_contents [v4,3/3] cat-file: add --batch-command mode

John Cai Feb. 10, 2022, 4:01 a.m. UTC

From: John Cai <johncai86@gmail.com>

Add a new flag --batch-command that accepts commands and arguments
from stdin, similar to git-update-ref --stdin.

At GitLab, we use a pair of long running cat-file processes when
accessing object content. One for iterating over object metadata with
--batch-check, and the other to grab object contents with --batch.

However, if we had --batch-command, we wouldn't need to keep both
processes around, and instead just have one --batch-command process
where we can flip between getting object info, and getting object
contents. Since we have a pair of cat-file processes per repository,
this means we can get rid of roughly half of long lived git cat-file
processes. Given there are many repositories being accessed at any given
time, this can lead to huge savings.

git cat-file --batch-command

will enter an interactive command mode whereby the user can enter in
commands and their arguments that get queued in memory:

<command1> [arg1] [arg2] LF
<command2> [arg1] [arg2] LF

When --buffer mode is used, commands will be queued in memory until a
flush command is issued that execute them:

flush LF

The reason for a flush command is that when a consumer process (A)
talks to a git cat-file process (B) and interactively writes to and
reads from it in --buffer mode, (A) needs to be able to control when
the buffer is flushed to stdout.

Currently, from (A)'s perspective, the only way is to either

1. kill (B)'s process
2. send an invalid object to stdin.

1. is not ideal from a performance perspective as it will require
spawning a new cat-file process each time, and 2. is hacky and not a
good long term solution.

With this mechanism of queueing up commands and letting (A) issue a
flush command, process (A) can control when the buffer is flushed and
can guarantee it will receive all of the output when in --buffer mode.
--batch-command also will not allow (B) to flush to stdout until a flush
is received.

This patch adds the basic structure for adding command which can be
extended in the future to add more commands. It also adds the following
two commands (on top of the flush command):

contents <object> LF
info <object> LF

The contents command takes an <object> argument and prints out the object
contents.

The info command takes a <object> argument and prints out the object
metadata.

These can be used in the following way with --buffer:

info <sha1> LF
contents <sha1> LF
contents <sha1> LF
info <sha1> LF
flush
info <sha1> LF
flush

When used without --buffer:

info <sha1> LF
contents <sha1> LF
contents <sha1> LF
info <sha1> LF
info <sha1> LF

Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: John Cai <johncai86@gmail.com>
---
 Documentation/git-cat-file.txt |  24 ++++
 builtin/cat-file.c             | 124 +++++++++++++++++++
 t/t1006-cat-file.sh            | 211 ++++++++++++++++++++++++++++++++-
 3 files changed, 358 insertions(+), 1 deletion(-)

Phillip Wood Feb. 10, 2022, 10:57 a.m. UTC | #1

Hi John

I've concentrated on the tests as others have commented on the 
implementation

On 10/02/2022 04:01, John Cai via GitGitGadget wrote:
> From: John Cai <johncai86@gmail.com>
> [...]
> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
> index 145eee11df9..a20c8dae85d 100755
> --- a/t/t1006-cat-file.sh
> +++ b/t/t1006-cat-file.sh
> @@ -177,6 +177,24 @@ $content"
>   	test_cmp expect actual
>       '
>   
> +    for opt in --buffer --no-buffer
> +    do
> +	test -z "$content" ||
> +		test_expect_success "--batch-command $opt output of $type content is correct" '
> +		maybe_remove_timestamp "$batch_output" $no_ts >expect &&
> +		maybe_remove_timestamp "$(test_write_lines "contents $sha1" \
> +		| git cat-file --batch-command $opt)" $no_ts >actual &&
> +		test_cmp expect actual
> +	'
> +
> +	test_expect_success "--batch-command $opt output of $type info is correct" '
> +		echo "$sha1 $type $size" >expect &&
> +		test_write_lines "info $sha1" \
> +		| git cat-file --batch-command $opt >actual &&
> +		test_cmp expect actual
> +	'
> +    done
> +
>       test_expect_success "custom --batch-check format" '
>   	echo "$type $sha1" >expect &&
>   	echo $sha1 | git cat-file --batch-check="%(objecttype) %(objectname)" >actual &&
> @@ -213,6 +231,70 @@ $content"
>       '
>   }
>   
> +run_buffer_test_flush () {
> +	type=$1
> +	sha1=$2
> +	size=$3
> +
> +	mkfifo input &&
> +	test_when_finished 'rm input' &&
> +	mkfifo output &&
> +	exec 9<>output &&
> +	test_when_finished 'rm output; exec 9<&-'
> +	(
> +		# TODO - Ideally we'd pipe the output of cat-file
> +		# through "sed s'/$/\\/'" to make sure that that read
> +		# would consume all the available
> +		# output. Unfortunately we cannot do this as we cannot
> +		# control when sed flushes its output. We could write
> +		# a test helper in C that appended a '\' to the end of
> +		# each line and flushes its output after every line.
> +		git cat-file --buffer --batch-command <input 2>err &
> +		echo $! &&
> +		wait $!
> +	) >&9 &
> +	sh_pid=$! &&
> +	read cat_file_pid <&9 &&
> +	test_when_finished "kill $cat_file_pid
> +			    kill $sh_pid; wait $sh_pid; :" &&
> +	echo "$sha1 $type $size" >expect &&
> +	test_write_lines "info $sha1" flush "info $sha1" >input

This closes input and so cat-file exits and flushes its output - 
therefore you are not testing whether flush actually flushes. When I 
wrote this test in[1] this line was inside a subshell that was 
redirected to the input fifo so that the read happened before cat-file 
exited. This test is also not testing the exit code of cat-file or that 
the output is flushed on exit. Is there a reason you can't just use the 
test as I wrote it? I'm happy to explain anything that isn't clear.

> +	# TODO - consume all available input, not just one
> +	# line (see above).
> +	read actual <&9 &&
> +	echo "$actual" >actual &&
> +	test_cmp expect actual &&
> +	test_must_be_empty err
> +}
> +
> +run_buffer_test_no_flush () {
> +	type=$1
> +	sha1=$2
> +	size=$3
> +
> +	touch output &&
> +	test_when_finished 'rm output' &&
> +	mkfifo input &&
> +	test_when_finished 'rm input' &&
> +	mkfifo pid &&
> +	exec 9<>pid &&
> +	test_when_finished 'rm pid; exec 9<&-'
> +	(
> +		git cat-file --buffer --batch-command <input >output &
> +		echo $! &&
> +		wait $!
> +		echo $?
> +	) >&9 &
> +	sh_pid=$! &&
> +	read cat_file_pid <&9 &&
> +	test_when_finished "kill $cat_file_pid
> +			    kill $sh_pid; wait $sh_pid; :" &&
> +	test_write_lines "info $sha1" "info $sha1" &&

This prints to stdout rather than piping into cat-file so it would not 
produce any output even if it exited normally. In my original[1] this 
line is inside a subshell that is redirected to the input fifo.

> +	kill $cat_file_pid &&
> +	read status <&9 &&
> +	test_must_be_empty output
> +}
> +
>   hello_content="Hello World"
>   hello_size=$(strlen "$hello_content")
>   hello_sha1=$(echo_without_newline "$hello_content" | git hash-object --stdin)
> @@ -224,6 +306,14 @@ test_expect_success "setup" '
>   
>   run_tests 'blob' $hello_sha1 $hello_size "$hello_content" "$hello_content"
>   
> +test_expect_success PIPE '--batch-command --buffer with flush for blob info' '
> +       run_buffer_test_flush blob $hello_sha1 $hello_size
> +'
> +
> +test_expect_success PIPE '--batch-command --buffer without flush for blob info' '
> +       run_buffer_test_no_flush blob $hello_sha1 $hello_size false
> +'

If we need to run the flush tests for each object type then could they 
go inside run_tests? Personally I think I'd be happy just to test the 
flush command on one object type.

>   test_expect_success '--batch-check without %(rest) considers whole line' '
>   	echo "$hello_sha1 blob $hello_size" >expect &&
>   	git update-index --add --cacheinfo 100644 $hello_sha1 "white space" &&
> @@ -238,6 +328,14 @@ tree_pretty_content="100644 blob $hello_sha1	hello"
>   
>   run_tests 'tree' $tree_sha1 $tree_size "" "$tree_pretty_content"
>   
> +test_expect_success PIPE '--batch-command --buffer with flush for tree info' '
> +       run_buffer_test_flush tree $tree_sha1 $tree_size
> +'
> +
> +test_expect_success PIPE '--batch-command --buffer without flush for tree info' '
> +       run_buffer_test_no_flush tree $tree_sha1 $tree_size false
> +'
> +
>   commit_message="Initial commit"
>   commit_sha1=$(echo_without_newline "$commit_message" | git commit-tree $tree_sha1)
>   commit_size=$(($(test_oid hexsz) + 137))
> @@ -249,6 +347,14 @@ $commit_message"
>   
>   run_tests 'commit' $commit_sha1 $commit_size "$commit_content" "$commit_content" 1
>   
> +test_expect_success PIPE '--batch-command --buffer with flush for commit info' '
> +       run_buffer_test_flush commit $commit_sha1 $commit_size
> +'
> +
> +test_expect_success PIPE '--batch-command --buffer without flush for commit info' '
> +       run_buffer_test_no_flush commit $commit_sha1 $commit_size false
> +'
> +
>   tag_header_without_timestamp="object $hello_sha1
>   type blob
>   tag hellotag
> @@ -263,11 +369,19 @@ tag_size=$(strlen "$tag_content")
>   
>   run_tests 'tag' $tag_sha1 $tag_size "$tag_content" "$tag_content" 1
>   
> +test_expect_success PIPE '--batch-command --buffer with flush for tag info' '
> +       run_buffer_test_flush tag $tag_sha1 $tag_size
> +'
> +
> +test_expect_success PIPE '--batch-command --buffer without flush for tag info' '
> +       run_buffer_test_no_flush tag $tag_sha1 $tag_size false
> +'
> +
>   test_expect_success \
>       "Reach a blob from a tag pointing to it" \
>       "test '$hello_content' = \"\$(git cat-file blob $tag_sha1)\""
>   
> -for batch in batch batch-check
> +for batch in batch batch-check batch-command
>   do
>       for opt in t s e p
>       do
> @@ -373,6 +487,72 @@ test_expect_success "--batch-check with multiple sha1s gives correct format" '
>       "$(echo_without_newline "$batch_check_input" | git cat-file --batch-check)"
>   '
>   
> +batch_command_info_input="info $hello_sha1
> +info $tree_sha1
> +info $commit_sha1
> +info $tag_sha1
> +info deadbeef

I know there are existing uses of the constant in the file but I'm not 
thrilled about adding more.

> +flush

This flush in redundant isn't it

> +"
> +
> +batch_command_info_output="$hello_sha1 blob $hello_size
> +$tree_sha1 tree $tree_size
> +$commit_sha1 commit $commit_size
> +$tag_sha1 tag $tag_size
> +deadbeef missing"
> +
> +test_expect_success "--batch-command with multiple info calls gives correct format" '

double quotes are generally reserved for test titles that use parameter 
substitution which this one does not.

> +	test "$batch_command_info_output" = "$(echo_without_newline \
> +	"$batch_command_info_input" | git cat-file --batch-command --buffer)"
> +'

This test and the one below are quite hard to follow. These days we try 
to avoid using test to compare strings as when it fails it does not 
provide any clues as to what when wrong. Instead we use here documents 
and test_cmp so that when a test fails you can see what went wrong. Also 
the setup happens inside the test

test_expect_success '--batch-command with multiple info calls gives 
correct format' '
	batch_command_info_input="info $hello_sha1\
	info $tree_sha1\
	info $commit_sha1\
	info $tag_sha1\
	info deadbeef\
	flush"
	
	cat >expect <<-EOF &&
	$hello_sha1 blob $hello_size
	$tree_sha1 tree $tree_size
	$commit_sha1 commit $commit_size
	$tag_sha1 tag $tag_size
	deadbeef missing
	EOF

	echo_without_newline "$batch_command_info_input" | git cat-file 
--batch-command --buffer >actual &&
	test_cmp expect actual
'

> +batch_command_contents_input="contents $hello_sha1
> +contents $commit_sha1
> +contents $tag_sha1
> +contents deadbeef
> +flush
> +"
> +
> +batch_command_output="$hello_sha1 blob $hello_size
> +$hello_content
> +$commit_sha1 commit $commit_size
> +$commit_content
> +$tag_sha1 tag $tag_size
> +$tag_content
> +deadbeef missing"
> +
> +test_expect_success "--batch-command with multiple contents calls gives correct format" '
> +	test "$(maybe_remove_timestamp "$batch_command_output" 1)" = \
> +	"$(maybe_remove_timestamp "$(echo_without_newline "$batch_command_contents_input" | git cat-file --batch-command)" 1)"
> +'
> +
> +batch_command_mixed_input="info $hello_sha1
> +contents $hello_sha1
> +info $commit_sha1
> +contents $commit_sha1
> +info $tag_sha1
> +contents $tag_sha1
> +contents deadbeef
> +flush
> +"
> +
> +batch_command_mixed_output="$hello_sha1 blob $hello_size
> +$hello_sha1 blob $hello_size
> +$hello_content
> +$commit_sha1 commit $commit_size
> +$commit_sha1 commit $commit_size
> +$commit_content
> +$tag_sha1 tag $tag_size
> +$tag_sha1 tag $tag_size
> +$tag_content
> +deadbeef missing"
> +
> +test_expect_success "--batch-command with mixed calls gives correct format" '
> +	test "$(maybe_remove_timestamp "$batch_command_mixed_output" 1)" = \
> +	"$(maybe_remove_timestamp "$(echo_without_newline \
> +	"$batch_command_mixed_input" | git cat-file --batch-command --buffer)" 1)"
> +'
> +
>   test_expect_success 'setup blobs which are likely to delta' '
>   	test-tool genrandom foo 10240 >foo &&
>   	{ cat foo && echo plus; } >foo-plus &&
> @@ -963,5 +1143,34 @@ test_expect_success 'cat-file --batch-all-objects --batch-check ignores replace'
>   	echo "$orig commit $orig_size" >expect &&
>   	test_cmp expect actual
>   '
> +test_expect_success 'batch-command empty command' '
> +	echo "" >cmd &&
> +	test_expect_code 128 git cat-file --batch-command <cmd 2>err &&
> +	grep -E "^fatal:.*empty command in input.*" err
> +'

This test and the ones below look good but they don't need to pass -E to 
grep are they are not using an extended regex.

Best Wishes

Phillip

[1] 
https://lore.kernel.org/git/e75ba9ea-fdda-6e9f-4dd6-24190117d93b@gmail.com

> +test_expect_success 'batch-command whitespace before command' '
> +	echo " info deadbeef" >cmd &&
> +	test_expect_code 128 git cat-file --batch-command <cmd 2>err &&
> +	grep -E "^fatal:.*whitespace before command.*" err
> +'
> +
> +test_expect_success 'batch-command unknown command' '
> +	echo unknown_command >cmd &&
> +	test_expect_code 128 git cat-file --batch-command <cmd 2>err &&
> +	grep -E "^fatal:.*unknown command.*" err
> +'
> +
> +test_expect_success 'batch-command flush with arguments' '
> +	echo "flush arg" >cmd &&
> +	test_expect_code 128 git cat-file --batch-command --buffer <cmd 2>err &&
> +	grep -E "^fatal:.*flush takes no arguments.*" err
> +'
> +
> +test_expect_success 'batch-command flush without --buffer' '
> +	echo "flush arg" >cmd &&
> +	test_expect_code 128 git cat-file --batch-command <cmd 2>err &&
> +	grep -E "^fatal:.*flush is only for --buffer mode.*" err
> +'
>   
>   test_done

Junio C Hamano Feb. 10, 2022, 5:05 p.m. UTC | #2

Phillip Wood <phillip.wood123@gmail.com> writes:

>> +	type=$1
>> +	sha1=$2
>> +	size=$3
>> +
>> +	mkfifo input &&
>> +	test_when_finished 'rm input' &&
>> +	mkfifo output &&
>> +	exec 9<>output &&
>> +	test_when_finished 'rm output; exec 9<&-'
>> +	(
>> +		# TODO - Ideally we'd pipe the output of cat-file
>> +		# through "sed s'/$/\\/'" to make sure that that read
>> +		# would consume all the available
>> +		# output. Unfortunately we cannot do this as we cannot
>> +		# control when sed flushes its output. We could write
>> +		# a test helper in C that appended a '\' to the end of
>> +		# each line and flushes its output after every line.
>> +		git cat-file --buffer --batch-command <input 2>err &
>> +		echo $! &&
>> +		wait $!
>> +	) >&9 &
>> +	sh_pid=$! &&
>> +	read cat_file_pid <&9 &&
>> +	test_when_finished "kill $cat_file_pid
>> +			    kill $sh_pid; wait $sh_pid; :" &&
>> +	echo "$sha1 $type $size" >expect &&
>> +	test_write_lines "info $sha1" flush "info $sha1" >input
>
> This closes input and so cat-file exits and flushes its output -
> therefore you are not testing whether flush actually flushes. When I 
> wrote this test in[1] this line was inside a subshell that was
> redirected to the input fifo so that the read happened before cat-file 
> exited.

Yeah, very good point.

> This test is also not testing the exit code of cat-file or
> that the output is flushed on exit. Is there a reason you can't just
> use the test as I wrote it? I'm happy to explain anything that isn't
> clear.

I admit I do not offhand recall what your tests did but help with
this (and more) level of detail with an offer to collaborate is
something I am very happy to see.  Thanks for working well together.

One thing that I wasn't quite sure was how well failure cases are
tested.  If we ask, in a batch mode, "info" for two objects and then
"flush", does the asker get enough clue when to read and when to
stop reading with all four combinations of states, i.e. asking for
two missing objects, one good object and one bad object, one bad
object and one good object, two good objects, for example?

Testing such combinations reliably is tricky---if the asker needs to
react to different response differently, a test that expects good
and then bad may not just fail but can get into deadlock, for
example if the reaction to good response has to read a lot but the
reaction to bad response is to just consume the "bad object" notice,
when a bug in the program being tested makes it issue the response
for a bad case when the asker is expecting a response for a good
object, because the asker will keep waiting for more response to
read which may not come.

John Cai Feb. 10, 2022, 6:55 p.m. UTC | #3

Hi Phillip,

Thanks again for helping with this! a few comments/questions below:

On 10 Feb 2022, at 5:57, Phillip Wood wrote:

> Hi John
>
> I've concentrated on the tests as others have commented on the implementation
>
> On 10/02/2022 04:01, John Cai via GitGitGadget wrote:
>> From: John Cai <johncai86@gmail.com>
>> [...]
>> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
>> index 145eee11df9..a20c8dae85d 100755
>> --- a/t/t1006-cat-file.sh
>> +++ b/t/t1006-cat-file.sh
>> @@ -177,6 +177,24 @@ $content"
>>   	test_cmp expect actual
>>       '
>>  +    for opt in --buffer --no-buffer
>> +    do
>> +	test -z "$content" ||
>> +		test_expect_success "--batch-command $opt output of $type content is correct" '
>> +		maybe_remove_timestamp "$batch_output" $no_ts >expect &&
>> +		maybe_remove_timestamp "$(test_write_lines "contents $sha1" \
>> +		| git cat-file --batch-command $opt)" $no_ts >actual &&
>> +		test_cmp expect actual
>> +	'
>> +
>> +	test_expect_success "--batch-command $opt output of $type info is correct" '
>> +		echo "$sha1 $type $size" >expect &&
>> +		test_write_lines "info $sha1" \
>> +		| git cat-file --batch-command $opt >actual &&
>> +		test_cmp expect actual
>> +	'
>> +    done
>> +
>>       test_expect_success "custom --batch-check format" '
>>   	echo "$type $sha1" >expect &&
>>   	echo $sha1 | git cat-file --batch-check="%(objecttype) %(objectname)" >actual &&
>> @@ -213,6 +231,70 @@ $content"
>>       '
>>   }
>>  +run_buffer_test_flush () {
>> +	type=$1
>> +	sha1=$2
>> +	size=$3
>> +
>> +	mkfifo input &&
>> +	test_when_finished 'rm input' &&
>> +	mkfifo output &&
>> +	exec 9<>output &&
>> +	test_when_finished 'rm output; exec 9<&-'
>> +	(
>> +		# TODO - Ideally we'd pipe the output of cat-file
>> +		# through "sed s'/$/\\/'" to make sure that that read
>> +		# would consume all the available
>> +		# output. Unfortunately we cannot do this as we cannot
>> +		# control when sed flushes its output. We could write
>> +		# a test helper in C that appended a '\' to the end of
>> +		# each line and flushes its output after every line.
>> +		git cat-file --buffer --batch-command <input 2>err &
>> +		echo $! &&
>> +		wait $!
>> +	) >&9 &
>> +	sh_pid=$! &&
>> +	read cat_file_pid <&9 &&
>> +	test_when_finished "kill $cat_file_pid
>> +			    kill $sh_pid; wait $sh_pid; :" &&
>> +	echo "$sha1 $type $size" >expect &&
>> +	test_write_lines "info $sha1" flush "info $sha1" >input
>
> This closes input and so cat-file exits and flushes its output - therefore you are not testing whether flush actually flushes. When I wrote this test in[1] this line was inside a subshell that was redirected to the input fifo so that the read happened before cat-file exited. This test is also not testing the exit code of cat-file or that the output is flushed on exit. Is there a reason you can't just use the test as I wrote it? I'm happy to explain anything that isn't clear.

I've restored the tests in the form you suggested. I had removed some lines to simplify the test but as it turns out I removed some of the important aspects of the test.

Here are my modifications to the tests you helped me with. Let me know if these changes make sense, or if I'm missing something.

> @@ -3,6 +3,7 @@ run_buffer_test_flush () {
>         sha1=$2
>         size=$3
>
> +       rm -f input output &&

on my end some tests were hanging because these files were not getting removed
by test_when_finished.

>         mkfifo input &&
>         test_when_finished 'rm input'
>         mkfifo output &&
> @@ -26,7 +27,7 @@ run_buffer_test_flush () {
>         test_when_finished "kill $cat_file_pid
>                             kill $sh_pid; wait $sh_pid; :" &&
>         (
> -               test_write_lines "info $sha1" fflush "info $sha1" &&
> +               test_write_lines "info $sha1" flush "info $sha1" &&
>                 # TODO - consume all available input, not just one
>                 # line (see above).
>                 read actual <&9 &&
> @@ -48,13 +49,14 @@ run_buffer_test_no_flush () {
>         sha1=$2
>         size=$3
>
> +       touch output &&

It looks like test_must_be_empty expects a file, and if output is never written
to it doesn't open the file.

>         mkfifo input &&
>         test_when_finished 'rm input'
>         mkfifo pid &&
>         exec 9<>pid &&
>         test_when_finished 'rm pid; exec 9<&-'
>         (
> -               git cat-file --buffer --batch-command <input >output &
> +               git cat-file --buffer --batch-command <input >>output &
>                 echo $! &&
>                 wait $!
>                 echo $?
> @@ -67,7 +69,7 @@ run_buffer_test_no_flush () {
>                 test_write_lines "info $sha1" "info $sha1" &&
>                 kill $cat_file_pid &&
>                 read status <&9 &&
> -               test "$status" -ne 0 &&
> -               test_must_be_empty output
> -       ) >input
> +               test "$status" -ne 0
> +       ) >input &&
> +       test_must_be_empty output

I wanted to ask about this, because the test hung here. I surmised that it was
because we are checking the output before writing to the input.

>  }

>
>> +	# TODO - consume all available input, not just one
>> +	# line (see above).
>> +	read actual <&9 &&
>> +	echo "$actual" >actual &&
>> +	test_cmp expect actual &&
>> +	test_must_be_empty err
>> +}
>> +
>> +run_buffer_test_no_flush () {
>> +	type=$1
>> +	sha1=$2
>> +	size=$3
>> +
>> +	touch output &&
>> +	test_when_finished 'rm output' &&
>> +	mkfifo input &&
>> +	test_when_finished 'rm input' &&
>> +	mkfifo pid &&
>> +	exec 9<>pid &&
>> +	test_when_finished 'rm pid; exec 9<&-'
>> +	(
>> +		git cat-file --buffer --batch-command <input >output &
>> +		echo $! &&
>> +		wait $!
>> +		echo $?
>> +	) >&9 &
>> +	sh_pid=$! &&
>> +	read cat_file_pid <&9 &&
>> +	test_when_finished "kill $cat_file_pid
>> +			    kill $sh_pid; wait $sh_pid; :" &&
>> +	test_write_lines "info $sha1" "info $sha1" &&
>
> This prints to stdout rather than piping into cat-file so it would not produce any output even if it exited normally. In my original[1] this line is inside a subshell that is redirected to the input fifo.
>
>> +	kill $cat_file_pid &&
>> +	read status <&9 &&
>> +	test_must_be_empty output
>> +}
>> +
>>   hello_content="Hello World"
>>   hello_size=$(strlen "$hello_content")
>>   hello_sha1=$(echo_without_newline "$hello_content" | git hash-object --stdin)
>> @@ -224,6 +306,14 @@ test_expect_success "setup" '
>>    run_tests 'blob' $hello_sha1 $hello_size "$hello_content" "$hello_content"
>>  +test_expect_success PIPE '--batch-command --buffer with flush for blob info' '
>> +       run_buffer_test_flush blob $hello_sha1 $hello_size
>> +'
>> +
>> +test_expect_success PIPE '--batch-command --buffer without flush for blob info' '
>> +       run_buffer_test_no_flush blob $hello_sha1 $hello_size false
>> +'
>
> If we need to run the flush tests for each object type then could they go inside run_tests? Personally I think I'd be happy just to test the flush command on one object type.

yeah, that makes sense

>
>>   test_expect_success '--batch-check without %(rest) considers whole line' '
>>   	echo "$hello_sha1 blob $hello_size" >expect &&
>>   	git update-index --add --cacheinfo 100644 $hello_sha1 "white space" &&
>> @@ -238,6 +328,14 @@ tree_pretty_content="100644 blob $hello_sha1	hello"
>>    run_tests 'tree' $tree_sha1 $tree_size "" "$tree_pretty_content"
>>  +test_expect_success PIPE '--batch-command --buffer with flush for tree info' '
>> +       run_buffer_test_flush tree $tree_sha1 $tree_size
>> +'
>> +
>> +test_expect_success PIPE '--batch-command --buffer without flush for tree info' '
>> +       run_buffer_test_no_flush tree $tree_sha1 $tree_size false
>> +'
>> +
>>   commit_message="Initial commit"
>>   commit_sha1=$(echo_without_newline "$commit_message" | git commit-tree $tree_sha1)
>>   commit_size=$(($(test_oid hexsz) + 137))
>> @@ -249,6 +347,14 @@ $commit_message"
>>    run_tests 'commit' $commit_sha1 $commit_size "$commit_content" "$commit_content" 1
>>  +test_expect_success PIPE '--batch-command --buffer with flush for commit info' '
>> +       run_buffer_test_flush commit $commit_sha1 $commit_size
>> +'
>> +
>> +test_expect_success PIPE '--batch-command --buffer without flush for commit info' '
>> +       run_buffer_test_no_flush commit $commit_sha1 $commit_size false
>> +'
>> +
>>   tag_header_without_timestamp="object $hello_sha1
>>   type blob
>>   tag hellotag
>> @@ -263,11 +369,19 @@ tag_size=$(strlen "$tag_content")
>>    run_tests 'tag' $tag_sha1 $tag_size "$tag_content" "$tag_content" 1
>>  +test_expect_success PIPE '--batch-command --buffer with flush for tag info' '
>> +       run_buffer_test_flush tag $tag_sha1 $tag_size
>> +'
>> +
>> +test_expect_success PIPE '--batch-command --buffer without flush for tag info' '
>> +       run_buffer_test_no_flush tag $tag_sha1 $tag_size false
>> +'
>> +
>>   test_expect_success \
>>       "Reach a blob from a tag pointing to it" \
>>       "test '$hello_content' = \"\$(git cat-file blob $tag_sha1)\""
>>  -for batch in batch batch-check
>> +for batch in batch batch-check batch-command
>>   do
>>       for opt in t s e p
>>       do
>> @@ -373,6 +487,72 @@ test_expect_success "--batch-check with multiple sha1s gives correct format" '
>>       "$(echo_without_newline "$batch_check_input" | git cat-file --batch-check)"
>>   '
>>  +batch_command_info_input="info $hello_sha1
>> +info $tree_sha1
>> +info $commit_sha1
>> +info $tag_sha1
>> +info deadbeef
>
> I know there are existing uses of the constant in the file but I'm not thrilled about adding more.
>
>> +flush
>
> This flush in redundant isn't it

true, we don't actually need it

>
>> +"
>> +
>> +batch_command_info_output="$hello_sha1 blob $hello_size
>> +$tree_sha1 tree $tree_size
>> +$commit_sha1 commit $commit_size
>> +$tag_sha1 tag $tag_size
>> +deadbeef missing"
>> +
>> +test_expect_success "--batch-command with multiple info calls gives correct format" '
>
> double quotes are generally reserved for test titles that use parameter substitution which this one does not.
>
>> +	test "$batch_command_info_output" = "$(echo_without_newline \
>> +	"$batch_command_info_input" | git cat-file --batch-command --buffer)"
>> +'
>
> This test and the one below are quite hard to follow. These days we try to avoid using test to compare strings as when it fails it does not provide any clues as to what when wrong. Instead we use here documents and test_cmp so that when a test fails you can see what went wrong. Also the setup happens inside the test
>
> test_expect_success '--batch-command with multiple info calls gives correct format' '
>     batch_command_info_input="info $hello_sha1\
>     info $tree_sha1\
>     info $commit_sha1\
>     info $tag_sha1\
>     info deadbeef\
>     flush"
>
>     cat >expect <<-EOF &&
>     $hello_sha1 blob $hello_size
>     $tree_sha1 tree $tree_size
>     $commit_sha1 commit $commit_size
>     $tag_sha1 tag $tag_size
>     deadbeef missing
>     EOF
>
>     echo_without_newline "$batch_command_info_input" | git cat-file --batch-command --buffer >actual &&
>     test_cmp expect actual
> '

sounds good, will adjust
>
>> +batch_command_contents_input="contents $hello_sha1
>> +contents $commit_sha1
>> +contents $tag_sha1
>> +contents deadbeef
>> +flush
>> +"
>> +
>> +batch_command_output="$hello_sha1 blob $hello_size
>> +$hello_content
>> +$commit_sha1 commit $commit_size
>> +$commit_content
>> +$tag_sha1 tag $tag_size
>> +$tag_content
>> +deadbeef missing"
>> +
>> +test_expect_success "--batch-command with multiple contents calls gives correct format" '
>> +	test "$(maybe_remove_timestamp "$batch_command_output" 1)" = \
>> +	"$(maybe_remove_timestamp "$(echo_without_newline "$batch_command_contents_input" | git cat-file --batch-command)" 1)"
>> +'
>> +
>> +batch_command_mixed_input="info $hello_sha1
>> +contents $hello_sha1
>> +info $commit_sha1
>> +contents $commit_sha1
>> +info $tag_sha1
>> +contents $tag_sha1
>> +contents deadbeef
>> +flush
>> +"
>> +
>> +batch_command_mixed_output="$hello_sha1 blob $hello_size
>> +$hello_sha1 blob $hello_size
>> +$hello_content
>> +$commit_sha1 commit $commit_size
>> +$commit_sha1 commit $commit_size
>> +$commit_content
>> +$tag_sha1 tag $tag_size
>> +$tag_sha1 tag $tag_size
>> +$tag_content
>> +deadbeef missing"
>> +
>> +test_expect_success "--batch-command with mixed calls gives correct format" '
>> +	test "$(maybe_remove_timestamp "$batch_command_mixed_output" 1)" = \
>> +	"$(maybe_remove_timestamp "$(echo_without_newline \
>> +	"$batch_command_mixed_input" | git cat-file --batch-command --buffer)" 1)"
>> +'
>> +
>>   test_expect_success 'setup blobs which are likely to delta' '
>>   	test-tool genrandom foo 10240 >foo &&
>>   	{ cat foo && echo plus; } >foo-plus &&
>> @@ -963,5 +1143,34 @@ test_expect_success 'cat-file --batch-all-objects --batch-check ignores replace'
>>   	echo "$orig commit $orig_size" >expect &&
>>   	test_cmp expect actual
>>   '
>> +test_expect_success 'batch-command empty command' '
>> +	echo "" >cmd &&
>> +	test_expect_code 128 git cat-file --batch-command <cmd 2>err &&
>> +	grep -E "^fatal:.*empty command in input.*" err
>> +'
>
> This test and the ones below look good but they don't need to pass -E to grep are they are not using an extended regex.
>
> Best Wishes
>
> Phillip
>
> [1] https://lore.kernel.org/git/e75ba9ea-fdda-6e9f-4dd6-24190117d93b@gmail.com
>
>> +test_expect_success 'batch-command whitespace before command' '
>> +	echo " info deadbeef" >cmd &&
>> +	test_expect_code 128 git cat-file --batch-command <cmd 2>err &&
>> +	grep -E "^fatal:.*whitespace before command.*" err
>> +'
>> +
>> +test_expect_success 'batch-command unknown command' '
>> +	echo unknown_command >cmd &&
>> +	test_expect_code 128 git cat-file --batch-command <cmd 2>err &&
>> +	grep -E "^fatal:.*unknown command.*" err
>> +'
>> +
>> +test_expect_success 'batch-command flush with arguments' '
>> +	echo "flush arg" >cmd &&
>> +	test_expect_code 128 git cat-file --batch-command --buffer <cmd 2>err &&
>> +	grep -E "^fatal:.*flush takes no arguments.*" err
>> +'
>> +
>> +test_expect_success 'batch-command flush without --buffer' '
>> +	echo "flush arg" >cmd &&
>> +	test_expect_code 128 git cat-file --batch-command <cmd 2>err &&
>> +	grep -E "^fatal:.*flush is only for --buffer mode.*" err
>> +'
>>    test_done

Eric Sunshine Feb. 10, 2022, 10:46 p.m. UTC | #4

On Wed, Feb 9, 2022 at 11:01 PM John Cai via GitGitGadget
<gitgitgadget@gmail.com> wrote:
> Add a new flag --batch-command that accepts commands and arguments
> from stdin, similar to git-update-ref --stdin.

Some comments not offered by other reviewers...

> This patch adds the basic structure for adding command which can be
> extended in the future to add more commands. It also adds the following
> two commands (on top of the flush command):
>
> contents <object> LF
> info <object> LF
>
> The contents command takes an <object> argument and prints out the object
> contents.
>
> The info command takes a <object> argument and prints out the object
> metadata.
>
> These can be used in the following way with --buffer:
>
> info <sha1> LF
> contents <sha1> LF
> contents <sha1> LF
> info <sha1> LF
> flush
> info <sha1> LF
> flush

s/<sha1>/<object>/ for consistency with the usage information earlier
in the commit message, and since Git is migrating to SHA-256, and to
avoid reviewer confusion as occurred earlier[1].

Also: s/flush$/flush LF/

> When used without --buffer:
>
> info <sha1> LF
> contents <sha1> LF
> contents <sha1> LF
> info <sha1> LF
> info <sha1> LF

Ditto.

[1]: https://lore.kernel.org/git/CAPig+cTeqhOYTu9WBiY=LnZtt35hAp3Qa5RduC2yLut6p01_1w@mail.gmail.com/

> diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
> @@ -96,6 +96,30 @@ OPTIONS
> +--batch-command::
> +       Enter a command mode that reads commands and arguments from stdin.
> +       May not be combined with any other options or arguments except
> +       `--textconv` or `--filters`, in which case the input lines also need to
> +       specify the path, separated by whitespace.  See the section
> +       `BATCH OUTPUT` below for details.

The SYNOPSIS probably needs an update too.

Perhaps say something like "Recognized commands include:" here before
enumerating the commands themselves?

> +--
> +contents <object>::
> +       Print object contents for object reference <object>. This corresponds to
> +       the output of --batch.

s/<object>/`<object>`/
s/--batch/`--batch`/

> +info <object>::
> +       Print object info for object reference <object>. This corresponds to the
> +       output of --batch-check.

s/<object>/`<object>`/
s/--batch/`--batch-check`/

> +flush::
> +       Used in --buffer mode to execute all preceding commands that were issued
> +       since the beginning or since the last flush was issued. When --buffer
> +       is used, no output will come until flush is issued. When --buffer is not
> +       used, commands are flushed each time without issuing `flush`.
> +--

s/--buffer/`--buffer`/g
s/flush/`flush`/g

This says that it's legal to use `--buffer` along with
`--batch-command`, but the description of `--batch-command` itself
just above says that it can be combined only with `--textconv` or
`--filters`. (I see you copied the problematic text from the other
batch options, so they also are guilty of not mentioning `--buffer`.
This series doesn't necessarily need to fix those existing
documentation problems, but perhaps don't repeat the problem with
newly-added text?)

The description of the `--buffer` option probably also needs to be
updated to mention the new `--batch-command` option, and there may be
other places in this document which should mention it, as well.

> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> +static const struct parse_cmd {
> +       const char *prefix;
> +       parse_cmd_fn_t fn;
> +       unsigned takes_args;
> +} commands[] = {
> +       { "contents", parse_cmd_contents, 1},
> +       { "info", parse_cmd_info, 1},
> +};
> +
> +static void batch_objects_command(struct batch_options *opt,
> +                                   struct strbuf *output,
> +                                   struct expand_data *data)
> +{
> +       while (!strbuf_getline(&input, stdin)) {
> +               if (!input.len)
> +                       die(_("empty command in input"));
> +               if (isspace(*input.buf))
> +                       die(_("whitespace before command: '%s'"), input.buf);
> +
> +               if (skip_prefix(input.buf, "flush", &cmd_end)) {
> +                       if (!opt->buffer_output)
> +                               die(_("flush is only for --buffer mode"));
> +                       if (*cmd_end)
> +                               die(_("flush takes no arguments"));
> +
> +                       dispatch_calls(opt, output, data, queued_cmd, nr);
> +                       nr = 0;
> +                       continue;
> +               }
> +
> +               for (i = 0; i < ARRAY_SIZE(commands); i++) {
> +                       if (!skip_prefix(input.buf, commands[i].prefix, &cmd_end))
> +                               continue;

This prefix-matching is going to incorrectly match non-commands such
as "contentsify <object>" and "information <object>" and then treat
them as "contents fy <object>" and "info mation <object>",
respectively, with undesirable results. You need to verify that there
is a space or NUL at `*cmd_end` before treating `input.buf` as an
actual command.

> +                       cmd = &commands[i];
> +                       if (cmd->takes_args)

What happens if `cmd->takes_arg` is true but no arguments follow the
command? Should that be diagnosed as an error?

> +                               p = cmd_end + 1;

This unconditional +1 is going to make `p` point beyond the NUL
character if the input is just a bare command, such as "contents" or
"info" without any space or any argument...

> +                       break;
> +               }
> +
> +               if (!cmd)
> +                       die(_("unknown command: '%s'"), input.buf);
> +
> +               if (!opt->buffer_output) {
> +                       cmd->fn(opt, p, output, data);
> +                       continue;
> +               }
> +
> +               ALLOC_GROW(queued_cmd, nr + 1, alloc);
> +               call.fn = cmd->fn;
> +               call.line = xstrdup_or_null(p);

... which means that xstrdup_or_null() will be copying whatever random
garbage is in memory following the bare command.

> +               queued_cmd[nr++] = call;
> +       }
> +
> +       if (opt->buffer_output && nr)
> +               dispatch_calls(opt, output, data, queued_cmd, nr);
> +
> +       free(queued_cmd);
> +       strbuf_release(&input);
> +}

John Cai Feb. 11, 2022, 5:45 p.m. UTC | #5

Hi Junio

On 10 Feb 2022, at 12:05, Junio C Hamano wrote:

> Phillip Wood <phillip.wood123@gmail.com> writes:
>
>>> +	type=$1
>>> +	sha1=$2
>>> +	size=$3
>>> +
>>> +	mkfifo input &&
>>> +	test_when_finished 'rm input' &&
>>> +	mkfifo output &&
>>> +	exec 9<>output &&
>>> +	test_when_finished 'rm output; exec 9<&-'
>>> +	(
>>> +		# TODO - Ideally we'd pipe the output of cat-file
>>> +		# through "sed s'/$/\\/'" to make sure that that read
>>> +		# would consume all the available
>>> +		# output. Unfortunately we cannot do this as we cannot
>>> +		# control when sed flushes its output. We could write
>>> +		# a test helper in C that appended a '\' to the end of
>>> +		# each line and flushes its output after every line.
>>> +		git cat-file --buffer --batch-command <input 2>err &
>>> +		echo $! &&
>>> +		wait $!
>>> +	) >&9 &
>>> +	sh_pid=$! &&
>>> +	read cat_file_pid <&9 &&
>>> +	test_when_finished "kill $cat_file_pid
>>> +			    kill $sh_pid; wait $sh_pid; :" &&
>>> +	echo "$sha1 $type $size" >expect &&
>>> +	test_write_lines "info $sha1" flush "info $sha1" >input
>>
>> This closes input and so cat-file exits and flushes its output -
>> therefore you are not testing whether flush actually flushes. When I
>> wrote this test in[1] this line was inside a subshell that was
>> redirected to the input fifo so that the read happened before cat-file
>> exited.
>
> Yeah, very good point.
>
>> This test is also not testing the exit code of cat-file or
>> that the output is flushed on exit. Is there a reason you can't just
>> use the test as I wrote it? I'm happy to explain anything that isn't
>> clear.
>
> I admit I do not offhand recall what your tests did but help with
> this (and more) level of detail with an offer to collaborate is
> something I am very happy to see.  Thanks for working well together.
>
> One thing that I wasn't quite sure was how well failure cases are
> tested.  If we ask, in a batch mode, "info" for two objects and then
> "flush", does the asker get enough clue when to read and when to
> stop reading with all four combinations of states, i.e. asking for
> two missing objects, one good object and one bad object, one bad
> object and one good object, two good objects, for example?

This is a good point. We currently don't have tests that exercise these
combinations.

>
> Testing such combinations reliably is tricky---if the asker needs to
> react to different response differently, a test that expects good
> and then bad may not just fail but can get into deadlock, for
> example if the reaction to good response has to read a lot but the
> reaction to bad response is to just consume the "bad object" notice,
> when a bug in the program being tested makes it issue the response
> for a bad case when the asker is expecting a response for a good
> object, because the asker will keep waiting for more response to
> read which may not come.

Let me see if I understand you. What I'm hearing is that it's hard to test a git
processes (A) that read/write from/to pipes without knowing exactly how (A) will
behave. By necessity, the test logic will have embedded some logic in it that
assumes certain behavior from (A), which might or might not be the case.

This can lead to a hanging test if, say, it is waiting around for (A) to output
data when due to a bug in the code, it never does. Did I get that right?

I still see value in having a test that hangs when it doesn't receive expected
output from the git process. If we had something that detected timeout on tests
then this could catch such a case. But since we don't, then that means having
tests like run_buffer_test_flush() and run_buffer_test_no_flush() will run the
risk of being a deadlocked test if there is a regression of the code in the future.
While still providing value in showing that something is wrong, these deadlocked
tests can be inconvenient to debug.

Junio C Hamano Feb. 11, 2022, 8:07 p.m. UTC | #6

John Cai <johncai86@gmail.com> writes:

> Let me see if I understand you. What I'm hearing is that it's hard to test a git
> processes (A) that read/write from/to pipes without knowing exactly how (A) will
> behave. By necessity, the test logic will have embedded some logic in it that
> assumes certain behavior from (A), which might or might not be the case.
>
> This can lead to a hanging test if, say, it is waiting around for (A) to output
> data when due to a bug in the code, it never does. Did I get that right?

Exactly.  And we've seen such tests that are designed to hang, when
they detect bugs, which made us very unhappy and we fixed them not
to hang but reliably fail.  Otherwise, such tests weren't very
useful in unattended CI environment, which we do not want to wait
for 3 hours to timeout and leave later steps in the same script
untested.

Thanks.

John Cai Feb. 11, 2022, 9:30 p.m. UTC | #7

Hi Junio

On 11 Feb 2022, at 15:07, Junio C Hamano wrote:

> John Cai <johncai86@gmail.com> writes:
>
>> Let me see if I understand you. What I'm hearing is that it's hard to test a git
>> processes (A) that read/write from/to pipes without knowing exactly how (A) will
>> behave. By necessity, the test logic will have embedded some logic in it that
>> assumes certain behavior from (A), which might or might not be the case.
>>
>> This can lead to a hanging test if, say, it is waiting around for (A) to output
>> data when due to a bug in the code, it never does. Did I get that right?
>
> Exactly.  And we've seen such tests that are designed to hang, when
> they detect bugs, which made us very unhappy and we fixed them not
> to hang but reliably fail.  Otherwise, such tests weren't very
> useful in unattended CI environment, which we do not want to wait
> for 3 hours to timeout and leave later steps in the same script
> untested.

That makes sense. Do you have an example of one of these tests? I'd like to see
how it was converted from a test that hung to a test that failed reliably. As
I'm thinking about converting run_buffer_test_flush() and run_buffer_test_no_flush()
into tests that fail rather than hang, I'm having a hard time avoiding the
pattern of A writes to B and waits for B to respond.

>
> Thanks.

[v4,3/3] cat-file: add --batch-command mode

Commit Message

Comments

Patch