[v3,4/4] ref-filter: add support for %(contents:size)

Message ID	20200707174049.21714-5-chriscool@tuxfamily.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=WOg5=AS=vger.kernel.org=git-owner@kernel.org> From: Christian Couder <christian.couder@gmail.com> To: git@vger.kernel.org Cc: Junio C Hamano <gitster@pobox.com>, Jeff King <peff@peff.net>, Christian Couder <chriscool@tuxfamily.org> Subject: [PATCH v3 4/4] ref-filter: add support for %(contents:size) Date: Tue, 7 Jul 2020 19:40:49 +0200 Message-Id: <20200707174049.21714-5-chriscool@tuxfamily.org> In-Reply-To: <20200707174049.21714-1-chriscool@tuxfamily.org> References: <20200707174049.21714-1-chriscool@tuxfamily.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Add support for %(contents:size) in ref-filter \| expand [v3,0/4] Add support for %(contents:size) in ref-filter [v3,1/4] Documentation: clarify %(contents:XXXX) doc [v3,2/4] Documentation: clarify 'complete message' [v3,3/4] t6300: test refs pointing to tree and blob [v3,4/4] ref-filter: add support for %(contents:size)

Christian Couder July 7, 2020, 5:40 p.m. UTC

It's useful and efficient to be able to get the size of the
contents directly without having to pipe through `wc -c`.

Also the result of the following:

`git for-each-ref --format='%(contents)' refs/heads/my-branch | wc -c`

is off by one as `git for-each-ref` appends a newline character
after the contents, which can be seen by comparing its output
with the output from `git cat-file`.

As with %(contents), %(contents:size) is silently ignored, if a
ref points to something other than a commit or a tag:

```
$ git update-ref refs/mytrees/first HEAD^{tree}
$ git for-each-ref --format='%(contents)' refs/mytrees/first

$ git for-each-ref --format='%(contents:size)' refs/mytrees/first

```

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-for-each-ref.txt |  3 +++
 ref-filter.c                       |  7 ++++++-
 t/t6300-for-each-ref.sh            | 19 +++++++++++++++++++
 3 files changed, 28 insertions(+), 1 deletion(-)

Junio C Hamano July 7, 2020, 7:45 p.m. UTC | #1

Christian Couder <christian.couder@gmail.com> writes:

> It's useful and efficient to be able to get the size of the
> contents directly without having to pipe through `wc -c`.
>
> Also the result of the following:
>
> `git for-each-ref --format='%(contents)' refs/heads/my-branch | wc -c`
>
> is off by one as `git for-each-ref` appends a newline character
> after the contents, which can be seen by comparing its output
> with the output from `git cat-file`.
>
> As with %(contents), %(contents:size) is silently ignored, if a
> ref points to something other than a commit or a tag:
>
> ```
> $ git update-ref refs/mytrees/first HEAD^{tree}
> $ git for-each-ref --format='%(contents)' refs/mytrees/first
>
> $ git for-each-ref --format='%(contents:size)' refs/mytrees/first
>
> ```
>
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---
>  Documentation/git-for-each-ref.txt |  3 +++
>  ref-filter.c                       |  7 ++++++-
>  t/t6300-for-each-ref.sh            | 19 +++++++++++++++++++
>  3 files changed, 28 insertions(+), 1 deletion(-)

Nice.  The only questionable thing here is if we later regret for
assuming that all sizes are always measured in bytes.  If we later
find an application that wants an efficient access to "| wc -l"
(instead of "| wc -c" that motivated this patch), we'd want to be
able to say "%(contents:lines)" and at that point we may want to go
back in time and call this "%(contents:bytes)" or something.

>  test_atom head contents 'Initial
>  '
> +test_atom head contents:size '8'

These two are tied together (any change to the test script that
causes us to update the former also forces us to update the latter),
but I do not think of a way to unify the test without writing too
much boilerplate code, so let's say this is good enough at least for
now (but I may change my opinion as I read along).

>  test_atom tag contents 'Tagging at 1151968727
>  '
> +test_atom tag contents:size '22'

Likewise.

> @@ -580,6 +582,7 @@ test_atom refs/tags/subject-body contents 'the subject line
>  first body line
>  second body line
>  '
> +test_atom refs/tags/subject-body contents:size '51'

Likewise.

Of course, we _could_ update the test_atom to do something magic
only when the 'contents' atom is being asked.  We notice that $2 is
'contents', do the usual test_expect_success for 'contents', and
then measure the byte length of $3 ourselves and test
'contents:size'.  That way, all the above test updates would become
unnecessary (and the last two hunks of this patch can also go).

That approach may even allow you to hide the details of sanitize-pgp
in the updated test_atom so that the actual tests may not have to get
updated even for signed tags.

> +# We cannot use test_atom to check contents:size of signed tags due to sanitize_pgp
> +test_tag_contents_size_pgp () {
> +	ref="$1"
> +	test_expect_success $PREREQ "basic atom: $ref contents:size" "
> +		git cat-file tag $ref | tail -n +6 | wc -c >expected &&
> +		git for-each-ref --format='%(contents:size)' $ref >actual &&
> +		test_cmp expected actual
> +	"
> +}
> +
>  PREREQ=GPG
>  test_atom refs/tags/signed-empty subject ''
>  test_atom refs/tags/signed-empty contents:subject ''
> @@ -629,6 +643,7 @@ test_atom refs/tags/signed-empty body "$sig"
>  test_atom refs/tags/signed-empty contents:body ''
>  test_atom refs/tags/signed-empty contents:signature "$sig"
>  test_atom refs/tags/signed-empty contents "$sig"
> +test_tag_contents_size_pgp refs/tags/signed-empty
>  
>  test_atom refs/tags/signed-short subject 'subject line'
>  test_atom refs/tags/signed-short contents:subject 'subject line'
> @@ -637,6 +652,7 @@ test_atom refs/tags/signed-short contents:body ''
>  test_atom refs/tags/signed-short contents:signature "$sig"
>  test_atom refs/tags/signed-short contents "subject line
>  $sig"
> +test_tag_contents_size_pgp refs/tags/signed-short
>  
>  test_atom refs/tags/signed-long subject 'subject line'
>  test_atom refs/tags/signed-long contents:subject 'subject line'
> @@ -649,6 +665,7 @@ test_atom refs/tags/signed-long contents "subject line
>  
>  body contents
>  $sig"
> +test_tag_contents_size_pgp refs/tags/signed-long
>  
>  test_expect_success 'set up refs pointing to tree and blob' '
>  	git update-ref refs/mytrees/first refs/heads/master^{tree} &&
> @@ -664,6 +681,7 @@ test_atom refs/mytrees/first body ""
>  test_atom refs/mytrees/first contents:body ""
>  test_atom refs/mytrees/first contents:signature ""
>  test_atom refs/mytrees/first contents ""
> +test_atom refs/mytrees/first contents:size ""
>  
>  test_atom refs/myblobs/first subject ""
>  test_atom refs/myblobs/first contents:subject ""
> @@ -671,6 +689,7 @@ test_atom refs/myblobs/first body ""
>  test_atom refs/myblobs/first contents:body ""
>  test_atom refs/myblobs/first contents:signature ""
>  test_atom refs/myblobs/first contents ""
> +test_atom refs/myblobs/first contents:size ""
>  
>  test_expect_success 'set up multiple-sort tags' '
>  	for when in 100000 200000

Junio C Hamano July 7, 2020, 10:21 p.m. UTC | #2

Christian Couder <christian.couder@gmail.com> writes:

> +		else if (atom->u.contents.option == C_LENGTH)
> +			v->s = xstrfmt("%ld", strlen(subpos));

Please squash something like this in.  32-bit builds are failing.

 ref-filter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ref-filter.c b/ref-filter.c
index 8ec28f0535..73d8bfa86d 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1257,7 +1257,7 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
 		else if (atom->u.contents.option == C_BODY_DEP)
 			v->s = xmemdupz(bodypos, bodylen);
 		else if (atom->u.contents.option == C_LENGTH)
-			v->s = xstrfmt("%ld", strlen(subpos));
+			v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos));
 		else if (atom->u.contents.option == C_BODY)
 			v->s = xmemdupz(bodypos, nonsiglen);
 		else if (atom->u.contents.option == C_SIG)

Junio C Hamano July 8, 2020, 11:05 p.m. UTC | #3

Christian Couder <christian.couder@gmail.com> writes:

> +# We cannot use test_atom to check contents:size of signed tags due to sanitize_pgp
> +test_tag_contents_size_pgp () {
> +	ref="$1"
> +	test_expect_success $PREREQ "basic atom: $ref contents:size" "
> +		git cat-file tag $ref | tail -n +6 | wc -c >expected &&
> +		git for-each-ref --format='%(contents:size)' $ref >actual &&
> +		test_cmp expected actual
> +	"
> +}

Of course, this will BREAK the tests on macOS and possibly others
with "wc" that emits leading whitespaces before the number.

The tip of 'seen' has been failing ever since this topic was merged
there e.g. https://travis-ci.org/github/git/git/jobs/705986794

Thanks.

Junio C Hamano July 9, 2020, 12:14 a.m. UTC | #4

Junio C Hamano <gitster@pobox.com> writes:

> Of course, we _could_ update the test_atom to do something magic
> only when the 'contents' atom is being asked.  We notice that $2 is
> 'contents', do the usual test_expect_success for 'contents', and
> then measure the byte length of $3 ourselves and test
> 'contents:size'.  That way, all the above test updates would become
> unnecessary (and the last two hunks of this patch can also go).
>
> That approach may even allow you to hide the details of sanitize-pgp
> in the updated test_atom so that the actual tests may not have to get
> updated even for signed tags.

After seeing the "wc -c" portability issues, I am now even more
inclined to say that the above is the right direction.  The
portability worries can and should be encapsulated in a single
test_atom helper function, just as it can be used to hide the
differences between signed tags, annotated tags and commits.

Thanks.

>> +# We cannot use test_atom to check contents:size of signed tags due to sanitize_pgp
>> +test_tag_contents_size_pgp () {
>> +	ref="$1"
>> +	test_expect_success $PREREQ "basic atom: $ref contents:size" "
>> +		git cat-file tag $ref | tail -n +6 | wc -c >expected &&
>> +		git for-each-ref --format='%(contents:size)' $ref >actual &&
>> +		test_cmp expected actual
>> +	"
>> +}

Christian Couder July 9, 2020, 8:10 a.m. UTC | #5

On Thu, Jul 9, 2020 at 2:14 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Junio C Hamano <gitster@pobox.com> writes:
>
> > Of course, we _could_ update the test_atom to do something magic
> > only when the 'contents' atom is being asked.  We notice that $2 is
> > 'contents', do the usual test_expect_success for 'contents', and
> > then measure the byte length of $3 ourselves and test
> > 'contents:size'.  That way, all the above test updates would become
> > unnecessary (and the last two hunks of this patch can also go).
> >
> > That approach may even allow you to hide the details of sanitize-pgp
> > in the updated test_atom so that the actual tests may not have to get
> > updated even for signed tags.
>
> After seeing the "wc -c" portability issues, I am now even more
> inclined to say that the above is the right direction.  The
> portability worries can and should be encapsulated in a single
> test_atom helper function, just as it can be used to hide the
> differences between signed tags, annotated tags and commits.

Yeah, I have been working on that and will send a new patch series soon.
The current test_atom() change looks like this:

diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
index 371e45e5ad..e514d98574 100755
--- a/t/t6300-for-each-ref.sh
+++ b/t/t6300-for-each-ref.sh
@@ -52,6 +52,26 @@ test_atom() {
                sanitize_pgp <actual >actual.clean &&
                test_cmp expected actual.clean
        "
+       # Automatically test "contents:size" atom after testing "contents"
+       if test "$2" = "contents"
+       then
+               case "$1" in
+               refs/tags/signed-*)
+                       # We cannot use $3 as it expects sanitize_pgp to run
+                       git cat-file tag $ref | tail -n +6 | \
+                               wc -c | sed -e 's/^ *//' >expected ;;
+               refs/mytrees/*)
+                       echo >expected ;;
+               refs/myblobs/*)
+                       echo >expected ;;
+               *)
+                       printf '%s' "$3" | wc -c | sed -e 's/^ *//' >expected ;;
+               esac
+               test_expect_${4:-success} $PREREQ "basic atom: $1 $2:size" "
+                       git for-each-ref --format='%($2:size)' $ref >actual &&
+                       test_cmp expected actual
+               "
+       fi
 }

I am wondering if it's worth adding a preparatory patch to introduce
an helper function like the following in test-lib-functions.sh:

+# test_byte_count outputs the number of bytes in files or stdin
+#
+# It is like wc -c but without portability issues, as on macOS and
+# possibly other platforms leading whitespaces are emitted before the
+# number.
+
+test_byte_count () {
+       wc -c "$@" | sed -e 's/^ *//'
+}

Not sure about the name of this helper function as it works
differently than test_line_count().

Junio C Hamano July 9, 2020, 1:47 p.m. UTC | #6

Christian Couder <christian.couder@gmail.com> writes:

> Yeah, I have been working on that and will send a new patch series soon.
> The current test_atom() change looks like this:
>
> diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
> index 371e45e5ad..e514d98574 100755
> --- a/t/t6300-for-each-ref.sh
> +++ b/t/t6300-for-each-ref.sh
> @@ -52,6 +52,26 @@ test_atom() {
>                 sanitize_pgp <actual >actual.clean &&
>                 test_cmp expected actual.clean
>         "
> +       # Automatically test "contents:size" atom after testing "contents"
> +       if test "$2" = "contents"
> +       then
> +               case "$1" in
> +               refs/tags/signed-*)
> +                       # We cannot use $3 as it expects sanitize_pgp to run
> +                       git cat-file tag $ref | tail -n +6 | \
> +                               wc -c | sed -e 's/^ *//' >expected ;;
> +               refs/mytrees/*)
> +                       echo >expected ;;
> +               refs/myblobs/*)
> +                       echo >expected ;;
> +               *)
> +                       printf '%s' "$3" | wc -c | sed -e 's/^ *//' >expected ;;
> +               esac
> +               test_expect_${4:-success} $PREREQ "basic atom: $1 $2:size" "
> +                       git for-each-ref --format='%($2:size)' $ref >actual &&
> +                       test_cmp expected actual
> +               "
> +       fi
>  }
>
> I am wondering if it's worth adding a preparatory patch to introduce
> an helper function like the following in test-lib-functions.sh:
>
> +# test_byte_count outputs the number of bytes in files or stdin
> +#
> +# It is like wc -c but without portability issues, as on macOS and
> +# possibly other platforms leading whitespaces are emitted before the
> +# number.
> +
> +test_byte_count () {
> +       wc -c "$@" | sed -e 's/^ *//'
> +}
>
> Not sure about the name of this helper function as it works
> differently than test_line_count().

Yeah, if I were writing it, I'd call it "sane_wc_c" or something.

But more importantly, I think the invocation of "|sed" is overkill.
If I were writing it, I would go more like...

	if test $2 = contents
	then
		case "$1" in 
		...)
			expect=$(git cat-file ... | wc -c)
			;;
		refs/mytrees/* | refs/myblobs/*)
			expect=0
			;;
		*)
			expect=$(printf ... | wc -c)
			;;
		esac

		# leave $expect unquoted to lose possible leading whitespaces
	        echo $expect >expect
		test_expect_success "..." '
			...
			test_cmp expect actual
		'
	fi

[v3,4/4] ref-filter: add support for %(contents:size)

Commit Message

Comments

Patch