t6300: values containing ')' are broken in ref formats

Message ID	20241105190235.13502-1-five231003@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6EF2A1E8855 for <git@vger.kernel.org>; Tue, 5 Nov 2024 19:02:53 +0000 (UTC) From: Kousik Sanagavarapu <five231003@gmail.com> To: git@vger.kernel.org Cc: Christian Couder <christian.couder@gmail.com>, Kousik Sanagavarapu <five231003@gmail.com> Subject: [PATCH] t6300: values containing ')' are broken in ref formats Date: Wed, 6 Nov 2024 00:11:34 +0530 Message-ID: <20241105190235.13502-1-five231003@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	t6300: values containing ')' are broken in ref formats \| expand t6300: values containing ')' are broken in ref formats

Kousik Sanagavarapu Nov. 5, 2024, 6:41 p.m. UTC

Document that values containing ')' in formats are not parsed correctly
in ref-filter.

The problem here is that ref-filter, while parsing the format string,
looks for the first occurence of ')' and marks it to be the end of _that_
particular atom - which is obviously wrong in cases where the format is
of type

	atom:key=value

where "value" has ')' somewhere in it.

However formats having a '(' instead in "value" will parse correctly
because in a general format string we also mark start of the format by
making note of '%(' instead of just '('.

For example, in

	%(if:equals=somere)f)%(refname:short)...

the string that ref-filter should compare against is "somere)f", although
since the parsing behavior in these cases is broken, we instead compare
against "somere".

While in

	%(if:equals=somere(f)%(refname:short)...

ref-filter rightly compares against "somere(f" as expected.

As a side note it should be mentioned that values containing ')' are
legit in %(refname) (and other atoms like %(upstream)).  Meaning
the parser wouldn't err out such values as they are legal, which is also
confirmed by - say the refname coming from

	$ git branch '1)branch'

or a remote name coming from

	$ git remote add 'up)stream' 'some.url'
	$ git push --set-upstream 'up)stream' '1)branch'

since none of these fail.

Signed-off-by: Kousik Sanagavarapu <five231003@gmail.com>
---

This raises the question of what can be done to parse ')' in values of
the format correctly.  It seems to me like a clean solution would
involve a huge refactoring involving a large portion of ref-filter but I
maybe wrong.

So this patch also hopes to open up discussion on not only solving this
bug but also how in general ref-filter currently parses and formats
atoms and if it is the way in which we would like to do things in the
future, which would in turn also be helpful in the long term goal of
merging both pretty and ref-filter.

Here is also a simple script to demonstrate the difference between '('
and ')' in values in formats - as described in the commit msg -

#!/bin/sh

rm -rf /tmp/atom-test-dir &&

# create env
git init /tmp/atom-test-dir 1>/tmp/init-dump 2>>/tmp/init-dump &&
cd /tmp/atom-test-dir &&
echo "smtg" >file &&
git add file &&
git commit -s -m "initial revision" >/tmp/commit-dump &&

# using "(" in refname works good
echo "test with refname as \"bran(ch\"" &&
git branch "bran(ch" &&
printf "bran(ch\n\n" >expect &&
git for-each-ref --format="%(if:equals=bran(ch)%(refname:short)%(then)%(refname:short)%(end)" refs/heads/ >actual &&
if ! diff -u expect actual; then
	echo "		actual is different from expect"
else
	echo "		actual is the same as expect"
fi

echo "" &&

# using ")" in refname will parse wrong in ref-filter code
echo "test with refname as \"bran()ch\"" &&
git branch "bran()ch" &&
printf "bran()ch\n\n\n" >expect &&
git for-each-ref --format="%(if:equals=bran()ch)%(refname:short)%(then)%(refname:short)%(end)" refs/heads/ >actual &&
if ! diff -u expect actual; then
	echo "		actual is different from expect"
else
	echo "		actual is the same as expect"
fi

 t/t6300-for-each-ref.sh | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

Junio C Hamano Nov. 6, 2024, 1:18 a.m. UTC | #1

Kousik Sanagavarapu <five231003@gmail.com> writes:

> Document that values containing ')' in formats are not parsed correctly
> in ref-filter.

The problem is probably lack of a way to quote such a closing
parenthesis.

> However formats having a '(' instead in "value" will parse correctly
> because in a general format string we also mark start of the format by
> making note of '%(' instead of just '('.

So if you wanted to have a two-char sequence '%(' in value, you'd
see a similar problem?  If so, it is not quite a "bug" or "not
parsed correctly"---it is "because there is no way to include
closing ')' in the value (e.g., by quoting), you cannot write such a
string in the value part".

> This raises the question of what can be done to parse ')' in values of
> the format correctly.  It seems to me like a clean solution would
> involve a huge refactoring involving a large portion of ref-filter but I
> maybe wrong.

Yes, so I wouldn't even call the current behaviour "bug".  The
language is merely "limited" and the user cannot express certain
values with it at all.

Having said that, I just tried this

    $ git for-each-ref --format='%28%(refname)%29' refs/heads/master
    (refs/heads/master)

So, if there is anything that needs "fixing", wouldn't it be
documentation?

If I knew (or easily find out from "git for-each-ref --help") that
hex escapes %XX can be used, I wouldn't have written any of what I
said before "Having said that" in this response.

Jeff King Nov. 6, 2024, 2:25 a.m. UTC | #2

On Tue, Nov 05, 2024 at 05:18:10PM -0800, Junio C Hamano wrote:

> > This raises the question of what can be done to parse ')' in values of
> > the format correctly.  It seems to me like a clean solution would
> > involve a huge refactoring involving a large portion of ref-filter but I
> > maybe wrong.
> 
> Yes, so I wouldn't even call the current behaviour "bug".  The
> language is merely "limited" and the user cannot express certain
> values with it at all.

Agreed. I think we may have discussed this quoting problem before, but
it's not usually a big deal because the set of likely values is quite
limited. Most of them are just keywords or numeric values. I _think_
that the equals/notequals parameters of %(if) are the only ones.

Which isn't to say we shouldn't make things better if we can. Just that
I am not too surprised nobody has run into it before.

> Having said that, I just tried this
> 
>     $ git for-each-ref --format='%28%(refname)%29' refs/heads/master
>     (refs/heads/master)
> 
> So, if there is anything that needs "fixing", wouldn't it be
> documentation?
> 
> If I knew (or easily find out from "git for-each-ref --help") that
> hex escapes %XX can be used, I wouldn't have written any of what I
> said before "Having said that" in this response.

I tried something similar, but I don't think it quite works for the case
in question. Within %(if:equals=<foo>) we do not further expand the
<foo> value (at least from my limited tests). And so something like:

  git for-each-ref --format='%(if:equals=ref-with-%29)%(refname:short)...etc'

would never match "ref-with-(", but only a literal "ref-with-%29".

I am tempted to say the solution is to expand that "equals" value, and
possibly add some less-arcane version of the character (maybe "%)"?).
But it be a break in backwards compatibility if somebody is trying to
match literal %-chars in their "if" block.

Another option: in the rest of the "if" design we tried to keep
arbitrary text outside of the parentheses. So you could imagine a syntax
like:

  %(if:equals)ref-with-)%(foo)%(refname:short)%(then)...%(end)

where %(foo) is some placeholder that separate the two arguments to the
"equals". In sane languages that is a space or a comma, but I'm not sure
that works here. We have %(end) which would otherwise be a syntax error
here, but it feels word. I dunno. The whole language is kind of
hideously verbose. I feel sorry for anybody trying to write non-trivial
formats. :)

-Peff

Kousik Sanagavarapu Nov. 6, 2024, 2:40 a.m. UTC | #3

On Tue, Nov 05, 2024 at 05:18:10PM -0800, Junio C Hamano wrote:
> Kousik Sanagavarapu <five231003@gmail.com> writes:
> 
> > Document that values containing ')' in formats are not parsed correctly
> > in ref-filter.
> 
> The problem is probably lack of a way to quote such a closing
> parenthesis.

Yes correct.  We currently don't have a way to quote such strings in
"value" of "%(atom:someparam=value)"

> > However formats having a '(' instead in "value" will parse correctly
> > because in a general format string we also mark start of the format by
> > making note of '%(' instead of just '('.
> 
> So if you wanted to have a two-char sequence '%(' in value, you'd
> see a similar problem?  If so, it is not quite a "bug" or "not
> parsed correctly"---it is "because there is no way to include
> closing ')' in the value (e.g., by quoting), you cannot write such a
> string in the value part".
> 
> > This raises the question of what can be done to parse ')' in values of
> > the format correctly.  It seems to me like a clean solution would
> > involve a huge refactoring involving a large portion of ref-filter but I
> > maybe wrong.
> 
> Yes, so I wouldn't even call the current behaviour "bug".  The
> language is merely "limited" and the user cannot express certain
> values with it at all.
> 
> 
> Having said that, I just tried this
> 
>     $ git for-each-ref --format='%28%(refname)%29' refs/heads/master
>     (refs/heads/master)
> 
> So, if there is anything that needs "fixing", wouldn't it be
> documentation?
> 
> If I knew (or easily find out from "git for-each-ref --help") that
> hex escapes %XX can be used, I wouldn't have written any of what I
> said before "Having said that" in this response.

Hmm, but hex escapes do work as intended and the problem here is not the
')' outside the atom but within it.  To be more clear, let's take

	$ git for-each-ref --format="%(if:equals=refs/heads/step-1)start)%(refname)%(then)%(objectname:short)%(end)" refs/heads/

(Sorry for not wrapping the line above x<)

First let's notice the difference between what these two commands are
trying to do.  Your command asks to print all the refs matching
"refs/heads/master" in the format of (%(refname)) and since we support
escaping literals in the form of %xx, where xx is the hexcode of the
literal to be escaped during the parsing of the format, this would
obviously work as intended.

Now let's come to my command.  My command asks to print all the
abbreviated commit ids of the refs which compare equal to
"refs/heads/step-1)start" from all of "refs/heads/".  Now here, since
ref-filter parses the format string by making note of '%(' and ')', it
accidentally thinks that I want to compare equality with
"refs/heads/step-1" instead of "refs/heads/step-1)start", which I
actually wanted.  If my local repo contained both the "refs/heads/step1"
and "refs/heads/step1)start", wouldn't this be a bug?

So I do agree that it is a lack of quoting when entering the "value"
part of "%(atom:someparam=value)", but another part of me also thinks
that ref-filter should be intelligent enough while parsing the format
string to acknowledge where exactly the atom ends and which is the last
closing ')' and hence follows whatever I wrote below the "---" line,
till the script.

Also, I'm thinking the commit msg was not clear as it lead you (perhaps
someone else too when they visit this topic) to think about escape
literals while that not exactly is the problem I'm trying to get at.

Also if you think a change to the documentation would be more proper
than reflecting this with a test breakage, I'll do that.  My intention
with the test was that - in the future if we parse the "value" correctly
then that commit would also include a

s/test_expect_failure/test_expect_success

change.

Thanks!

Junio C Hamano Nov. 6, 2024, 3:05 a.m. UTC | #4

Jeff King <peff@peff.net> writes:

> I am tempted to say the solution is to expand that "equals" value, and
> possibly add some less-arcane version of the character (maybe "%)"?).
> But it be a break in backwards compatibility if somebody is trying to
> match literal %-chars in their "if" block.

If they were trying to write a literal %, wouldn't they be writing
%% already, not because % followed by a byte without any special
meaning happens to be passed intact by the implementation, but
because that is _the_ right thing to do, when % is used as an
introducer for escape sequences?  So I do agree it would be a change
that breaks backward compatibility but I do not think we want to
stay bug to bug compatible with the current behaviour here.  I am
not sure with the wisdom of %) though.  Wouldn't "%(foo %)" look as
if %( opens and %) closes a group in our language?

So I am very much in favor of this "if condition should be expanded
before comparison" solution.

Kousik Sanagavarapu Nov. 6, 2024, 3:54 a.m. UTC | #5

On Tue, Nov 05, 2024 at 07:05:13PM -0800, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> 
> > I am tempted to say the solution is to expand that "equals" value, and
> > possibly add some less-arcane version of the character (maybe "%)"?).
> > But it be a break in backwards compatibility if somebody is trying to
> > match literal %-chars in their "if" block.
> 
> If they were trying to write a literal %, wouldn't they be writing
> %% already, not because % followed by a byte without any special
> meaning happens to be passed intact by the implementation, but
> because that is _the_ right thing to do, when % is used as an
> introducer for escape sequences?  So I do agree it would be a change
> that breaks backward compatibility but I do not think we want to
> stay bug to bug compatible with the current behaviour here.  I am
> not sure with the wisdom of %) though.  Wouldn't "%(foo %)" look as
> if %( opens and %) closes a group in our language?
> 
> So I am very much in favor of this "if condition should be expanded
> before comparison" solution.

I had worked on this "if condition should be expanded before comparison"
solution but Christian and I agreed that it would be better to open up
discussion incase there would be other possible and better solutions
which would also be viable in the long term.

This solution relies on how we parse out literals in ref-filter, so '%%'
would work as intended in the comparision string - ie if we want to
compare against "some%ref", we would do "%(if:equals=some%%ref)...".

Also it is obscure that someone will use this in practice as I myself
had discovered this when working on a corner case of some other
implementation related to parsing but way lower in the call-chain of
ref-filter ;) but here goes

(this applies on top of the current patch)

------------------------ >8 ------------------------
Subject: [PATCH] ref-filter: parse parentheses correctly in %(if) atoms

Having a ')' in "<string>" in ":equals=<string>" or
":notequals=<string>" wouldn't parse correctly in ref-filter as
documented in the previous commit.

One way to fix this is refactoring the way in which we parse our format
string.  Although this would mean we would have to do a huge refactoring
as this step happens very high up in the call chain.

Therefore, support including parenthesis characters in "<string>" by
instead giving their hexcode equivalents - as a for-now hack.

Do this by further abstracting "append_literal()" to "parse_literal()"
where the output is no longer stored into a ref formatting stack's
strbuf but a standard standalone strbuf.  append_literal would then
hence wrap appropriately around "parse_literal()".

Also introduce "convert_hexcode()" which also wraps around
"parse_literal()" and must be used to convert hexcode in a given string
and be silent when such a string doesn't contain hexcode or doesn't
exist (ie is NULL).

Using "convert_hexcode()" would mean that we now have an alloced string -
hence free() it once we are done with it to prevent any memory leaks.

Signed-off-by: Kousik Sanagavarapu <five231003@gmail.com>
---
 Documentation/git-for-each-ref.txt |  4 ++
 ref-filter.c                       | 76 +++++++++++++++++++++++-------
 t/t6300-for-each-ref.sh            |  6 ++-
 3 files changed, 66 insertions(+), 20 deletions(-)

diff --git a/Documentation/git-for-each-ref.txt b/Documentation/git-for-each-ref.txt
index d3764401a2..ce12400040 100644
--- a/Documentation/git-for-each-ref.txt
+++ b/Documentation/git-for-each-ref.txt
@@ -221,6 +221,10 @@ if::
 	the value between the %(if:...) and %(then) atoms with the
 	given string.
 
+	Additionally, if `<string>` must contain parenthesis, then these
+	parentheses are spelled out as hexcode.  For e.g., `1)someref`
+	would need to be `1%29someref`.
+
 symref::
 	The ref which the given symbolic ref refers to. If not a
 	symbolic ref, nothing is printed. Respects the `:short`,
diff --git a/ref-filter.c b/ref-filter.c
index 84c6036107..ebdb2daeb7 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1234,13 +1234,64 @@ static void if_then_else_handler(struct ref_formatting_stack **stack)
 	*stack = cur;
 }
 
+/*
+ * Parse out literals in the string pointed to by "cp" and store them in
+ * a strbuf - this would go on until we hit NUL or "ep".
+ *
+ * While at it, if they're of the form "%xx", where xx represents the
+ * hexcode of some character, then convert them into their equivalent
+ * characters.
+ */
+static void parse_literal(const char *cp, const char *ep,
+			  struct strbuf *s)
+{
+	while (*cp && (!ep || cp < ep)) {
+		if (*cp == '%') {
+			if (cp[1] == '%')
+				cp++;
+			else {
+				int ch = hex2chr(cp + 1);
+				if (0 <= ch) {
+					strbuf_addch(s, ch);
+					cp += 3;
+					continue;
+				}
+			}
+		}
+		strbuf_addch(s, *cp);
+		cp++;
+	}
+}
+
+/*
+ * Convert the string, pointed to by "cp", which might or might not
+ * contain hexcode (in the format of "%xx" where xx is the hexcode) to
+ * its character-equivalent string and return it.
+ *
+ * If the string does not contain any hexcode - then it is returned as
+ * is.
+ */
+static const char *convert_hexcode(const char *cp)
+{
+	struct strbuf s = STRBUF_INIT;
+
+	if (!cp)
+		return NULL;
+	/*
+	 * This has the effect of an in-place translation but
+	 * implementation-wise it is not.
+	 */
+	parse_literal(cp, NULL, &s);
+	return strbuf_detach(&s, NULL);
+}
+
 static int if_atom_handler(struct atom_value *atomv, struct ref_formatting_state *state,
 			   struct strbuf *err UNUSED)
 {
 	struct ref_formatting_stack *new_stack;
 	struct if_then_else *if_then_else = xcalloc(1, sizeof(*if_then_else));
 
-	if_then_else->str = atomv->atom->u.if_then_else.str;
+	if_then_else->str = convert_hexcode(atomv->atom->u.if_then_else.str);
 	if_then_else->cmp_status = atomv->atom->u.if_then_else.cmp_status;
 
 	push_stack_element(&state->stack);
@@ -1296,6 +1347,9 @@ static int then_atom_handler(struct atom_value *atomv UNUSED,
 			if_then_else->condition_satisfied = 1;
 	} else if (cur->output.len && !is_empty(&cur->output))
 		if_then_else->condition_satisfied = 1;
+
+	if (if_then_else->str)
+		free((char *)if_then_else->str);
 	strbuf_reset(&cur->output);
 	return 0;
 }
@@ -3425,26 +3479,12 @@ void ref_array_sort(struct ref_sorting *sorting, struct ref_array *array)
 		QSORT_S(array->items, array->nr, compare_refs, sorting);
 }
 
-static void append_literal(const char *cp, const char *ep, struct ref_formatting_state *state)
+static void append_literal(const char *cp, const char *ep,
+			   struct ref_formatting_state *state)
 {
 	struct strbuf *s = &state->stack->output;
 
-	while (*cp && (!ep || cp < ep)) {
-		if (*cp == '%') {
-			if (cp[1] == '%')
-				cp++;
-			else {
-				int ch = hex2chr(cp + 1);
-				if (0 <= ch) {
-					strbuf_addch(s, ch);
-					cp += 3;
-					continue;
-				}
-			}
-		}
-		strbuf_addch(s, *cp);
-		cp++;
-	}
+	parse_literal(cp, ep, s);
 }
 
 int format_ref_array_item(struct ref_array_item *info,
diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
index ce5c607193..c9383d23a4 100755
--- a/t/t6300-for-each-ref.sh
+++ b/t/t6300-for-each-ref.sh
@@ -2141,7 +2141,7 @@ test_expect_success GPG 'show lack of signature with custom format' '
 	test_cmp expect actual
 '
 
-test_expect_failure 'format having values containing ) parse correctly' '
+test_expect_success 'format having values containing ) parse correctly' '
 	git branch "1)feat" &&
 	cat >expect <<-\EOF &&
 	refs/heads/1)feat
@@ -2151,7 +2151,9 @@ test_expect_failure 'format having values containing ) parse correctly' '
 	not equals
 	not equals
 	EOF
-	git for-each-ref --format="%(if:equals=1)feat)%(refname:short)%(then)%(refname)%(else)not equals%(end)" \
+
+	# 29 is the hexcode of )
+	git for-each-ref --format="%(if:equals=1%29feat)%(refname:short)%(then)%(refname)%(else)not equals%(end)" \
 		refs/heads/ >actual &&
 	test_cmp expect actual
 '

Jeff King Nov. 6, 2024, 6:51 p.m. UTC | #6

On Tue, Nov 05, 2024 at 07:05:13PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > I am tempted to say the solution is to expand that "equals" value, and
> > possibly add some less-arcane version of the character (maybe "%)"?).
> > But it be a break in backwards compatibility if somebody is trying to
> > match literal %-chars in their "if" block.
> 
> If they were trying to write a literal %, wouldn't they be writing
> %% already, not because % followed by a byte without any special
> meaning happens to be passed intact by the implementation, but
> because that is _the_ right thing to do, when % is used as an
> introducer for escape sequences?  So I do agree it would be a change
> that breaks backward compatibility but I do not think we want to
> stay bug to bug compatible with the current behaviour here.

I think "because that is the right thing to do" is what is in question.
It is not like we happen to allow "%", but you should be writing "%%" in
an if:equals value already. They mean two different things, and anybody
who is doing:

  %(if:equals=%%foo)

to match the literal "%%foo" will be broken if we change that. They are
not doing anything wrong; that is the only way to make it work now.

I wouldn't go so far as to call the current behavior a bug. It's
just...not very flexible. I also think it is unlikely that anybody would
care in practice (though I find matching refs with ")" in them already a
bit far-fetched).

If we wanted to be extra careful, we could introduce a variant of
"equals" that indicates that it will be expanded before comparison.  Or
even an extra tag, like:

  %(if:expand:equals=%%foo)

> I am not sure with the wisdom of %) though.  Wouldn't "%(foo %)" look
> as if %( opens and %) closes a group in our language?

Yeah, I agree it is ugly and possibly confusing. Normally I'd suggest
"\" for escaping, but it isn't otherwise syntactically important within
these formats (I don't think, anyway). The magic character is "%" so
that is what we have to work with.

-Peff

Jeff King Nov. 6, 2024, 6:55 p.m. UTC | #7

On Wed, Nov 06, 2024 at 09:24:08AM +0530, Kousik Sanagavarapu wrote:

> One way to fix this is refactoring the way in which we parse our format
> string.  Although this would mean we would have to do a huge refactoring
> as this step happens very high up in the call chain.
> 
> Therefore, support including parenthesis characters in "<string>" by
> instead giving their hexcode equivalents - as a for-now hack.

So if I understand this is just expanding %<hex> and nothing else? That
seems like the worst of both worlds. Now "%" is magic in these value
strings, breaking compatibility, but we didn't buy ourselves the
flexibility to do arbitrary comparisons like:

  %(if:equals=%(upstream:lstrip=3))%(refname:short)%(then)...

-Peff

Kousik Sanagavarapu Nov. 7, 2024, 2:29 a.m. UTC | #8

On Wed, Nov 06, 2024 at 01:51:02PM -0500, Jeff King wrote:
> On Tue, Nov 05, 2024 at 07:05:13PM -0800, Junio C Hamano wrote:
> 
> > Jeff King <peff@peff.net> writes:
> > 
> > > I am tempted to say the solution is to expand that "equals" value, and
> > > possibly add some less-arcane version of the character (maybe "%)"?).
> > > But it be a break in backwards compatibility if somebody is trying to
> > > match literal %-chars in their "if" block.
> > 
> > If they were trying to write a literal %, wouldn't they be writing
> > %% already, not because % followed by a byte without any special
> > meaning happens to be passed intact by the implementation, but
> > because that is _the_ right thing to do, when % is used as an
> > introducer for escape sequences?  So I do agree it would be a change
> > that breaks backward compatibility but I do not think we want to
> > stay bug to bug compatible with the current behaviour here.
> 
> I think "because that is the right thing to do" is what is in question.
> It is not like we happen to allow "%", but you should be writing "%%" in
> an if:equals value already. They mean two different things, and anybody
> who is doing:
> 
>   %(if:equals=%%foo)
> 
> to match the literal "%%foo" will be broken if we change that. They are
> not doing anything wrong; that is the only way to make it work now.

True.

> I wouldn't go so far as to call the current behavior a bug. It's
> just...not very flexible. I also think it is unlikely that anybody would
> care in practice (though I find matching refs with ")" in them already a
> bit far-fetched).

Yeah.  I really don't think anyone in practice will hit upon this case.
As I mentioned already before, I was just trying to pick out a corner
case for another implementation in ref-filter and stumbled upon this.

> If we wanted to be extra careful, we could introduce a variant of
> "equals" that indicates that it will be expanded before comparison.  Or
> even an extra tag, like:
> 
>   %(if:expand:equals=%%foo)

This seems like a nice idea, if we are thinking about not breaking
backwards compatibility but then there is also this discussion about the
formats being too verbose but I dunno.

Kousik Sanagavarapu Nov. 7, 2024, 2:34 a.m. UTC | #9

On Wed, Nov 06, 2024 at 01:55:11PM -0500, Jeff King wrote:
> On Wed, Nov 06, 2024 at 09:24:08AM +0530, Kousik Sanagavarapu wrote:
> 
> > One way to fix this is refactoring the way in which we parse our format
> > string.  Although this would mean we would have to do a huge refactoring
> > as this step happens very high up in the call chain.
> > 
> > Therefore, support including parenthesis characters in "<string>" by
> > instead giving their hexcode equivalents - as a for-now hack.
> 
> So if I understand this is just expanding %<hex> and nothing else? That
> seems like the worst of both worlds. Now "%" is magic in these value
> strings, breaking compatibility, but

Yeah, I agree that this might be the worst of both worlds after I read
your reply to Junio.  It indeed is a hack - just trying to fix the
parenthesis case and not taking into account

- backwards compatibility with regards to '%'.
- not being able to do

> we didn't buy ourselves the flexibility to do arbitrary comparisons like:
> 
>   %(if:equals=%(upstream:lstrip=3))%(refname:short)%(then)...

Junio C Hamano Nov. 7, 2024, 2:52 a.m. UTC | #10

Jeff King <peff@peff.net> writes:

> who is doing:
>
>   %(if:equals=%%foo)
>
> to match the literal "%%foo" will be broken if we change that. They are
> not doing anything wrong; that is the only way to make it work now.

Ah, you're absolutely right.  Unescaping would start breaking them.

> I wouldn't go so far as to call the current behavior a bug. It's
> just...not very flexible. I also think it is unlikely that anybody would
> care in practice (though I find matching refs with ")" in them already a
> bit far-fetched).

100% agreed.  For that matter, I find "if:equals=%%foo" equally
implausible.

> If we wanted to be extra careful, we could introduce a variant of
> "equals" that indicates that it will be expanded before comparison.  Or
> even an extra tag, like:
>
>   %(if:expand:equals=%%foo)

Surely, but if nobody screams, I am tempted to suggest fixing the
equals/notequals---we do not have to be bug-to-bug compatible with a
buggy old implementation. After all, we do expand the string being
inspected that appears between %(if) and %(then).  I do not think of
a good excuse for us to limit the string that it gets compared with
to literals.

The implementation may be a bit involved, but shouldn't be too bad.

When .str is an empty string in if_atom_handler(), we can follow
what the current code does.  If .str is not empty, allocate a new
stack element in order to parse the .str to its end by pointing
.at_end of the new stack element to a new handler (call it
if_cond_handler()), and pass the if_then_else structure it allocated
as .at_end_data to it.

And in the if_cond_handler(), grab the cur->output and overwrite the
.str member with it (while being careful to avoid leaks).  At the
end of the if_cond_handler(), pass control to if_then_else_handler()
by arranging the if_then_else_handler is called, imitating the way
how if_atom_handler() passes control to if_then_else_handler() in
the current code.

Then things like

  %(if:equals=%(upstream:lstrip=3))%(refname:short)%(then)...

would work as expected ;-)

Kousik Sanagavarapu Nov. 8, 2024, 4:11 a.m. UTC | #11

On Thu, Nov 07, 2024 at 11:52:03AM +0900, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> 
> [...]
> 
> The implementation may be a bit involved, but shouldn't be too bad.
> 
> When .str is an empty string in if_atom_handler(), we can follow
> what the current code does.  If .str is not empty, allocate a new
> stack element in order to parse the .str to its end by pointing
> .at_end of the new stack element to a new handler (call it
> if_cond_handler()), and pass the if_then_else structure it allocated
> as .at_end_data to it.
> 
> And in the if_cond_handler(), grab the cur->output and overwrite the
> .str member with it (while being careful to avoid leaks).  At the
> end of the if_cond_handler(), pass control to if_then_else_handler()
> by arranging the if_then_else_handler is called, imitating the way
> how if_atom_handler() passes control to if_then_else_handler() in
> the current code.
> 
> Then things like
> 
>   %(if:equals=%(upstream:lstrip=3))%(refname:short)%(then)...

So if I understand correctly, we grab the .str and operate on it so that
we expand the atom within it and then do the comparision.

This seems nice, but there is a problem.  Since we always look for the
first occurring ')' in our format string to indicate the end of the atom,
we end up with

	.str = %(upstream:lstrip=3

(the call chain is

	verify_ref_format() -> parse_ref_filter_atom() -> if_atom_parser()
)

Since we have now left out a ')', this ')' gets appended to our output
buf, which would also show up in cur->output when we do the comparision
in then_atom_handler().  For example, in this case our cur->output would
be ")master" instead of "master" after we get the value of
%(refname:short), meaning our comparision always fails.

Jeff King Nov. 8, 2024, 5:16 p.m. UTC | #12

On Fri, Nov 08, 2024 at 09:41:03AM +0530, Kousik Sanagavarapu wrote:

> > Then things like
> > 
> >   %(if:equals=%(upstream:lstrip=3))%(refname:short)%(then)...
> 
> So if I understand correctly, we grab the .str and operate on it so that
> we expand the atom within it and then do the comparision.
> 
> This seems nice, but there is a problem.  Since we always look for the
> first occurring ')' in our format string to indicate the end of the atom,
> we end up with
> 
> 	.str = %(upstream:lstrip=3
> 
> (the call chain is
> 
> 	verify_ref_format() -> parse_ref_filter_atom() -> if_atom_parser()
> )
> 
> Since we have now left out a ')', this ')' gets appended to our output
> buf, which would also show up in cur->output when we do the comparision
> in then_atom_handler().  For example, in this case our cur->output would
> be ")master" instead of "master" after we get the value of
> %(refname:short), meaning our comparision always fails.

Yes, though I think the parser _could_ be improved here. This is
different the earlier case of matching "ref-with-)". In that case it is
syntactically ambiguous. For example given:

  %(if:equals=ref-with-))foo

you cannot tell the difference between:

  - matching "ref-with-)", followed by "foo"

  - matching "ref-with-", followed by ")foo"

But if the parenthesis in question is closing a %() item, like:

  %(if:equals=%(refname))foo

then we know that the inner ")" is closing %(refname), since parsing it
as "%(refname" followed by ")foo" would leave an unbalanced pair. But
finding that would require a real recursive descent parser, rather than
a blind strchr() for the closing ")"[1].

In the meantime yeah, you'd have to spell it as:

  %(if:equals=%(refname%29)

which is...deeply unsatisfying.

I have long dreamed of throwing out all of this format code in favor of
a recursive parser which generates an actual tree of nodes, and
implements all of the ref-filter/pretty.c/cat-file format placeholders.
But I think it's a non-trivial task.

-Peff

[1] Incidentally, the "%)" I proposed earlier would also fall afoul of
    this problem. The search for the closing ")" is done blindly without
    regard to possible quoting.

Kousik Sanagavarapu Nov. 8, 2024, 6:12 p.m. UTC | #13

On Fri, Nov 08, 2024 at 12:16:37PM -0500, Jeff King wrote:
> On Fri, Nov 08, 2024 at 09:41:03AM +0530, Kousik Sanagavarapu wrote:
> 
> [...]
> 
> In the meantime yeah, you'd have to spell it as:
> 
>   %(if:equals=%(refname%29)
> 
> which is...deeply unsatisfying.
> 
> I have long dreamed of throwing out all of this format code in favor of
> a recursive parser which generates an actual tree of nodes, and
> implements all of the ref-filter/pretty.c/cat-file format placeholders.

Oh!  I remember this, let me search up the thread...

Quoting you from

	https://lore.kernel.org/git/20230901191639.GA1955435@coredump.intra.peff.net/

    IMHO the code would be a lot easier to work with if the atoms were
    structured as a parse tree with child pointers (especially when you get
    into things like "if" that have sub-expressions). I think one of the
    reasons that used_atom is an array is to de-duplicate repeated mentions
    (so if you formatted "%(foo) %(foo)" it would only have to store the
    computed value once).

    But I think that is the wrong way to optimize it. We shouldn't be
    storing any strings per-atom, but rather walking the parse tree to
    produce a single output buffer. And the values should be cheap to fill
    in, because we should parse the object as necessary up front. This is
    more or less the way the pretty.c parser does it.

> But I think it's a non-trivial task.

True.

t6300: values containing ')' are broken in ref formats

Commit Message

Comments

Patch