diff mbox series

[RFC,v1,1/2] pathspec: warn: long and short forms are incompatible

Message ID 20210326024005.26962-2-stdedos+git@gmail.com (mailing list archive)
State Superseded
Headers show
Series pathspec: warn: long and short forms are incompatible | expand

Commit Message

Σταύρος Ντέντος March 26, 2021, 2:40 a.m. UTC
From: Stavros Ntentos <133706+stdedos@users.noreply.github.com>

Namely, `!` and any other long magic form (e.g. `glob`)
cannot be combined to one entry.

Issue a warning when such thing happens, and hint to the solution.

Signed-off-by: Stavros Ntentos <133706+stdedos@users.noreply.github.com>
---
 pathspec.c                  | 19 +++++++++++++++++++
 pathspec.h                  |  1 +
 t/t6132-pathspec-exclude.sh | 33 +++++++++++++++++++++++++++++++++
 3 files changed, 53 insertions(+)

Comments

Jeff King March 26, 2021, 5:28 a.m. UTC | #1
On Fri, Mar 26, 2021 at 04:40:04AM +0200, Σταύρος Ντέντος wrote:

> +void check_mishandled_exclude(const char *entry) {
> +	size_t entry_len = strlen(entry);
> +	char flags[entry_len];
> +	char path[entry_len];

We avoid using variable-length arrays in our codebase. For one thing,
they were not historically supported by all platforms (we are slowly
using more C99 features, but we are introducing them slowly and
intentionally).

But two, they are limited in size and the failure mode is not graceful.
If "entry" is larger than the available stack, then we'll get a segfault
with no option to handle it better.

> +	if (sscanf(entry, ":!(%4096[^)])%4096s", &flags, &path) != 2) {
> +		return;

We also generally avoid using scanf, because it's error-prone. The
"4096" is scary here, but I don't _think_ it's a buffer overflow,
because "path" is already the same size as "entry" (not including the
NUL terminator, but that is negated by the fact that we'll have skipped
at least ":!").

Is this "%4096[^)]" actually valid? I don't think scanf understands
regular expressions.

We'd want to avoid making an extra copy of the string anyway. So you'd
probably want to just parse left-to-right in the original string, like:

  const char *p = entry;

  /* skip past stuff we know must be there */
  if (!skip_prefix(p, ":!(", &p))
        return;

  /* this checks count_slashes() > 0 in the flags section, though I'm
   * not sure I understand what that is looking for... */
  for (; *p && *p != ')'; p++) {
        if (*p == '/')
	        return;
  }
  if (*p++ != ')')
        return;

  /* now p is pointing at "path", though we don't seem to do anything
   * with it... */

-Peff
Σταύρος Ντέντος March 26, 2021, 4:16 p.m. UTC | #2
> We avoid using variable-length arrays in our codebase. ...

Hear hear, however, I wanted to avoid the "small mess"
that allocate/free would cause (one or more of ++sloc, labels, if-nesting); ...

> But two, they are limited in size and the failure mode is not graceful. ...
... however, my main issue is that - I don't know what's a sane allocation size.

> ... The "4096" is scary here ...
While scary, it is "a safe" upper high.
The first time the string ends up in pathspec.c for processing it's here:
	entry = argv[i];
which comes from here
	parse_pathspec(pathspec, magic_mask, flags, prefix, parsed_file.v);

and I don't know what's the maximum size of `parsed_file.v[0]`

> Is this "%4096[^)]" actually valid? I don't think scanf understands
> regular expressions.

I was suprised too - but it's not regex.

see: https://www.cplusplus.com/reference/cstdio/scanf/#parameters - 4th/3rd row from the end

%s is greedy and there would be no other way to limit `sscanf` to not include `)` on its first variable.

> ... though I'm not sure I understand what that is looking for ...

I think it will help you see what I am trying to achieve if you read at the warning message / testcase
https://lore.kernel.org/git/20210326024005.26962-2-stdedos+git@gmail.com/#iZ30t:t6132-pathspec-exclude.sh

And, to clean up the testcase:
	git log --oneline --format=%s -- ':!(glob)**/file'

I guess it should be now obvious what am I targetting:

If someone naively mixes short and long pathspec magics (`!`, and `(glob)`),
short form takes precedence and ignores long magic / assumes long magic as part of path.

(If it's not obvious, all the more reason to include such warning)
Jeff King March 27, 2021, 9:41 a.m. UTC | #3
On Fri, Mar 26, 2021 at 06:16:26PM +0200, Stavros Ntentos wrote:

> > We avoid using variable-length arrays in our codebase. ...
> 
> Hear hear, however, I wanted to avoid the "small mess"
> that allocate/free would cause (one or more of ++sloc, labels, if-nesting); ...
> 
> > But two, they are limited in size and the failure mode is not graceful. ...
> ... however, my main issue is that - I don't know what's a sane allocation size.

I don't think a VLA gets you out of knowing the allocation size. Either
way, you are using strlen(entry).

But I do think avoiding allocating altogether is better (as I showed in
my previous response).

> > ... The "4096" is scary here ...
> While scary, it is "a safe" upper high.
> The first time the string ends up in pathspec.c for processing it's here:
> 	entry = argv[i];
> which comes from here
> 	parse_pathspec(pathspec, magic_mask, flags, prefix, parsed_file.v);
> 
> and I don't know what's the maximum size of `parsed_file.v[0]`

There is no reasonable maximum size you can assume. Using 4096 is most
definitely not a safe upper bound.

However, as I said, I don't think it is doing anything useful in the
first place. You have sized the destination buffers as large as the
original string, so they must be large enough to hold any subset of the
original. Dropping them would be equally correct, but less distracting
to a reader.

> > Is this "%4096[^)]" actually valid? I don't think scanf understands
> > regular expressions.
> 
> I was suprised too - but it's not regex.
> 
> see: https://www.cplusplus.com/reference/cstdio/scanf/#parameters - 4th/3rd row from the end

Thanks, this is a corner of scanf I haven't looked at (mostly because
again, we generally avoid scanf in our code base entirely).

> I think it will help you see what I am trying to achieve if you read at the warning message / testcase
> https://lore.kernel.org/git/20210326024005.26962-2-stdedos+git@gmail.com/#iZ30t:t6132-pathspec-exclude.sh
> 
> And, to clean up the testcase:
> 	git log --oneline --format=%s -- ':!(glob)**/file'
> 
> I guess it should be now obvious what am I targetting:
> 
> If someone naively mixes short and long pathspec magics (`!`, and `(glob)`),
> short form takes precedence and ignores long magic / assumes long magic as part of path.
> 
> (If it's not obvious, all the more reason to include such warning)

I understand the overall goal. I am not sure why slashes in the flags
section are a reliable indicator that this mixing is not happening and
we should not show the warning.

It also feels like any checks like this should be relying on the
existing pathspec-magic parser a bit more. I don't know the pathspec
code that well, but surely at some point it has a notion of which parts
are magic flags (e.g., after parse_element_magic in init_pathspec_item).

-Peff
Junio C Hamano March 27, 2021, 6:36 p.m. UTC | #4
Jeff King <peff@peff.net> writes:

> It also feels like any checks like this should be relying on the
> existing pathspec-magic parser a bit more. I don't know the pathspec
> code that well, but surely at some point it has a notion of which parts
> are magic flags (e.g., after parse_element_magic in init_pathspec_item).

Absolutely.  parse_element_magic() decides, by looking at the
character after the initial ':', the we are looking at the longhand
form when magic sequence is introduced by ":(".

Otherwise, : must be followed by shorthand form, where the only
usable ones are '/' (synonym for top), '!' and '^' (synonym for
exclude), and ':' (end of short-form marker), but most importantly,
'(' will *never* be a valid shorthand form (because it is not part
of the GIT_PATHSPEC_MAGIC class of bytes).

So, presumably, inside parse_short_magic(), you could detect '('
and complain it as an attempt to switch to longform.

Here is to illustrate where to hook the check.  With this, I can get
something like this:

    $ git show -s --oneline -- ':!/(foobar)frotz' ':/(bar)nitfol' .
    warning: :!/: cannot mix shortform magic with longform like (exclude)
    warning: :/: cannot mix shortform magic with longform like (exclude)
    84d06cdc06 Sync with v2.31.1

Note that this illustration does not show the best warning message.
To improve it to a bit more useful level, I think two things need to
happen on top.

 * Use advice to allow users sequelch.

 * Show a few letters after '(' (but pay attention to the end of the
   string, which ends with either '\0' or ':'), and lose the
   substring "like (exclude)" from the message.

That would have given

    hint: ":!/(foo...": cannot mix shortform and longform magic

for the above example.

I initially thought it might make it even better if we looked ahead
of what comes after '(', took the substring before ',' or ')', and
looked up pathspec_magic[] array, BUT I do not think it is a good
idea.  It would be confusing if we do not give the same advice to
":!(literol)" when the user does get one for ":!(literal)".  So the
above "two things need to happen" list deliberately excludes an
"improvement" doing so.


 pathspec.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git i/pathspec.c w/pathspec.c
index 18b3be362a..cd343d5b54 100644
--- i/pathspec.c
+++ w/pathspec.c
@@ -336,6 +336,9 @@ static const char *parse_long_magic(unsigned *magic, int *prefix_len,
 	return pos;
 }
 
+static const char mixed_pathspec_magic[] =
+	N_("%.*s: cannot mix shortform magic with longform like (exclude)");
+
 /*
  * Parse the pathspec element looking for short magic
  *
@@ -356,6 +359,16 @@ static const char *parse_short_magic(unsigned *magic, const char *elem)
 			continue;
 		}
 
+		if (ch == '(') {
+			/*
+			 * a possible common mistake: once you started
+			 * a shortform you cannot add (longform), e.g.
+			 * ":!(top)"
+			 */
+			warning(_(mixed_pathspec_magic),
+				(int)(pos-elem), elem);
+		}
+
 		if (!is_pathspec_magic(ch))
 			break;
Σταύρος Ντέντος March 28, 2021, 3:44 p.m. UTC | #5
Junio C Hamano <gitster@pobox.com> writes:
> Jeff King <peff@peff.net> writes:
>
> > It also feels like any checks like this should be relying on the
> > existing pathspec-magic parser a bit more. I don't know the pathspec
> > code that well, but surely at some point it has a notion of which parts
> > are magic flags (e.g., after parse_element_magic in init_pathspec_item).
>
> Absolutely.  parse_element_magic() decides, by looking at the
> character after the initial ':', the we are looking at the longhand
> form when magic sequence is introduced by ":(".

Absolutely from me too! I would want to re-use more of the existing code.
In response to that, patch
https://lore.kernel.org/git/20210328152629.16486-1-133706+stdedos@users.noreply.github.com/
clearly shows that it is possible to re-use the de-facto long form magic parsing
code to attempt to parse the string that continues as a long form magic.

This could theoretically pave the road for mixed short/long form parsing
(in that order), if would be so desired.

However, as probably suspected, it suffers from one big problem:
	if (ARRAY_SIZE(pathspec_magic) <= i)
		die(_("Invalid pathspec magic '%.*s' in '%s'"),
		    (int) len, pos, elem);

which is an unconditional `exit(128)`.

I don't know if it's possible to rescue (or redirect) that `die` call.
In any case, considering that, this again deviates from the
"small informative" change this commit was supposed to be.

Having said that, I have cooked a working patch with the proposed code.
diff mbox series

Patch

diff --git a/pathspec.c b/pathspec.c
index 7a229d8d22..374c529569 100644
--- a/pathspec.c
+++ b/pathspec.c
@@ -1,3 +1,5 @@ 
+#include <stdio.h>
+#include <string.h>
 #include "cache.h"
 #include "config.h"
 #include "dir.h"
@@ -586,6 +588,8 @@  void parse_pathspec(struct pathspec *pathspec,
 	for (i = 0; i < n; i++) {
 		entry = argv[i];
 
+		check_mishandled_exclude(entry);
+
 		init_pathspec_item(item + i, flags, prefix, prefixlen, entry);
 
 		if (item[i].magic & PATHSPEC_EXCLUDE)
@@ -739,3 +743,18 @@  int match_pathspec_attrs(const struct index_state *istate,
 
 	return 1;
 }
+
+void check_mishandled_exclude(const char *entry) {
+	size_t entry_len = strlen(entry);
+	char flags[entry_len];
+	char path[entry_len];
+
+	if (sscanf(entry, ":!(%4096[^)])%4096s", &flags, &path) != 2) {
+		return;
+	}
+	if (count_slashes(flags) > 0) {
+		return;
+	}
+
+	warning(_("Pathspec provided matches `:!(...)`\n\tDid you mean `:(exclude,...)`?"));
+}
diff --git a/pathspec.h b/pathspec.h
index 454ce364fa..879d4e82c6 100644
--- a/pathspec.h
+++ b/pathspec.h
@@ -157,5 +157,6 @@  char *find_pathspecs_matching_against_index(const struct pathspec *pathspec,
 int match_pathspec_attrs(const struct index_state *istate,
 			 const char *name, int namelen,
 			 const struct pathspec_item *item);
+void check_mishandled_exclude(const char* pathspec_entry);
 
 #endif /* PATHSPEC_H */
diff --git a/t/t6132-pathspec-exclude.sh b/t/t6132-pathspec-exclude.sh
index 30328b87f0..b32ddb2a56 100755
--- a/t/t6132-pathspec-exclude.sh
+++ b/t/t6132-pathspec-exclude.sh
@@ -244,4 +244,37 @@  test_expect_success 'grep --untracked PATTERN :(exclude)*FILE' '
 	test_cmp expect-grep actual-grep
 '
 
+cat > expected_warn <<"EOF"
+Pathspec provided matches `:!(...)`
+EOF
+test_expect_success 'warn pathspec :!(...) skips the parenthesized magics' '
+	git log --oneline --format=%s -- '"'"':!(glob)**/file'"'"' >actual 2>warn &&
+	cat <<EOF >expect &&
+sub2/file
+sub/sub/sub/file
+sub/file2
+sub/sub/file
+sub/file
+file
+EOF
+	cat actual &&
+	cat warn &&
+	test_cmp expect actual &&
+	grep -Ff expected_warn warn
+'
+
+test_expect_success 'do not warn that pathspec :!(...) skips the parenthesized magics (if parenthesis would not be part of the magic)' '
+	git log --oneline --format=%s -- '"'"':!(gl/ob)/file'"'"' >actual 2>warn &&
+	cat <<EOF >expect &&
+sub2/file
+sub/sub/sub/file
+sub/file2
+sub/sub/file
+sub/file
+file
+EOF
+	test_cmp expect actual &&
+	! grep -Ff expected_warn warn
+'
+
 test_done