[v2,0/8] fast-import: tighten parsing of paths

Message ID	cover.1711960552.git.thalia@archibald.dev (mailing list archive)
Headers	show Received: from mail-40136.proton.ch (mail-40136.proton.ch [185.70.40.136]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8D42FDF59 for <git@vger.kernel.org>; Mon, 1 Apr 2024 09:02:18 +0000 (UTC) Date: Mon, 01 Apr 2024 09:02:03 +0000 To: git@vger.kernel.org From: Thalia Archibald <thalia@archibald.dev> Cc: Patrick Steinhardt <ps@pks.im>, Elijah Newren <newren@gmail.com>, Thalia Archibald <thalia@archibald.dev> Subject: [PATCH v2 0/8] fast-import: tighten parsing of paths Message-ID: <cover.1711960552.git.thalia@archibald.dev> In-Reply-To: <20240322000304.76810-1-thalia@archibald.dev> References: <20240322000304.76810-1-thalia@archibald.dev> Feedback-ID: 63908566:user:proton Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable
Series	fast-import: tighten parsing of paths \| expand [v2,0/8] fast-import: tighten parsing of paths [v2,1/8] fast-import: tighten path unquoting [v2,2/8] fast-import: directly use strbufs for paths [v2,3/8] fast-import: allow unquoted empty path for root [v2,4/8] fast-import: remove dead strbuf [v2,5/8] fast-import: improve documentation for unquoted paths [v2,6/8] fast-import: document C-style escapes for paths [v2,7/8] fast-import: forbid escaped NUL in paths [v2,8/8] fast-import: make comments more precise

Thalia Archibald April 1, 2024, 9:02 a.m. UTC

> fast-import has subtle differences in how it parses file paths between each
> occurrence of <path> in the grammar. Many errors are suppressed or not checked,
> which could lead to silent data corruption. A particularly bad case is when a
> front-end sent escapes that Git doesn't recognize (e.g., hex escapes are not
> supported), it would be treated as literal bytes instead of a quoted string.
>
> Bring path parsing into line with the documented behavior and improve
> documentation to fill in missing details.

Thanks for the review, Patrick. I've made several changes, which I think address
your feedback. What's the protocol for adding `Reviewed-by`? Since I don't know
whether I, you, or Junio add it, I've refrained from attaching your name to
these patches.

Changes since v1:
* In fast-import:
  * Move `strbuf_release` outside of `parse_path_space` and `parse_path_eol`.
  * Retain allocations for static `strbuf`s.
  * Rename `allow_spaces` parameter of `parse_path` to `include_spaces`.
  * Extract change to neighboring comments as patch 8.
* In tests:
  * Test `` for the root path additionally in all tests using `""`.
  * Pass all arguments by positional variables.
  * Use `local`.
  * Use `test_when_finished` instead of manual cleanup.
* In documentation:
  * Better document conditions under which a path is considered quoted or
    unquoted.
* Reword commit messages.

Thalia


Thalia Archibald (8):
  fast-import: tighten path unquoting
  fast-import: directly use strbufs for paths
  fast-import: allow unquoted empty path for root
  fast-import: remove dead strbuf
  fast-import: improve documentation for unquoted paths
  fast-import: document C-style escapes for paths
  fast-import: forbid escaped NUL in paths
  fast-import: make comments more precise

 Documentation/git-fast-import.txt |  30 +-
 builtin/fast-import.c             | 156 ++++----
 t/t9300-fast-import.sh            | 617 +++++++++++++++++++++---------
 3 files changed, 541 insertions(+), 262 deletions(-)

Range-diff against v1:
1:  8d9e0b25cb ! 1:  e790bdf714 fast-import: tighten parsing of paths
    @@ Metadata
     Author: Thalia Archibald <thalia@archibald.dev>

      ## Commit message ##
    -    fast-import: tighten parsing of paths
    +    fast-import: tighten path unquoting

         Path parsing in fast-import is inconsistent and many unquoting errors
    -    are suppressed.
    +    are suppressed or not checked.

    -    `<path>` appears in the grammar in these places:
    +    <path> appears in the grammar in these places:

             filemodify ::= 'M' SP <mode> (<dataref> | 'inline') SP <path> LF
             filedelete ::= 'D' SP <path> LF
    @@ Commit message
         and fast-import.c parses them in five different ways:

         1. For filemodify and filedelete:
    -       If `<path>` is a valid quoted string, unquote it;
    -       otherwise, treat it as literal bytes (including any number of SP).
    +       Try to unquote <path>. If it unquotes without errors, use the
    +       unquoted version; otherwise, treat it as literal bytes to the end of
    +       the line (including any number of SP).
         2. For filecopy (source) and filerename (source):
    -       If `<path>` is a valid quoted string, unquote it;
    -       otherwise, treat it as literal bytes until the next SP.
    +       Try to unquote <path>. If it unquotes without errors, use the
    +       unquoted version; otherwise, treat it as literal bytes up to, but not
    +       including, the next SP.
         3. For filecopy (dest) and filerename (dest):
    -       Like 1., but an unquoted empty string is an error.
    +       Like 1., but an unquoted empty string is forbidden.
         4. For ls:
    -       If `<path>` starts with `"`, unquote it and report parse errors;
    -       otherwise, treat it as literal bytes (including any number of SP).
    +       If <path> starts with `"`, unquote it and report parse errors;
    +       otherwise, treat it as literal bytes to the end of the line
    +       (including any number of SP).
         5. For ls-commit:
    -       Unquote `<path>` and report parse errors.
    +       Unquote <path> and report parse errors.
            (It must start with `"` to disambiguate from ls.)

         In the first three, any errors from trying to unquote a string are
         suppressed, so a quoted string that contains invalid escapes would be
         interpreted as literal bytes. For example, `"\xff"` would fail to
         unquote (because hex escapes are not supported), and it would instead be
    -    interpreted as the byte sequence `"` `\` `x` `f` `f` `"`, which is
    +    interpreted as the byte sequence '"', '\\', 'x', 'f', 'f', '"', which is
         certainly not intended. Some front-ends erroneously use their language's
    -    standard quoting routine and could silently introduce escapes that would
    -    be incorrectly parsed due to this.
    +    standard quoting routine instead of matching Git's, which could silently
    +    introduce escapes that would be incorrectly parsed due to this and lead
    +    to data corruption.

    -    The documentation states that “To use a source path that contains SP the
    -    path must be quoted.”, so it is expected that some implementations
    -    depend on spaces being allowed in paths in the final position. Thus we
    -    have two documented ways to parse paths, so simplify the implementation
    -    to that.
    +    The documentation states “To use a source path that contains SP the path
    +    must be quoted.”, so it is expected that some implementations depend on
    +    spaces being allowed in paths in the final position. Thus we have two
    +    documented ways to parse paths, so simplify the implementation to that.

         Now we have:

         1. `parse_path_eol` for filemodify, filedelete, filecopy (dest),
            filerename (dest), ls, and ls-commit:

    -       If `<path>` starts with `"`, unquote it and report parse errors;
    -       otherwise, treat it as literal bytes (including any number of SP).
    -       Garbage after a quoted string or an unquoted empty string are errors.
    -       (In ls-commit, it must be quoted to disambiguate from ls.)
    +       If <path> starts with `"`, unquote it and report parse errors;
    +       otherwise, treat it as literal bytes to the end of the line
    +       (including any number of SP).

         2. `parse_path_space` for filecopy (source) and filerename (source):

    -       If `<path>` starts with `"`, unquote it and report parse errors;
    -       otherwise, treat it as literal bytes until the next SP.
    -       It must be followed by a SP. An unquoted empty string is an error.
    +       If <path> starts with `"`, unquote it and report parse errors;
    +       otherwise, treat it as literal bytes up to, but not including, the
    +       next SP. It must be followed by SP.
    +
    +    There remain two special cases: The dest <path> in filecopy and rename
    +    cannot be an unquoted empty string (this will be addressed subsequently)
    +    and <path> in ls-commit must be quoted to disambiguate it from ls.

         Signed-off-by: Thalia Archibald <thalia@archibald.dev>

    - ## Documentation/git-fast-import.txt ##
    -@@ Documentation/git-fast-import.txt: The value of `<path>` must be in canonical form. That is it must not:
    - * contain the special component `.` or `..` (e.g. `foo/./bar` and
    -   `foo/../bar` are invalid).
    -
    --The root of the tree can be represented by an empty string as `<path>`.
    -+The root of the tree can be represented by a quoted empty string (`""`)
    -+as `<path>`.
    -
    - It is recommended that `<path>` always be encoded using UTF-8.
    -
    -
      ## builtin/fast-import.c ##
    -@@ builtin/fast-import.c: static int parse_mapped_oid_hex(const char *hex, struct object_id *oid, const ch
    -  *
    -  *   idnum ::= ':' bigint;
    -  *
    -- * Return the first character after the value in *endptr.
    -+ * Update *endptr to point to the first character after the value.
    -  *
    -  * Complain if the following character is not what is expected,
    -  * either a space or end of the string.
    -@@ builtin/fast-import.c: static uintmax_t parse_mark_ref_eol(const char *p)
    - }
    -
    - /*
    -- * Parse the mark reference, demanding a trailing space.  Return a
    -- * pointer to the space.
    -+ * Parse the mark reference, demanding a trailing space. Update *p to
    -+ * point to the first character after the space.
    -  */
    - static uintmax_t parse_mark_ref_space(const char **p)
    - {
     @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p)
      	return mark;
      }
    @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p)
     +/*
     + * Parse the path string into the strbuf. It may be quoted with escape sequences
     + * or unquoted without escape sequences. When unquoted, it may only contain a
    -+ * space if `allow_spaces` is nonzero.
    ++ * space if `include_spaces` is nonzero.
     + */
    -+static void parse_path(struct strbuf *sb, const char *p, const char **endp, int allow_spaces, const char *field)
    ++static void parse_path(struct strbuf *sb, const char *p, const char **endp, int include_spaces, const char *field)
     +{
    -+	strbuf_reset(sb);
     +	if (*p == '"') {
     +		if (unquote_c_style(sb, p, endp))
     +			die("Invalid %s: %s", field, command_buf.buf);
     +	} else {
    -+		if (allow_spaces)
    ++		if (include_spaces)
     +			*endp = p + strlen(p);
     +		else
     +			*endp = strchr(p, ' ');
    -+		if (*endp == p)
    -+			die("Missing %s: %s", field, command_buf.buf);
     +		strbuf_add(sb, p, *endp - p);
     +	}
     +}
    @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p)
      	struct object_id oid;
      	uint16_t mode, inline_data = 0;
     @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b)
    - 			die("Missing space after SHA1: %s", command_buf.buf);
      	}

    --	strbuf_reset(&uq);
    + 	strbuf_reset(&uq);
     -	if (!unquote_c_style(&uq, p, &endp)) {
     -		if (*endp)
     -			die("Garbage after path in: %s", command_buf.buf);
    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
      	static struct strbuf uq = STRBUF_INIT;
     -	const char *endp;

    --	strbuf_reset(&uq);
    + 	strbuf_reset(&uq);
     -	if (!unquote_c_style(&uq, p, &endp)) {
     -		if (*endp)
     -			die("Garbage after path in: %s", command_buf.buf);
    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
     -	const char *endp;
      	struct tree_entry leaf;

    --	strbuf_reset(&s_uq);
    + 	strbuf_reset(&s_uq);
     -	if (!unquote_c_style(&s_uq, s, &endp)) {
     -		if (*endp != ' ')
     -			die("Missing space after source: %s", command_buf.buf);
    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
     -		strbuf_add(&s_uq, s, endp - s);
     -	}
     +	parse_path_space(&s_uq, p, &p, "source");
    -+	parse_path_eol(&d_uq, p, "dest");
      	s = s_uq.buf;
    --
    +
     -	endp++;
     -	if (!*endp)
    --		die("Missing dest: %s", command_buf.buf);
    ++	if (!p)
    + 		die("Missing dest: %s", command_buf.buf);
     -
     -	d = endp;
    --	strbuf_reset(&d_uq);
    + 	strbuf_reset(&d_uq);
     -	if (!unquote_c_style(&d_uq, d, &endp)) {
     -		if (*endp)
     -			die("Garbage after dest in: %s", command_buf.buf);
     -		d = d_uq.buf;
     -	}
    ++	parse_path_eol(&d_uq, p, "dest");
     +	d = d_uq.buf;

      	memset(&leaf, 0, sizeof(leaf));
      	if (rename)
    -@@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
    +@@ builtin/fast-import.c: static void print_ls(int mode, const unsigned char *hash, const char *path)
    +
    + static void parse_ls(const char *p, struct branch *b)
      {
    ++	static struct strbuf uq = STRBUF_INIT;
      	struct tree_entry *root = NULL;
      	struct tree_entry leaf = {NULL};
    -+	static struct strbuf uq = STRBUF_INIT;

    - 	/* ls SP (<tree-ish> SP)? <path> */
    - 	if (*p == '"') {
     @@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
      			root->versions[1].mode = S_IFDIR;
      		load_tree(root);
    @@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
     -			die("Garbage after path in: %s", command_buf.buf);
     -		p = uq.buf;
     -	}
    ++	strbuf_reset(&uq);
     +	parse_path_eol(&uq, p, "path");
     +	p = uq.buf;
      	tree_content_get(root, p, &leaf, 1);
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +# Path parsing
     +#
     +# There are two sorts of ways a path can be parsed, depending on whether it is
    -+# the last field on the line. Additionally, ls without a <tree-ish> has a
    -+# special case. Test every occurrence of <path> in the grammar against every
    -+# error case.
    ++# the last field on the line. Additionally, ls without a <dataref> has a special
    ++# case. Test every occurrence of <path> in the grammar against every error case.
     +#
     +
     +#
     +# Valid paths at the end of a line: filemodify, filedelete, filecopy (dest),
     +# filerename (dest), and ls.
     +#
    -+# commit :301 from root -- modify hello.c
    ++# commit :301 from root -- modify hello.c (for setup)
     +# commit :302 from :301 -- modify $path
     +# commit :303 from :302 -- delete $path
     +# commit :304 from :301 -- copy hello.c $path
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +# ls :305 $path
     +#
     +test_path_eol_success () {
    -+	test="$1" path="$2" unquoted_path="$3"
    ++	local test="$1" path="$2" unquoted_path="$3"
     +	test_expect_success "S: paths at EOL with $test must work" '
    ++		test_when_finished "git branch -D S-path-eol" &&
    ++
     +		git fast-import --export-marks=marks.out <<-EOF >out 2>err &&
     +		blob
     +		mark :401
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		hallo welt
     +		BLOB
     +
    -+		commit refs/heads/path-eol
    ++		commit refs/heads/S-path-eol
     +		mark :301
     +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
     +		data <<COMMIT
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		COMMIT
     +		M 100644 :401 hello.c
     +
    -+		commit refs/heads/path-eol
    ++		commit refs/heads/S-path-eol
     +		mark :302
     +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
     +		data <<COMMIT
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		from :301
     +		M 100644 :402 '"$path"'
     +
    -+		commit refs/heads/path-eol
    ++		commit refs/heads/S-path-eol
     +		mark :303
     +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
     +		data <<COMMIT
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		from :302
     +		D '"$path"'
     +
    -+		commit refs/heads/path-eol
    ++		commit refs/heads/S-path-eol
     +		mark :304
     +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
     +		data <<COMMIT
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		from :301
     +		C hello.c '"$path"'
     +
    -+		commit refs/heads/path-eol
    ++		commit refs/heads/S-path-eol
     +		mark :305
     +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
     +		data <<COMMIT
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		git ls-tree $commit_r >tree_r.out &&
     +		test_cmp tree_r.exp tree_r.out &&
     +
    -+		test_cmp out tree_r.exp &&
    -+
    -+		git branch -D path-eol
    ++		test_cmp out tree_r.exp
     +	'
     +}
     +
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +#
     +# Valid paths before a space: filecopy (source) and filerename (source).
     +#
    -+# commit :301 from root -- modify $path
    ++# commit :301 from root -- modify $path (for setup)
     +# commit :302 from :301 -- copy $path hello2.c
     +# commit :303 from :301 -- rename $path hello2.c
     +#
     +test_path_space_success () {
    -+	test="$1" path="$2" unquoted_path="$3"
    ++	local test="$1" path="$2" unquoted_path="$3"
     +	test_expect_success "S: paths before space with $test must work" '
    ++		test_when_finished "git branch -D S-path-space" &&
    ++
     +		git fast-import --export-marks=marks.out <<-EOF 2>err &&
     +		blob
     +		mark :401
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		hello world
     +		BLOB
     +
    -+		commit refs/heads/path-space
    ++		commit refs/heads/S-path-space
     +		mark :301
     +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
     +		data <<COMMIT
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		COMMIT
     +		M 100644 :401 '"$path"'
     +
    -+		commit refs/heads/path-space
    ++		commit refs/heads/S-path-space
     +		mark :302
     +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
     +		data <<COMMIT
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +		from :301
     +		C '"$path"' hello2.c
     +
    -+		commit refs/heads/path-space
    ++		commit refs/heads/S-path-space
     +		mark :303
     +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
     +		data <<COMMIT
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +
     +		printf "100644 blob $blob\thello2.c\n" >tree_r.exp &&
     +		git ls-tree $commit_r >tree_r.out &&
    -+		test_cmp tree_r.exp tree_r.out &&
    -+
    -+		git branch -D path-space
    ++		test_cmp tree_r.exp tree_r.out
     +	'
     +}
     +
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +# of <path> in the grammar against all error kinds.
     +#
     +test_path_fail () {
    -+	what="$1" path="$2" err_grep="$3"
    ++	local change="$1" what="$2" prefix="$3" path="$4" suffix="$5" err_grep="$6"
     +	test_expect_success "S: $change with $what must fail" '
     +		test_must_fail git fast-import <<-EOF 2>err &&
     +		blob
    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
     +}
     +
     +test_path_base_fail () {
    -+	test_path_fail 'unclosed " in '"$field"          '"hello.c'    "Invalid $field"
    -+	test_path_fail "invalid escape in quoted $field" '"hello\xff"' "Invalid $field"
    ++	local change="$1" prefix="$2" field="$3" suffix="$4"
    ++	test_path_fail "$change" 'unclosed " in '"$field"          "$prefix" '"hello.c'    "$suffix" "Invalid $field"
    ++	test_path_fail "$change" "invalid escape in quoted $field" "$prefix" '"hello\xff"' "$suffix" "Invalid $field"
     +}
     +test_path_eol_quoted_fail () {
    -+	test_path_base_fail
    -+	test_path_fail "garbage after quoted $field" '"hello.c"x' "Garbage after $field"
    -+	test_path_fail "space after quoted $field"   '"hello.c" ' "Garbage after $field"
    ++	local change="$1" prefix="$2" field="$3" suffix="$4"
    ++	test_path_base_fail "$change" "$prefix" "$field" "$suffix"
    ++	test_path_fail "$change" "garbage after quoted $field" "$prefix" '"hello.c"x' "$suffix" "Garbage after $field"
    ++	test_path_fail "$change" "space after quoted $field"   "$prefix" '"hello.c" ' "$suffix" "Garbage after $field"
     +}
     +test_path_eol_fail () {
    -+	test_path_eol_quoted_fail
    -+	test_path_fail 'empty unquoted path' '' "Missing $field"
    ++	local change="$1" prefix="$2" field="$3" suffix="$4"
    ++	test_path_eol_quoted_fail "$change" "$prefix" "$field" "$suffix"
     +}
     +test_path_space_fail () {
    -+	test_path_base_fail
    -+	test_path_fail 'empty unquoted path' '' "Missing $field"
    -+	test_path_fail "missing space after quoted $field" '"hello.c"x' "Missing space after $field"
    ++	local change="$1" prefix="$2" field="$3" suffix="$4"
    ++	test_path_base_fail "$change" "$prefix" "$field" "$suffix"
    ++	test_path_fail "$change" "missing space after quoted $field" "$prefix" '"hello.c"x' "$suffix" "Missing space after $field"
     +}
     +
    -+change=filemodify       prefix='M 100644 :1 ' field=path   suffix=''         test_path_eol_fail
    -+change=filedelete       prefix='D '           field=path   suffix=''         test_path_eol_fail
    -+change=filecopy         prefix='C '           field=source suffix=' world.c' test_path_space_fail
    -+change=filecopy         prefix='C hello.c '   field=dest   suffix=''         test_path_eol_fail
    -+change=filerename       prefix='R '           field=source suffix=' world.c' test_path_space_fail
    -+change=filerename       prefix='R hello.c '   field=dest   suffix=''         test_path_eol_fail
    -+change='ls (in commit)' prefix='ls :2 '       field=path   suffix=''         test_path_eol_fail
    ++test_path_eol_fail   filemodify       'M 100644 :1 ' path   ''
    ++test_path_eol_fail   filedelete       'D '           path   ''
    ++test_path_space_fail filecopy         'C '           source ' world.c'
    ++test_path_eol_fail   filecopy         'C hello.c '   dest   ''
    ++test_path_space_fail filerename       'R '           source ' world.c'
    ++test_path_eol_fail   filerename       'R hello.c '   dest   ''
    ++test_path_eol_fail   'ls (in commit)' 'ls :2 '       path   ''
     +
    -+# When 'ls' has no <tree-ish>, the <path> must be quoted.
    -+change='ls (without tree-ish in commit)' prefix='ls ' field=path suffix='' \
    -+test_path_eol_quoted_fail &&
    -+test_path_fail 'empty unquoted path' '' "Invalid dataref"
    ++# When 'ls' has no <dataref>, the <path> must be quoted.
    ++test_path_eol_quoted_fail 'ls (without dataref in commit)' 'ls ' path ''
     +
      ###
      ### series T (ls)
2:  a2aca9f9e6 ! 2:  82a6f53c13 fast-import: directly use strbufs for paths
    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
      			die("Missing space after SHA1: %s", command_buf.buf);
      	}

    +-	strbuf_reset(&uq);
     -	parse_path_eol(&uq, p, "path");
     -	p = uq.buf;
    ++	strbuf_reset(&path);
     +	parse_path_eol(&path, p, "path");

      	/* Git does not track empty, non-toplevel directories. */
    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
     -	static struct strbuf uq = STRBUF_INIT;
     +	static struct strbuf path = STRBUF_INIT;

    +-	strbuf_reset(&uq);
     -	parse_path_eol(&uq, p, "path");
     -	p = uq.buf;
     -	tree_content_remove(&b->branch_tree, p, NULL, 1);
    ++	strbuf_reset(&path);
     +	parse_path_eol(&path, p, "path");
     +	tree_content_remove(&b->branch_tree, path.buf, NULL, 1);
      }
    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
     +	static struct strbuf dest = STRBUF_INIT;
      	struct tree_entry leaf;

    +-	strbuf_reset(&s_uq);
     -	parse_path_space(&s_uq, p, &p, "source");
    --	parse_path_eol(&d_uq, p, "dest");
     -	s = s_uq.buf;
    --	d = d_uq.buf;
    ++	strbuf_reset(&source);
     +	parse_path_space(&source, p, &p, "source");
    +
    + 	if (!p)
    + 		die("Missing dest: %s", command_buf.buf);
    +-	strbuf_reset(&d_uq);
    +-	parse_path_eol(&d_uq, p, "dest");
    +-	d = d_uq.buf;
    ++	strbuf_reset(&dest);
     +	parse_path_eol(&dest, p, "dest");

      	memset(&leaf, 0, sizeof(leaf));
    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
      		&leaf.versions[1].oid,
      		leaf.versions[1].mode,
      		leaf.tree);
    -@@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
    +@@ builtin/fast-import.c: static void print_ls(int mode, const unsigned char *hash, const char *path)
    +
    + static void parse_ls(const char *p, struct branch *b)
      {
    - 	struct tree_entry *root = NULL;
    - 	struct tree_entry leaf = {NULL};
     -	static struct strbuf uq = STRBUF_INIT;
     +	static struct strbuf path = STRBUF_INIT;
    + 	struct tree_entry *root = NULL;
    + 	struct tree_entry leaf = {NULL};

    - 	/* ls SP (<tree-ish> SP)? <path> */
    - 	if (*p == '"') {
     @@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
      			root->versions[1].mode = S_IFDIR;
      		load_tree(root);
      	}
    +-	strbuf_reset(&uq);
     -	parse_path_eol(&uq, p, "path");
     -	p = uq.buf;
     -	tree_content_get(root, p, &leaf, 1);
    ++	strbuf_reset(&path);
     +	parse_path_eol(&path, p, "path");
     +	tree_content_get(root, path.buf, &leaf, 1);
      	/*
3:  ecaf4883d1 < -:  ---------- fast-import: release unfreed strbufs
-:  ---------- > 3:  893bbf5e73 fast-import: allow unquoted empty path for root
4:  058a38416a ! 4:  cb05a184e6 fast-import: remove dead strbuf
    @@ Metadata
      ## Commit message ##
         fast-import: remove dead strbuf

    -    The strbuf in `note_change_n` has been unused since the function was
    +    The strbuf in `note_change_n` is to copy the remainder of `p` before
    +    potentially invalidating it when reading the next line. However, `p` is
    +    not used after that point. It has been unused since the function was
         created in a8dd2e7d2b (fast-import: Add support for importing commit
         notes, 2009-10-09) and looks to be a fossil from adapting
    -    `note_change_m`. Remove it.
    +    `file_change_m`. Remove it.

         Signed-off-by: Thalia Archibald <thalia@archibald.dev>

5:  a5e8df0759 < -:  ---------- fast-import: document C-style escapes for paths
6:  9792940ba9 < -:  ---------- fast-import: forbid escaped NUL in paths
-:  ---------- > 5:  1f34b632d7 fast-import: improve documentation for unquoted paths
-:  ---------- > 6:  82a4da68af fast-import: document C-style escapes for paths
-:  ---------- > 7:  c087c6a860 fast-import: forbid escaped NUL in paths
-:  ---------- > 8:  a503c55b83 fast-import: make comments more precise
--
2.44.0

Thalia Archibald April 7, 2024, 9:19 p.m. UTC | #1

On Apr 1, 2024, at 02:02, Thalia Archibald wrote:
>> fast-import has subtle differences in how it parses file paths between each
>> occurrence of <path> in the grammar. Many errors are suppressed or not checked,
>> which could lead to silent data corruption. A particularly bad case is when a
>> front-end sent escapes that Git doesn't recognize (e.g., hex escapes are not
>> supported), it would be treated as literal bytes instead of a quoted string.
>> 
>> Bring path parsing into line with the documented behavior and improve
>> documentation to fill in missing details.
> 
> Thanks for the review, Patrick. I've made several changes, which I think address
> your feedback. What's the protocol for adding `Reviewed-by`? Since I don't know
> whether I, you, or Junio add it, I've refrained from attaching your name to
> these patches.

Hello! Friendly ping here. It’s been a week since the last emails for this patch
set, so I’d like to check in on the status.

As a new contributor to the project, I don’t yet have a full view of the
contribution flow, aside from what I’ve read. I suspect the latency is because I
may not have cc’d all the area experts. (When I sent v1, I used separate Cc
lines in send-email --compose, but it dropped all but the last. After Patrick
reviewed it, I figured I could leave the cc list as-is for v2, assuming I’d get
another review.) I’ve now cc’d everyone listed by contrib/contacts, as well as
the maintainer. For anyone not a part of the earlier discussion, the latest
version is at https://lore.kernel.org/git/cover.1711960552.git.thalia@archibald.dev/.
If that’s not the problem, I’d be keen to know what I could do better.

I have several more patch sets in the works, that I’ve held back on sending
until this one is finished, in case I’ve been doing something wrong. I want to
move forward. Thank you for your time.

Thalia

> Changes since v1:
> * In fast-import:
>  * Move `strbuf_release` outside of `parse_path_space` and `parse_path_eol`.
>  * Retain allocations for static `strbuf`s.
>  * Rename `allow_spaces` parameter of `parse_path` to `include_spaces`.
>  * Extract change to neighboring comments as patch 8.
> * In tests:
>  * Test `` for the root path additionally in all tests using `""`.
>  * Pass all arguments by positional variables.
>  * Use `local`.
>  * Use `test_when_finished` instead of manual cleanup.
> * In documentation:
>  * Better document conditions under which a path is considered quoted or
>    unquoted.
> * Reword commit messages.
> 
> Thalia
> 
> 
> Thalia Archibald (8):
>  fast-import: tighten path unquoting
>  fast-import: directly use strbufs for paths
>  fast-import: allow unquoted empty path for root
>  fast-import: remove dead strbuf
>  fast-import: improve documentation for unquoted paths
>  fast-import: document C-style escapes for paths
>  fast-import: forbid escaped NUL in paths
>  fast-import: make comments more precise
> 
> Documentation/git-fast-import.txt |  30 +-
> builtin/fast-import.c             | 156 ++++----
> t/t9300-fast-import.sh            | 617 +++++++++++++++++++++---------
> 3 files changed, 541 insertions(+), 262 deletions(-)
> 
> Range-diff against v1:
> 1:  8d9e0b25cb ! 1:  e790bdf714 fast-import: tighten parsing of paths
>    @@ Metadata
>     Author: Thalia Archibald <thalia@archibald.dev>
> 
>      ## Commit message ##
>    -    fast-import: tighten parsing of paths
>    +    fast-import: tighten path unquoting
> 
>         Path parsing in fast-import is inconsistent and many unquoting errors
>    -    are suppressed.
>    +    are suppressed or not checked.
> 
>    -    `<path>` appears in the grammar in these places:
>    +    <path> appears in the grammar in these places:
> 
>             filemodify ::= 'M' SP <mode> (<dataref> | 'inline') SP <path> LF
>             filedelete ::= 'D' SP <path> LF
>    @@ Commit message
>         and fast-import.c parses them in five different ways:
> 
>         1. For filemodify and filedelete:
>    -       If `<path>` is a valid quoted string, unquote it;
>    -       otherwise, treat it as literal bytes (including any number of SP).
>    +       Try to unquote <path>. If it unquotes without errors, use the
>    +       unquoted version; otherwise, treat it as literal bytes to the end of
>    +       the line (including any number of SP).
>         2. For filecopy (source) and filerename (source):
>    -       If `<path>` is a valid quoted string, unquote it;
>    -       otherwise, treat it as literal bytes until the next SP.
>    +       Try to unquote <path>. If it unquotes without errors, use the
>    +       unquoted version; otherwise, treat it as literal bytes up to, but not
>    +       including, the next SP.
>         3. For filecopy (dest) and filerename (dest):
>    -       Like 1., but an unquoted empty string is an error.
>    +       Like 1., but an unquoted empty string is forbidden.
>         4. For ls:
>    -       If `<path>` starts with `"`, unquote it and report parse errors;
>    -       otherwise, treat it as literal bytes (including any number of SP).
>    +       If <path> starts with `"`, unquote it and report parse errors;
>    +       otherwise, treat it as literal bytes to the end of the line
>    +       (including any number of SP).
>         5. For ls-commit:
>    -       Unquote `<path>` and report parse errors.
>    +       Unquote <path> and report parse errors.
>            (It must start with `"` to disambiguate from ls.)
> 
>         In the first three, any errors from trying to unquote a string are
>         suppressed, so a quoted string that contains invalid escapes would be
>         interpreted as literal bytes. For example, `"\xff"` would fail to
>         unquote (because hex escapes are not supported), and it would instead be
>    -    interpreted as the byte sequence `"` `\` `x` `f` `f` `"`, which is
>    +    interpreted as the byte sequence '"', '\\', 'x', 'f', 'f', '"', which is
>         certainly not intended. Some front-ends erroneously use their language's
>    -    standard quoting routine and could silently introduce escapes that would
>    -    be incorrectly parsed due to this.
>    +    standard quoting routine instead of matching Git's, which could silently
>    +    introduce escapes that would be incorrectly parsed due to this and lead
>    +    to data corruption.
> 
>    -    The documentation states that “To use a source path that contains SP the
>    -    path must be quoted.”, so it is expected that some implementations
>    -    depend on spaces being allowed in paths in the final position. Thus we
>    -    have two documented ways to parse paths, so simplify the implementation
>    -    to that.
>    +    The documentation states “To use a source path that contains SP the path
>    +    must be quoted.”, so it is expected that some implementations depend on
>    +    spaces being allowed in paths in the final position. Thus we have two
>    +    documented ways to parse paths, so simplify the implementation to that.
> 
>         Now we have:
> 
>         1. `parse_path_eol` for filemodify, filedelete, filecopy (dest),
>            filerename (dest), ls, and ls-commit:
> 
>    -       If `<path>` starts with `"`, unquote it and report parse errors;
>    -       otherwise, treat it as literal bytes (including any number of SP).
>    -       Garbage after a quoted string or an unquoted empty string are errors.
>    -       (In ls-commit, it must be quoted to disambiguate from ls.)
>    +       If <path> starts with `"`, unquote it and report parse errors;
>    +       otherwise, treat it as literal bytes to the end of the line
>    +       (including any number of SP).
> 
>         2. `parse_path_space` for filecopy (source) and filerename (source):
> 
>    -       If `<path>` starts with `"`, unquote it and report parse errors;
>    -       otherwise, treat it as literal bytes until the next SP.
>    -       It must be followed by a SP. An unquoted empty string is an error.
>    +       If <path> starts with `"`, unquote it and report parse errors;
>    +       otherwise, treat it as literal bytes up to, but not including, the
>    +       next SP. It must be followed by SP.
>    +
>    +    There remain two special cases: The dest <path> in filecopy and rename
>    +    cannot be an unquoted empty string (this will be addressed subsequently)
>    +    and <path> in ls-commit must be quoted to disambiguate it from ls.
> 
>         Signed-off-by: Thalia Archibald <thalia@archibald.dev>
> 
>    - ## Documentation/git-fast-import.txt ##
>    -@@ Documentation/git-fast-import.txt: The value of `<path>` must be in canonical form. That is it must not:
>    - * contain the special component `.` or `..` (e.g. `foo/./bar` and
>    -   `foo/../bar` are invalid).
>    -
>    --The root of the tree can be represented by an empty string as `<path>`.
>    -+The root of the tree can be represented by a quoted empty string (`""`)
>    -+as `<path>`.
>    -
>    - It is recommended that `<path>` always be encoded using UTF-8.
>    -
>    -
>      ## builtin/fast-import.c ##
>    -@@ builtin/fast-import.c: static int parse_mapped_oid_hex(const char *hex, struct object_id *oid, const ch
>    -  *
>    -  *   idnum ::= ':' bigint;
>    -  *
>    -- * Return the first character after the value in *endptr.
>    -+ * Update *endptr to point to the first character after the value.
>    -  *
>    -  * Complain if the following character is not what is expected,
>    -  * either a space or end of the string.
>    -@@ builtin/fast-import.c: static uintmax_t parse_mark_ref_eol(const char *p)
>    - }
>    -
>    - /*
>    -- * Parse the mark reference, demanding a trailing space.  Return a
>    -- * pointer to the space.
>    -+ * Parse the mark reference, demanding a trailing space. Update *p to
>    -+ * point to the first character after the space.
>    -  */
>    - static uintmax_t parse_mark_ref_space(const char **p)
>    - {
>     @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p)
>       return mark;
>      }
>    @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p)
>     +/*
>     + * Parse the path string into the strbuf. It may be quoted with escape sequences
>     + * or unquoted without escape sequences. When unquoted, it may only contain a
>    -+ * space if `allow_spaces` is nonzero.
>    ++ * space if `include_spaces` is nonzero.
>     + */
>    -+static void parse_path(struct strbuf *sb, const char *p, const char **endp, int allow_spaces, const char *field)
>    ++static void parse_path(struct strbuf *sb, const char *p, const char **endp, int include_spaces, const char *field)
>     +{
>    -+ strbuf_reset(sb);
>     + if (*p == '"') {
>     + if (unquote_c_style(sb, p, endp))
>     + die("Invalid %s: %s", field, command_buf.buf);
>     + } else {
>    -+ if (allow_spaces)
>    ++ if (include_spaces)
>     + *endp = p + strlen(p);
>     + else
>     + *endp = strchr(p, ' ');
>    -+ if (*endp == p)
>    -+ die("Missing %s: %s", field, command_buf.buf);
>     + strbuf_add(sb, p, *endp - p);
>     + }
>     +}
>    @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p)
>       struct object_id oid;
>       uint16_t mode, inline_data = 0;
>     @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b)
>    - die("Missing space after SHA1: %s", command_buf.buf);
>       }
> 
>    -- strbuf_reset(&uq);
>    + strbuf_reset(&uq);
>     - if (!unquote_c_style(&uq, p, &endp)) {
>     - if (*endp)
>     - die("Garbage after path in: %s", command_buf.buf);
>    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
>       static struct strbuf uq = STRBUF_INIT;
>     - const char *endp;
> 
>    -- strbuf_reset(&uq);
>    + strbuf_reset(&uq);
>     - if (!unquote_c_style(&uq, p, &endp)) {
>     - if (*endp)
>     - die("Garbage after path in: %s", command_buf.buf);
>    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
>     - const char *endp;
>       struct tree_entry leaf;
> 
>    -- strbuf_reset(&s_uq);
>    + strbuf_reset(&s_uq);
>     - if (!unquote_c_style(&s_uq, s, &endp)) {
>     - if (*endp != ' ')
>     - die("Missing space after source: %s", command_buf.buf);
>    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
>     - strbuf_add(&s_uq, s, endp - s);
>     - }
>     + parse_path_space(&s_uq, p, &p, "source");
>    -+ parse_path_eol(&d_uq, p, "dest");
>       s = s_uq.buf;
>    --
>    +
>     - endp++;
>     - if (!*endp)
>    -- die("Missing dest: %s", command_buf.buf);
>    ++ if (!p)
>    + die("Missing dest: %s", command_buf.buf);
>     -
>     - d = endp;
>    -- strbuf_reset(&d_uq);
>    + strbuf_reset(&d_uq);
>     - if (!unquote_c_style(&d_uq, d, &endp)) {
>     - if (*endp)
>     - die("Garbage after dest in: %s", command_buf.buf);
>     - d = d_uq.buf;
>     - }
>    ++ parse_path_eol(&d_uq, p, "dest");
>     + d = d_uq.buf;
> 
>       memset(&leaf, 0, sizeof(leaf));
>       if (rename)
>    -@@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
>    +@@ builtin/fast-import.c: static void print_ls(int mode, const unsigned char *hash, const char *path)
>    +
>    + static void parse_ls(const char *p, struct branch *b)
>      {
>    ++ static struct strbuf uq = STRBUF_INIT;
>       struct tree_entry *root = NULL;
>       struct tree_entry leaf = {NULL};
>    -+ static struct strbuf uq = STRBUF_INIT;
> 
>    - /* ls SP (<tree-ish> SP)? <path> */
>    - if (*p == '"') {
>     @@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
>       root->versions[1].mode = S_IFDIR;
>       load_tree(root);
>    @@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
>     - die("Garbage after path in: %s", command_buf.buf);
>     - p = uq.buf;
>     - }
>    ++ strbuf_reset(&uq);
>     + parse_path_eol(&uq, p, "path");
>     + p = uq.buf;
>       tree_content_get(root, p, &leaf, 1);
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     +# Path parsing
>     +#
>     +# There are two sorts of ways a path can be parsed, depending on whether it is
>    -+# the last field on the line. Additionally, ls without a <tree-ish> has a
>    -+# special case. Test every occurrence of <path> in the grammar against every
>    -+# error case.
>    ++# the last field on the line. Additionally, ls without a <dataref> has a special
>    ++# case. Test every occurrence of <path> in the grammar against every error case.
>     +#
>     +
>     +#
>     +# Valid paths at the end of a line: filemodify, filedelete, filecopy (dest),
>     +# filerename (dest), and ls.
>     +#
>    -+# commit :301 from root -- modify hello.c
>    ++# commit :301 from root -- modify hello.c (for setup)
>     +# commit :302 from :301 -- modify $path
>     +# commit :303 from :302 -- delete $path
>     +# commit :304 from :301 -- copy hello.c $path
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     +# ls :305 $path
>     +#
>     +test_path_eol_success () {
>    -+ test="$1" path="$2" unquoted_path="$3"
>    ++ local test="$1" path="$2" unquoted_path="$3"
>     + test_expect_success "S: paths at EOL with $test must work" '
>    ++ test_when_finished "git branch -D S-path-eol" &&
>    ++
>     + git fast-import --export-marks=marks.out <<-EOF >out 2>err &&
>     + blob
>     + mark :401
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     + hallo welt
>     + BLOB
>     +
>    -+ commit refs/heads/path-eol
>    ++ commit refs/heads/S-path-eol
>     + mark :301
>     + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
>     + data <<COMMIT
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     + COMMIT
>     + M 100644 :401 hello.c
>     +
>    -+ commit refs/heads/path-eol
>    ++ commit refs/heads/S-path-eol
>     + mark :302
>     + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
>     + data <<COMMIT
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     + from :301
>     + M 100644 :402 '"$path"'
>     +
>    -+ commit refs/heads/path-eol
>    ++ commit refs/heads/S-path-eol
>     + mark :303
>     + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
>     + data <<COMMIT
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     + from :302
>     + D '"$path"'
>     +
>    -+ commit refs/heads/path-eol
>    ++ commit refs/heads/S-path-eol
>     + mark :304
>     + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
>     + data <<COMMIT
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     + from :301
>     + C hello.c '"$path"'
>     +
>    -+ commit refs/heads/path-eol
>    ++ commit refs/heads/S-path-eol
>     + mark :305
>     + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
>     + data <<COMMIT
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     + git ls-tree $commit_r >tree_r.out &&
>     + test_cmp tree_r.exp tree_r.out &&
>     +
>    -+ test_cmp out tree_r.exp &&
>    -+
>    -+ git branch -D path-eol
>    ++ test_cmp out tree_r.exp
>     + '
>     +}
>     +
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     +#
>     +# Valid paths before a space: filecopy (source) and filerename (source).
>     +#
>    -+# commit :301 from root -- modify $path
>    ++# commit :301 from root -- modify $path (for setup)
>     +# commit :302 from :301 -- copy $path hello2.c
>     +# commit :303 from :301 -- rename $path hello2.c
>     +#
>     +test_path_space_success () {
>    -+ test="$1" path="$2" unquoted_path="$3"
>    ++ local test="$1" path="$2" unquoted_path="$3"
>     + test_expect_success "S: paths before space with $test must work" '
>    ++ test_when_finished "git branch -D S-path-space" &&
>    ++
>     + git fast-import --export-marks=marks.out <<-EOF 2>err &&
>     + blob
>     + mark :401
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     + hello world
>     + BLOB
>     +
>    -+ commit refs/heads/path-space
>    ++ commit refs/heads/S-path-space
>     + mark :301
>     + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
>     + data <<COMMIT
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     + COMMIT
>     + M 100644 :401 '"$path"'
>     +
>    -+ commit refs/heads/path-space
>    ++ commit refs/heads/S-path-space
>     + mark :302
>     + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
>     + data <<COMMIT
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     + from :301
>     + C '"$path"' hello2.c
>     +
>    -+ commit refs/heads/path-space
>    ++ commit refs/heads/S-path-space
>     + mark :303
>     + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
>     + data <<COMMIT
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     +
>     + printf "100644 blob $blob\thello2.c\n" >tree_r.exp &&
>     + git ls-tree $commit_r >tree_r.out &&
>    -+ test_cmp tree_r.exp tree_r.out &&
>    -+
>    -+ git branch -D path-space
>    ++ test_cmp tree_r.exp tree_r.out
>     + '
>     +}
>     +
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     +# of <path> in the grammar against all error kinds.
>     +#
>     +test_path_fail () {
>    -+ what="$1" path="$2" err_grep="$3"
>    ++ local change="$1" what="$2" prefix="$3" path="$4" suffix="$5" err_grep="$6"
>     + test_expect_success "S: $change with $what must fail" '
>     + test_must_fail git fast-import <<-EOF 2>err &&
>     + blob
>    @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must
>     +}
>     +
>     +test_path_base_fail () {
>    -+ test_path_fail 'unclosed " in '"$field"          '"hello.c'    "Invalid $field"
>    -+ test_path_fail "invalid escape in quoted $field" '"hello\xff"' "Invalid $field"
>    ++ local change="$1" prefix="$2" field="$3" suffix="$4"
>    ++ test_path_fail "$change" 'unclosed " in '"$field"          "$prefix" '"hello.c'    "$suffix" "Invalid $field"
>    ++ test_path_fail "$change" "invalid escape in quoted $field" "$prefix" '"hello\xff"' "$suffix" "Invalid $field"
>     +}
>     +test_path_eol_quoted_fail () {
>    -+ test_path_base_fail
>    -+ test_path_fail "garbage after quoted $field" '"hello.c"x' "Garbage after $field"
>    -+ test_path_fail "space after quoted $field"   '"hello.c" ' "Garbage after $field"
>    ++ local change="$1" prefix="$2" field="$3" suffix="$4"
>    ++ test_path_base_fail "$change" "$prefix" "$field" "$suffix"
>    ++ test_path_fail "$change" "garbage after quoted $field" "$prefix" '"hello.c"x' "$suffix" "Garbage after $field"
>    ++ test_path_fail "$change" "space after quoted $field"   "$prefix" '"hello.c" ' "$suffix" "Garbage after $field"
>     +}
>     +test_path_eol_fail () {
>    -+ test_path_eol_quoted_fail
>    -+ test_path_fail 'empty unquoted path' '' "Missing $field"
>    ++ local change="$1" prefix="$2" field="$3" suffix="$4"
>    ++ test_path_eol_quoted_fail "$change" "$prefix" "$field" "$suffix"
>     +}
>     +test_path_space_fail () {
>    -+ test_path_base_fail
>    -+ test_path_fail 'empty unquoted path' '' "Missing $field"
>    -+ test_path_fail "missing space after quoted $field" '"hello.c"x' "Missing space after $field"
>    ++ local change="$1" prefix="$2" field="$3" suffix="$4"
>    ++ test_path_base_fail "$change" "$prefix" "$field" "$suffix"
>    ++ test_path_fail "$change" "missing space after quoted $field" "$prefix" '"hello.c"x' "$suffix" "Missing space after $field"
>     +}
>     +
>    -+change=filemodify       prefix='M 100644 :1 ' field=path   suffix=''         test_path_eol_fail
>    -+change=filedelete       prefix='D '           field=path   suffix=''         test_path_eol_fail
>    -+change=filecopy         prefix='C '           field=source suffix=' world.c' test_path_space_fail
>    -+change=filecopy         prefix='C hello.c '   field=dest   suffix=''         test_path_eol_fail
>    -+change=filerename       prefix='R '           field=source suffix=' world.c' test_path_space_fail
>    -+change=filerename       prefix='R hello.c '   field=dest   suffix=''         test_path_eol_fail
>    -+change='ls (in commit)' prefix='ls :2 '       field=path   suffix=''         test_path_eol_fail
>    ++test_path_eol_fail   filemodify       'M 100644 :1 ' path   ''
>    ++test_path_eol_fail   filedelete       'D '           path   ''
>    ++test_path_space_fail filecopy         'C '           source ' world.c'
>    ++test_path_eol_fail   filecopy         'C hello.c '   dest   ''
>    ++test_path_space_fail filerename       'R '           source ' world.c'
>    ++test_path_eol_fail   filerename       'R hello.c '   dest   ''
>    ++test_path_eol_fail   'ls (in commit)' 'ls :2 '       path   ''
>     +
>    -+# When 'ls' has no <tree-ish>, the <path> must be quoted.
>    -+change='ls (without tree-ish in commit)' prefix='ls ' field=path suffix='' \
>    -+test_path_eol_quoted_fail &&
>    -+test_path_fail 'empty unquoted path' '' "Invalid dataref"
>    ++# When 'ls' has no <dataref>, the <path> must be quoted.
>    ++test_path_eol_quoted_fail 'ls (without dataref in commit)' 'ls ' path ''
>     +
>      ###
>      ### series T (ls)
> 2:  a2aca9f9e6 ! 2:  82a6f53c13 fast-import: directly use strbufs for paths
>    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
>       die("Missing space after SHA1: %s", command_buf.buf);
>       }
> 
>    +- strbuf_reset(&uq);
>     - parse_path_eol(&uq, p, "path");
>     - p = uq.buf;
>    ++ strbuf_reset(&path);
>     + parse_path_eol(&path, p, "path");
> 
>       /* Git does not track empty, non-toplevel directories. */
>    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
>     - static struct strbuf uq = STRBUF_INIT;
>     + static struct strbuf path = STRBUF_INIT;
> 
>    +- strbuf_reset(&uq);
>     - parse_path_eol(&uq, p, "path");
>     - p = uq.buf;
>     - tree_content_remove(&b->branch_tree, p, NULL, 1);
>    ++ strbuf_reset(&path);
>     + parse_path_eol(&path, p, "path");
>     + tree_content_remove(&b->branch_tree, path.buf, NULL, 1);
>      }
>    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
>     + static struct strbuf dest = STRBUF_INIT;
>       struct tree_entry leaf;
> 
>    +- strbuf_reset(&s_uq);
>     - parse_path_space(&s_uq, p, &p, "source");
>    -- parse_path_eol(&d_uq, p, "dest");
>     - s = s_uq.buf;
>    -- d = d_uq.buf;
>    ++ strbuf_reset(&source);
>     + parse_path_space(&source, p, &p, "source");
>    +
>    + if (!p)
>    + die("Missing dest: %s", command_buf.buf);
>    +- strbuf_reset(&d_uq);
>    +- parse_path_eol(&d_uq, p, "dest");
>    +- d = d_uq.buf;
>    ++ strbuf_reset(&dest);
>     + parse_path_eol(&dest, p, "dest");
> 
>       memset(&leaf, 0, sizeof(leaf));
>    @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b
>       &leaf.versions[1].oid,
>       leaf.versions[1].mode,
>       leaf.tree);
>    -@@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
>    +@@ builtin/fast-import.c: static void print_ls(int mode, const unsigned char *hash, const char *path)
>    +
>    + static void parse_ls(const char *p, struct branch *b)
>      {
>    - struct tree_entry *root = NULL;
>    - struct tree_entry leaf = {NULL};
>     - static struct strbuf uq = STRBUF_INIT;
>     + static struct strbuf path = STRBUF_INIT;
>    + struct tree_entry *root = NULL;
>    + struct tree_entry leaf = {NULL};
> 
>    - /* ls SP (<tree-ish> SP)? <path> */
>    - if (*p == '"') {
>     @@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b)
>       root->versions[1].mode = S_IFDIR;
>       load_tree(root);
>       }
>    +- strbuf_reset(&uq);
>     - parse_path_eol(&uq, p, "path");
>     - p = uq.buf;
>     - tree_content_get(root, p, &leaf, 1);
>    ++ strbuf_reset(&path);
>     + parse_path_eol(&path, p, "path");
>     + tree_content_get(root, path.buf, &leaf, 1);
>       /*
> 3:  ecaf4883d1 < -:  ---------- fast-import: release unfreed strbufs
> -:  ---------- > 3:  893bbf5e73 fast-import: allow unquoted empty path for root
> 4:  058a38416a ! 4:  cb05a184e6 fast-import: remove dead strbuf
>    @@ Metadata
>      ## Commit message ##
>         fast-import: remove dead strbuf
> 
>    -    The strbuf in `note_change_n` has been unused since the function was
>    +    The strbuf in `note_change_n` is to copy the remainder of `p` before
>    +    potentially invalidating it when reading the next line. However, `p` is
>    +    not used after that point. It has been unused since the function was
>         created in a8dd2e7d2b (fast-import: Add support for importing commit
>         notes, 2009-10-09) and looks to be a fossil from adapting
>    -    `note_change_m`. Remove it.
>    +    `file_change_m`. Remove it.
> 
>         Signed-off-by: Thalia Archibald <thalia@archibald.dev>
> 
> 5:  a5e8df0759 < -:  ---------- fast-import: document C-style escapes for paths
> 6:  9792940ba9 < -:  ---------- fast-import: forbid escaped NUL in paths
> -:  ---------- > 5:  1f34b632d7 fast-import: improve documentation for unquoted paths
> -:  ---------- > 6:  82a4da68af fast-import: document C-style escapes for paths
> -:  ---------- > 7:  c087c6a860 fast-import: forbid escaped NUL in paths
> -:  ---------- > 8:  a503c55b83 fast-import: make comments more precise
> --
> 2.44.0

Eric Sunshine April 7, 2024, 11:46 p.m. UTC | #2

On Sun, Apr 7, 2024 at 5:19 PM Thalia Archibald <thalia@archibald.dev> wrote:
> On Apr 1, 2024, at 02:02, Thalia Archibald wrote:
> >> fast-import has subtle differences in how it parses file paths between each
> >> occurrence of <path> in the grammar. Many errors are suppressed or not checked,
> >> which could lead to silent data corruption. A particularly bad case is when a
> >> front-end sent escapes that Git doesn't recognize (e.g., hex escapes are not
> >> supported), it would be treated as literal bytes instead of a quoted string.
> >>
> >> Bring path parsing into line with the documented behavior and improve
> >> documentation to fill in missing details.
> >
> > Thanks for the review, Patrick. I've made several changes, which I think address
> > your feedback. What's the protocol for adding `Reviewed-by`? Since I don't know
> > whether I, you, or Junio add it, I've refrained from attaching your name to
> > these patches.
>
> Hello! Friendly ping here. It’s been a week since the last emails for this patch
> set, so I’d like to check in on the status.

Pinging is certainly the correct thing to do.

Regarding `Reviewed-by:`: that trailer doesn't mean that someone
merely read and commented on a patch. Instead, it's explicitly _given_
by a reviewer to indicate that the reviewer has thoroughly reviewed
the patch set and is confident that it is ready to be merged into the
project, at which point Junio typically adds the `Reviewed-by:`.

> As a new contributor to the project, I don’t yet have a full view of the
> contribution flow, aside from what I’ve read. I suspect the latency is because I
> may not have cc’d all the area experts. (When I sent v1, I used separate Cc
> lines in send-email --compose, but it dropped all but the last. After Patrick
> reviewed it, I figured I could leave the cc list as-is for v2, assuming I’d get
> another review.) I’ve now cc’d everyone listed by contrib/contacts, as well as
> the maintainer. For anyone not a part of the earlier discussion, the latest
> version is at https://lore.kernel.org/git/cover.1711960552.git.thalia@archibald.dev/.
> If that’s not the problem, I’d be keen to know what I could do better.

Lack of response may be due to the series being overlooked, or it
could mean that nobody has any particular interest in the changes
(which is not to say that there is anything wrong with the changes),
or that people are simply busy elsewhere. It could also mean that this
reroll is good enough and reviewers have nothing else to add. So,
cc'ing potentially interested people makes sense.

> I have several more patch sets in the works, that I’ve held back on sending
> until this one is finished, in case I’ve been doing something wrong. I want to
> move forward. Thank you for your time.

If the additional patch sets are unrelated to this patch set, then I
don't see a reason to hold them back. Feel free to send them. Even if
they are related to this patch set, you may still want to send them.
After all, doing so may get the ball rolling again on this patch set.

Patrick Steinhardt April 8, 2024, 6:25 a.m. UTC | #3

On Sun, Apr 07, 2024 at 07:46:52PM -0400, Eric Sunshine wrote:
> On Sun, Apr 7, 2024 at 5:19 PM Thalia Archibald <thalia@archibald.dev> wrote:
> > On Apr 1, 2024, at 02:02, Thalia Archibald wrote:
> > >> fast-import has subtle differences in how it parses file paths between each
> > >> occurrence of <path> in the grammar. Many errors are suppressed or not checked,
> > >> which could lead to silent data corruption. A particularly bad case is when a
> > >> front-end sent escapes that Git doesn't recognize (e.g., hex escapes are not
> > >> supported), it would be treated as literal bytes instead of a quoted string.
> > >>
> > >> Bring path parsing into line with the documented behavior and improve
> > >> documentation to fill in missing details.
> > >
> > > Thanks for the review, Patrick. I've made several changes, which I think address
> > > your feedback. What's the protocol for adding `Reviewed-by`? Since I don't know
> > > whether I, you, or Junio add it, I've refrained from attaching your name to
> > > these patches.
> >
> > Hello! Friendly ping here. It’s been a week since the last emails for this patch
> > set, so I’d like to check in on the status.
> 
> Pinging is certainly the correct thing to do.
> 
> Regarding `Reviewed-by:`: that trailer doesn't mean that someone
> merely read and commented on a patch. Instead, it's explicitly _given_
> by a reviewer to indicate that the reviewer has thoroughly reviewed
> the patch set and is confident that it is ready to be merged into the
> project, at which point Junio typically adds the `Reviewed-by:`.
> 
> > As a new contributor to the project, I don’t yet have a full view of the
> > contribution flow, aside from what I’ve read. I suspect the latency is because I
> > may not have cc’d all the area experts. (When I sent v1, I used separate Cc
> > lines in send-email --compose, but it dropped all but the last. After Patrick
> > reviewed it, I figured I could leave the cc list as-is for v2, assuming I’d get
> > another review.) I’ve now cc’d everyone listed by contrib/contacts, as well as
> > the maintainer. For anyone not a part of the earlier discussion, the latest
> > version is at https://lore.kernel.org/git/cover.1711960552.git.thalia@archibald.dev/.
> > If that’s not the problem, I’d be keen to know what I could do better.
> 
> Lack of response may be due to the series being overlooked, or it
> could mean that nobody has any particular interest in the changes
> (which is not to say that there is anything wrong with the changes),
> or that people are simply busy elsewhere. It could also mean that this
> reroll is good enough and reviewers have nothing else to add. So,
> cc'ing potentially interested people makes sense.

Yeah, for this patch series I think it's mostly a lack of interest.
Which is too bad, because it does address some real issues with
git-fast-import(1). Part of the problem is also that this area does not
really have an area expert at all -- if you git-shortlog(1) for example
"builtin/fast-import.c" for the last year you will see that it didn't
get much attention at all.

Anyway, I'm currently trying to make it a habit to pick up and review
random patch series that didn't get any attention at all every once in a
while, which is also why I reviewed your first version. I'm taking these
a bit slower though, also in the hope that some initial discussion may
motivate others to chime in, as well. Which may explain why I didn't yet
review your v2.

In any case I do have it in my backlog and will get to it somewhen this
week.

> > I have several more patch sets in the works, that I’ve held back on sending
> > until this one is finished, in case I’ve been doing something wrong. I want to
> > move forward. Thank you for your time.
> 
> If the additional patch sets are unrelated to this patch set, then I
> don't see a reason to hold them back. Feel free to send them. Even if
> they are related to this patch set, you may still want to send them.
> After all, doing so may get the ball rolling again on this patch set.

Agreed. Especially given that this is your first contribution, the
quality of your patch series is quite high. So I don't see much of a
reason to hold back the other patch series in case they are unrelated.

Patrick

Thalia Archibald April 8, 2024, 7:15 a.m. UTC | #4

On Apr 7, 2024, at 23:25, Patrick Steinhardt <ps@pks.im> wrote:
> On Sun, Apr 07, 2024 at 07:46:52PM -0400, Eric Sunshine wrote:
>> 
>> Lack of response may be due to the series being overlooked, or it
>> could mean that nobody has any particular interest in the changes
>> (which is not to say that there is anything wrong with the changes),
>> or that people are simply busy elsewhere. It could also mean that this
>> reroll is good enough and reviewers have nothing else to add. So,
>> cc'ing potentially interested people makes sense.
> 
> Yeah, for this patch series I think it's mostly a lack of interest.
> Which is too bad, because it does address some real issues with
> git-fast-import(1). Part of the problem is also that this area does not
> really have an area expert at all -- if you git-shortlog(1) for example
> "builtin/fast-import.c" for the last year you will see that it didn't
> get much attention at all.

Unfortunately, my upcoming patches will probably suffer the same fate, as
they're mostly fixing parsing issues in fast-import.

> Anyway, I'm currently trying to make it a habit to pick up and review
> random patch series that didn't get any attention at all every once in a
> while, which is also why I reviewed your first version. I'm taking these
> a bit slower though, also in the hope that some initial discussion may
> motivate others to chime in, as well. Which may explain why I didn't yet
> review your v2.
> 
> In any case I do have it in my backlog and will get to it somewhen this
> week.

Thank you!

>>> I have several more patch sets in the works, that I’ve held back on sending
>>> until this one is finished, in case I’ve been doing something wrong. I want to
>>> move forward. Thank you for your time.
>> 
>> If the additional patch sets are unrelated to this patch set, then I
>> don't see a reason to hold them back. Feel free to send them. Even if
>> they are related to this patch set, you may still want to send them.
>> After all, doing so may get the ball rolling again on this patch set.
> 
> Agreed. Especially given that this is your first contribution, the
> quality of your patch series is quite high. So I don't see much of a
> reason to hold back the other patch series in case they are unrelated.

My effort comes from reimplementing fast-import parsing as a Rust library,
following the implementation, not just the documentation, so I’ve noticed many
mismatches between the concrete and abstract grammars. Perhaps it would save
reviewer time to submit those around the same time, so knowledge of fast-import
needs to be evicted and loaded from cache less.

Thalia

Patrick Steinhardt April 8, 2024, 9:07 a.m. UTC | #5

On Mon, Apr 08, 2024 at 07:15:35AM +0000, Thalia Archibald wrote:
> On Apr 7, 2024, at 23:25, Patrick Steinhardt <ps@pks.im> wrote:
> > On Sun, Apr 07, 2024 at 07:46:52PM -0400, Eric Sunshine wrote:
[snip]
> >>> I have several more patch sets in the works, that I’ve held back on sending
> >>> until this one is finished, in case I’ve been doing something wrong. I want to
> >>> move forward. Thank you for your time.
> >> 
> >> If the additional patch sets are unrelated to this patch set, then I
> >> don't see a reason to hold them back. Feel free to send them. Even if
> >> they are related to this patch set, you may still want to send them.
> >> After all, doing so may get the ball rolling again on this patch set.
> > 
> > Agreed. Especially given that this is your first contribution, the
> > quality of your patch series is quite high. So I don't see much of a
> > reason to hold back the other patch series in case they are unrelated.
> 
> My effort comes from reimplementing fast-import parsing as a Rust library,
> following the implementation, not just the documentation, so I’ve noticed many
> mismatches between the concrete and abstract grammars. Perhaps it would save
> reviewer time to submit those around the same time, so knowledge of fast-import
> needs to be evicted and loaded from cache less.

In this case I think it depends on whether or not these patch series
would conflict with each other. If they do it's preferable to land them
sequentially. If they don't conflict then it should be fine to send
separate patch series and parallelize the review to a certain degree.

Patrick

Junio C Hamano April 8, 2024, 2:52 p.m. UTC | #6

Patrick Steinhardt <ps@pks.im> writes:

>> Regarding `Reviewed-by:`: that trailer doesn't mean that someone
>> merely read and commented on a patch. Instead, it's explicitly _given_
>> by a reviewer to indicate that the reviewer has thoroughly reviewed
>> the patch set and is confident that it is ready to be merged into the
>> project, at which point Junio typically adds the `Reviewed-by:`.

Just to clarify, "adds" means "cuts the reviewed-by lines written by
the reviewers in their review messages and pasts them into the
commit message, either amending the commits after the fact or while
applying them".  I do _not_ judge from the sideline the quality of
reviews given by others and say "yeah, that is an adequate review,
so I'll forge _their_ reviewed-by trailer".

> Anyway, I'm currently trying to make it a habit to pick up and review
> random patch series that didn't get any attention at all every once in a
> while, ...

I have noticed your effort and appreciated it very much.  I wish
there were many more others who do the same.  I of course have been
playing that role for some time as the final catch-all reviewer, but
a single person does not scale.

Thanks.

[v2,0/8] fast-import: tighten parsing of paths

Message

Comments