diff mbox series

[v4,3/5] ref: add more strict checks for regular refs

Message ID ZuRzxyjAI3tp4uLK@ArchLinux (mailing list archive)
State New
Headers show
Series [v4,1/5] ref: initialize "fsck_ref_report" with zero | expand

Commit Message

shejialuo Sept. 13, 2024, 5:17 p.m. UTC
We have already used "parse_loose_ref_contents" function to check
whether the ref content is valid in files backend. However, by
using "parse_loose_ref_contents", we allow the ref's content to end with
garbage or without a newline.

Even though we never create such loose refs ourselves, we have accepted
such loose refs. So, it is entirely possible that some third-party tools
may rely on such loose refs being valid. We should not report an error
fsck message at current. We should notify the users about such
"curiously formatted" loose refs so that adequate care is taken before
we decide to tighten the rules in the future.

And it's not suitable either to report a warn fsck message to the user.
We don't yet want the "--strict" flag that controls this bit to end up
generating errors for such weirdly-formatted reference contents, as we
first want to assess whether this retroactive tightening will cause
issues for any tools out there. It may cause compatibility issues which
may break the repository. So we add the following two fsck infos to
represent the situation where the ref content ends without newline or
has trailing garbages:

1. refMissingNewline(INFO): A ref does not end with newline. This will
   be considered an error in the future.
2. trailingRefContent(INFO): A ref has trailing content. This will be
   considered an error in the future.

It might appear that we can't provide the user with any warnings by
using FSCK_INFO. However, in "fsck.c::fsck_vreport", we will convert
FSCK_INFO to FSCK_WARN and we can still warn the user about these
situations when using "git refs verify" without introducing
compatibility issues.

Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
---
 Documentation/fsck-msgids.txt |  8 +++++
 fsck.h                        |  2 ++
 refs.c                        |  2 +-
 refs/files-backend.c          | 27 ++++++++++++++--
 refs/refs-internal.h          |  2 +-
 t/t0602-reffiles-fsck.sh      | 60 +++++++++++++++++++++++++++++++++++
 6 files changed, 96 insertions(+), 5 deletions(-)

Comments

Junio C Hamano Sept. 18, 2024, 7:39 p.m. UTC | #1
shejialuo <shejialuo@gmail.com> writes:

> +`refMissingNewline`::
> +	(INFO) A ref does not end with newline. This will be
> +	considered an error in the future.

It is ONLY files backend's loose-ref representation to store the
object name that is the value of the ref as hexadecimal text
terminated with a newline.  With packed backend, even if the file
ends with an incomplete line, it would be confusing to say that such
lack of terminating LF is associated with a particular ref.  With
reftable backend, the object name may not even be hexadecimal but
binary without any terminating LF.

At least you should say "A loose ref file that does not end with...",
because a ref NEVER ends or contains newline, and what you are
expecting to be terminated with LF is not even a ref, but the value
of it.

Also, isn't it too strong to say "will be" without giving any
further information, like:

    As valid implementations of Git never created such a loose ref
    file, it may become an error in the future.  Report to the
    git@vger.kernel.org mailing list if you see this error, as we
    need to know what tools created such a file.

or something?

The same comment applies to the next entry.

> @@ -619,6 +619,10 @@ int parse_loose_ref_contents(const struct git_hash_algo *algop,
>  		*failure_errno = EINVAL;
>  		return -1;
>  	}
> +
> +	if (trailing)
> +		*trailing = p;
> +
>  	return 0;

In the pre-context of this hunk, if parse_oid_hex_algoph() failed to
recognise the initial segment of buf as a valid hexadecimal object
name, it would have already returned, so we know 'p' is always valid
here.  It is the byte that comes immediately after the hexadecimal
object name.

OK.

>  	if (parse_loose_ref_contents(ref_store->repo->hash_algo,
>  				     ref_content.buf, &oid, &referent,
> -				     &type, &failure_errno)) {
> +				     &type, &trailing, &failure_errno)) {
>  		ret = fsck_report_ref(o, &report,
>  				      FSCK_MSG_BAD_REF_CONTENT,
>  				      "invalid ref content");
>  		goto cleanup;
>  	}
>  
> +	if (!(type & REF_ISSYMREF)) {

Just like we punted for S_ISLNK() in an earlier step, we for now
deal with regular refs in this step.  OK.

> +		if (!*trailing) {
> +			ret = fsck_report_ref(o, &report,
> +					      FSCK_MSG_REF_MISSING_NEWLINE,
> +					      "missing newline");
> +			goto cleanup;
> +		}
> +
> +		if (*trailing != '\n' || *(trailing + 1)) {
> +			ret = fsck_report_ref(o, &report,
> +					      FSCK_MSG_TRAILING_REF_CONTENT,
> +					      "trailing garbage in ref");
> +			goto cleanup;
> +		}

Not limited to this patch, but isn't fsck_report_ref() misdesigned,
or is it just they are used poorly in these patches?  In these two
callsites, the message string parameter does not give any more
information than what the FSCK_MSG_* enum gives.

In fact, MSG_REF_MISSING_NEWLINE at least says that the complaint is
about refs, but "missing newline" does not even say from what the
newline is missing.  For TRAILING_REF_CONTENT, people may expect to
see what garbage follows the expected contents, but that information
(i.e. contents of *trailing) is lost here.
shejialuo Sept. 22, 2024, 3:06 p.m. UTC | #2
On Wed, Sep 18, 2024 at 12:39:13PM -0700, Junio C Hamano wrote:
> shejialuo <shejialuo@gmail.com> writes:
> 
> > +`refMissingNewline`::
> > +	(INFO) A ref does not end with newline. This will be
> > +	considered an error in the future.
> 
> It is ONLY files backend's loose-ref representation to store the
> object name that is the value of the ref as hexadecimal text
> terminated with a newline.  With packed backend, even if the file
> ends with an incomplete line, it would be confusing to say that such
> lack of terminating LF is associated with a particular ref.  With
> reftable backend, the object name may not even be hexadecimal but
> binary without any terminating LF.
> 
> At least you should say "A loose ref file that does not end with...",
> because a ref NEVER ends or contains newline, and what you are
> expecting to be terminated with LF is not even a ref, but the value
> of it.
> 

Thanks, I will improve this in the next version.

> Also, isn't it too strong to say "will be" without giving any
> further information, like:
> 
>     As valid implementations of Git never created such a loose ref
>     file, it may become an error in the future.  Report to the
>     git@vger.kernel.org mailing list if you see this error, as we
>     need to know what tools created such a file.
> 
> or something?
> 

This is nice. I know the intention here.

> > +		if (!*trailing) {
> > +			ret = fsck_report_ref(o, &report,
> > +					      FSCK_MSG_REF_MISSING_NEWLINE,
> > +					      "missing newline");
> > +			goto cleanup;
> > +		}
> > +
> > +		if (*trailing != '\n' || *(trailing + 1)) {
> > +			ret = fsck_report_ref(o, &report,
> > +					      FSCK_MSG_TRAILING_REF_CONTENT,
> > +					      "trailing garbage in ref");
> > +			goto cleanup;
> > +		}
> 
> Not limited to this patch, but isn't fsck_report_ref() misdesigned,
> or is it just they are used poorly in these patches?  In these two
> callsites, the message string parameter does not give any more
> information than what the FSCK_MSG_* enum gives.
> 
> In fact, MSG_REF_MISSING_NEWLINE at least says that the complaint is
> about refs, but "missing newline" does not even say from what the
> newline is missing.  For TRAILING_REF_CONTENT, people may expect to
> see what garbage follows the expected contents, but that information
> (i.e. contents of *trailing) is lost here.

I agree with you here, I use way too general words to describe what
happens. I will improve this. Actually, I feel hard to find words for
"MSG_REF_MISSING_NEWLINE". I think we should say:

	LF should be at the end of the file.

Thanks,
Jialuo
Junio C Hamano Sept. 22, 2024, 4:48 p.m. UTC | #3
shejialuo <shejialuo@gmail.com> writes:

> I agree with you here, I use way too general words to describe what
> happens. I will improve this. Actually, I feel hard to find words for
> "MSG_REF_MISSING_NEWLINE". I think we should say:
>
> 	LF should be at the end of the file.

Giving a human-readable message when we have an enum can be done at
a lot higher layer with the current way the fsck_report_ref()
function is used (i.e. in that function, not by its callers).

That is what I meant by "misdesigned"---if one message enum always
corresponds to one human-readable message, there is not much point
in forcing callers to supply both, is there?
diff mbox series

Patch

diff --git a/Documentation/fsck-msgids.txt b/Documentation/fsck-msgids.txt
index 22c385ea22..8827137ef0 100644
--- a/Documentation/fsck-msgids.txt
+++ b/Documentation/fsck-msgids.txt
@@ -173,6 +173,14 @@ 
 `nullSha1`::
 	(WARN) Tree contains entries pointing to a null sha1.
 
+`refMissingNewline`::
+	(INFO) A ref does not end with newline. This will be
+	considered an error in the future.
+
+`trailingRefContent`::
+	(INFO) A ref has trailing content. This will be
+	considered an error in the future.
+
 `treeNotSorted`::
 	(ERROR) A tree is not properly sorted.
 
diff --git a/fsck.h b/fsck.h
index 0d99a87911..b85072df57 100644
--- a/fsck.h
+++ b/fsck.h
@@ -85,6 +85,8 @@  enum fsck_msg_type {
 	FUNC(MAILMAP_SYMLINK, INFO) \
 	FUNC(BAD_TAG_NAME, INFO) \
 	FUNC(MISSING_TAGGER_ENTRY, INFO) \
+	FUNC(REF_MISSING_NEWLINE, INFO) \
+	FUNC(TRAILING_REF_CONTENT, INFO) \
 	/* ignored (elevated when requested) */ \
 	FUNC(EXTRA_HEADER_ENTRY, IGNORE)
 
diff --git a/refs.c b/refs.c
index 74de3d3009..5e74881945 100644
--- a/refs.c
+++ b/refs.c
@@ -1758,7 +1758,7 @@  static int refs_read_special_head(struct ref_store *ref_store,
 	}
 
 	result = parse_loose_ref_contents(ref_store->repo->hash_algo, content.buf,
-					  oid, referent, type, failure_errno);
+					  oid, referent, type, NULL, failure_errno);
 
 done:
 	strbuf_release(&full_path);
diff --git a/refs/files-backend.c b/refs/files-backend.c
index b1ed2e5c04..df4ce270ae 100644
--- a/refs/files-backend.c
+++ b/refs/files-backend.c
@@ -560,7 +560,7 @@  static int read_ref_internal(struct ref_store *ref_store, const char *refname,
 	buf = sb_contents.buf;
 
 	ret = parse_loose_ref_contents(ref_store->repo->hash_algo, buf,
-				       oid, referent, type, &myerr);
+				       oid, referent, type, NULL, &myerr);
 
 out:
 	if (ret && !myerr)
@@ -597,7 +597,7 @@  static int files_read_symbolic_ref(struct ref_store *ref_store, const char *refn
 int parse_loose_ref_contents(const struct git_hash_algo *algop,
 			     const char *buf, struct object_id *oid,
 			     struct strbuf *referent, unsigned int *type,
-			     int *failure_errno)
+			     const char **trailing, int *failure_errno)
 {
 	const char *p;
 	if (skip_prefix(buf, "ref:", &buf)) {
@@ -619,6 +619,10 @@  int parse_loose_ref_contents(const struct git_hash_algo *algop,
 		*failure_errno = EINVAL;
 		return -1;
 	}
+
+	if (trailing)
+		*trailing = p;
+
 	return 0;
 }
 
@@ -3439,6 +3443,7 @@  static int files_fsck_refs_content(struct ref_store *ref_store,
 	struct strbuf referent = STRBUF_INIT;
 	struct strbuf refname = STRBUF_INIT;
 	struct fsck_ref_report report = {0};
+	const char *trailing = NULL;
 	unsigned int type = 0;
 	int failure_errno = 0;
 	struct object_id oid;
@@ -3458,13 +3463,29 @@  static int files_fsck_refs_content(struct ref_store *ref_store,
 
 	if (parse_loose_ref_contents(ref_store->repo->hash_algo,
 				     ref_content.buf, &oid, &referent,
-				     &type, &failure_errno)) {
+				     &type, &trailing, &failure_errno)) {
 		ret = fsck_report_ref(o, &report,
 				      FSCK_MSG_BAD_REF_CONTENT,
 				      "invalid ref content");
 		goto cleanup;
 	}
 
+	if (!(type & REF_ISSYMREF)) {
+		if (!*trailing) {
+			ret = fsck_report_ref(o, &report,
+					      FSCK_MSG_REF_MISSING_NEWLINE,
+					      "missing newline");
+			goto cleanup;
+		}
+
+		if (*trailing != '\n' || *(trailing + 1)) {
+			ret = fsck_report_ref(o, &report,
+					      FSCK_MSG_TRAILING_REF_CONTENT,
+					      "trailing garbage in ref");
+			goto cleanup;
+		}
+	}
+
 cleanup:
 	strbuf_release(&refname);
 	strbuf_release(&ref_content);
diff --git a/refs/refs-internal.h b/refs/refs-internal.h
index 2313c830d8..73b05f971b 100644
--- a/refs/refs-internal.h
+++ b/refs/refs-internal.h
@@ -715,7 +715,7 @@  struct ref_store {
 int parse_loose_ref_contents(const struct git_hash_algo *algop,
 			     const char *buf, struct object_id *oid,
 			     struct strbuf *referent, unsigned int *type,
-			     int *failure_errno);
+			     const char **trailing, int *failure_errno);
 
 /*
  * Fill in the generic part of refs and add it to our collection of
diff --git a/t/t0602-reffiles-fsck.sh b/t/t0602-reffiles-fsck.sh
index a1205b3a3b..a06ad044f2 100755
--- a/t/t0602-reffiles-fsck.sh
+++ b/t/t0602-reffiles-fsck.sh
@@ -101,6 +101,54 @@  test_expect_success 'regular ref content should be checked (individual)' '
 	git refs verify 2>err &&
 	test_must_be_empty err &&
 
+	printf "%s" "$(git rev-parse main)" >$branch_dir_prefix/branch-no-newline &&
+	git refs verify 2>err &&
+	cat >expect <<-EOF &&
+	warning: refs/heads/branch-no-newline: refMissingNewline: missing newline
+	EOF
+	rm $branch_dir_prefix/branch-no-newline &&
+	test_cmp expect err &&
+
+	printf "%s garbage" "$(git rev-parse main)" >$branch_dir_prefix/branch-garbage &&
+	git refs verify 2>err &&
+	cat >expect <<-EOF &&
+	warning: refs/heads/branch-garbage: trailingRefContent: trailing garbage in ref
+	EOF
+	rm $branch_dir_prefix/branch-garbage &&
+	test_cmp expect err &&
+
+	printf "%s\n\n\n" "$(git rev-parse main)" >$tag_dir_prefix/tag-garbage-1 &&
+	git refs verify 2>err &&
+	cat >expect <<-EOF &&
+	warning: refs/tags/tag-garbage-1: trailingRefContent: trailing garbage in ref
+	EOF
+	rm $tag_dir_prefix/tag-garbage-1 &&
+	test_cmp expect err &&
+
+	printf "%s\n\n\n  garbage" "$(git rev-parse main)" >$tag_dir_prefix/tag-garbage-2 &&
+	git refs verify 2>err &&
+	cat >expect <<-EOF &&
+	warning: refs/tags/tag-garbage-2: trailingRefContent: trailing garbage in ref
+	EOF
+	rm $tag_dir_prefix/tag-garbage-2 &&
+	test_cmp expect err &&
+
+	printf "%s    garbage\n\na" "$(git rev-parse main)" >$tag_dir_prefix/tag-garbage-3 &&
+	git refs verify 2>err &&
+	cat >expect <<-EOF &&
+	warning: refs/tags/tag-garbage-3: trailingRefContent: trailing garbage in ref
+	EOF
+	rm $tag_dir_prefix/tag-garbage-3 &&
+	test_cmp expect err &&
+
+	printf "%s garbage" "$(git rev-parse main)" >$tag_dir_prefix/tag-garbage-4 &&
+	test_must_fail git -c fsck.trailingRefContent=error refs verify 2>err &&
+	cat >expect <<-EOF &&
+	error: refs/tags/tag-garbage-4: trailingRefContent: trailing garbage in ref
+	EOF
+	rm $tag_dir_prefix/tag-garbage-4 &&
+	test_cmp expect err &&
+
 	printf "%sx" "$(git rev-parse main)" >$tag_dir_prefix/tag-bad-1 &&
 	test_must_fail git refs verify 2>err &&
 	cat >expect <<-EOF &&
@@ -135,6 +183,12 @@  test_expect_success 'regular ref content should be checked (aggregate)' '
 	test_commit default &&
 	mkdir -p "$branch_dir_prefix/a/b" &&
 
+	printf "%s" "$(git rev-parse main)" >$branch_dir_prefix/branch-no-newline &&
+	printf "%s garbage" "$(git rev-parse main)" >$branch_dir_prefix/branch-garbage &&
+	printf "%s\n\n\n" "$(git rev-parse main)" >$tag_dir_prefix/tag-garbage-1 &&
+	printf "%s\n\n\n  garbage" "$(git rev-parse main)" >$tag_dir_prefix/tag-garbage-2 &&
+	printf "%s    garbage\n\na" "$(git rev-parse main)" >$tag_dir_prefix/tag-garbage-3 &&
+	printf "%s garbage" "$(git rev-parse main)" >$tag_dir_prefix/tag-garbage-4 &&
 	printf "%sx" "$(git rev-parse main)" >$tag_dir_prefix/tag-bad-1 &&
 	printf "xfsazqfxcadas" >$tag_dir_prefix/tag-bad-2 &&
 	printf "xfsazqfxcadas" >$branch_dir_prefix/a/b/branch-bad &&
@@ -144,6 +198,12 @@  test_expect_success 'regular ref content should be checked (aggregate)' '
 	error: refs/heads/a/b/branch-bad: badRefContent: invalid ref content
 	error: refs/tags/tag-bad-1: badRefContent: invalid ref content
 	error: refs/tags/tag-bad-2: badRefContent: invalid ref content
+	warning: refs/heads/branch-garbage: trailingRefContent: trailing garbage in ref
+	warning: refs/heads/branch-no-newline: refMissingNewline: missing newline
+	warning: refs/tags/tag-garbage-1: trailingRefContent: trailing garbage in ref
+	warning: refs/tags/tag-garbage-2: trailingRefContent: trailing garbage in ref
+	warning: refs/tags/tag-garbage-3: trailingRefContent: trailing garbage in ref
+	warning: refs/tags/tag-garbage-4: trailingRefContent: trailing garbage in ref
 	EOF
 	sort err >sorted_err &&
 	test_cmp expect sorted_err