diff mbox series

[5/5] patch-id: tighten code to detect the patch header

Message ID 20240621231826.3280338-6-gitster@pobox.com (mailing list archive)
State New, archived
Headers show
Series Tighten patch header parsing in patch-id | expand

Commit Message

Junio C Hamano June 21, 2024, 11:18 p.m. UTC
The get_one_patchid() function unconditionally takes a line that
matches the patch header (namely, a line that begins with a full
object name, possibly prefixed by "commit" or "From" plus a space)
as the beginning of a patch.  Even when it is *not* looking for one
(namely, when the previous call found the patch header and returned,
and then we are called again to skip the log message and process the
patch whose header was found by the previous invocation).

As a consequence, a line in the commit log message that begins with
one of these patterns can be mistaken to start another patch, with
current message entirely skipped (because we haven't even reached
the patch at all).

Allow the caller to tell us if it called us already and saw the
patch header (in which case we shouldn't be looking for another one,
until we see the "diff" part of the patch; instead we simply should
be skipping these lines as part of the commit log message), and skip
the header processing logic when that is the case.  In the helper
function, it also needs to flip this "are we looking for a header?"
bit, once it finished skipping the commit log message and started
processing the patches, as the patch header of the _next_ message is
the only clue in the input that the current patch is done.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/patch-id.c  | 43 ++++++++++++++++++++++++++++++-------------
 t/t4204-patch-id.sh | 17 +++++++++++++++++
 2 files changed, 47 insertions(+), 13 deletions(-)

Comments

Patrick Steinhardt July 29, 2024, 12:07 p.m. UTC | #1
On Fri, Jun 21, 2024 at 04:18:26PM -0700, Junio C Hamano wrote:
> @@ -196,11 +211,13 @@ static void generate_id_list(unsigned flags)
>  	struct strbuf line_buf = STRBUF_INIT;
>  
>  	oidclr(&oid);
> +	flags |= GOPID_FIND_HEADER;
>  	while (!feof(stdin)) {
>  		patchlen = get_one_patchid(&n, &result, &line_buf, flags);
>  		if (patchlen)
>  			flush_current_id(&oid, &result);
>  		oidcpy(&oid, &n);
> +		flags &= ~GOPID_FIND_HEADER;
>  	}

I think I'm missing the obvious. But why don't we have to set
`GOPID_FIND_HEADER` when we have flushed the current patch ID? Is this
because we know that `get_one_patchid()` stops once it finds the next
line starting with a commit? Makes me wonder what happens when there is
non-diff garbage between patches for which we are about to generate
patch IDs.

Patrick
Junio C Hamano July 29, 2024, 8:12 p.m. UTC | #2
Patrick Steinhardt <ps@pks.im> writes:

> On Fri, Jun 21, 2024 at 04:18:26PM -0700, Junio C Hamano wrote:
>> @@ -196,11 +211,13 @@ static void generate_id_list(unsigned flags)
>>  	struct strbuf line_buf = STRBUF_INIT;
>>  
>>  	oidclr(&oid);
>> +	flags |= GOPID_FIND_HEADER;
>>  	while (!feof(stdin)) {
>>  		patchlen = get_one_patchid(&n, &result, &line_buf, flags);
>>  		if (patchlen)
>>  			flush_current_id(&oid, &result);
>>  		oidcpy(&oid, &n);
>> +		flags &= ~GOPID_FIND_HEADER;
>>  	}
>
> I think I'm missing the obvious. But why don't we have to set
> `GOPID_FIND_HEADER` when we have flushed the current patch ID? Is this
> because we know that `get_one_patchid()` stops once it finds the next
> line starting with a commit?

Yup the original control flow is rather convoluted.  The first call
stops when it finds the header that begins the log message part and
returns, but the subsequent calls are to (1) skip the log message
and then (2) parse and hash the diff part, until it finds another
header that begins the log message part of the _next_ patch and
return.  GOPID_FIND_HEADER bit is used to tell the callee when we
haven't found the header (hence we can stop at a line whose
beginning looks like a hash) or the previous round already found the
header and we positively know we are now in the "skip the log
message" phase (hence a line whose beginning looks like a hash is
not a new header).

> Makes me wonder what happens when there is
> non-diff garbage between patches for which we are about to generate
> patch IDs.

"Skip non-diff garbage until we see a patch" is the mechanism used
to skip the log message, so it would be a reasonable thing to skip
such no-diff garbage between patches, no?
Patrick Steinhardt July 30, 2024, 4:55 a.m. UTC | #3
On Mon, Jul 29, 2024 at 01:12:42PM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > On Fri, Jun 21, 2024 at 04:18:26PM -0700, Junio C Hamano wrote:
> >> @@ -196,11 +211,13 @@ static void generate_id_list(unsigned flags)
> >>  	struct strbuf line_buf = STRBUF_INIT;
> >>  
> >>  	oidclr(&oid);
> >> +	flags |= GOPID_FIND_HEADER;
> >>  	while (!feof(stdin)) {
> >>  		patchlen = get_one_patchid(&n, &result, &line_buf, flags);
> >>  		if (patchlen)
> >>  			flush_current_id(&oid, &result);
> >>  		oidcpy(&oid, &n);
> >> +		flags &= ~GOPID_FIND_HEADER;
> >>  	}
> >
> > I think I'm missing the obvious. But why don't we have to set
> > `GOPID_FIND_HEADER` when we have flushed the current patch ID? Is this
> > because we know that `get_one_patchid()` stops once it finds the next
> > line starting with a commit?
> 
> Yup the original control flow is rather convoluted.  The first call
> stops when it finds the header that begins the log message part and
> returns, but the subsequent calls are to (1) skip the log message
> and then (2) parse and hash the diff part, until it finds another
> header that begins the log message part of the _next_ patch and
> return.  GOPID_FIND_HEADER bit is used to tell the callee when we
> haven't found the header (hence we can stop at a line whose
> beginning looks like a hash) or the previous round already found the
> header and we positively know we are now in the "skip the log
> message" phase (hence a line whose beginning looks like a hash is
> not a new header).

Okay.

> > Makes me wonder what happens when there is
> > non-diff garbage between patches for which we are about to generate
> > patch IDs.
> 
> "Skip non-diff garbage until we see a patch" is the mechanism used
> to skip the log message, so it would be a reasonable thing to skip
> such no-diff garbage between patches, no?

Oh, yes, it is reasonable. I just didn't quite figure out the flow of
the above loop when reading through the code. As you say, it is somewhat
convoluted and not all that straight forward.

In any case though, your changes improve readability, so the fact that
things are not quite straight forward is not the fault of this patch
series here.

Patrick
Patrick Steinhardt July 30, 2024, 5:12 a.m. UTC | #4
On Tue, Jul 30, 2024 at 06:55:14AM +0200, Patrick Steinhardt wrote:
> On Mon, Jul 29, 2024 at 01:12:42PM -0700, Junio C Hamano wrote:
> > Patrick Steinhardt <ps@pks.im> writes:
> > > On Fri, Jun 21, 2024 at 04:18:26PM -0700, Junio C Hamano wrote:
> > "Skip non-diff garbage until we see a patch" is the mechanism used
> > to skip the log message, so it would be a reasonable thing to skip
> > such no-diff garbage between patches, no?
> 
> Oh, yes, it is reasonable. I just didn't quite figure out the flow of
> the above loop when reading through the code. As you say, it is somewhat
> convoluted and not all that straight forward.

As far as I can see we didn't have a test for this yet, so I did have a
quick go at it to reassure myself that things work as expected before
and after your change. Feel free to pick it up if you feel like it, or
to just ignore it :)

Patrick

test_expect_success 'patch-id handles diffs with garbage in between' '
	cat >diff-with-garbage <<-\EOF &&
	$(git rev-parse HEAD)
	diff --git a/bar b/bar
	index bdaf90f..31051f6 100644
	--- a/bar
	+++ b/bar
	@@ -2 +2,2 @@
	 b
	+c
	some
	garbage
	lines
	$(git rev-parse HEAD)
	diff --git a/car b/car
	index 00750ed..2ae5e34 100644
	--- a/car
	+++ b/car
	@@ -1 +1,2 @@
	 3
	+d
	EOF

	cat >diff-without-garbage <<-\EOF &&
	$(git rev-parse HEAD)
	diff --git a/bar b/bar
	index bdaf90f..31051f6 100644
	--- a/bar
	+++ b/bar
	@@ -2 +2,2 @@
	 b
	+c
	$(git rev-parse HEAD)
	diff --git a/car b/car
	index 00750ed..2ae5e34 100644
	--- a/car
	+++ b/car
	@@ -1 +1,2 @@
	 3
	+d
	EOF

	for stable in true false
	do
		test_config patchid.stable $stable &&
		git patch-id <diff-with-garbage >id-with-garbage &&
		git patch-id <diff-without-garbage >id-without-garbage &&
		test_line_count -eq 2 id-with-garbage &&
		test_cmp id-with-garbage id-without-garbage ||
		return 1
	done
'
diff mbox series

Patch

diff --git a/builtin/patch-id.c b/builtin/patch-id.c
index a649966f31..0e6aab1ca2 100644
--- a/builtin/patch-id.c
+++ b/builtin/patch-id.c
@@ -60,12 +60,14 @@  static int scan_hunk_header(const char *p, int *p_before, int *p_after)
 
 #define GOPID_STABLE   01
 #define GOPID_VERBATIM 02
+#define GOPID_FIND_HEADER 04
 
 static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
 			   struct strbuf *line_buf, unsigned flags)
 {
 	int stable = flags & GOPID_STABLE;
 	int verbatim = flags & GOPID_VERBATIM;
+	int find_header = flags & GOPID_FIND_HEADER;
 	int patchlen = 0, found_next = 0;
 	int before = -1, after = -1;
 	int diff_is_binary = 0;
@@ -81,26 +83,39 @@  static int get_one_patchid(struct object_id *next_oid, struct object_id *result,
 		int len;
 
 		/*
-		 * If we see a line that begins with "<object name>",
-		 * "commit <object name>" or "From <object name>", it is
-		 * the beginning of a patch.  Return to the caller, as
-		 * we are done with the one we have been processing.
+		 * The caller hasn't seen us find a patch header and
+		 * return to it, or we have started processing patch
+		 * and may encounter the beginning of the next patch.
 		 */
-		if (skip_prefix(line, "commit ", &p))
-			;
-		else if (skip_prefix(line, "From ", &p))
-			;
-		if (!get_oid_hex(p, next_oid)) {
-			if (verbatim)
-				the_hash_algo->update_fn(&ctx, line, strlen(line));
-			found_next = 1;
-			break;
+		if (find_header) {
+			/*
+			 * If we see a line that begins with "<object name>",
+			 * "commit <object name>" or "From <object name>", it is
+			 * the beginning of a patch.  Return to the caller, as
+			 * we are done with the one we have been processing.
+			 */
+			if (skip_prefix(line, "commit ", &p))
+				;
+			else if (skip_prefix(line, "From ", &p))
+				;
+			if (!get_oid_hex(p, next_oid)) {
+				if (verbatim)
+					the_hash_algo->update_fn(&ctx, line, strlen(line));
+				found_next = 1;
+				break;
+			}
 		}
 
 		/* Ignore commit comments */
 		if (!patchlen && !starts_with(line, "diff "))
 			continue;
 
+		/*
+		 * We are past the commit log message.  Prepare to
+		 * stop at the beginning of the next patch header.
+		 */
+		find_header = 1;
+
 		/* Parsing diff header?  */
 		if (before == -1) {
 			if (starts_with(line, "GIT binary patch") ||
@@ -196,11 +211,13 @@  static void generate_id_list(unsigned flags)
 	struct strbuf line_buf = STRBUF_INIT;
 
 	oidclr(&oid);
+	flags |= GOPID_FIND_HEADER;
 	while (!feof(stdin)) {
 		patchlen = get_one_patchid(&n, &result, &line_buf, flags);
 		if (patchlen)
 			flush_current_id(&oid, &result);
 		oidcpy(&oid, &n);
+		flags &= ~GOPID_FIND_HEADER;
 	}
 	strbuf_release(&line_buf);
 }
diff --git a/t/t4204-patch-id.sh b/t/t4204-patch-id.sh
index 1627fdda1b..b1d98d4110 100755
--- a/t/t4204-patch-id.sh
+++ b/t/t4204-patch-id.sh
@@ -137,6 +137,23 @@  test_expect_success 'patch-id computes the same for various formats' '
 	test_cmp actual expect
 '
 
+hash=$(git rev-parse same:)
+for cruft in "$hash" "commit $hash is bad" "From $hash status"
+do
+	test_expect_success "patch-id with <$cruft> in log message" '
+		git format-patch -1 --stdout same >patch-0 &&
+		git patch-id <patch-0 >expect &&
+
+		{
+			sed -e "/^$/q" patch-0 &&
+			printf "random message\n%s\n\n" "$cruft" &&
+			sed -e "1,/^$/d" patch-0
+		} >patch-cruft &&
+		git patch-id <patch-cruft >actual &&
+		test_cmp actual expect
+	'
+done
+
 test_expect_success 'whitespace is irrelevant in footer' '
 	get_patch_id main &&
 	git checkout same &&