diff mbox series

[RFC,1/1] commit-graph.c: die on un-parseable commits

Message ID 34e4ec793cb0d321d16b88777cd2db64ed7b772e.1567563244.git.me@ttaylorr.com (mailing list archive)
State New, archived
Headers show
Series commit-graph.c: handle corrupt commit trees | expand

Commit Message

Taylor Blau Sept. 4, 2019, 2:22 a.m. UTC
When we write a commit graph chunk, we process a given list of 'struct
commit *'s and parse out the parent(s) and tree OID in order to write
out its information.

We do this by calling 'parse_commit_no_graph', and then checking the
result of 'get_commit_tree_oid' to write the tree OID. This process
assumes that 'parse_commit_no_graph' parses the commit successfully.
When this isn't the case, 'get_commit_tree_oid(*list)' may return NULL,
in which case trying to '->hash' it causes a SIGSEGV.

Instead, teach 'write_graph_chunk_data' to stop when a commit isn't able
to be parsed, at the peril of failing to write a commit-graph.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Jeff King Sept. 4, 2019, 3:04 a.m. UTC | #1
On Tue, Sep 03, 2019 at 10:22:57PM -0400, Taylor Blau wrote:

> When we write a commit graph chunk, we process a given list of 'struct
> commit *'s and parse out the parent(s) and tree OID in order to write
> out its information.
> 
> We do this by calling 'parse_commit_no_graph', and then checking the
> result of 'get_commit_tree_oid' to write the tree OID. This process
> assumes that 'parse_commit_no_graph' parses the commit successfully.
> When this isn't the case, 'get_commit_tree_oid(*list)' may return NULL,
> in which case trying to '->hash' it causes a SIGSEGV.
> 
> Instead, teach 'write_graph_chunk_data' to stop when a commit isn't able
> to be parsed, at the peril of failing to write a commit-graph.

Yeah, I think it makes sense for commit-graph to bail completely if it
can't parse here. In theory it could omit the entry from the
commit-graph file (and a reader would just fall back to trying to access
the object contents itself), but I think we're too late in the process
for that. And besides, this should generally only happen in a corrupt
repository.

However...

> diff --git a/commit-graph.c b/commit-graph.c
> index f2888c203b..6aa6998ecd 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -843,7 +843,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
>  		uint32_t packedDate[2];
>  		display_progress(ctx->progress, ++ctx->progress_cnt);
>  
> -		parse_commit_no_graph(*list);
> +		if (parse_commit_no_graph(*list))
> +			die(_("unable to parse commit %s"),
> +				oid_to_hex(&(*list)->object.oid));
>  		hashwrite(f, get_commit_tree_oid(*list)->hash, hash_len);

I don't think parse_commit_no_graph() returning 0 assures us that
get_commit_tree() and friends will return non-NULL.

This is similar to the case discussed recently where a failed parse of a
tag may leave "tag->tagged == NULL" even though "tag->obj.parsed" is
set.

Here an earlier parsing error could cause (*list)->object.parsed to be
true, but with (*list)->maybe_tree still NULL. Our call to
parse_commit_no_graph() here would silently return "yep, already tried
to parse this", and then we'd still segfault.

We _could_ check:

  if (parse_commit_no_graph(*list))
	die("unable to parse...");
  tree = get_commit_tree_oid(*list);
  if (!tree)
	die("unable to get tree for %s...");

but trees are just one piece of data. In fact, the situation is much
worse for parents: there a NULL parent pointer is valid data, so worse
than segfaulting, we'd write invalid data to the commit graph, skipping
one or more parents!

And I think there's literally no way for this function to tell the
difference between "no parent" and "there was an earlier error, but we
set the parsed flag anyway and the parent flag is invalid".

I think that argues against Junio's response in:

  https://public-inbox.org/git/xmqqo90bhmi3.fsf@gitster-ct.c.googlers.com/

about how we can use the parsed flag to look at "slightly corrupt"
objects. I think we'd need at least an extra flag for "corrupt", though
I think it is simpler just _not_ setting "parsed" and letting the next
caller spend the extra cycles to rediscover the problem if they're
interested.

(All of this presupposes that you can convince another piece of code in
the same process to parse the commit buffer and ignore the error. I have
no idea if that's possible or not in this case, but it sure would be
nice not to have to care).

-Peff
Taylor Blau Sept. 4, 2019, 9:18 p.m. UTC | #2
Hi Peff,

On Tue, Sep 03, 2019 at 11:04:56PM -0400, Jeff King wrote:
> On Tue, Sep 03, 2019 at 10:22:57PM -0400, Taylor Blau wrote:
>
> > When we write a commit graph chunk, we process a given list of 'struct
> > commit *'s and parse out the parent(s) and tree OID in order to write
> > out its information.
> >
> > We do this by calling 'parse_commit_no_graph', and then checking the
> > result of 'get_commit_tree_oid' to write the tree OID. This process
> > assumes that 'parse_commit_no_graph' parses the commit successfully.
> > When this isn't the case, 'get_commit_tree_oid(*list)' may return NULL,
> > in which case trying to '->hash' it causes a SIGSEGV.
> >
> > Instead, teach 'write_graph_chunk_data' to stop when a commit isn't able
> > to be parsed, at the peril of failing to write a commit-graph.
>
> Yeah, I think it makes sense for commit-graph to bail completely if it
> can't parse here. In theory it could omit the entry from the
> commit-graph file (and a reader would just fall back to trying to access
> the object contents itself), but I think we're too late in the process
> for that. And besides, this should generally only happen in a corrupt
> repository.

Yep. I sent this as an RFC PATCH because I wasn't quite sure how folks
would feel about 'die()'-ing in the middle of 'write_graph_chunk_data',
but I think that your reasoning makes sense, and it matches my own
preferences.

So, I am glad that we resolved that portion. I'll keep going below...

> However...
>
> > diff --git a/commit-graph.c b/commit-graph.c
> > index f2888c203b..6aa6998ecd 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -843,7 +843,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
> >  		uint32_t packedDate[2];
> >  		display_progress(ctx->progress, ++ctx->progress_cnt);
> >
> > -		parse_commit_no_graph(*list);
> > +		if (parse_commit_no_graph(*list))
> > +			die(_("unable to parse commit %s"),
> > +				oid_to_hex(&(*list)->object.oid));
> >  		hashwrite(f, get_commit_tree_oid(*list)->hash, hash_len);
>
> I don't think parse_commit_no_graph() returning 0 assures us that
> get_commit_tree() and friends will return non-NULL.
>
> This is similar to the case discussed recently where a failed parse of a
> tag may leave "tag->tagged == NULL" even though "tag->obj.parsed" is
> set.
>
> Here an earlier parsing error could cause (*list)->object.parsed to be
> true, but with (*list)->maybe_tree still NULL. Our call to
> parse_commit_no_graph() here would silently return "yep, already tried
> to parse this", and then we'd still segfault.
>
> We _could_ check:
>
>   if (parse_commit_no_graph(*list))
> 	die("unable to parse...");
>   tree = get_commit_tree_oid(*list);
>   if (!tree)
> 	die("unable to get tree for %s...");
>
> but trees are just one piece of data. In fact, the situation is much
> worse for parents: there a NULL parent pointer is valid data, so worse
> than segfaulting, we'd write invalid data to the commit graph, skipping
> one or more parents!
>
> And I think there's literally no way for this function to tell the
> difference between "no parent" and "there was an earlier error, but we
> set the parsed flag anyway and the parent flag is invalid".
>
> I think that argues against Junio's response in:
>
>   https://public-inbox.org/git/xmqqo90bhmi3.fsf@gitster-ct.c.googlers.com/
>
> about how we can use the parsed flag to look at "slightly corrupt"
> objects. I think we'd need at least an extra flag for "corrupt", though
> I think it is simpler just _not_ setting "parsed" and letting the next
> caller spend the extra cycles to rediscover the problem if they're
> interested.

All of this makes sense to me, so I'm wondering what part(s) of this you
feel are worth addressing in this first patch series. Presumably, there
is a longer series that we _could_ write which would introduce a new
'corrupt' field and then check for it here.

But, I'm hesitant to write those patches, since I only have this one
call-site in mind. If we introduce 'corrupt', I feel it would be best to
use it uniformly, instead of only checking it here, and relying on other
bespoke mechanisms to detect corruption elsewhere.

So, I'm content to write the pseudo-code you provided above (which is to
say, call and check both 'parse_commit_no_graph', _and_
'get_commit_tree_oid'), because I think that it's expedient, and fix the
issue which I'm pointing out here.

I don't know how to address the parents situation without going further,
so perhaps it's worth it to think of an alternative, or even leave it
out of these first patch(es).

Let me know what you think, and thanks for your thoughts so far.

> (All of this presupposes that you can convince another piece of code in
> the same process to parse the commit buffer and ignore the error. I have
> no idea if that's possible or not in this case, but it sure would be
> nice not to have to care).
>
> -Peff
Thanks,
Taylor
Jeff King Sept. 5, 2019, 6:47 a.m. UTC | #3
On Wed, Sep 04, 2019 at 05:18:47PM -0400, Taylor Blau wrote:

> All of this makes sense to me, so I'm wondering what part(s) of this you
> feel are worth addressing in this first patch series. Presumably, there
> is a longer series that we _could_ write which would introduce a new
> 'corrupt' field and then check for it here.
> 
> But, I'm hesitant to write those patches, since I only have this one
> call-site in mind. If we introduce 'corrupt', I feel it would be best to
> use it uniformly, instead of only checking it here, and relying on other
> bespoke mechanisms to detect corruption elsewhere.
> 
> So, I'm content to write the pseudo-code you provided above (which is to
> say, call and check both 'parse_commit_no_graph', _and_
> 'get_commit_tree_oid'), because I think that it's expedient, and fix the
> issue which I'm pointing out here.

I'd actually be willing to just take the patch you have here, and
consider the "parsed but we saw an error" thing as an oddity of the
object code.  IOW, we shouldn't _have_ to be double-checking here.
Looking for an error return from parse_commit() should really be all a
caller needs to do. Once that's fixed, then your code would just be
doing the right thing.

That said, there's another unhandled case, I think: lookup_tree() might
return NULL (if somebody previously saw that oid as a non-tree), and
parse_commit() wouldn't even notice and return an error!

IMHO that's also something that parse_commit() should be returning an
error for. And it's probably a lot easier to trigger versus the "parsed
earlier but corrupted" thing.

So it might be worth doing the NULL tree check here in the meantime. I
dunno.

Below is a sketch of what I'm thinking parse_commit() should do:

  - remember when an earlier parse returned an error, so we can repeat
    that error (this requires some unfortunate bit-field adjusting)

  - notice a lookup_tree failure

  - likewise, notice a lookup_parent failure

---
diff --git a/commit.c b/commit.c
index a98de16e3d..7e415932b7 100644
--- a/commit.c
+++ b/commit.c
@@ -391,7 +391,9 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
 	return ret;
 }
 
-int parse_commit_buffer(struct repository *r, struct commit *item, const void *buffer, unsigned long size, int check_graph)
+static int parse_commit_buffer_1(struct repository *r, struct commit *item,
+				 const void *buffer, unsigned long size,
+				 int check_graph)
 {
 	const char *tail = buffer;
 	const char *bufptr = buffer;
@@ -401,9 +403,6 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
 	const int tree_entry_len = the_hash_algo->hexsz + 5;
 	const int parent_entry_len = the_hash_algo->hexsz + 7;
 
-	if (item->object.parsed)
-		return 0;
-	item->object.parsed = 1;
 	tail += size;
 	if (tail <= bufptr + tree_entry_len + 1 || memcmp(bufptr, "tree ", 5) ||
 			bufptr[tree_entry_len] != '\n')
@@ -412,6 +411,10 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
 		return error("bad tree pointer in commit %s",
 			     oid_to_hex(&item->object.oid));
 	set_commit_tree(item, lookup_tree(r, &parent));
+	if (!item->maybe_tree)
+		return error("bad tree pointer %s in commit %s",
+			     oid_to_hex(&parent),
+			     oid_to_hex(&item->object.oid));
 	bufptr += tree_entry_len + 1; /* "tree " + "hex sha1" + "\n" */
 	pptr = &item->parents;
 
@@ -431,15 +434,19 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
 		if (graft && (graft->nr_parent < 0 || grafts_replace_parents))
 			continue;
 		new_parent = lookup_commit(r, &parent);
-		if (new_parent)
-			pptr = &commit_list_insert(new_parent, pptr)->next;
+		if (!new_parent)
+			return error("bad parent %s in commit %s",
+				     oid_to_hex(&parent),
+				     oid_to_hex(&item->object.oid));
+		pptr = &commit_list_insert(new_parent, pptr)->next;
 	}
 	if (graft) {
 		int i;
 		struct commit *new_parent;
 		for (i = 0; i < graft->nr_parent; i++) {
 			new_parent = lookup_commit(r,
 						   &graft->parent[i]);
+			/* Here we ignore bogus grafts. Also should be an error? */
 			if (!new_parent)
 				continue;
 			pptr = &commit_list_insert(new_parent, pptr)->next;
@@ -453,6 +460,23 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
 	return 0;
 }
 
+int parse_commit_buffer(struct repository *r, struct commit *item,
+			const void *buffer, unsigned long size,
+			int check_graph)
+{
+	int ret;
+
+	if (item->object.parsed)
+		return item->object.corrupt ? -1 : 0;
+	item->object.parsed = 1;
+
+	ret = parse_commit_buffer_1(r, item, buffer, size, check_graph);
+	if (ret < 0)
+		item->object.corrupt = 1;
+
+	return ret;
+}
+
 int repo_parse_commit_internal(struct repository *r,
 			       struct commit *item,
 			       int quiet_on_missing,
diff --git a/object.h b/object.h
index 0120892bbd..b83d3964ad 100644
--- a/object.h
+++ b/object.h
@@ -59,7 +59,7 @@ struct object_array {
 
 /*
  * object flag allocation:
- * revision.h:               0---------10                              25----28
+ * revision.h:               0---------10                              24----27
  * fetch-pack.c:             01
  * negotiator/default.c:       2--5
  * walker.c:                 0-2
@@ -78,13 +78,14 @@ struct object_array {
  * builtin/show-branch.c:    0-------------------------------------------26
  * builtin/unpack-objects.c:                                 2021
  */
-#define FLAG_BITS  29
+#define FLAG_BITS  28
 
 /*
  * The object type is stored in 3 bits.
  */
 struct object {
 	unsigned parsed : 1;
+	unsigned corrupt : 1;
 	unsigned type : TYPE_BITS;
 	unsigned flags : FLAG_BITS;
 	struct object_id oid;
diff --git a/revision.h b/revision.h
index 4134dc6029..5c0b831b37 100644
--- a/revision.h
+++ b/revision.h
@@ -33,7 +33,7 @@
 #define ALL_REV_FLAGS	(((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR)
 
 #define TOPO_WALK_EXPLORED	(1u<<27)
-#define TOPO_WALK_INDEGREE	(1u<<28)
+#define TOPO_WALK_INDEGREE	(1u<<24)
 
 #define DECORATE_SHORT_REFS	1
 #define DECORATE_FULL_REFS	2
Junio C Hamano Sept. 5, 2019, 10:19 p.m. UTC | #4
Jeff King <peff@peff.net> writes:

> I don't think parse_commit_no_graph() returning 0 assures us that
> get_commit_tree() and friends will return non-NULL.
>
> This is similar to the case discussed recently where a failed parse of a
> tag may leave "tag->tagged == NULL" even though "tag->obj.parsed" is
> set.
>
> Here an earlier parsing error could cause (*list)->object.parsed to be
> true, but with (*list)->maybe_tree still NULL. Our call to
> parse_commit_no_graph() here would silently return "yep, already tried
> to parse this", and then we'd still segfault.
> ...
> And I think there's literally no way for this function to tell the
> difference between "no parent" and "there was an earlier error, but we
> set the parsed flag anyway and the parent flag is invalid".
>
> I think that argues against Junio's response in:

Fair enough.  Forcing later users to reattempt parsing (and failing
the same way) would be safer and it should also be sufficient as we
are talking about how to handle a broken repository, i.e. an error
case.
Jeff King Sept. 6, 2019, 6:35 a.m. UTC | #5
On Thu, Sep 05, 2019 at 03:19:48PM -0700, Junio C Hamano wrote:

> > Here an earlier parsing error could cause (*list)->object.parsed to be
> > true, but with (*list)->maybe_tree still NULL. Our call to
> > parse_commit_no_graph() here would silently return "yep, already tried
> > to parse this", and then we'd still segfault.
> > ...
> > And I think there's literally no way for this function to tell the
> > difference between "no parent" and "there was an earlier error, but we
> > set the parsed flag anyway and the parent flag is invalid".
> >
> > I think that argues against Junio's response in:
> 
> Fair enough.  Forcing later users to reattempt parsing (and failing
> the same way) would be safer and it should also be sufficient as we
> are talking about how to handle a broken repository, i.e. an error
> case.

One of the tricky things, and the reason I used a "corrupt" flag in my
earlier sketch, is that the state after we encounter a parse error is
unknown. So imagine parse_commit_buffer() sees that one of the parent
lines is bogus, and we return an error. The caller gets to see whatever
half-parsed state we managed to come up with.

So far so good. But now imagine we call parse_commit_buffer() again, and
we re-parse. How does that interact with the half-parsed state? Some of
it works OK (e.g., lookup_tree() would find the same tree). Some not so
much (I think we'd keep appending parents at each call).

I guess this might not be too bad to handle. Value fields like
timestamp_t are OK to overwrite. Pointers to objects likewise, since the
memory is owned elsewhere. If we see existing parent pointers in an
object we're parsing, we could probably free them under the assumption
they're leftover cruft. Likewise for the "tag" field of "struct tag",
which is owned by the struct and should be freed.

Blobs and trees don't actually parse anything into their structs. So
it's really just special-casing those two items.

-Peff
Jeff King Sept. 6, 2019, 6:56 a.m. UTC | #6
On Fri, Sep 06, 2019 at 02:35:03AM -0400, Jeff King wrote:

> > Fair enough.  Forcing later users to reattempt parsing (and failing
> > the same way) would be safer and it should also be sufficient as we
> > are talking about how to handle a broken repository, i.e. an error
> > case.
> 
> One of the tricky things, and the reason I used a "corrupt" flag in my
> earlier sketch, is that the state after we encounter a parse error is
> unknown. So imagine parse_commit_buffer() sees that one of the parent
> lines is bogus, and we return an error. The caller gets to see whatever
> half-parsed state we managed to come up with.
> 
> So far so good. But now imagine we call parse_commit_buffer() again, and
> we re-parse. How does that interact with the half-parsed state? Some of
> it works OK (e.g., lookup_tree() would find the same tree). Some not so
> much (I think we'd keep appending parents at each call).
> 
> I guess this might not be too bad to handle. Value fields like
> timestamp_t are OK to overwrite. Pointers to objects likewise, since the
> memory is owned elsewhere. If we see existing parent pointers in an
> object we're parsing, we could probably free them under the assumption
> they're leftover cruft. Likewise for the "tag" field of "struct tag",
> which is owned by the struct and should be freed.
> 
> Blobs and trees don't actually parse anything into their structs. So
> it's really just special-casing those two items.

So here's something a bit more concrete to play with. Using the patch
below, we maintain the invariant that if you called parse_commit() or
equivalent and got a successful return, then commit->tree is always
non-NULL. That fixes the second "missing tree" test from elsewhere in
this thread:

  https://public-inbox.org/git/042a8ba8b2a98c269f9cd1a8e88488b80d686f0d.1567720960.git.me@ttaylorr.com/

_without_ applying the third patch (though the error message we expect
in the test does not tweaked). And it likely also protects most of the
other callers of get_commit_tree_oid(), assuming somewhere in their code
path they call parse_commit() and actually check the error code.

Likewise, tag->tagged is always non-NULL after a successful tag parse,
which should fix the segfault discussed recently in:

  https://public-inbox.org/git/20190824230944.GA14132@jessup.stsp.name/

as well as probably others.

And callers are still able to view the broken objects, but they have to
ignore the error return from the parse functions.

This seems like a promising direction. I'd probably break it into a few
separate patches, and it would be nice to have some tests (especially
one where we actually do try to parse multiple times).

---
diff --git a/commit.c b/commit.c
index a98de16e3d..47e36fd13e 100644
--- a/commit.c
+++ b/commit.c
@@ -398,20 +398,37 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
 	struct object_id parent;
 	struct commit_list **pptr;
 	struct commit_graft *graft;
+	struct tree *tree;
 	const int tree_entry_len = the_hash_algo->hexsz + 5;
 	const int parent_entry_len = the_hash_algo->hexsz + 7;
 
 	if (item->object.parsed)
 		return 0;
-	item->object.parsed = 1;
+
+	if (item->parents) {
+		/*
+		 * Presumably this is leftover from an earlier failed parse;
+		 * clear it out in preparation for us re-parsing (we'll hit the
+		 * same error, but that's good, since it lets our caller know
+		 * the result cannot be trusted.
+		 */
+		free_commit_list(item->parents);
+		item->parents = NULL;
+	}
+
 	tail += size;
 	if (tail <= bufptr + tree_entry_len + 1 || memcmp(bufptr, "tree ", 5) ||
 			bufptr[tree_entry_len] != '\n')
 		return error("bogus commit object %s", oid_to_hex(&item->object.oid));
 	if (get_oid_hex(bufptr + 5, &parent) < 0)
 		return error("bad tree pointer in commit %s",
 			     oid_to_hex(&item->object.oid));
-	set_commit_tree(item, lookup_tree(r, &parent));
+	tree = lookup_tree(r, &parent);
+	if (!tree)
+		return error("bad tree pointer %s in commit %s",
+			     oid_to_hex(&parent),
+			     oid_to_hex(&item->object.oid));
+	set_commit_tree(item, tree);
 	bufptr += tree_entry_len + 1; /* "tree " + "hex sha1" + "\n" */
 	pptr = &item->parents;
 
@@ -450,6 +467,7 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
 	if (check_graph)
 		load_commit_graph_info(r, item);
 
+	item->object.parsed = 1;
 	return 0;
 }
 
diff --git a/tag.c b/tag.c
index 5db870edb9..3dcebb4715 100644
--- a/tag.c
+++ b/tag.c
@@ -141,7 +141,11 @@ int parse_tag_buffer(struct repository *r, struct tag *item, const void *data, u
 
 	if (item->object.parsed)
 		return 0;
-	item->object.parsed = 1;
+
+	if (item->tag) {
+		/* Left over from an earlier failed parse */
+		FREE_AND_NULL(item->tag);
+	}
 
 	if (size < the_hash_algo->hexsz + 24)
 		return -1;
@@ -167,10 +171,15 @@ int parse_tag_buffer(struct repository *r, struct tag *item, const void *data, u
 	} else if (!strcmp(type, tag_type)) {
 		item->tagged = (struct object *)lookup_tag(r, &oid);
 	} else {
-		error("Unknown type %s", type);
-		item->tagged = NULL;
+		return error("unknown tag type '%s' in %s",
+			     type, oid_to_hex(&item->object.oid));
 	}
 
+	if (!item->tagged)
+		return error("bad tag pointer to %s in %s",
+			     oid_to_hex(&oid),
+			     oid_to_hex(&item->object.oid));
+
 	if (bufptr + 4 < tail && starts_with(bufptr, "tag "))
 		; 		/* good */
 	else
@@ -187,6 +196,7 @@ int parse_tag_buffer(struct repository *r, struct tag *item, const void *data, u
 	else
 		item->date = 0;
 
+	item->object.parsed = 1;
 	return 0;
 }
Derrick Stolee Sept. 6, 2019, 4:48 p.m. UTC | #7
On 9/5/2019 2:47 AM, Jeff King wrote:
> On Wed, Sep 04, 2019 at 05:18:47PM -0400, Taylor Blau wrote:
> 
>> All of this makes sense to me, so I'm wondering what part(s) of this you
>> feel are worth addressing in this first patch series. Presumably, there
>> is a longer series that we _could_ write which would introduce a new
>> 'corrupt' field and then check for it here.
>>
>> But, I'm hesitant to write those patches, since I only have this one
>> call-site in mind. If we introduce 'corrupt', I feel it would be best to
>> use it uniformly, instead of only checking it here, and relying on other
>> bespoke mechanisms to detect corruption elsewhere.
>>
>> So, I'm content to write the pseudo-code you provided above (which is to
>> say, call and check both 'parse_commit_no_graph', _and_
>> 'get_commit_tree_oid'), because I think that it's expedient, and fix the
>> issue which I'm pointing out here.
> 
> I'd actually be willing to just take the patch you have here, and
> consider the "parsed but we saw an error" thing as an oddity of the
> object code.  IOW, we shouldn't _have_ to be double-checking here.
> Looking for an error return from parse_commit() should really be all a
> caller needs to do. Once that's fixed, then your code would just be
> doing the right thing.
> 
> That said, there's another unhandled case, I think: lookup_tree() might
> return NULL (if somebody previously saw that oid as a non-tree), and
> parse_commit() wouldn't even notice and return an error!
> 
> IMHO that's also something that parse_commit() should be returning an
> error for. And it's probably a lot easier to trigger versus the "parsed
> earlier but corrupted" thing.
> 
> So it might be worth doing the NULL tree check here in the meantime. I
> dunno.
> 
> Below is a sketch of what I'm thinking parse_commit() should do:
> 
>   - remember when an earlier parse returned an error, so we can repeat
>     that error (this requires some unfortunate bit-field adjusting)
> 
>   - notice a lookup_tree failure
> 
>   - likewise, notice a lookup_parent failure
> 
> ---
> diff --git a/commit.c b/commit.c
> index a98de16e3d..7e415932b7 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -391,7 +391,9 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
>  	return ret;
>  }
>  
> -int parse_commit_buffer(struct repository *r, struct commit *item, const void *buffer, unsigned long size, int check_graph)
> +static int parse_commit_buffer_1(struct repository *r, struct commit *item,
> +				 const void *buffer, unsigned long size,
> +				 int check_graph)
>  {
>  	const char *tail = buffer;
>  	const char *bufptr = buffer;
> @@ -401,9 +403,6 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
>  	const int tree_entry_len = the_hash_algo->hexsz + 5;
>  	const int parent_entry_len = the_hash_algo->hexsz + 7;
>  
> -	if (item->object.parsed)
> -		return 0;
> -	item->object.parsed = 1;
>  	tail += size;
>  	if (tail <= bufptr + tree_entry_len + 1 || memcmp(bufptr, "tree ", 5) ||
>  			bufptr[tree_entry_len] != '\n')
> @@ -412,6 +411,10 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
>  		return error("bad tree pointer in commit %s",
>  			     oid_to_hex(&item->object.oid));
>  	set_commit_tree(item, lookup_tree(r, &parent));
> +	if (!item->maybe_tree)
> +		return error("bad tree pointer %s in commit %s",
> +			     oid_to_hex(&parent),
> +			     oid_to_hex(&item->object.oid));
>  	bufptr += tree_entry_len + 1; /* "tree " + "hex sha1" + "\n" */
>  	pptr = &item->parents;
>  
> @@ -431,15 +434,19 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
>  		if (graft && (graft->nr_parent < 0 || grafts_replace_parents))
>  			continue;
>  		new_parent = lookup_commit(r, &parent);
> -		if (new_parent)
> -			pptr = &commit_list_insert(new_parent, pptr)->next;
> +		if (!new_parent)
> +			return error("bad parent %s in commit %s",
> +				     oid_to_hex(&parent),
> +				     oid_to_hex(&item->object.oid));
> +		pptr = &commit_list_insert(new_parent, pptr)->next;
>  	}
>  	if (graft) {
>  		int i;
>  		struct commit *new_parent;
>  		for (i = 0; i < graft->nr_parent; i++) {
>  			new_parent = lookup_commit(r,
>  						   &graft->parent[i]);
> +			/* Here we ignore bogus grafts. Also should be an error? */
>  			if (!new_parent)
>  				continue;
>  			pptr = &commit_list_insert(new_parent, pptr)->next;
> @@ -453,6 +460,23 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
>  	return 0;
>  }
>  
> +int parse_commit_buffer(struct repository *r, struct commit *item,
> +			const void *buffer, unsigned long size,
> +			int check_graph)
> +{
> +	int ret;
> +
> +	if (item->object.parsed)
> +		return item->object.corrupt ? -1 : 0;
> +	item->object.parsed = 1;
> +
> +	ret = parse_commit_buffer_1(r, item, buffer, size, check_graph);
> +	if (ret < 0)
> +		item->object.corrupt = 1;
> +
> +	return ret;
> +}
> +
>  int repo_parse_commit_internal(struct repository *r,
>  			       struct commit *item,
>  			       int quiet_on_missing,
> diff --git a/object.h b/object.h
> index 0120892bbd..b83d3964ad 100644
> --- a/object.h
> +++ b/object.h
> @@ -59,7 +59,7 @@ struct object_array {
>  
>  /*
>   * object flag allocation:
> - * revision.h:               0---------10                              25----28
> + * revision.h:               0---------10                              24----27
>   * fetch-pack.c:             01
>   * negotiator/default.c:       2--5
>   * walker.c:                 0-2
> @@ -78,13 +78,14 @@ struct object_array {
>   * builtin/show-branch.c:    0-------------------------------------------26
>   * builtin/unpack-objects.c:                                 2021
>   */
> -#define FLAG_BITS  29
> +#define FLAG_BITS  28
>  
>  /*
>   * The object type is stored in 3 bits.
>   */
>  struct object {
>  	unsigned parsed : 1;
> +	unsigned corrupt : 1;
>  	unsigned type : TYPE_BITS;
>  	unsigned flags : FLAG_BITS;
>  	struct object_id oid;
> diff --git a/revision.h b/revision.h
> index 4134dc6029..5c0b831b37 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -33,7 +33,7 @@
>  #define ALL_REV_FLAGS	(((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR)
>  
>  #define TOPO_WALK_EXPLORED	(1u<<27)
> -#define TOPO_WALK_INDEGREE	(1u<<28)
> +#define TOPO_WALK_INDEGREE	(1u<<24)

As an aside, these flag bit modifications look fine, but would need to
be explained. I'm guessing that since you are adding a bit of data
to struct object you want to avoid increasing the struct size across
a 32-bit boundary. Are we sure that bit 24 is not used anywhere else?
(My search for "1u<<24" found nothing, and "1 << 24" found a bit in
the cache-entry flags, so this seems safe.)

Thanks,
-Stolee
Junio C Hamano Sept. 6, 2019, 4:59 p.m. UTC | #8
Jeff King <peff@peff.net> writes:

> So far so good. But now imagine we call parse_commit_buffer() again, and
> we re-parse. How does that interact with the half-parsed state? Some of
> it works OK (e.g., lookup_tree() would find the same tree). Some not so
> much (I think we'd keep appending parents at each call).
>
> I guess this might not be too bad to handle. Value fields like
> timestamp_t are OK to overwrite. Pointers to objects likewise, since the
> memory is owned elsewhere. If we see existing parent pointers in an
> object we're parsing, we could probably free them under the assumption
> they're leftover cruft. Likewise for the "tag" field of "struct tag",
> which is owned by the struct and should be freed.

Yeah, or clear them before returning with .corrupt bit set?
Jeff King Sept. 6, 2019, 5:04 p.m. UTC | #9
On Fri, Sep 06, 2019 at 12:48:05PM -0400, Derrick Stolee wrote:

> > diff --git a/revision.h b/revision.h
> > index 4134dc6029..5c0b831b37 100644
> > --- a/revision.h
> > +++ b/revision.h
> > @@ -33,7 +33,7 @@
> >  #define ALL_REV_FLAGS	(((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR)
> >  
> >  #define TOPO_WALK_EXPLORED	(1u<<27)
> > -#define TOPO_WALK_INDEGREE	(1u<<28)
> > +#define TOPO_WALK_INDEGREE	(1u<<24)
> 
> As an aside, these flag bit modifications look fine, but would need to
> be explained. I'm guessing that since you are adding a bit of data
> to struct object you want to avoid increasing the struct size across
> a 32-bit boundary. Are we sure that bit 24 is not used anywhere else?
> (My search for "1u<<24" found nothing, and "1 << 24" found a bit in
> the cache-entry flags, so this seems safe.)

Yeah, I'd definitely break this up into several commits with explanation
(though see an alternate I posted that just uses the parsed flag without
any new bits).

Bit 24 isn't used according to the table in objects.h, which is
_supposed_ to be the source of truth, though of course there's no
compiler-level checking. (One aside: is there a reason TOPO_WALK_* isn't
part of ALL_REV_FLAGS?).

And yes, the goal was to keep things to the 32-bit boundary. But in the
course of this, I discovered something interesting: 64-bit systems are
now padding this up to the 8-byte boundary!

The culprit is the switch of GIT_MAX_RAWSZ for sha256. Before then, our
object_id was 20 bytes for sha1. Adding 4 bytes of flags still left us
at 24 bytes, which is both 4- and 8-byte aligned.

With the switch to sha256, object_id is now 32 bytes. Adding 4 bytes
takes us to 36, and then 8-byte aligning the struct takes us to 40
bytes, with 4 bytes of wasted padding.

I'm sorely tempted to use this as an opportunity to move commit->index
into "struct object". That would actually shrink commit object sizes by
4 bytes, and would let all object types do the commit-slab trick to
store object data with constant-time lookup. This would make it possible
to migrate some uses of flags to per-operation bitfields (so e.g., two
traversals would have their _own_ flag data, and wouldn't risk stomping
on each other's bits).

The one downside would be that the index space would become more sparse.
I.e., right now if you're only storing things for commits in a slab, you
know that every slot you allocate is for a commit. But if we allocate an
index for each object, then the commits are less likely to be together
(so wasted memory and worse cache performance). That might be solvable
by assigning a per-type index (with a few hacks to handle OBJ_NONE).

Anyway, all of that is rather off the topic of this discussion.

-Peff
Jeff King Sept. 6, 2019, 5:04 p.m. UTC | #10
On Fri, Sep 06, 2019 at 09:59:04AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > So far so good. But now imagine we call parse_commit_buffer() again, and
> > we re-parse. How does that interact with the half-parsed state? Some of
> > it works OK (e.g., lookup_tree() would find the same tree). Some not so
> > much (I think we'd keep appending parents at each call).
> >
> > I guess this might not be too bad to handle. Value fields like
> > timestamp_t are OK to overwrite. Pointers to objects likewise, since the
> > memory is owned elsewhere. If we see existing parent pointers in an
> > object we're parsing, we could probably free them under the assumption
> > they're leftover cruft. Likewise for the "tag" field of "struct tag",
> > which is owned by the struct and should be freed.
> 
> Yeah, or clear them before returning with .corrupt bit set?

This was my attempt to avoid dealing with a .corrupt bit. :)

-Peff
Derrick Stolee Sept. 6, 2019, 5:19 p.m. UTC | #11
On 9/6/2019 1:04 PM, Jeff King wrote:
> On Fri, Sep 06, 2019 at 12:48:05PM -0400, Derrick Stolee wrote:
> 
>>> diff --git a/revision.h b/revision.h
>>> index 4134dc6029..5c0b831b37 100644
>>> --- a/revision.h
>>> +++ b/revision.h
>>> @@ -33,7 +33,7 @@
>>>  #define ALL_REV_FLAGS	(((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR)
>>>  
>>>  #define TOPO_WALK_EXPLORED	(1u<<27)
>>> -#define TOPO_WALK_INDEGREE	(1u<<28)
>>> +#define TOPO_WALK_INDEGREE	(1u<<24)
>>
>> As an aside, these flag bit modifications look fine, but would need to
>> be explained. I'm guessing that since you are adding a bit of data
>> to struct object you want to avoid increasing the struct size across
>> a 32-bit boundary. Are we sure that bit 24 is not used anywhere else?
>> (My search for "1u<<24" found nothing, and "1 << 24" found a bit in
>> the cache-entry flags, so this seems safe.)
> 
> Yeah, I'd definitely break this up into several commits with explanation
> (though see an alternate I posted that just uses the parsed flag without
> any new bits).
> 
> Bit 24 isn't used according to the table in objects.h, which is
> _supposed_ to be the source of truth, though of course there's no
> compiler-level checking. (One aside: is there a reason TOPO_WALK_* isn't
> part of ALL_REV_FLAGS?).

This was an oversight on my part. Sorry.

-Stolee
Derrick Stolee Sept. 6, 2019, 5:20 p.m. UTC | #12
On 9/6/2019 1:04 PM, Jeff King wrote:
> On Fri, Sep 06, 2019 at 12:48:05PM -0400, Derrick Stolee wrote:
> 
>>> diff --git a/revision.h b/revision.h
>>> index 4134dc6029..5c0b831b37 100644
>>> --- a/revision.h
>>> +++ b/revision.h
>>> @@ -33,7 +33,7 @@
>>>  #define ALL_REV_FLAGS	(((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR)
>>>  
>>>  #define TOPO_WALK_EXPLORED	(1u<<27)
>>> -#define TOPO_WALK_INDEGREE	(1u<<28)
>>> +#define TOPO_WALK_INDEGREE	(1u<<24)
>>
>> As an aside, these flag bit modifications look fine, but would need to
>> be explained. I'm guessing that since you are adding a bit of data
>> to struct object you want to avoid increasing the struct size across
>> a 32-bit boundary. Are we sure that bit 24 is not used anywhere else?
>> (My search for "1u<<24" found nothing, and "1 << 24" found a bit in
>> the cache-entry flags, so this seems safe.)
> 
> Yeah, I'd definitely break this up into several commits with explanation
> (though see an alternate I posted that just uses the parsed flag without
> any new bits).
> 
> Bit 24 isn't used according to the table in objects.h, which is
> _supposed_ to be the source of truth, though of course there's no
> compiler-level checking. (One aside: is there a reason TOPO_WALK_* isn't
> part of ALL_REV_FLAGS?).
> 
> And yes, the goal was to keep things to the 32-bit boundary. But in the
> course of this, I discovered something interesting: 64-bit systems are
> now padding this up to the 8-byte boundary!
> 
> The culprit is the switch of GIT_MAX_RAWSZ for sha256. Before then, our
> object_id was 20 bytes for sha1. Adding 4 bytes of flags still left us
> at 24 bytes, which is both 4- and 8-byte aligned.
> 
> With the switch to sha256, object_id is now 32 bytes. Adding 4 bytes
> takes us to 36, and then 8-byte aligning the struct takes us to 40
> bytes, with 4 bytes of wasted padding.
> 
> I'm sorely tempted to use this as an opportunity to move commit->index
> into "struct object". That would actually shrink commit object sizes by
> 4 bytes, and would let all object types do the commit-slab trick to
> store object data with constant-time lookup. This would make it possible
> to migrate some uses of flags to per-operation bitfields (so e.g., two
> traversals would have their _own_ flag data, and wouldn't risk stomping
> on each other's bits).

This reminds me that I'm hoping to eventually get around to moving
"generation" into a commit slab. That would reduce the space for people
still working without a commit-graph, and would allow updating to
generation number v2 (which needs 64 bits of data).

-Stolee
Junio C Hamano Sept. 9, 2019, 4:39 p.m. UTC | #13
Jeff King <peff@peff.net> writes:

> On Fri, Sep 06, 2019 at 09:59:04AM -0700, Junio C Hamano wrote:
>
>> Jeff King <peff@peff.net> writes:
>> 
>> > So far so good. But now imagine we call parse_commit_buffer() again, and
>> > we re-parse. How does that interact with the half-parsed state? Some of
>> > it works OK (e.g., lookup_tree() would find the same tree). Some not so
>> > much (I think we'd keep appending parents at each call).
>> >
>> > I guess this might not be too bad to handle. Value fields like
>> > timestamp_t are OK to overwrite. Pointers to objects likewise, since the
>> > memory is owned elsewhere. If we see existing parent pointers in an
>> > object we're parsing, we could probably free them under the assumption
>> > they're leftover cruft. Likewise for the "tag" field of "struct tag",
>> > which is owned by the struct and should be freed.
>> 
>> Yeah, or clear them before returning with .corrupt bit set?
>
> This was my attempt to avoid dealing with a .corrupt bit. :)

Then, clear them before returning with .parsed bit clear?
Jeff King Sept. 9, 2019, 4:54 p.m. UTC | #14
On Mon, Sep 09, 2019 at 09:39:41AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > On Fri, Sep 06, 2019 at 09:59:04AM -0700, Junio C Hamano wrote:
> >
> >> Jeff King <peff@peff.net> writes:
> >> 
> >> > So far so good. But now imagine we call parse_commit_buffer() again, and
> >> > we re-parse. How does that interact with the half-parsed state? Some of
> >> > it works OK (e.g., lookup_tree() would find the same tree). Some not so
> >> > much (I think we'd keep appending parents at each call).
> >> >
> >> > I guess this might not be too bad to handle. Value fields like
> >> > timestamp_t are OK to overwrite. Pointers to objects likewise, since the
> >> > memory is owned elsewhere. If we see existing parent pointers in an
> >> > object we're parsing, we could probably free them under the assumption
> >> > they're leftover cruft. Likewise for the "tag" field of "struct tag",
> >> > which is owned by the struct and should be freed.
> >> 
> >> Yeah, or clear them before returning with .corrupt bit set?
> >
> > This was my attempt to avoid dealing with a .corrupt bit. :)
> 
> Then, clear them before returning with .parsed bit clear?

But then somebody calling parse_commit() and getting an error return
doesn't get to see the full extent of what we managed to parse. Which I
thought was one of your original points: people should be able to get
whatever information we can get out of even a corrupted or malformed
object.

If we keep that requirement, then we have to either clear them at the
start of the re-parsing step, or we have to avoid re-parsing entirely by
using an extra bit to say "parsed but had an error".  I've sent messy
"something like this" patches for both strategies.

I think the latter is cleaner conceptually, but the former doesn't
require giving up a flag bit.

-Peff
diff mbox series

Patch

diff --git a/commit-graph.c b/commit-graph.c
index f2888c203b..6aa6998ecd 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -843,7 +843,9 @@  static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 		uint32_t packedDate[2];
 		display_progress(ctx->progress, ++ctx->progress_cnt);
 
-		parse_commit_no_graph(*list);
+		if (parse_commit_no_graph(*list))
+			die(_("unable to parse commit %s"),
+				oid_to_hex(&(*list)->object.oid));
 		hashwrite(f, get_commit_tree_oid(*list)->hash, hash_len);
 
 		parent = (*list)->parents;