Message ID | 34e4ec793cb0d321d16b88777cd2db64ed7b772e.1567563244.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | commit-graph.c: handle corrupt commit trees | expand |
On Tue, Sep 03, 2019 at 10:22:57PM -0400, Taylor Blau wrote: > When we write a commit graph chunk, we process a given list of 'struct > commit *'s and parse out the parent(s) and tree OID in order to write > out its information. > > We do this by calling 'parse_commit_no_graph', and then checking the > result of 'get_commit_tree_oid' to write the tree OID. This process > assumes that 'parse_commit_no_graph' parses the commit successfully. > When this isn't the case, 'get_commit_tree_oid(*list)' may return NULL, > in which case trying to '->hash' it causes a SIGSEGV. > > Instead, teach 'write_graph_chunk_data' to stop when a commit isn't able > to be parsed, at the peril of failing to write a commit-graph. Yeah, I think it makes sense for commit-graph to bail completely if it can't parse here. In theory it could omit the entry from the commit-graph file (and a reader would just fall back to trying to access the object contents itself), but I think we're too late in the process for that. And besides, this should generally only happen in a corrupt repository. However... > diff --git a/commit-graph.c b/commit-graph.c > index f2888c203b..6aa6998ecd 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -843,7 +843,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, > uint32_t packedDate[2]; > display_progress(ctx->progress, ++ctx->progress_cnt); > > - parse_commit_no_graph(*list); > + if (parse_commit_no_graph(*list)) > + die(_("unable to parse commit %s"), > + oid_to_hex(&(*list)->object.oid)); > hashwrite(f, get_commit_tree_oid(*list)->hash, hash_len); I don't think parse_commit_no_graph() returning 0 assures us that get_commit_tree() and friends will return non-NULL. This is similar to the case discussed recently where a failed parse of a tag may leave "tag->tagged == NULL" even though "tag->obj.parsed" is set. Here an earlier parsing error could cause (*list)->object.parsed to be true, but with (*list)->maybe_tree still NULL. Our call to parse_commit_no_graph() here would silently return "yep, already tried to parse this", and then we'd still segfault. We _could_ check: if (parse_commit_no_graph(*list)) die("unable to parse..."); tree = get_commit_tree_oid(*list); if (!tree) die("unable to get tree for %s..."); but trees are just one piece of data. In fact, the situation is much worse for parents: there a NULL parent pointer is valid data, so worse than segfaulting, we'd write invalid data to the commit graph, skipping one or more parents! And I think there's literally no way for this function to tell the difference between "no parent" and "there was an earlier error, but we set the parsed flag anyway and the parent flag is invalid". I think that argues against Junio's response in: https://public-inbox.org/git/xmqqo90bhmi3.fsf@gitster-ct.c.googlers.com/ about how we can use the parsed flag to look at "slightly corrupt" objects. I think we'd need at least an extra flag for "corrupt", though I think it is simpler just _not_ setting "parsed" and letting the next caller spend the extra cycles to rediscover the problem if they're interested. (All of this presupposes that you can convince another piece of code in the same process to parse the commit buffer and ignore the error. I have no idea if that's possible or not in this case, but it sure would be nice not to have to care). -Peff
Hi Peff, On Tue, Sep 03, 2019 at 11:04:56PM -0400, Jeff King wrote: > On Tue, Sep 03, 2019 at 10:22:57PM -0400, Taylor Blau wrote: > > > When we write a commit graph chunk, we process a given list of 'struct > > commit *'s and parse out the parent(s) and tree OID in order to write > > out its information. > > > > We do this by calling 'parse_commit_no_graph', and then checking the > > result of 'get_commit_tree_oid' to write the tree OID. This process > > assumes that 'parse_commit_no_graph' parses the commit successfully. > > When this isn't the case, 'get_commit_tree_oid(*list)' may return NULL, > > in which case trying to '->hash' it causes a SIGSEGV. > > > > Instead, teach 'write_graph_chunk_data' to stop when a commit isn't able > > to be parsed, at the peril of failing to write a commit-graph. > > Yeah, I think it makes sense for commit-graph to bail completely if it > can't parse here. In theory it could omit the entry from the > commit-graph file (and a reader would just fall back to trying to access > the object contents itself), but I think we're too late in the process > for that. And besides, this should generally only happen in a corrupt > repository. Yep. I sent this as an RFC PATCH because I wasn't quite sure how folks would feel about 'die()'-ing in the middle of 'write_graph_chunk_data', but I think that your reasoning makes sense, and it matches my own preferences. So, I am glad that we resolved that portion. I'll keep going below... > However... > > > diff --git a/commit-graph.c b/commit-graph.c > > index f2888c203b..6aa6998ecd 100644 > > --- a/commit-graph.c > > +++ b/commit-graph.c > > @@ -843,7 +843,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, > > uint32_t packedDate[2]; > > display_progress(ctx->progress, ++ctx->progress_cnt); > > > > - parse_commit_no_graph(*list); > > + if (parse_commit_no_graph(*list)) > > + die(_("unable to parse commit %s"), > > + oid_to_hex(&(*list)->object.oid)); > > hashwrite(f, get_commit_tree_oid(*list)->hash, hash_len); > > I don't think parse_commit_no_graph() returning 0 assures us that > get_commit_tree() and friends will return non-NULL. > > This is similar to the case discussed recently where a failed parse of a > tag may leave "tag->tagged == NULL" even though "tag->obj.parsed" is > set. > > Here an earlier parsing error could cause (*list)->object.parsed to be > true, but with (*list)->maybe_tree still NULL. Our call to > parse_commit_no_graph() here would silently return "yep, already tried > to parse this", and then we'd still segfault. > > We _could_ check: > > if (parse_commit_no_graph(*list)) > die("unable to parse..."); > tree = get_commit_tree_oid(*list); > if (!tree) > die("unable to get tree for %s..."); > > but trees are just one piece of data. In fact, the situation is much > worse for parents: there a NULL parent pointer is valid data, so worse > than segfaulting, we'd write invalid data to the commit graph, skipping > one or more parents! > > And I think there's literally no way for this function to tell the > difference between "no parent" and "there was an earlier error, but we > set the parsed flag anyway and the parent flag is invalid". > > I think that argues against Junio's response in: > > https://public-inbox.org/git/xmqqo90bhmi3.fsf@gitster-ct.c.googlers.com/ > > about how we can use the parsed flag to look at "slightly corrupt" > objects. I think we'd need at least an extra flag for "corrupt", though > I think it is simpler just _not_ setting "parsed" and letting the next > caller spend the extra cycles to rediscover the problem if they're > interested. All of this makes sense to me, so I'm wondering what part(s) of this you feel are worth addressing in this first patch series. Presumably, there is a longer series that we _could_ write which would introduce a new 'corrupt' field and then check for it here. But, I'm hesitant to write those patches, since I only have this one call-site in mind. If we introduce 'corrupt', I feel it would be best to use it uniformly, instead of only checking it here, and relying on other bespoke mechanisms to detect corruption elsewhere. So, I'm content to write the pseudo-code you provided above (which is to say, call and check both 'parse_commit_no_graph', _and_ 'get_commit_tree_oid'), because I think that it's expedient, and fix the issue which I'm pointing out here. I don't know how to address the parents situation without going further, so perhaps it's worth it to think of an alternative, or even leave it out of these first patch(es). Let me know what you think, and thanks for your thoughts so far. > (All of this presupposes that you can convince another piece of code in > the same process to parse the commit buffer and ignore the error. I have > no idea if that's possible or not in this case, but it sure would be > nice not to have to care). > > -Peff Thanks, Taylor
On Wed, Sep 04, 2019 at 05:18:47PM -0400, Taylor Blau wrote: > All of this makes sense to me, so I'm wondering what part(s) of this you > feel are worth addressing in this first patch series. Presumably, there > is a longer series that we _could_ write which would introduce a new > 'corrupt' field and then check for it here. > > But, I'm hesitant to write those patches, since I only have this one > call-site in mind. If we introduce 'corrupt', I feel it would be best to > use it uniformly, instead of only checking it here, and relying on other > bespoke mechanisms to detect corruption elsewhere. > > So, I'm content to write the pseudo-code you provided above (which is to > say, call and check both 'parse_commit_no_graph', _and_ > 'get_commit_tree_oid'), because I think that it's expedient, and fix the > issue which I'm pointing out here. I'd actually be willing to just take the patch you have here, and consider the "parsed but we saw an error" thing as an oddity of the object code. IOW, we shouldn't _have_ to be double-checking here. Looking for an error return from parse_commit() should really be all a caller needs to do. Once that's fixed, then your code would just be doing the right thing. That said, there's another unhandled case, I think: lookup_tree() might return NULL (if somebody previously saw that oid as a non-tree), and parse_commit() wouldn't even notice and return an error! IMHO that's also something that parse_commit() should be returning an error for. And it's probably a lot easier to trigger versus the "parsed earlier but corrupted" thing. So it might be worth doing the NULL tree check here in the meantime. I dunno. Below is a sketch of what I'm thinking parse_commit() should do: - remember when an earlier parse returned an error, so we can repeat that error (this requires some unfortunate bit-field adjusting) - notice a lookup_tree failure - likewise, notice a lookup_parent failure --- diff --git a/commit.c b/commit.c index a98de16e3d..7e415932b7 100644 --- a/commit.c +++ b/commit.c @@ -391,7 +391,9 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep) return ret; } -int parse_commit_buffer(struct repository *r, struct commit *item, const void *buffer, unsigned long size, int check_graph) +static int parse_commit_buffer_1(struct repository *r, struct commit *item, + const void *buffer, unsigned long size, + int check_graph) { const char *tail = buffer; const char *bufptr = buffer; @@ -401,9 +403,6 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b const int tree_entry_len = the_hash_algo->hexsz + 5; const int parent_entry_len = the_hash_algo->hexsz + 7; - if (item->object.parsed) - return 0; - item->object.parsed = 1; tail += size; if (tail <= bufptr + tree_entry_len + 1 || memcmp(bufptr, "tree ", 5) || bufptr[tree_entry_len] != '\n') @@ -412,6 +411,10 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b return error("bad tree pointer in commit %s", oid_to_hex(&item->object.oid)); set_commit_tree(item, lookup_tree(r, &parent)); + if (!item->maybe_tree) + return error("bad tree pointer %s in commit %s", + oid_to_hex(&parent), + oid_to_hex(&item->object.oid)); bufptr += tree_entry_len + 1; /* "tree " + "hex sha1" + "\n" */ pptr = &item->parents; @@ -431,15 +434,19 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b if (graft && (graft->nr_parent < 0 || grafts_replace_parents)) continue; new_parent = lookup_commit(r, &parent); - if (new_parent) - pptr = &commit_list_insert(new_parent, pptr)->next; + if (!new_parent) + return error("bad parent %s in commit %s", + oid_to_hex(&parent), + oid_to_hex(&item->object.oid)); + pptr = &commit_list_insert(new_parent, pptr)->next; } if (graft) { int i; struct commit *new_parent; for (i = 0; i < graft->nr_parent; i++) { new_parent = lookup_commit(r, &graft->parent[i]); + /* Here we ignore bogus grafts. Also should be an error? */ if (!new_parent) continue; pptr = &commit_list_insert(new_parent, pptr)->next; @@ -453,6 +460,23 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b return 0; } +int parse_commit_buffer(struct repository *r, struct commit *item, + const void *buffer, unsigned long size, + int check_graph) +{ + int ret; + + if (item->object.parsed) + return item->object.corrupt ? -1 : 0; + item->object.parsed = 1; + + ret = parse_commit_buffer_1(r, item, buffer, size, check_graph); + if (ret < 0) + item->object.corrupt = 1; + + return ret; +} + int repo_parse_commit_internal(struct repository *r, struct commit *item, int quiet_on_missing, diff --git a/object.h b/object.h index 0120892bbd..b83d3964ad 100644 --- a/object.h +++ b/object.h @@ -59,7 +59,7 @@ struct object_array { /* * object flag allocation: - * revision.h: 0---------10 25----28 + * revision.h: 0---------10 24----27 * fetch-pack.c: 01 * negotiator/default.c: 2--5 * walker.c: 0-2 @@ -78,13 +78,14 @@ struct object_array { * builtin/show-branch.c: 0-------------------------------------------26 * builtin/unpack-objects.c: 2021 */ -#define FLAG_BITS 29 +#define FLAG_BITS 28 /* * The object type is stored in 3 bits. */ struct object { unsigned parsed : 1; + unsigned corrupt : 1; unsigned type : TYPE_BITS; unsigned flags : FLAG_BITS; struct object_id oid; diff --git a/revision.h b/revision.h index 4134dc6029..5c0b831b37 100644 --- a/revision.h +++ b/revision.h @@ -33,7 +33,7 @@ #define ALL_REV_FLAGS (((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR) #define TOPO_WALK_EXPLORED (1u<<27) -#define TOPO_WALK_INDEGREE (1u<<28) +#define TOPO_WALK_INDEGREE (1u<<24) #define DECORATE_SHORT_REFS 1 #define DECORATE_FULL_REFS 2
Jeff King <peff@peff.net> writes: > I don't think parse_commit_no_graph() returning 0 assures us that > get_commit_tree() and friends will return non-NULL. > > This is similar to the case discussed recently where a failed parse of a > tag may leave "tag->tagged == NULL" even though "tag->obj.parsed" is > set. > > Here an earlier parsing error could cause (*list)->object.parsed to be > true, but with (*list)->maybe_tree still NULL. Our call to > parse_commit_no_graph() here would silently return "yep, already tried > to parse this", and then we'd still segfault. > ... > And I think there's literally no way for this function to tell the > difference between "no parent" and "there was an earlier error, but we > set the parsed flag anyway and the parent flag is invalid". > > I think that argues against Junio's response in: Fair enough. Forcing later users to reattempt parsing (and failing the same way) would be safer and it should also be sufficient as we are talking about how to handle a broken repository, i.e. an error case.
On Thu, Sep 05, 2019 at 03:19:48PM -0700, Junio C Hamano wrote: > > Here an earlier parsing error could cause (*list)->object.parsed to be > > true, but with (*list)->maybe_tree still NULL. Our call to > > parse_commit_no_graph() here would silently return "yep, already tried > > to parse this", and then we'd still segfault. > > ... > > And I think there's literally no way for this function to tell the > > difference between "no parent" and "there was an earlier error, but we > > set the parsed flag anyway and the parent flag is invalid". > > > > I think that argues against Junio's response in: > > Fair enough. Forcing later users to reattempt parsing (and failing > the same way) would be safer and it should also be sufficient as we > are talking about how to handle a broken repository, i.e. an error > case. One of the tricky things, and the reason I used a "corrupt" flag in my earlier sketch, is that the state after we encounter a parse error is unknown. So imagine parse_commit_buffer() sees that one of the parent lines is bogus, and we return an error. The caller gets to see whatever half-parsed state we managed to come up with. So far so good. But now imagine we call parse_commit_buffer() again, and we re-parse. How does that interact with the half-parsed state? Some of it works OK (e.g., lookup_tree() would find the same tree). Some not so much (I think we'd keep appending parents at each call). I guess this might not be too bad to handle. Value fields like timestamp_t are OK to overwrite. Pointers to objects likewise, since the memory is owned elsewhere. If we see existing parent pointers in an object we're parsing, we could probably free them under the assumption they're leftover cruft. Likewise for the "tag" field of "struct tag", which is owned by the struct and should be freed. Blobs and trees don't actually parse anything into their structs. So it's really just special-casing those two items. -Peff
On Fri, Sep 06, 2019 at 02:35:03AM -0400, Jeff King wrote: > > Fair enough. Forcing later users to reattempt parsing (and failing > > the same way) would be safer and it should also be sufficient as we > > are talking about how to handle a broken repository, i.e. an error > > case. > > One of the tricky things, and the reason I used a "corrupt" flag in my > earlier sketch, is that the state after we encounter a parse error is > unknown. So imagine parse_commit_buffer() sees that one of the parent > lines is bogus, and we return an error. The caller gets to see whatever > half-parsed state we managed to come up with. > > So far so good. But now imagine we call parse_commit_buffer() again, and > we re-parse. How does that interact with the half-parsed state? Some of > it works OK (e.g., lookup_tree() would find the same tree). Some not so > much (I think we'd keep appending parents at each call). > > I guess this might not be too bad to handle. Value fields like > timestamp_t are OK to overwrite. Pointers to objects likewise, since the > memory is owned elsewhere. If we see existing parent pointers in an > object we're parsing, we could probably free them under the assumption > they're leftover cruft. Likewise for the "tag" field of "struct tag", > which is owned by the struct and should be freed. > > Blobs and trees don't actually parse anything into their structs. So > it's really just special-casing those two items. So here's something a bit more concrete to play with. Using the patch below, we maintain the invariant that if you called parse_commit() or equivalent and got a successful return, then commit->tree is always non-NULL. That fixes the second "missing tree" test from elsewhere in this thread: https://public-inbox.org/git/042a8ba8b2a98c269f9cd1a8e88488b80d686f0d.1567720960.git.me@ttaylorr.com/ _without_ applying the third patch (though the error message we expect in the test does not tweaked). And it likely also protects most of the other callers of get_commit_tree_oid(), assuming somewhere in their code path they call parse_commit() and actually check the error code. Likewise, tag->tagged is always non-NULL after a successful tag parse, which should fix the segfault discussed recently in: https://public-inbox.org/git/20190824230944.GA14132@jessup.stsp.name/ as well as probably others. And callers are still able to view the broken objects, but they have to ignore the error return from the parse functions. This seems like a promising direction. I'd probably break it into a few separate patches, and it would be nice to have some tests (especially one where we actually do try to parse multiple times). --- diff --git a/commit.c b/commit.c index a98de16e3d..47e36fd13e 100644 --- a/commit.c +++ b/commit.c @@ -398,20 +398,37 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b struct object_id parent; struct commit_list **pptr; struct commit_graft *graft; + struct tree *tree; const int tree_entry_len = the_hash_algo->hexsz + 5; const int parent_entry_len = the_hash_algo->hexsz + 7; if (item->object.parsed) return 0; - item->object.parsed = 1; + + if (item->parents) { + /* + * Presumably this is leftover from an earlier failed parse; + * clear it out in preparation for us re-parsing (we'll hit the + * same error, but that's good, since it lets our caller know + * the result cannot be trusted. + */ + free_commit_list(item->parents); + item->parents = NULL; + } + tail += size; if (tail <= bufptr + tree_entry_len + 1 || memcmp(bufptr, "tree ", 5) || bufptr[tree_entry_len] != '\n') return error("bogus commit object %s", oid_to_hex(&item->object.oid)); if (get_oid_hex(bufptr + 5, &parent) < 0) return error("bad tree pointer in commit %s", oid_to_hex(&item->object.oid)); - set_commit_tree(item, lookup_tree(r, &parent)); + tree = lookup_tree(r, &parent); + if (!tree) + return error("bad tree pointer %s in commit %s", + oid_to_hex(&parent), + oid_to_hex(&item->object.oid)); + set_commit_tree(item, tree); bufptr += tree_entry_len + 1; /* "tree " + "hex sha1" + "\n" */ pptr = &item->parents; @@ -450,6 +467,7 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b if (check_graph) load_commit_graph_info(r, item); + item->object.parsed = 1; return 0; } diff --git a/tag.c b/tag.c index 5db870edb9..3dcebb4715 100644 --- a/tag.c +++ b/tag.c @@ -141,7 +141,11 @@ int parse_tag_buffer(struct repository *r, struct tag *item, const void *data, u if (item->object.parsed) return 0; - item->object.parsed = 1; + + if (item->tag) { + /* Left over from an earlier failed parse */ + FREE_AND_NULL(item->tag); + } if (size < the_hash_algo->hexsz + 24) return -1; @@ -167,10 +171,15 @@ int parse_tag_buffer(struct repository *r, struct tag *item, const void *data, u } else if (!strcmp(type, tag_type)) { item->tagged = (struct object *)lookup_tag(r, &oid); } else { - error("Unknown type %s", type); - item->tagged = NULL; + return error("unknown tag type '%s' in %s", + type, oid_to_hex(&item->object.oid)); } + if (!item->tagged) + return error("bad tag pointer to %s in %s", + oid_to_hex(&oid), + oid_to_hex(&item->object.oid)); + if (bufptr + 4 < tail && starts_with(bufptr, "tag ")) ; /* good */ else @@ -187,6 +196,7 @@ int parse_tag_buffer(struct repository *r, struct tag *item, const void *data, u else item->date = 0; + item->object.parsed = 1; return 0; }
On 9/5/2019 2:47 AM, Jeff King wrote: > On Wed, Sep 04, 2019 at 05:18:47PM -0400, Taylor Blau wrote: > >> All of this makes sense to me, so I'm wondering what part(s) of this you >> feel are worth addressing in this first patch series. Presumably, there >> is a longer series that we _could_ write which would introduce a new >> 'corrupt' field and then check for it here. >> >> But, I'm hesitant to write those patches, since I only have this one >> call-site in mind. If we introduce 'corrupt', I feel it would be best to >> use it uniformly, instead of only checking it here, and relying on other >> bespoke mechanisms to detect corruption elsewhere. >> >> So, I'm content to write the pseudo-code you provided above (which is to >> say, call and check both 'parse_commit_no_graph', _and_ >> 'get_commit_tree_oid'), because I think that it's expedient, and fix the >> issue which I'm pointing out here. > > I'd actually be willing to just take the patch you have here, and > consider the "parsed but we saw an error" thing as an oddity of the > object code. IOW, we shouldn't _have_ to be double-checking here. > Looking for an error return from parse_commit() should really be all a > caller needs to do. Once that's fixed, then your code would just be > doing the right thing. > > That said, there's another unhandled case, I think: lookup_tree() might > return NULL (if somebody previously saw that oid as a non-tree), and > parse_commit() wouldn't even notice and return an error! > > IMHO that's also something that parse_commit() should be returning an > error for. And it's probably a lot easier to trigger versus the "parsed > earlier but corrupted" thing. > > So it might be worth doing the NULL tree check here in the meantime. I > dunno. > > Below is a sketch of what I'm thinking parse_commit() should do: > > - remember when an earlier parse returned an error, so we can repeat > that error (this requires some unfortunate bit-field adjusting) > > - notice a lookup_tree failure > > - likewise, notice a lookup_parent failure > > --- > diff --git a/commit.c b/commit.c > index a98de16e3d..7e415932b7 100644 > --- a/commit.c > +++ b/commit.c > @@ -391,7 +391,9 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep) > return ret; > } > > -int parse_commit_buffer(struct repository *r, struct commit *item, const void *buffer, unsigned long size, int check_graph) > +static int parse_commit_buffer_1(struct repository *r, struct commit *item, > + const void *buffer, unsigned long size, > + int check_graph) > { > const char *tail = buffer; > const char *bufptr = buffer; > @@ -401,9 +403,6 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b > const int tree_entry_len = the_hash_algo->hexsz + 5; > const int parent_entry_len = the_hash_algo->hexsz + 7; > > - if (item->object.parsed) > - return 0; > - item->object.parsed = 1; > tail += size; > if (tail <= bufptr + tree_entry_len + 1 || memcmp(bufptr, "tree ", 5) || > bufptr[tree_entry_len] != '\n') > @@ -412,6 +411,10 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b > return error("bad tree pointer in commit %s", > oid_to_hex(&item->object.oid)); > set_commit_tree(item, lookup_tree(r, &parent)); > + if (!item->maybe_tree) > + return error("bad tree pointer %s in commit %s", > + oid_to_hex(&parent), > + oid_to_hex(&item->object.oid)); > bufptr += tree_entry_len + 1; /* "tree " + "hex sha1" + "\n" */ > pptr = &item->parents; > > @@ -431,15 +434,19 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b > if (graft && (graft->nr_parent < 0 || grafts_replace_parents)) > continue; > new_parent = lookup_commit(r, &parent); > - if (new_parent) > - pptr = &commit_list_insert(new_parent, pptr)->next; > + if (!new_parent) > + return error("bad parent %s in commit %s", > + oid_to_hex(&parent), > + oid_to_hex(&item->object.oid)); > + pptr = &commit_list_insert(new_parent, pptr)->next; > } > if (graft) { > int i; > struct commit *new_parent; > for (i = 0; i < graft->nr_parent; i++) { > new_parent = lookup_commit(r, > &graft->parent[i]); > + /* Here we ignore bogus grafts. Also should be an error? */ > if (!new_parent) > continue; > pptr = &commit_list_insert(new_parent, pptr)->next; > @@ -453,6 +460,23 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b > return 0; > } > > +int parse_commit_buffer(struct repository *r, struct commit *item, > + const void *buffer, unsigned long size, > + int check_graph) > +{ > + int ret; > + > + if (item->object.parsed) > + return item->object.corrupt ? -1 : 0; > + item->object.parsed = 1; > + > + ret = parse_commit_buffer_1(r, item, buffer, size, check_graph); > + if (ret < 0) > + item->object.corrupt = 1; > + > + return ret; > +} > + > int repo_parse_commit_internal(struct repository *r, > struct commit *item, > int quiet_on_missing, > diff --git a/object.h b/object.h > index 0120892bbd..b83d3964ad 100644 > --- a/object.h > +++ b/object.h > @@ -59,7 +59,7 @@ struct object_array { > > /* > * object flag allocation: > - * revision.h: 0---------10 25----28 > + * revision.h: 0---------10 24----27 > * fetch-pack.c: 01 > * negotiator/default.c: 2--5 > * walker.c: 0-2 > @@ -78,13 +78,14 @@ struct object_array { > * builtin/show-branch.c: 0-------------------------------------------26 > * builtin/unpack-objects.c: 2021 > */ > -#define FLAG_BITS 29 > +#define FLAG_BITS 28 > > /* > * The object type is stored in 3 bits. > */ > struct object { > unsigned parsed : 1; > + unsigned corrupt : 1; > unsigned type : TYPE_BITS; > unsigned flags : FLAG_BITS; > struct object_id oid; > diff --git a/revision.h b/revision.h > index 4134dc6029..5c0b831b37 100644 > --- a/revision.h > +++ b/revision.h > @@ -33,7 +33,7 @@ > #define ALL_REV_FLAGS (((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR) > > #define TOPO_WALK_EXPLORED (1u<<27) > -#define TOPO_WALK_INDEGREE (1u<<28) > +#define TOPO_WALK_INDEGREE (1u<<24) As an aside, these flag bit modifications look fine, but would need to be explained. I'm guessing that since you are adding a bit of data to struct object you want to avoid increasing the struct size across a 32-bit boundary. Are we sure that bit 24 is not used anywhere else? (My search for "1u<<24" found nothing, and "1 << 24" found a bit in the cache-entry flags, so this seems safe.) Thanks, -Stolee
Jeff King <peff@peff.net> writes: > So far so good. But now imagine we call parse_commit_buffer() again, and > we re-parse. How does that interact with the half-parsed state? Some of > it works OK (e.g., lookup_tree() would find the same tree). Some not so > much (I think we'd keep appending parents at each call). > > I guess this might not be too bad to handle. Value fields like > timestamp_t are OK to overwrite. Pointers to objects likewise, since the > memory is owned elsewhere. If we see existing parent pointers in an > object we're parsing, we could probably free them under the assumption > they're leftover cruft. Likewise for the "tag" field of "struct tag", > which is owned by the struct and should be freed. Yeah, or clear them before returning with .corrupt bit set?
On Fri, Sep 06, 2019 at 12:48:05PM -0400, Derrick Stolee wrote: > > diff --git a/revision.h b/revision.h > > index 4134dc6029..5c0b831b37 100644 > > --- a/revision.h > > +++ b/revision.h > > @@ -33,7 +33,7 @@ > > #define ALL_REV_FLAGS (((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR) > > > > #define TOPO_WALK_EXPLORED (1u<<27) > > -#define TOPO_WALK_INDEGREE (1u<<28) > > +#define TOPO_WALK_INDEGREE (1u<<24) > > As an aside, these flag bit modifications look fine, but would need to > be explained. I'm guessing that since you are adding a bit of data > to struct object you want to avoid increasing the struct size across > a 32-bit boundary. Are we sure that bit 24 is not used anywhere else? > (My search for "1u<<24" found nothing, and "1 << 24" found a bit in > the cache-entry flags, so this seems safe.) Yeah, I'd definitely break this up into several commits with explanation (though see an alternate I posted that just uses the parsed flag without any new bits). Bit 24 isn't used according to the table in objects.h, which is _supposed_ to be the source of truth, though of course there's no compiler-level checking. (One aside: is there a reason TOPO_WALK_* isn't part of ALL_REV_FLAGS?). And yes, the goal was to keep things to the 32-bit boundary. But in the course of this, I discovered something interesting: 64-bit systems are now padding this up to the 8-byte boundary! The culprit is the switch of GIT_MAX_RAWSZ for sha256. Before then, our object_id was 20 bytes for sha1. Adding 4 bytes of flags still left us at 24 bytes, which is both 4- and 8-byte aligned. With the switch to sha256, object_id is now 32 bytes. Adding 4 bytes takes us to 36, and then 8-byte aligning the struct takes us to 40 bytes, with 4 bytes of wasted padding. I'm sorely tempted to use this as an opportunity to move commit->index into "struct object". That would actually shrink commit object sizes by 4 bytes, and would let all object types do the commit-slab trick to store object data with constant-time lookup. This would make it possible to migrate some uses of flags to per-operation bitfields (so e.g., two traversals would have their _own_ flag data, and wouldn't risk stomping on each other's bits). The one downside would be that the index space would become more sparse. I.e., right now if you're only storing things for commits in a slab, you know that every slot you allocate is for a commit. But if we allocate an index for each object, then the commits are less likely to be together (so wasted memory and worse cache performance). That might be solvable by assigning a per-type index (with a few hacks to handle OBJ_NONE). Anyway, all of that is rather off the topic of this discussion. -Peff
On Fri, Sep 06, 2019 at 09:59:04AM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > So far so good. But now imagine we call parse_commit_buffer() again, and > > we re-parse. How does that interact with the half-parsed state? Some of > > it works OK (e.g., lookup_tree() would find the same tree). Some not so > > much (I think we'd keep appending parents at each call). > > > > I guess this might not be too bad to handle. Value fields like > > timestamp_t are OK to overwrite. Pointers to objects likewise, since the > > memory is owned elsewhere. If we see existing parent pointers in an > > object we're parsing, we could probably free them under the assumption > > they're leftover cruft. Likewise for the "tag" field of "struct tag", > > which is owned by the struct and should be freed. > > Yeah, or clear them before returning with .corrupt bit set? This was my attempt to avoid dealing with a .corrupt bit. :) -Peff
On 9/6/2019 1:04 PM, Jeff King wrote: > On Fri, Sep 06, 2019 at 12:48:05PM -0400, Derrick Stolee wrote: > >>> diff --git a/revision.h b/revision.h >>> index 4134dc6029..5c0b831b37 100644 >>> --- a/revision.h >>> +++ b/revision.h >>> @@ -33,7 +33,7 @@ >>> #define ALL_REV_FLAGS (((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR) >>> >>> #define TOPO_WALK_EXPLORED (1u<<27) >>> -#define TOPO_WALK_INDEGREE (1u<<28) >>> +#define TOPO_WALK_INDEGREE (1u<<24) >> >> As an aside, these flag bit modifications look fine, but would need to >> be explained. I'm guessing that since you are adding a bit of data >> to struct object you want to avoid increasing the struct size across >> a 32-bit boundary. Are we sure that bit 24 is not used anywhere else? >> (My search for "1u<<24" found nothing, and "1 << 24" found a bit in >> the cache-entry flags, so this seems safe.) > > Yeah, I'd definitely break this up into several commits with explanation > (though see an alternate I posted that just uses the parsed flag without > any new bits). > > Bit 24 isn't used according to the table in objects.h, which is > _supposed_ to be the source of truth, though of course there's no > compiler-level checking. (One aside: is there a reason TOPO_WALK_* isn't > part of ALL_REV_FLAGS?). This was an oversight on my part. Sorry. -Stolee
On 9/6/2019 1:04 PM, Jeff King wrote: > On Fri, Sep 06, 2019 at 12:48:05PM -0400, Derrick Stolee wrote: > >>> diff --git a/revision.h b/revision.h >>> index 4134dc6029..5c0b831b37 100644 >>> --- a/revision.h >>> +++ b/revision.h >>> @@ -33,7 +33,7 @@ >>> #define ALL_REV_FLAGS (((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR) >>> >>> #define TOPO_WALK_EXPLORED (1u<<27) >>> -#define TOPO_WALK_INDEGREE (1u<<28) >>> +#define TOPO_WALK_INDEGREE (1u<<24) >> >> As an aside, these flag bit modifications look fine, but would need to >> be explained. I'm guessing that since you are adding a bit of data >> to struct object you want to avoid increasing the struct size across >> a 32-bit boundary. Are we sure that bit 24 is not used anywhere else? >> (My search for "1u<<24" found nothing, and "1 << 24" found a bit in >> the cache-entry flags, so this seems safe.) > > Yeah, I'd definitely break this up into several commits with explanation > (though see an alternate I posted that just uses the parsed flag without > any new bits). > > Bit 24 isn't used according to the table in objects.h, which is > _supposed_ to be the source of truth, though of course there's no > compiler-level checking. (One aside: is there a reason TOPO_WALK_* isn't > part of ALL_REV_FLAGS?). > > And yes, the goal was to keep things to the 32-bit boundary. But in the > course of this, I discovered something interesting: 64-bit systems are > now padding this up to the 8-byte boundary! > > The culprit is the switch of GIT_MAX_RAWSZ for sha256. Before then, our > object_id was 20 bytes for sha1. Adding 4 bytes of flags still left us > at 24 bytes, which is both 4- and 8-byte aligned. > > With the switch to sha256, object_id is now 32 bytes. Adding 4 bytes > takes us to 36, and then 8-byte aligning the struct takes us to 40 > bytes, with 4 bytes of wasted padding. > > I'm sorely tempted to use this as an opportunity to move commit->index > into "struct object". That would actually shrink commit object sizes by > 4 bytes, and would let all object types do the commit-slab trick to > store object data with constant-time lookup. This would make it possible > to migrate some uses of flags to per-operation bitfields (so e.g., two > traversals would have their _own_ flag data, and wouldn't risk stomping > on each other's bits). This reminds me that I'm hoping to eventually get around to moving "generation" into a commit slab. That would reduce the space for people still working without a commit-graph, and would allow updating to generation number v2 (which needs 64 bits of data). -Stolee
Jeff King <peff@peff.net> writes: > On Fri, Sep 06, 2019 at 09:59:04AM -0700, Junio C Hamano wrote: > >> Jeff King <peff@peff.net> writes: >> >> > So far so good. But now imagine we call parse_commit_buffer() again, and >> > we re-parse. How does that interact with the half-parsed state? Some of >> > it works OK (e.g., lookup_tree() would find the same tree). Some not so >> > much (I think we'd keep appending parents at each call). >> > >> > I guess this might not be too bad to handle. Value fields like >> > timestamp_t are OK to overwrite. Pointers to objects likewise, since the >> > memory is owned elsewhere. If we see existing parent pointers in an >> > object we're parsing, we could probably free them under the assumption >> > they're leftover cruft. Likewise for the "tag" field of "struct tag", >> > which is owned by the struct and should be freed. >> >> Yeah, or clear them before returning with .corrupt bit set? > > This was my attempt to avoid dealing with a .corrupt bit. :) Then, clear them before returning with .parsed bit clear?
On Mon, Sep 09, 2019 at 09:39:41AM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > On Fri, Sep 06, 2019 at 09:59:04AM -0700, Junio C Hamano wrote: > > > >> Jeff King <peff@peff.net> writes: > >> > >> > So far so good. But now imagine we call parse_commit_buffer() again, and > >> > we re-parse. How does that interact with the half-parsed state? Some of > >> > it works OK (e.g., lookup_tree() would find the same tree). Some not so > >> > much (I think we'd keep appending parents at each call). > >> > > >> > I guess this might not be too bad to handle. Value fields like > >> > timestamp_t are OK to overwrite. Pointers to objects likewise, since the > >> > memory is owned elsewhere. If we see existing parent pointers in an > >> > object we're parsing, we could probably free them under the assumption > >> > they're leftover cruft. Likewise for the "tag" field of "struct tag", > >> > which is owned by the struct and should be freed. > >> > >> Yeah, or clear them before returning with .corrupt bit set? > > > > This was my attempt to avoid dealing with a .corrupt bit. :) > > Then, clear them before returning with .parsed bit clear? But then somebody calling parse_commit() and getting an error return doesn't get to see the full extent of what we managed to parse. Which I thought was one of your original points: people should be able to get whatever information we can get out of even a corrupted or malformed object. If we keep that requirement, then we have to either clear them at the start of the re-parsing step, or we have to avoid re-parsing entirely by using an extra bit to say "parsed but had an error". I've sent messy "something like this" patches for both strategies. I think the latter is cleaner conceptually, but the former doesn't require giving up a flag bit. -Peff
diff --git a/commit-graph.c b/commit-graph.c index f2888c203b..6aa6998ecd 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -843,7 +843,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, uint32_t packedDate[2]; display_progress(ctx->progress, ++ctx->progress_cnt); - parse_commit_no_graph(*list); + if (parse_commit_no_graph(*list)) + die(_("unable to parse commit %s"), + oid_to_hex(&(*list)->object.oid)); hashwrite(f, get_commit_tree_oid(*list)->hash, hash_len); parent = (*list)->parents;
When we write a commit graph chunk, we process a given list of 'struct commit *'s and parse out the parent(s) and tree OID in order to write out its information. We do this by calling 'parse_commit_no_graph', and then checking the result of 'get_commit_tree_oid' to write the tree OID. This process assumes that 'parse_commit_no_graph' parses the commit successfully. When this isn't the case, 'get_commit_tree_oid(*list)' may return NULL, in which case trying to '->hash' it causes a SIGSEGV. Instead, teach 'write_graph_chunk_data' to stop when a commit isn't able to be parsed, at the peril of failing to write a commit-graph. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- commit-graph.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)