Message ID | Y8hX+pIZUKXsyYj5@coredump.intra.peff.net (mailing list archive) |
---|---|
Headers | show |
Series | hash-object: use fsck to check objects | expand |
On Wed, Jan 18, 2023 at 03:35:06PM -0500, Jeff King wrote: > The other option is having the fsck code avoid looking past the size it > was given. I think the intent is that this should work, from commits > like 4d0d89755e (Make sure fsck_commit_buffer() does not run out of the > buffer, 2014-09-11). We do use skip_prefix() and parse_oid_hex(), which > won't respect the size, but I think[1] that's OK because we'll have > parsed up to the end-of-header beforehand (and those functions would > never match past there). > > Which would mean that 9a1a3a4d4c (mktag: allow omitting the header/body > \n separator, 2021-01-05) and acf9de4c94 (mktag: use fsck instead of > custom verify_tag(), 2021-01-05) were buggy, and we can just fix them. That would look something like this: diff --git a/fsck.c b/fsck.c index c2c8facd2d..d220276bcb 100644 --- a/fsck.c +++ b/fsck.c @@ -898,6 +898,7 @@ int fsck_tag_standalone(const struct object_id *oid, const char *buffer, { int ret = 0; char *eol; + const char *eob = buffer + size; struct strbuf sb = STRBUF_INIT; const char *p; @@ -960,10 +961,8 @@ int fsck_tag_standalone(const struct object_id *oid, const char *buffer, } else ret = fsck_ident(&buffer, oid, OBJ_TAG, options); - if (!*buffer) - goto done; - if (!starts_with(buffer, "\n")) { + if (buffer != eob && *buffer != '\n') { /* * The verify_headers() check will allow * e.g. "[...]tagger <tagger>\nsome Changing the starts_with() is not strictly necessary, but I think it makes it more clear that we are only going to look at the one character we confirmed is still valid inside the buffer. This is enough to have the whole test suite pass with ASan/UBSan after my series. But as I said earlier, I'd want to look carefully at the rest of the fsck code to make sure there aren't any other possible inputs that could look past the end of the buffer. -Peff
Jeff King <peff@peff.net> writes: > [1/6]: t1007: modernize malformed object tests Obviously good. > [2/6]: t1006: stop using 0-padded timestamps > [3/6]: t7030: stop using invalid tag name These two are pleasant to see and revealed what are "accepted" by mistake, quite surprisingly. > [4/6]: t: use hash-object --literally when created malformed objects The --literally option was invented initially primarily to allow a bogus type of object (e.g. "hash-object -t xyzzy --literally") but I am happy to see that we are finding different uses. I wonder if these objects of known types but with syntactically bad contents can be "repack"ed from loose into packed? > [5/6]: fsck: provide a function to fsck buffer without object struct Obvious, clean and very nice. > [6/6]: hash-object: use fsck for object checks
On Wed, Jan 18, 2023 at 12:59:24PM -0800, Junio C Hamano wrote: > The --literally option was invented initially primarily to allow a > bogus type of object (e.g. "hash-object -t xyzzy --literally") but I > am happy to see that we are finding different uses. I wonder if > these objects of known types but with syntactically bad contents can > be "repack"ed from loose into packed? > > > [5/6]: fsck: provide a function to fsck buffer without object struct It is indeed possible: --- >8 --- Initialized empty Git repository in /home/ttaylorr/src/git/t/trash directory.t9999-test/.git/ expecting success of 9999.1 'repacking corrupt loose object into packed': name=$(echo $ZERO_OID | sed -e "s/00/Q/g") && printf "100644 fooQ$name" | q_to_nul | git hash-object -w --stdin -t tree >in && git pack-objects .git/objects/pack/pack <in Enumerating objects: 1, done. Counting objects: 100% (1/1), done. 06146c77fd19c096858d6459d602be0fdf10891b Writing objects: 100% (1/1), done. Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 ok 1 - repacking corrupt loose object into packed --- 8< --- Thanks, Taylor
On Wed, Jan 18, 2023 at 03:35:06PM -0500, Jeff King wrote: > The other option is having the fsck code avoid looking past the size it > was given. I think the intent is that this should work, from commits > like 4d0d89755e (Make sure fsck_commit_buffer() does not run out of the > buffer, 2014-09-11). We do use skip_prefix() and parse_oid_hex(), which > won't respect the size, but I think[1] that's OK because we'll have > parsed up to the end-of-header beforehand (and those functions would > never match past there). > > Which would mean that 9a1a3a4d4c (mktag: allow omitting the header/body > \n separator, 2021-01-05) and acf9de4c94 (mktag: use fsck instead of > custom verify_tag(), 2021-01-05) were buggy, and we can just fix them. > > [1] But I said "I think" above because it can get pretty subtle. There's > some more discussion in this thread: > > https://lore.kernel.org/git/20150625155128.C3E9738005C@gemini.denx.de/ > > but I haven't yet convinced myself it's safe. This is exactly the > kind of analysis I wish I had the power to nerd-snipe René into. I poked at this a bit more, and it definitely isn't safe. I think the use of skip_prefix(), etc, is OK, because they'd always stop at an unexpected newline. But verify_headers() is only confirming that we have a series of complete lines, and we might end with no "\n\n" (and hence no commit/tag message). And so the obvious case that fools us is one where the data simply ends at a newline, but we are missing one or more headers. So a truncated commit like: tree 1234567890123456789012345678901234567890 (with the newline at the end of the "tree" line, but nothing else) will cause fsck_commit() to look past the "size" we pass it. With all of the current callers, that means it sees a NUL and bails. So it's not currently a bug, but it becomes one if we can feed it arbitrary buffers. Fixing it isn't _too_ bad, and could look something like this: diff --git a/fsck.c b/fsck.c index c2c8facd2d..3f0bb3e350 100644 --- a/fsck.c +++ b/fsck.c @@ -834,6 +834,7 @@ static int fsck_commit(const struct object_id *oid, unsigned author_count; int err; const char *buffer_begin = buffer; + const char *buffer_end = buffer + size; const char *p; if (verify_headers(buffer, size, oid, OBJ_COMMIT, options)) @@ -847,7 +848,7 @@ static int fsck_commit(const struct object_id *oid, return err; } buffer = p + 1; - while (skip_prefix(buffer, "parent ", &buffer)) { + while (buffer < buffer_end && skip_prefix(buffer, "parent ", &buffer)) { if (parse_oid_hex(buffer, &parent_oid, &p) || *p != '\n') { err = report(options, oid, OBJ_COMMIT, FSCK_MSG_BAD_PARENT_SHA1, "invalid 'parent' line format - bad sha1"); if (err) @@ -856,7 +857,7 @@ static int fsck_commit(const struct object_id *oid, buffer = p + 1; } author_count = 0; - while (skip_prefix(buffer, "author ", &buffer)) { + while (buffer < buffer_end && skip_prefix(buffer, "author ", &buffer)) { author_count++; err = fsck_ident(&buffer, oid, OBJ_COMMIT, options); if (err) @@ -868,7 +869,7 @@ static int fsck_commit(const struct object_id *oid, err = report(options, oid, OBJ_COMMIT, FSCK_MSG_MULTIPLE_AUTHORS, "invalid format - multiple 'author' lines"); if (err) return err; - if (!skip_prefix(buffer, "committer ", &buffer)) + if (buffer >= buffer_end || !skip_prefix(buffer, "committer ", &buffer)) return report(options, oid, OBJ_COMMIT, FSCK_MSG_MISSING_COMMITTER, "invalid format - expected 'committer' line"); err = fsck_ident(&buffer, oid, OBJ_COMMIT, options); if (err) And then the tag side would need something similar. I'd probably also sprinkle some comments in verify_headers() and its callers documenting our assumptions and what's OK to do (string-like parsing functions work as long as they stop when they hit a newline). That, plus a few tests covering the problematic cases to avoid regressions, would probably be OK. I think fsck_tree() is mostly fine, as the tree-iterating code detects truncation. Though I do find the use of strlen() in decode_tree_entry() a little suspicious (and that would be true of the current code, as well, since it powers hash-object's existing parsing check). -Peff
On Wed, Jan 18, 2023 at 04:38:40PM -0500, Taylor Blau wrote: > On Wed, Jan 18, 2023 at 12:59:24PM -0800, Junio C Hamano wrote: > > The --literally option was invented initially primarily to allow a > > bogus type of object (e.g. "hash-object -t xyzzy --literally") but I > > am happy to see that we are finding different uses. I wonder if > > these objects of known types but with syntactically bad contents can > > be "repack"ed from loose into packed? > > > > > [5/6]: fsck: provide a function to fsck buffer without object struct > > It is indeed possible: > > --- >8 --- > Initialized empty Git repository in /home/ttaylorr/src/git/t/trash directory.t9999-test/.git/ > expecting success of 9999.1 'repacking corrupt loose object into packed': > name=$(echo $ZERO_OID | sed -e "s/00/Q/g") && > printf "100644 fooQ$name" | q_to_nul | > git hash-object -w --stdin -t tree >in && > > git pack-objects .git/objects/pack/pack <in > > Enumerating objects: 1, done. > Counting objects: 100% (1/1), done. > 06146c77fd19c096858d6459d602be0fdf10891b > Writing objects: 100% (1/1), done. > Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 > ok 1 - repacking corrupt loose object into packed > --- 8< --- Right, we don't do any fsck-ing when packing objects. Nor should we, I think. We should be checking objects when they come into the repository (via index-pack/unpack-objects) or when they're created (hash-object), but there's little need to do so when they migrate between storage formats. The fact that "--literally" manually writes a loose object is mostly an implementation detail. I think if we are not writing an object with an esoteric type, that it could even just hit the regular index_fd() code path (and drop the HASH_FORMAT_CHECK flag). If you do write one with "-t xyzzy", I think pack-objects would barf, but not because of fsck checks. It just couldn't represent that type (which really makes such objects pretty pointless; you cannot ever fetch or push them!). -Peff
Am 19.01.23 um 02:39 schrieb Jeff King: > > Though I do find the use of strlen() in decode_tree_entry() > a little suspicious (and that would be true of the current code, as > well, since it powers hash-object's existing parsing check). strlen() won't overrun the buffer because the first check in decode_tree_entry() makes sure there is a NUL in the buffer ahead. If get_mode() crosses it then we exit early. Storing the result in an unsigned int can overflow on platforms where size_t is bigger. That would result in pathlen values being too short for really long paths, but no out-of-bounds access. They are then stored as signed int in struct name_entry and used as such in many places -- that seems like a bad idea, but I didn't actually check them thoroughly. get_mode() can overflow "mode" if there are too many octal digits. Do we need to accept more than two handfuls in the first place? I'll send a patch for at least rejecting overflow. Hmm, what would be the performance impact of trees with mode fields zero-padded to silly lengths? René
On Sat, Jan 21, 2023 at 10:36:08AM +0100, René Scharfe wrote: > Am 19.01.23 um 02:39 schrieb Jeff King: > > > > Though I do find the use of strlen() in decode_tree_entry() > > a little suspicious (and that would be true of the current code, as > > well, since it powers hash-object's existing parsing check). > > strlen() won't overrun the buffer because the first check in > decode_tree_entry() makes sure there is a NUL in the buffer ahead. > If get_mode() crosses it then we exit early. Yeah, that was what I found after digging deeper (see my patch 7). > Storing the result in an unsigned int can overflow on platforms where > size_t is bigger. That would result in pathlen values being too short > for really long paths, but no out-of-bounds access. They are then > stored as signed int in struct name_entry and used as such in many > places -- that seems like a bad idea, but I didn't actually check them > thoroughly. Yeah, I agree that the use of a signed int there looks questionable. I do think it's orthogonal to my series here, as that tree-decoding is used by the existing hash-object checks. But it probably bears further examination, especially because we use it for the fsck checks on incoming objects via receive-pack, etc, which are meant to be the first line of defense for hosters who might receive malicious garbage from users. We probably ought to reject trees with enormous names via fsck anyway. I actually have a patch to do that, but of course it depends on decode_tree_entry() to get the length, so there's a bit of chicken-and-egg. We probably also should outright reject gigantic trees, which closes out a whole class of integer truncation problems. I know GitHub has rejected trees over 100MB for years for this reason. > get_mode() can overflow "mode" if there are too many octal digits. Do > we need to accept more than two handfuls in the first place? I'll send > a patch for at least rejecting overflow. Seems reasonable. I doubt there's an interesting attack here, just because the mode isn't used to index anything. If you feed a garbage mode that happens to overflow to something useful, you could just as easily have sent the useful mode in the first place. > Hmm, what would be the performance impact of trees with mode fields > zero-padded to silly lengths? Certainly it would waste some time parsing the tree, but you could do that already with a long pathname. Or just having a lot of paths in a tree. Or a lot of trees. -Peff
Am 22.01.23 um 08:48 schrieb Jeff King: > On Sat, Jan 21, 2023 at 10:36:08AM +0100, René Scharfe wrote: > >> Am 19.01.23 um 02:39 schrieb Jeff King: >>> >>> Though I do find the use of strlen() in decode_tree_entry() >>> a little suspicious (and that would be true of the current code, as >>> well, since it powers hash-object's existing parsing check). >> >> strlen() won't overrun the buffer because the first check in >> decode_tree_entry() makes sure there is a NUL in the buffer ahead. >> If get_mode() crosses it then we exit early. > > Yeah, that was what I found after digging deeper (see my patch 7). > >> Storing the result in an unsigned int can overflow on platforms where >> size_t is bigger. That would result in pathlen values being too short >> for really long paths, but no out-of-bounds access. They are then >> stored as signed int in struct name_entry and used as such in many >> places -- that seems like a bad idea, but I didn't actually check them >> thoroughly. > > Yeah, I agree that the use of a signed int there looks questionable. I > do think it's orthogonal to my series here, as that tree-decoding is > used by the existing hash-object checks. Sure. > But it probably bears further examination, especially because we use it > for the fsck checks on incoming objects via receive-pack, etc, which are > meant to be the first line of defense for hosters who might receive > malicious garbage from users. > > We probably ought to reject trees with enormous names via fsck anyway. I > actually have a patch to do that, but of course it depends on > decode_tree_entry() to get the length, so there's a bit of > chicken-and-egg. Solvable by limiting the search for the end of the string in decode_tree_entry() by using strnlen(3) or memchr(3) instead of strlen(3). You just need to define some (configurable?) limit. > We probably also should outright reject gigantic trees, > which closes out a whole class of integer truncation problems. I know > GitHub has rejected trees over 100MB for years for this reason. Makes sense. >> get_mode() can overflow "mode" if there are too many octal digits. Do >> we need to accept more than two handfuls in the first place? I'll send >> a patch for at least rejecting overflow. > > Seems reasonable. I doubt there's an interesting attack here, just > because the mode isn't used to index anything. If you feed a garbage > mode that happens to overflow to something useful, you could just as > easily have sent the useful mode in the first place. > >> Hmm, what would be the performance impact of trees with mode fields >> zero-padded to silly lengths? > > Certainly it would waste some time parsing the tree, but you could do > that already with a long pathname. Or just having a lot of paths in a > tree. Or a lot of trees. That's a cup half full/empty thing, perhaps. What's the harm in leading zeros? vs. Why allow leading zeros? René
On Sun, Jan 22 2023, René Scharfe wrote: > Am 22.01.23 um 08:48 schrieb Jeff King: >> We probably also should outright reject gigantic trees, >> which closes out a whole class of integer truncation problems. I know >> GitHub has rejected trees over 100MB for years for this reason. > > Makes sense. I really don't think it does, let's not forever encode arbitrary limits in the formats because of transitory implementation details. Those sort of arbitrary limits are understandable for hosting providers, and a sensible trade-off on that front. But for git as a general tool? I'd like to be able to throw whatever garbage I've got locally at it, and not have it complain. We already have a deluge of int v.s. unsigned int v.s. size_t warnings that we could address, we're just choosing to suppress them currently. Instead we have hacks like cast_size_t_to_int(). Those sorts of hacks are understandable as band-aid fixes, but let's work on fixing the real causes.