diff mbox series

[01/10] loose_object_info(): BUG() on inflating content with unknown type

Message ID 20250225062824.GA1293961@coredump.intra.peff.net (mailing list archive)
State New
Headers show
Series some zlib inflating bug fixes | expand

Commit Message

Jeff King Feb. 25, 2025, 6:28 a.m. UTC
After unpack_loose_header() returns, it will have inflated not only the
object header, but possibly some bytes of the object content. When we
call unpack_loose_rest() to extract the actual content, it finds those
extra bytes by skipping past the header's terminating NUL in the buffer.
Like this:

  int bytes = strlen(buffer) + 1;
  n = stream->total_out - bytes;
  ...
  memcpy(buf, (char *) buffer + bytes, n);

This won't work with the OBJECT_INFO_ALLOW_UNKNOWN_TYPE flag, as there
we allow a header of arbitrary size. We put into a strbuf, but feed only
the final 32-byte chunk we read to unpack_loose_rest(). In that case
stream->total_out may unexpectedly large, and thus our "n" will be
large, causing an out-of-bounds read (we do check it against our
allocated buffer size, which prevents an out-of-bounds write).

Probably this could be made to work by feeding the strbuf to
unpack_loose_rest(), along with adjusting some types (e.g., "bytes"
would need to be a size_t, since it is no longer operating on a 32-byte
buffer).

But I don't think it's possible to actually trigger this in practice.
The only caller who passes ALLOW_UNKNOWN_TYPE is cat-file, which only
allows it with the "-t" and "-s" options (neither of which access the
content). There is one way you can _almost_ trigger it: the oid compat
routines (i.e., accessing sha1 via sha256 names and vice versa) will
convert objects on the fly (which requires access to the content) using
the same flags that were passed in. So in theory this:

  t='some very large type field that causes an extra inflate call'
  sha1_oid=$(git hash-object -w -t "$t" file)
  sha256_oid=$(git rev-parse --output-object-format=sha256 $sha1_oid)
  git cat-file --allow-unknown-type -s $sha256_oid

would try to access the content. But it doesn't work, because using
compat objects requires an entry in the .git/objects/loose-object-idx
file, and we don't generate such an entry for non-standard types (see
the "compat" section of write_object_file_literally()).

If we use "t=blob" instead, then it does access the compat object, but
it doesn't trigger the problem (because "blob" is a standard short type
name, and it fits in the initial 32-byte buffer).

So given that this is almost a memory error bug, I think it's worth
addressing. But because we can't actually trigger the situation, I'm
hesitant to try a fix that we can't run. Instead let's document the
restriction and protect ourselves from the out-of-bounds read by adding
a BUG() check.

Signed-off-by: Jeff King <peff@peff.net>
---
I found this because I was tracing the code path after
unpack_loose_header() returns to verify some assumptions in the other
patches.

It really makes me wonder if this "unknown type" stuff has any value
at all. You can create an object with any type using "hash-object
--literally -t". And you can ask about its type and size. But you can
never retrieve the object content! Nor can you pack it or transfer it,
since packs use a numeric type field.

This code was added ~2015, but I don't think anybody built more on top
of it. I wonder if we should just consider it a failed experiment and
rip out the support.

 object-file.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Patrick Steinhardt Feb. 25, 2025, 11:42 a.m. UTC | #1
On Tue, Feb 25, 2025 at 01:28:24AM -0500, Jeff King wrote:
> After unpack_loose_header() returns, it will have inflated not only the
> object header, but possibly some bytes of the object content. When we
> call unpack_loose_rest() to extract the actual content, it finds those
> extra bytes by skipping past the header's terminating NUL in the buffer.
> Like this:
> 
>   int bytes = strlen(buffer) + 1;
>   n = stream->total_out - bytes;
>   ...
>   memcpy(buf, (char *) buffer + bytes, n);
> 
> This won't work with the OBJECT_INFO_ALLOW_UNKNOWN_TYPE flag, as there
> we allow a header of arbitrary size. We put into a strbuf, but feed only

s/into/it &/

> the final 32-byte chunk we read to unpack_loose_rest(). In that case
> stream->total_out may unexpectedly large, and thus our "n" will be

s/may/& be/

> large, causing an out-of-bounds read (we do check it against our
> allocated buffer size, which prevents an out-of-bounds write).
> 
> Probably this could be made to work by feeding the strbuf to
> unpack_loose_rest(), along with adjusting some types (e.g., "bytes"
> would need to be a size_t, since it is no longer operating on a 32-byte
> buffer).

I was a bit confused initially as I was thinking in terms of `size_t`
and `uint32_t` as I misread 32-byte for 32-bit, which is an immediate
shortcut that my brain took because 32 bit is something you read all the
time. I don't really have a great idea for how to introduce the byte
chunk better though to avoid this.

> Signed-off-by: Jeff King <peff@peff.net>
> ---
> I found this because I was tracing the code path after
> unpack_loose_header() returns to verify some assumptions in the other
> patches.
> 
> It really makes me wonder if this "unknown type" stuff has any value
> at all. You can create an object with any type using "hash-object
> --literally -t". And you can ask about its type and size. But you can
> never retrieve the object content! Nor can you pack it or transfer it,
> since packs use a numeric type field.
> 
> This code was added ~2015, but I don't think anybody built more on top
> of it. I wonder if we should just consider it a failed experiment and
> rip out the support.

I certainly do not know and cannot think of any usecase for this. I also
expect that a repository with unknown object types is a recipe for weird
edge cases in case they are being read somehow.

I guess we'll know a bit more when this patch series lands? In case
nobody complains it is another indicator that unknown object types
aren't being used out there in the wild.

>  object-file.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/object-file.c b/object-file.c
> index 00c3a4b910..45b251ba04 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1580,6 +1580,8 @@ static int loose_object_info(struct repository *r,
>  
>  		if (!oi->contentp)
>  			break;
> +		if (hdrbuf.len)
> +			BUG("unpacking content with unknown types not yet supported");
>  		*oi->contentp = unpack_loose_rest(&stream, hdr, *oi->sizep, oid);
>  		if (*oi->contentp)
>  			goto cleanup;

Okay. I was wondering whether we still need `hdrbuf`, but we of course
do in order to continue reading the type and length itself. The only
thing we restrict is reading the contents of such objects.

Patrick
Junio C Hamano Feb. 26, 2025, 1:47 a.m. UTC | #2
Jeff King <peff@peff.net> writes:

> It really makes me wonder if this "unknown type" stuff has any value
> at all. You can create an object with any type using "hash-object
> --literally -t". And you can ask about its type and size. But you can
> never retrieve the object content! Nor can you pack it or transfer it,
> since packs use a numeric type field.

Correct.  IIRC, the "--literally" support was mostly for debugging,
and as you noticed, is very much limited because it can only create
funny objects that are loose.  And the debugging was not really about
adding more object types, but was more about "what would our code do
when we see an object that is corrupt whose type we do not recognise".

I personally think the "--literally" should not survive the Git 3.0
boundary.

Thanks.
diff mbox series

Patch

diff --git a/object-file.c b/object-file.c
index 00c3a4b910..45b251ba04 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1580,6 +1580,8 @@  static int loose_object_info(struct repository *r,
 
 		if (!oi->contentp)
 			break;
+		if (hdrbuf.len)
+			BUG("unpacking content with unknown types not yet supported");
 		*oi->contentp = unpack_loose_rest(&stream, hdr, *oi->sizep, oid);
 		if (*oi->contentp)
 			goto cleanup;