diff mbox series

[03/23] pack-bitmap: bounds-check size of cache extension

Message ID 1573902df00e8a14a9cb68c37f55474388b1dc2e.1605123652.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series pack-bitmap: bitmap generation improvements | expand

Commit Message

Taylor Blau Nov. 11, 2020, 7:42 p.m. UTC
From: Jeff King <peff@peff.net>

A .bitmap file may have a "name hash cache" extension, which puts a
sequence of uint32_t bytes (one per object) at the end of the file. When
we see a flag indicating this extension, we blindly subtract the
appropriate number of bytes from our available length. However, if the
.bitmap file is too short, we'll underflow our length variable and wrap
around, thinking we have a very large length. This can lead to reading
out-of-bounds bytes while loading individual ewah bitmaps.

We can fix this by checking the number of available bytes when we parse
the header. The existing "truncated bitmap" test is now split into two
tests: one where we don't have this extension at all (and hence actually
do try to read a truncated ewah bitmap) and one where we realize
up-front that we can't even fit in the cache structure. We'll check
stderr in each case to make sure we hit the error we're expecting.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 pack-bitmap.c           |  8 ++++++--
 t/t5310-pack-bitmaps.sh | 17 +++++++++++++++--
 2 files changed, 21 insertions(+), 4 deletions(-)

Comments

Martin Ågren Nov. 12, 2020, 5:47 p.m. UTC | #1
On Wed, 11 Nov 2020 at 20:43, Taylor Blau <me@ttaylorr.com> wrote:
>
> A .bitmap file may have a "name hash cache" extension, which puts a
> sequence of uint32_t bytes (one per object) at the end of the file. When

s/bytes/values/, perhaps?

> we see a flag indicating this extension, we blindly subtract the
> appropriate number of bytes from our available length. However, if the
> .bitmap file is too short, we'll underflow our length variable and wrap
> around, thinking we have a very large length. This can lead to reading
> out-of-bounds bytes while loading individual ewah bitmaps.

> +               uint32_t cache_size = st_mult(index->pack->num_objects, sizeof(uint32_t));

Hmm. If `sizeof(size_t)` is 8, then this multiplication can't possibly
overflow. A huge value of `num_objects` (say, 0xffffffff) would give a
huge return value (0xffffffff<<2) which would be truncated (0xfffffffc).
I think?

Do we want a `u32_mult()`?

> +               unsigned char *index_end = index->map + index->map_size - the_hash_algo->rawsz;

The addition should be ok or mmap has failed on us. Do we know that we
have room for the final hash there so that the subtraction is ok? Yes,
from the previous commit, we know we have room for the header, which is
even larger. But that's cheating a bit -- see below.

> +                       if (index->map + header_size + cache_size > index_end)
> +                               return error("corrupted bitmap index file (too short to fit hash cache)");
> +                       index->hashes = (void *)(index_end - cache_size);
> +                       index_end -= cache_size;

If the header size we're adding is indeed too large, the addition in the
check would be undefined behavior, if I'm not mistaken. In practical
terms, with 32-bit pointers and a huge size, we'd wrap around, decide
that everything is ok and go on to do the same erroneous subtraction as
before.

Maybe shuffle a few things over from the left to the right to only make
subtractions that we know are ok:

  if (cache_size > index_end - index->map - header_size)

One could substitute for `index_end - index_map` and end up with

  if (cache_size > index->map_size - header_size - the_hash_algo->rawsz)

Maybe that's clearer in a way, or maybe then it's not so obvious that
the subtraction that follows matches this check.

But I don't think we can fully trust those subtractions. We're
subtracting the size of two hashes (one in the header, one in the
footer), but after the previous patch, we only know that there's room
for one. So probably the previous patch could go

  +       /*
  +        * Verify that we have room for the header and the
  +        * trailing checksum hash, so we can safely subtract
  +        * their sizes from map_size. We can afford to be
  +        * a bit imprecise with the error message.
  +        */
  -       if (index->map_size < sizeof(*header) + the_hash_algo->rawsz)
  +       if (index->map_size < header_size + the_hash_algo->rawsz)

I *think* I've got most of my comments here somewhat right, but I could
easily have missed something.

Martin
Jeff King Nov. 13, 2020, 4:57 a.m. UTC | #2
On Thu, Nov 12, 2020 at 06:47:09PM +0100, Martin Ågren wrote:

> > A .bitmap file may have a "name hash cache" extension, which puts a
> > sequence of uint32_t bytes (one per object) at the end of the file. When
> 
> s/bytes/values/, perhaps?

Yeah, definitely.

> > we see a flag indicating this extension, we blindly subtract the
> > appropriate number of bytes from our available length. However, if the
> > .bitmap file is too short, we'll underflow our length variable and wrap
> > around, thinking we have a very large length. This can lead to reading
> > out-of-bounds bytes while loading individual ewah bitmaps.
> 
> > +               uint32_t cache_size = st_mult(index->pack->num_objects, sizeof(uint32_t));
> 
> Hmm. If `sizeof(size_t)` is 8, then this multiplication can't possibly
> overflow. A huge value of `num_objects` (say, 0xffffffff) would give a
> huge return value (0xffffffff<<2) which would be truncated (0xfffffffc).
> I think?

Yeah, `cache_size` should absolutely be a `size_t`. If you have more
than a billion objects, obviously your cache is going to be bigger than
that. But most importantly, somebody can _claim_ to have a huge number
of objects and foil the size checks by wrapping around.

> Do we want a `u32_mult()`?

Nah, we should be doing this as a size_t in the first place. There are
similar problems with the .idx format, btw. I have a series to deal with
that which I've been meaning to post.

> > +               unsigned char *index_end = index->map + index->map_size - the_hash_algo->rawsz;
> 
> The addition should be ok or mmap has failed on us. Do we know that we
> have room for the final hash there so that the subtraction is ok? Yes,
> from the previous commit, we know we have room for the header, which is
> even larger. But that's cheating a bit -- see below.

Yeah, I agree this ought to be checking the minimum size against the
header _plus_ the trailer.

I think the previous patch is actually where it goes wrong. The original
was checking for a minimum of:

  if (index->map_size < sizeof(*header) + the_hash_algo->rawsz)

which is the header plus the trailer. We want to readjust for the
MAX_RAWSZ part of the header, so it should be:

  size_t header_size = sizeof(*header) - GIT_MAX_RAWSZ + the_hash_algo->rawsz;
  if (index->map_size < sizeof(*header) + the_hash_algo->rawsz)

> > +                       if (index->map + header_size + cache_size > index_end)
> > +                               return error("corrupted bitmap index file (too short to fit hash cache)");
> > +                       index->hashes = (void *)(index_end - cache_size);
> > +                       index_end -= cache_size;
> 
> If the header size we're adding is indeed too large, the addition in the
> check would be undefined behavior, if I'm not mistaken. In practical
> terms, with 32-bit pointers and a huge size, we'd wrap around, decide
> that everything is ok and go on to do the same erroneous subtraction as
> before.
> 
> Maybe shuffle a few things over from the left to the right to only make
> subtractions that we know are ok:
> 
>   if (cache_size > index_end - index->map - header_size)

Yes, I agree this should be done as a subtraction as you showed to avoid
integer overflow.

> But I don't think we can fully trust those subtractions. We're
> subtracting the size of two hashes (one in the header, one in the
> footer), but after the previous patch, we only know that there's room
> for one. So probably the previous patch could go
> 
>   +       /*
>   +        * Verify that we have room for the header and the
>   +        * trailing checksum hash, so we can safely subtract
>   +        * their sizes from map_size. We can afford to be
>   +        * a bit imprecise with the error message.
>   +        */
>   -       if (index->map_size < sizeof(*header) + the_hash_algo->rawsz)
>   +       if (index->map_size < header_size + the_hash_algo->rawsz)
> 
> I *think* I've got most of my comments here somewhat right, but I could
> easily have missed something.

Right. I think that's right, and the previous patch is just buggy.

-Peff
Martin Ågren Nov. 13, 2020, 5:26 a.m. UTC | #3
On Fri, 13 Nov 2020 at 05:57, Jeff King <peff@peff.net> wrote:
>
> On Thu, Nov 12, 2020 at 06:47:09PM +0100, Martin Ågren wrote:
>
> > > +               uint32_t cache_size = st_mult(index->pack->num_objects, sizeof(uint32_t));
> >
> > Hmm. If `sizeof(size_t)` is 8, then this multiplication can't possibly
> > overflow. A huge value of `num_objects` (say, 0xffffffff) would give a
> > huge return value (0xffffffff<<2) which would be truncated (0xfffffffc).
> > I think?
>
> Yeah, `cache_size` should absolutely be a `size_t`. If you have more
> than a billion objects, obviously your cache is going to be bigger than
> that. But most importantly, somebody can _claim_ to have a huge number
> of objects and foil the size checks by wrapping around.
>
> > Do we want a `u32_mult()`?
>
> Nah, we should be doing this as a size_t in the first place. There are
> similar problems with the .idx format, btw. I have a series to deal with
> that which I've been meaning to post.

Yes, that makes sense!

> >   if (cache_size > index_end - index->map - header_size)
>
> Yes, I agree this should be done as a subtraction as you showed to avoid
> integer overflow.

> >   -       if (index->map_size < sizeof(*header) + the_hash_algo->rawsz)
> >   +       if (index->map_size < header_size + the_hash_algo->rawsz)

> Right. I think that's right, and the previous patch is just buggy.

Martin
Taylor Blau Nov. 13, 2020, 9:29 p.m. UTC | #4
On Thu, Nov 12, 2020 at 11:57:00PM -0500, Jeff King wrote:
> On Thu, Nov 12, 2020 at 06:47:09PM +0100, Martin Ågren wrote:
>
> > The addition should be ok or mmap has failed on us. Do we know that we
> > have room for the final hash there so that the subtraction is ok? Yes,
> > from the previous commit, we know we have room for the header, which is
> > even larger. But that's cheating a bit -- see below.
>
> Yeah, I agree this ought to be checking the minimum size against the
> header _plus_ the trailer.
>
> I think the previous patch is actually where it goes wrong. The original
> was checking for a minimum of:
>
>   if (index->map_size < sizeof(*header) + the_hash_algo->rawsz)
>
> which is the header plus the trailer. We want to readjust for the
> MAX_RAWSZ part of the header, so it should be:
>
>   size_t header_size = sizeof(*header) - GIT_MAX_RAWSZ + the_hash_algo->rawsz;
>   if (index->map_size < sizeof(*header) + the_hash_algo->rawsz)

I'm not sure that I follow. If you apply this to the second patch in
this series, the only thing that changes is that it factors out:

  index->map_pos += ...;

into

  size_t header_size = ...;
  // ...
  index->map_pos += header_size;

What am I missing here?

Thanks,
Taylor
Jeff King Nov. 13, 2020, 9:39 p.m. UTC | #5
On Fri, Nov 13, 2020 at 04:29:54PM -0500, Taylor Blau wrote:

> > which is the header plus the trailer. We want to readjust for the
> > MAX_RAWSZ part of the header, so it should be:
> >
> >   size_t header_size = sizeof(*header) - GIT_MAX_RAWSZ + the_hash_algo->rawsz;
> >   if (index->map_size < sizeof(*header) + the_hash_algo->rawsz)
> 
> I'm not sure that I follow. If you apply this to the second patch in
> this series, the only thing that changes is that it factors out:
> 
>   index->map_pos += ...;
> 
> into
> 
>   size_t header_size = ...;
>   // ...
>   index->map_pos += header_size;
> 
> What am I missing here?

The problem is this hunk from patch 2:

> +       size_t header_size = sizeof(*header) - GIT_MAX_RAWSZ + the_hash_algo->rawsz;
> 
> -       if (index->map_size < sizeof(*header) + the_hash_algo->rawsz)
> +       if (index->map_size < header_size)
>                 return error("Corrupted bitmap index (missing header data)");

The header struct contains a field for the hash of the pack. So the
original code as taking that full header, and adding in another
current-algo rawsz to account for the trailer.

Afterwards, we adjust header_size to swap out the MAX_RAWSZ for the
current-algo rawsz. So header_size is a correct substitution for
sizeof(*header) now. But we still have to add back in
the_hash_algo->rawsz to account for the trailer. The second "+" line is
wrong to have removed it.

The later line we adjust:

> -       index->map_pos += sizeof(*header) - GIT_MAX_RAWSZ + the_hash_algo->rawsz;
> +       index->map_pos += header_size;

is correct. It's just skipping past the header, and doesn't care about
the trailer at all (and confusing the two is probably what led me to
write the bug in the first place).

-Peff
Taylor Blau Nov. 13, 2020, 9:49 p.m. UTC | #6
On Fri, Nov 13, 2020 at 04:39:42PM -0500, Jeff King wrote:
> The problem is this hunk from patch 2:
>
> > +       size_t header_size = sizeof(*header) - GIT_MAX_RAWSZ + the_hash_algo->rawsz;
> >
> > -       if (index->map_size < sizeof(*header) + the_hash_algo->rawsz)
> > +       if (index->map_size < header_size)
> >                 return error("Corrupted bitmap index (missing header data)");
>
> The header struct contains a field for the hash of the pack. So the
> original code as taking that full header, and adding in another
> current-algo rawsz to account for the trailer.
>
> Afterwards, we adjust header_size to swap out the MAX_RAWSZ for the
> current-algo rawsz. So header_size is a correct substitution for
> sizeof(*header) now. But we still have to add back in
> the_hash_algo->rawsz to account for the trailer. The second "+" line is
> wrong to have removed it.

Thanks for your patient explanation. This hunk should instead read:

+       size_t header_size = sizeof(*header) - GIT_MAX_RAWSZ + the_hash_algo->rawsz;

-       if (index->map_size < sizeof(*header) + the_hash_algo->rawsz)
+       if (index->map_size < header_size + the_hash_algo->rawsz)
                return error("Corrupted bitmap index (missing header data)");

That error might not necessarily be right (it could say "missing header
or trailer data"), though. I'm open to if you think it should be
changed or not.

Since we didn't realize this bug at the time, the rest of the patch
message is worded correctly, I believe.

> The later line we adjust:
>
> > -       index->map_pos += sizeof(*header) - GIT_MAX_RAWSZ + the_hash_algo->rawsz;
> > +       index->map_pos += header_size;
>
> is correct. It's just skipping past the header, and doesn't care about
> the trailer at all (and confusing the two is probably what led me to
> write the bug in the first place).

Right, makes sense.

> -Peff

Thanks,
Taylor
Jeff King Nov. 13, 2020, 10:11 p.m. UTC | #7
On Fri, Nov 13, 2020 at 04:49:28PM -0500, Taylor Blau wrote:

> Thanks for your patient explanation. This hunk should instead read:
> 
> +       size_t header_size = sizeof(*header) - GIT_MAX_RAWSZ + the_hash_algo->rawsz;
> 
> -       if (index->map_size < sizeof(*header) + the_hash_algo->rawsz)
> +       if (index->map_size < header_size + the_hash_algo->rawsz)
>                 return error("Corrupted bitmap index (missing header data)");
> 
> That error might not necessarily be right (it could say "missing header
> or trailer data"), though. I'm open to if you think it should be
> changed or not.

Yeah, I agree it's misleading. In the idx code path we just say "%s is
too small", which is more accurate.

-Peff
diff mbox series

Patch

diff --git a/pack-bitmap.c b/pack-bitmap.c
index cea3bb88bf..42d4824c76 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -153,14 +153,18 @@  static int load_bitmap_header(struct bitmap_index *index)
 	/* Parse known bitmap format options */
 	{
 		uint32_t flags = ntohs(header->options);
+		uint32_t cache_size = st_mult(index->pack->num_objects, sizeof(uint32_t));
+		unsigned char *index_end = index->map + index->map_size - the_hash_algo->rawsz;
 
 		if ((flags & BITMAP_OPT_FULL_DAG) == 0)
 			return error("Unsupported options for bitmap index file "
 				"(Git requires BITMAP_OPT_FULL_DAG)");
 
 		if (flags & BITMAP_OPT_HASH_CACHE) {
-			unsigned char *end = index->map + index->map_size - the_hash_algo->rawsz;
-			index->hashes = ((uint32_t *)end) - index->pack->num_objects;
+			if (index->map + header_size + cache_size > index_end)
+				return error("corrupted bitmap index file (too short to fit hash cache)");
+			index->hashes = (void *)(index_end - cache_size);
+			index_end -= cache_size;
 		}
 	}
 
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 8318781d2b..e2c3907a68 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -343,7 +343,8 @@  test_expect_success 'pack reuse respects --incremental' '
 	test_must_be_empty actual
 '
 
-test_expect_success 'truncated bitmap fails gracefully' '
+test_expect_success 'truncated bitmap fails gracefully (ewah)' '
+	test_config pack.writebitmaphashcache false &&
 	git repack -ad &&
 	git rev-list --use-bitmap-index --count --all >expect &&
 	bitmap=$(ls .git/objects/pack/*.bitmap) &&
@@ -352,7 +353,19 @@  test_expect_success 'truncated bitmap fails gracefully' '
 	mv -f $bitmap.tmp $bitmap &&
 	git rev-list --use-bitmap-index --count --all >actual 2>stderr &&
 	test_cmp expect actual &&
-	test_i18ngrep corrupt stderr
+	test_i18ngrep corrupt.ewah.bitmap stderr
+'
+
+test_expect_success 'truncated bitmap fails gracefully (cache)' '
+	git repack -ad &&
+	git rev-list --use-bitmap-index --count --all >expect &&
+	bitmap=$(ls .git/objects/pack/*.bitmap) &&
+	test_when_finished "rm -f $bitmap" &&
+	test_copy_bytes 512 <$bitmap >$bitmap.tmp &&
+	mv -f $bitmap.tmp $bitmap &&
+	git rev-list --use-bitmap-index --count --all >actual 2>stderr &&
+	test_cmp expect actual &&
+	test_i18ngrep corrupted.bitmap.index stderr
 '
 
 # have_delta <obj> <expected_base>