dir.c: skip .gitignore, etc larger than INT_MAX

Message ID	20240531120034.GA442032@coredump.intra.peff.net (mailing list archive)
State	Accepted
Commit	a2bc523e1ea2ef9b59eb0c26331b6e7d9dc5a812
Headers	show Received: from cloud.peff.net (cloud.peff.net [104.130.231.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 778C029A0 for <git@vger.kernel.org>; Fri, 31 May 2024 12:00:36 +0000 (UTC) Date: Fri, 31 May 2024 08:00:34 -0400 From: Jeff King <peff@peff.net> To: git@vger.kernel.org Subject: [PATCH] dir.c: skip .gitignore, etc larger than INT_MAX Message-ID: <20240531120034.GA442032@coredump.intra.peff.net> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline
Series	dir.c: skip .gitignore, etc larger than INT_MAX \| expand dir.c: skip .gitignore, etc larger than INT_MAX

Jeff King May 31, 2024, noon UTC

We use add_patterns() to read .gitignore, .git/info/exclude, etc, as
well as other pattern-like files like sparse-checkout. The parser for
these uses an "int" as an index, meaning that files over 2GB will
generally cause signed integer overflow and out-of-bounds access.

This is unlikely to happen in any real files, but we do read .gitignore
files from the tree. A malicious tree could cause an out-of-bounds read
and segfault (we also write NULs over newlines, so in theory it could be
an out-of-bounds write, too, but as we go char-by-char, the first thing
that happens is trying to read a negative 2GB offset).

We could fix the most obvious issue by replacing one "int" with a
"size_t". But there are tons of "int" sprinkled throughout this code for
things like pattern lengths, number of patterns, and so on. Since nobody
would actually want a 2GB .gitignore file, an easy defensive measure is
to just refuse to parse them.

The "int" in question is in add_patterns_from_buffer(), so we could
catch it there. But by putting the checks in its two callers, we can
produce more useful error messages.

Signed-off-by: Jeff King <peff@peff.net>
---
Just something I noticed while working on leaks nearby.

 dir.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Junio C Hamano May 31, 2024, 3:05 p.m. UTC | #1

Jeff King <peff@peff.net> writes:

> The "int" in question is in add_patterns_from_buffer(), so we could
> catch it there. But by putting the checks in its two callers, we can
> produce more useful error messages.

Nice trick.  I wondered if we want INT_MAX/2 or something even
lower, but because once these things are read, we only parse the
contents so allowing up to INT_MAX is fine.  It is not like we read
this and that and concatenate them into a larger buffer.

Junio C Hamano May 31, 2024, 3:10 p.m. UTC | #2

Jeff King <peff@peff.net> writes:

> We use add_patterns() to read .gitignore, .git/info/exclude, etc, as
> well as other pattern-like files like sparse-checkout. The parser for
> these uses an "int" as an index, meaning that files over 2GB will
> generally cause signed integer overflow and out-of-bounds access.

I also wondered if we need similar protection on the attribute side,
but it turns out that we process the files one line at a time
without holding everything in core.  Both exclude and attribute
subsystem are not protected against parseed _result_ consuming too
much memory, but at least with your patch, we are covered on the
input side.

I wonder if it is worth rewriting the exclude side to stream like
attribute parsing, though.  It probalby is not.

Jeff King June 4, 2024, 10:39 a.m. UTC | #3

On Fri, May 31, 2024 at 08:05:03AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > The "int" in question is in add_patterns_from_buffer(), so we could
> > catch it there. But by putting the checks in its two callers, we can
> > produce more useful error messages.
> 
> Nice trick.  I wondered if we want INT_MAX/2 or something even
> lower, but because once these things are read, we only parse the
> contents so allowing up to INT_MAX is fine.  It is not like we read
> this and that and concatenate them into a larger buffer.

Yeah, I think all of the individual string lengths would be fine. It
_might_ be possible to overflow an array of pointers or similar, though.
E.g., if you have "a\n" repeated INT_MAX times, then you have INT_MAX/2
entries. An array of pointers-to-structs, assuming 8-byte pointers,
would then need INT_MAX*4 bytes. And indeed, I think we stuff these into
a pattern_list which uses ints.

In practice I think the call to ALLOC_GROW() would catch that as the
overflowed value turns into a large size_t. But it probably would be
safer to use a much smaller limit.

I was hoping to avoid making up an arbitrary number. But your question
about gitattributes reminded me that we already did something similar in
3c50032ff5 (attr: ignore overly large gitattributes files, 2022-12-01).
There it's a hard-coded 100MB limit (without even a config option).

Maybe we should just do the same here?

-Peff

Junio C Hamano June 4, 2024, 5:45 p.m. UTC | #4

Jeff King <peff@peff.net> writes:

> I was hoping to avoid making up an arbitrary number. But your question
> about gitattributes reminded me that we already did something similar in
> 3c50032ff5 (attr: ignore overly large gitattributes files, 2022-12-01).
> There it's a hard-coded 100MB limit (without even a config option).
>
> Maybe we should just do the same here?

Hmph, I thought the 100MB was only for blobs, as we stream the input
we read from a regular file, but we do have the same limit there
presumably to match what we do to blobs?  I do not mind that, but I
do not mind leaving it a future "consolidate various size limits on
control files used by Git" patch, that unifies the limit for attrs,
excludes, gitmodules, .git/config, etc.

Jeff King June 5, 2024, 8:03 a.m. UTC | #5

On Tue, Jun 04, 2024 at 10:45:04AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > I was hoping to avoid making up an arbitrary number. But your question
> > about gitattributes reminded me that we already did something similar in
> > 3c50032ff5 (attr: ignore overly large gitattributes files, 2022-12-01).
> > There it's a hard-coded 100MB limit (without even a config option).
> >
> > Maybe we should just do the same here?
> 
> Hmph, I thought the 100MB was only for blobs, as we stream the input
> we read from a regular file, but we do have the same limit there
> presumably to match what we do to blobs?  I do not mind that, but I
> do not mind leaving it a future "consolidate various size limits on
> control files used by Git" patch, that unifies the limit for attrs,
> excludes, gitmodules, .git/config, etc.

I think we enforce the limit for files, too. Which makes sense, as we'd
read .gitattributes checked out from an untrusted tree. Even if we
stream it line by line, we still end up scaling some structures with the
total number of lines (and of course the file length also bounds the
line length, though I think we have an even shorter line length limit).

I do not mind if consolidation waits for later, but I guess the
immediate question is whether we'd prefer to reduce the limit set by my
patch to a more security-conservative value. It would be easy to swap
out INT_MAX for a 100MB #define on top.

Maybe this?

-- >8 --
Subject: [PATCH] dir.c: reduce max pattern file size to 100MB

In a2bc523e1e (dir.c: skip .gitignore, etc larger than INT_MAX,
2024-05-31) we put capped the size of some files whose parsing code and
data structures used ints. Setting the limit to INT_MAX was a natural
spot, since we know the parsing code would misbehave above that.

But it also leaves the possibility of overflow errors when we multiply
that limit to allocate memory. For instance, a file consisting only of
"a\na\n..." could have INT_MAX/2 entries. Allocating an array of
pointers for each would need INT_MAX*4 bytes on a 64-bit system, enough
to overflow a 32-bit int.

So let's give ourselves a bit more safety margin by giving a much
smaller limit. The size 100MB is somewhat arbitrary, but is based on the
similar value for attribute files added by 3c50032ff5 (attr: ignore
overly large gitattributes files, 2022-12-01).

There's no particular reason these have to be the same, but the idea is
that they are in the ballpark of "so huge that nobody would care, but
small enough to avoid malicious overflow". So lacking a better guess, it
makes sense to use the same value. The implementation here doesn't share
the same constant, but we could change that later (or even give it a
runtime config knob, though nobody has complained yet about the
attribute limit).

And likewise, let's add a few tests that exercise the limits, based on
the attr ones. In this case, though, we never read .gitignore from the
index; the blob code is exercised only for sparse filters. So we'll
trigger it that way.

Signed-off-by: Jeff King <peff@peff.net>
---
 dir.c                               | 10 ++++++++--
 t/t0008-ignores.sh                  |  8 ++++++++
 t/t6112-rev-list-filters-objects.sh | 12 ++++++++++++
 3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/dir.c b/dir.c
index 914060edfd..ad2b7ebe2d 100644
--- a/dir.c
+++ b/dir.c
@@ -32,6 +32,12 @@
 #include "tree.h"
 #include "hex.h"

+ /*
+  * The maximum size of a pattern/exclude file. If the file exceeds this size
+  * we will ignore it.
+  */
+#define PATTERN_MAX_FILE_SIZE (100 * 1024 * 1024)
+
 /*
  * Tells read_directory_recursive how a file or directory should be treated.
  * Values are ordered by significance, e.g. if a directory contains both
@@ -1149,7 +1155,7 @@ static int add_patterns(const char *fname, const char *base, int baselen,
 		}
 	}

-	if (size > INT_MAX) {
+	if (size > PATTERN_MAX_FILE_SIZE) {
 		warning("ignoring excessively large pattern file: %s", fname);
 		free(buf);
 		return -1;
@@ -1211,7 +1217,7 @@ int add_patterns_from_blob_to_list(
 	if (r != 1)
 		return r;

-	if (size > INT_MAX) {
+	if (size > PATTERN_MAX_FILE_SIZE) {
 		warning("ignoring excessively large pattern blob: %s",
 			oid_to_hex(oid));
 		free(buf);
diff --git a/t/t0008-ignores.sh b/t/t0008-ignores.sh
index 361446b2f4..02a18d4fdb 100755
--- a/t/t0008-ignores.sh
+++ b/t/t0008-ignores.sh
@@ -945,4 +945,12 @@ test_expect_success SYMLINKS 'symlinks not respected in-tree' '
 	test_grep "unable to access.*gitignore" err
 '

+test_expect_success EXPENSIVE 'large exclude file ignored in tree' '
+	test_when_finished "rm .gitignore" &&
+	dd if=/dev/zero of=.gitignore bs=101M count=1 &&
+	git ls-files -o --exclude-standard 2>err &&
+	echo "warning: ignoring excessively large pattern file: .gitignore" >expect &&
+	test_cmp expect err
+'
+
 test_done
diff --git a/t/t6112-rev-list-filters-objects.sh b/t/t6112-rev-list-filters-objects.sh
index 43e1afd44c..0387f35a32 100755
--- a/t/t6112-rev-list-filters-objects.sh
+++ b/t/t6112-rev-list-filters-objects.sh
@@ -701,4 +701,16 @@ test_expect_success 'expand blob limit in protocol' '
 	grep "blob:limit=1024" trace
 '

+test_expect_success EXPENSIVE 'large sparse filter file ignored' '
+	blob=$(dd if=/dev/zero bs=101M count=1 |
+	       git hash-object -w --stdin) &&
+	test_must_fail \
+		git rev-list --all --objects --filter=sparse:oid=$blob 2>err &&
+	cat >expect <<-EOF &&
+	warning: ignoring excessively large pattern blob: $blob
+	fatal: unable to parse sparse filter data in $blob
+	EOF
+	test_cmp expect err
+'
+
 test_done

Junio C Hamano June 5, 2024, 4:23 p.m. UTC | #6

Jeff King <peff@peff.net> writes:

> I do not mind if consolidation waits for later, but I guess the
> immediate question is whether we'd prefer to reduce the limit set by my
> patch to a more security-conservative value. It would be easy to swap
> out INT_MAX for a 100MB #define on top.
>
> Maybe this?

Sounds sensible.  Thanks.

> -- >8 --
> Subject: [PATCH] dir.c: reduce max pattern file size to 100MB
>
> In a2bc523e1e (dir.c: skip .gitignore, etc larger than INT_MAX,
> 2024-05-31) we put capped the size of some files whose parsing code and
> data structures used ints. Setting the limit to INT_MAX was a natural
> spot, since we know the parsing code would misbehave above that.
>
> But it also leaves the possibility of overflow errors when we multiply
> that limit to allocate memory. For instance, a file consisting only of
> "a\na\n..." could have INT_MAX/2 entries. Allocating an array of
> pointers for each would need INT_MAX*4 bytes on a 64-bit system, enough
> to overflow a 32-bit int.
>
> So let's give ourselves a bit more safety margin by giving a much
> smaller limit. The size 100MB is somewhat arbitrary, but is based on the
> similar value for attribute files added by 3c50032ff5 (attr: ignore
> overly large gitattributes files, 2022-12-01).
>
> There's no particular reason these have to be the same, but the idea is
> that they are in the ballpark of "so huge that nobody would care, but
> small enough to avoid malicious overflow". So lacking a better guess, it
> makes sense to use the same value. The implementation here doesn't share
> the same constant, but we could change that later (or even give it a
> runtime config knob, though nobody has complained yet about the
> attribute limit).
>
> And likewise, let's add a few tests that exercise the limits, based on
> the attr ones. In this case, though, we never read .gitignore from the
> index; the blob code is exercised only for sparse filters. So we'll
> trigger it that way.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  dir.c                               | 10 ++++++++--
>  t/t0008-ignores.sh                  |  8 ++++++++
>  t/t6112-rev-list-filters-objects.sh | 12 ++++++++++++
>  3 files changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/dir.c b/dir.c
> index 914060edfd..ad2b7ebe2d 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -32,6 +32,12 @@
>  #include "tree.h"
>  #include "hex.h"
>  
> + /*
> +  * The maximum size of a pattern/exclude file. If the file exceeds this size
> +  * we will ignore it.
> +  */
> +#define PATTERN_MAX_FILE_SIZE (100 * 1024 * 1024)
> +
>  /*
>   * Tells read_directory_recursive how a file or directory should be treated.
>   * Values are ordered by significance, e.g. if a directory contains both
> @@ -1149,7 +1155,7 @@ static int add_patterns(const char *fname, const char *base, int baselen,
>  		}
>  	}
>  
> -	if (size > INT_MAX) {
> +	if (size > PATTERN_MAX_FILE_SIZE) {
>  		warning("ignoring excessively large pattern file: %s", fname);
>  		free(buf);
>  		return -1;
> @@ -1211,7 +1217,7 @@ int add_patterns_from_blob_to_list(
>  	if (r != 1)
>  		return r;
>  
> -	if (size > INT_MAX) {
> +	if (size > PATTERN_MAX_FILE_SIZE) {
>  		warning("ignoring excessively large pattern blob: %s",
>  			oid_to_hex(oid));
>  		free(buf);
> diff --git a/t/t0008-ignores.sh b/t/t0008-ignores.sh
> index 361446b2f4..02a18d4fdb 100755
> --- a/t/t0008-ignores.sh
> +++ b/t/t0008-ignores.sh
> @@ -945,4 +945,12 @@ test_expect_success SYMLINKS 'symlinks not respected in-tree' '
>  	test_grep "unable to access.*gitignore" err
>  '
>  
> +test_expect_success EXPENSIVE 'large exclude file ignored in tree' '
> +	test_when_finished "rm .gitignore" &&
> +	dd if=/dev/zero of=.gitignore bs=101M count=1 &&
> +	git ls-files -o --exclude-standard 2>err &&
> +	echo "warning: ignoring excessively large pattern file: .gitignore" >expect &&
> +	test_cmp expect err
> +'
> +
>  test_done
> diff --git a/t/t6112-rev-list-filters-objects.sh b/t/t6112-rev-list-filters-objects.sh
> index 43e1afd44c..0387f35a32 100755
> --- a/t/t6112-rev-list-filters-objects.sh
> +++ b/t/t6112-rev-list-filters-objects.sh
> @@ -701,4 +701,16 @@ test_expect_success 'expand blob limit in protocol' '
>  	grep "blob:limit=1024" trace
>  '
>  
> +test_expect_success EXPENSIVE 'large sparse filter file ignored' '
> +	blob=$(dd if=/dev/zero bs=101M count=1 |
> +	       git hash-object -w --stdin) &&
> +	test_must_fail \
> +		git rev-list --all --objects --filter=sparse:oid=$blob 2>err &&
> +	cat >expect <<-EOF &&
> +	warning: ignoring excessively large pattern blob: $blob
> +	fatal: unable to parse sparse filter data in $blob
> +	EOF
> +	test_cmp expect err
> +'
> +
>  test_done

dir.c: skip .gitignore, etc larger than INT_MAX

Commit Message

Comments

Patch