[v2] receive-pack: not receive pack file with large object

Message ID	20210930132004.16075-1-chiyutianyi@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> From: Han Xin <chiyutianyi@gmail.com> To: Junio C Hamano <gitster@pobox.com>, Git List <git@vger.kernel.org> Cc: Han Xin <hanxin.hx@alibaba-inc.com> Subject: [PATCH v2] receive-pack: not receive pack file with large object Date: Thu, 30 Sep 2021 21:20:04 +0800 Message-Id: <20210930132004.16075-1-chiyutianyi@gmail.com> In-Reply-To: <20210930121058.5771-1-chiyutianyi@gmail.com> References: <20210930121058.5771-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[v2] receive-pack: not receive pack file with large object \| expand [v2] receive-pack: not receive pack file with large object

Han Xin Sept. 30, 2021, 1:20 p.m. UTC

From: Han Xin <hanxin.hx@alibaba-inc.com>

In addition to using 'receive.maxInputSize' to limit the overall size
of the received packfile, a new config variable
'receive.maxInputObjectSize' is added to limit the push of a single
object larger than this threshold.

Signed-off-by: Han Xin <hanxin.hx@alibaba-inc.com>
---
 Documentation/config/receive.txt |  6 ++++++
 builtin/index-pack.c             |  5 +++++
 builtin/receive-pack.c           | 12 ++++++++++++
 builtin/unpack-objects.c         |  8 ++++++++
 t/t5546-receive-limits.sh        | 25 +++++++++++++++++++++++++
 5 files changed, 56 insertions(+)

Ævar Arnfjörð Bjarmason Sept. 30, 2021, 1:42 p.m. UTC | #1

On Thu, Sep 30 2021, Han Xin wrote:

> From: Han Xin <hanxin.hx@alibaba-inc.com>
>
> In addition to using 'receive.maxInputSize' to limit the overall size
> of the received packfile, a new config variable
> 'receive.maxInputObjectSize' is added to limit the push of a single
> object larger than this threshold.

Maybe an unfair knee-jerk reaction: I think we should really be pushing
this sort of thing into pre-receive hooks and/or the proc-receive hook,
i.e. see 15d3af5e22e (receive-pack: add new proc-receive hook,
2020-08-27).

The latter pre-dates c08db5a2d0d (receive-pack: allow a maximum input
size to be specified, 2016-08-24), which may or may not be relevant
here.

Anyway, I think there may be dragons here that you haven't
considered. Is the "size" here the absolute size on disk, or the delta
size (I'm offhand not familiar enough with unpack-objects.c to
know). Does this have the same semantics no matter the
transfer.unpackLimit?

Either way, if it's the absolute size you may have a 100MB object that's
a fantastic delta candidate, so it may only add a few bytes to your
repo, or it's /dev/urandom output and really is adding 100MB.

If we're relying on deltas there's no guarantee that what the client is
sending is delta-ing against something we can delta against, although
admittedly this is also an issue with receive.maxInputSize.

Anyway, all of this is stuff that people "in the wild" don't consider
already, so maybe I'm being too much of a curmudgeon here :)

At an ex-job I re-wrote some "big push" blocking hook that had had N
iterations of other implementations getting it subtly wrong. I.e. if you
define "size" the wrong way you end up blocking things like the revert
that reverts "master" to yesterday's commit.

That's somehing that takes git almost no size at all to store, but which
depending on the stupidity of the hook that's on the other end may be
blocked as a "big push".

So I think if we're going to expand on the existing
"receive.maxInputSize" we should really be considering the edge cases
carefully.

But even then I'm somewhat skeptical of the benefit of doing this in
git's own guts v.s. a hook, or expanding the hook infrastructure to
accommodate it, we have "receive.maxInputSize", now maybe
"receive.maxInputObjectSize", then after that perhaps
"receive.maxInputBinaryObjectSize",
"receive.maxInputObjectSizeGlob=*.mp3" etc.

I'm not saying our hook infrastructure is good enough for this right
now, but surely anything that would be wanted as a check in this area
could be done by something that "git cat-file --batch"-style feeds OIDs
to a new hook over stdin from "index-pack" and "unpack-objects"? With
that hook aware of the temporary staged objects in the incoming-* area
of the store?. I.e. if the performance aspect of not waiting until
"pre-receive" time is a concern...

Junio C Hamano Sept. 30, 2021, 4:49 p.m. UTC | #2

Han Xin <chiyutianyi@gmail.com> writes:

> @@ -519,6 +520,8 @@ static void *unpack_raw_entry(struct object_entry *obj,
>  		shift += 7;
>  	}
>  	obj->size = size;
> +	if (max_input_object_size && size > max_input_object_size)
> +		die(_("object exceeds maximum allowed size "));
>  
>  	switch (obj->type) {
>  	case OBJ_REF_DELTA:

Here obj->size is the inflated payload size of a single entry in the
packfile.  If it happens to be represented as a base object
(i.e. without delta, just deflated), it would be close to the size
of the blob in the working tree (but LF->CRLF conversion and the
like may further inflate it), but if it is a delta object, this size
is just the size of the delta data we feed patch_delta() with, and
has no relevance to the actual "file size".

Sure, it is called max_INPUT_object_size and we can say we are not
limiting the final disk size, and that might be a workable excuse
to check based on the obj->size here, but then its usefulness from
the point of view of end users, who decide to set the variable to
limit "some" usage, becomes dubious.

So...

Jiang Xin Oct. 1, 2021, 2:30 a.m. UTC | #3

On Thu, Sep 30, 2021 at 10:05 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Thu, Sep 30 2021, Han Xin wrote:
>
> > From: Han Xin <hanxin.hx@alibaba-inc.com>
> >
> > In addition to using 'receive.maxInputSize' to limit the overall size
> > of the received packfile, a new config variable
> > 'receive.maxInputObjectSize' is added to limit the push of a single
> > object larger than this threshold.
>
> Maybe an unfair knee-jerk reaction: I think we should really be pushing
> this sort of thing into pre-receive hooks and/or the proc-receive hook,
> i.e. see 15d3af5e22e (receive-pack: add new proc-receive hook,
> 2020-08-27).

Last week, one user complained that he cannot push to his repo in our
server, and later Han Xin discovered the user was trying to push a
very big blob object over 10GB. For this case, the "pre-receive" hook
had no change to execute because "git-receive-pack" died early because
of OOM.  The function "unpack_non_delta_entry()" in
"builtin/unpack-objects.c" will try to allocate memory for the whole
10GB blob but no lucky.

Han Xin is preparing another patch to resolve the OOM issue found in
"unpack_non_delta_entry()". But we think it is reasonable to prevent
such a big blob in a pack to git-receive-pack, because it will be
slower to check objects from pack and loose objects in the quarantine
using pre-receive hook.

> Anyway, I think there may be dragons here that you haven't
> considered. Is the "size" here the absolute size on disk, or the delta
> size (I'm offhand not familiar enough with unpack-objects.c to
> know). Does this have the same semantics no matter the
> transfer.unpackLimit?

Yes, according to setting of transfer.unpackLimit, may call
git-index-pack to save the pack directly, or expand it by calling
git-unpack-object. The "size" may be the absolute size on disk, or the
delta size. But we know blob over 500MB (default value of
core.bigFileThreshold) will not be deltafied, so can we assume this
"size" is the absolute size on disk?

--
Jiang Xin

Jiang Xin Oct. 1, 2021, 2:52 a.m. UTC | #4

On Fri, Oct 1, 2021 at 12:50 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Han Xin <chiyutianyi@gmail.com> writes:
>
> > @@ -519,6 +520,8 @@ static void *unpack_raw_entry(struct object_entry *obj,
> >               shift += 7;
> >       }
> >       obj->size = size;
> > +     if (max_input_object_size && size > max_input_object_size)
> > +             die(_("object exceeds maximum allowed size "));
> >
> >       switch (obj->type) {
> >       case OBJ_REF_DELTA:
>
> Here obj->size is the inflated payload size of a single entry in the
> packfile.  If it happens to be represented as a base object
> (i.e. without delta, just deflated), it would be close to the size
> of the blob in the working tree (but LF->CRLF conversion and the
> like may further inflate it), but if it is a delta object, this size
> is just the size of the delta data we feed patch_delta() with, and
> has no relevance to the actual "file size".
>
> Sure, it is called max_INPUT_object_size and we can say we are not
> limiting the final disk size, and that might be a workable excuse
> to check based on the obj->size here, but then its usefulness from
> the point of view of end users, who decide to set the variable to
> limit "some" usage, becomes dubious.

Just like what I replied to Ævar, if the max_input_object_size is
greater than core.bigFileThreshold, is it save to save the size here
is almost the actual "file size"?

BTW, Han Xin will continue to resolvie the OOM issue found in
"unpack_non_delta_entry()" after our Nation Day holiday.

--
Jiang Xin

Jeff King Oct. 1, 2021, 6:17 a.m. UTC | #5

On Fri, Oct 01, 2021 at 10:30:24AM +0800, Jiang Xin wrote:

> > Maybe an unfair knee-jerk reaction: I think we should really be pushing
> > this sort of thing into pre-receive hooks and/or the proc-receive hook,
> > i.e. see 15d3af5e22e (receive-pack: add new proc-receive hook,
> > 2020-08-27).
> 
> Last week, one user complained that he cannot push to his repo in our
> server, and later Han Xin discovered the user was trying to push a
> very big blob object over 10GB. For this case, the "pre-receive" hook
> had no change to execute because "git-receive-pack" died early because
> of OOM.  The function "unpack_non_delta_entry()" in
> "builtin/unpack-objects.c" will try to allocate memory for the whole
> 10GB blob but no lucky.
> 
> Han Xin is preparing another patch to resolve the OOM issue found in
> "unpack_non_delta_entry()". But we think it is reasonable to prevent
> such a big blob in a pack to git-receive-pack, because it will be
> slower to check objects from pack and loose objects in the quarantine
> using pre-receive hook.

I think index-pack handles this case correctly, at least for base
objects. In unpack_entry_data(), it will stream anything bigger than
big_file_threshold. Probably unpack-objects needs to learn the same
trick.

In general, the code in index-pack has gotten a _lot_ more attention
over the years than unpack-objects. I'd trust its security a lot more,
and it has extra performance enhancements (like multithreading and
streaming). At GitHub, we always run index-pack on incoming packs, and
never unpack-objects. I'm tempted to say we should stop using
unpack-objects entirely for incoming packs, and then either:

  - just keep packs for incoming objects. It's not really any worse of a
    state than loose objects, and it all eventually gets rolled up by gc
    anyway. (The exception is that thin packs may duplicate base objects
    until that rollup).

  - teach index-pack an "--unpack" option. I actually wrote some patches
    for this a while back. It's not very much code, but there were some
    rough edges and I never came back to them. I'm happy to share if
    anybody's interested.

Though I would note that for deltas, even index-pack will not stream the
contents. You generally shouldn't have very large deltas, as clients
will also avoid computing them (because that also involves putting the
whole object in memory). But the notion of "large" is not necessarily
the same between client and server. And of course somebody can
maliciously make a giant delta.

-Peff

Jeff King Oct. 1, 2021, 6:24 a.m. UTC | #6

On Fri, Oct 01, 2021 at 10:52:15AM +0800, Jiang Xin wrote:

> > Sure, it is called max_INPUT_object_size and we can say we are not
> > limiting the final disk size, and that might be a workable excuse
> > to check based on the obj->size here, but then its usefulness from
> > the point of view of end users, who decide to set the variable to
> > limit "some" usage, becomes dubious.
> 
> Just like what I replied to Ævar, if the max_input_object_size is
> greater than core.bigFileThreshold, is it save to save the size here
> is almost the actual "file size"?

If we are storing a pack with index-pack, the on-disk size will match
exactly this input size. If we unpack it to loose, then big files don't
tend to have deltas or to compress with zlib, but that is not always the
case. I have definitely seen people try to store gigantic text files.

If your goal is introduce a user-facing object-size limit, then I think
the "logical" size of the uncompressed object is the only thing that
makes sense. Everything else is subject to change, and can be gamed in
weird ways.

If your goal is to avoid malicious pushers causing you to allocate too
much memory, then you might want to have some limits on the compressed
sizes you'll deal with, especially for deltas. But I don't think the
checks here do that, because I can send a small delta that reconstructs
a much larger object (which we'd eventually reconstruct in order to
compute its sha1).

-Peff

Jeff King Oct. 1, 2021, 6:55 a.m. UTC | #7

On Thu, Sep 30, 2021 at 03:42:29PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > From: Han Xin <hanxin.hx@alibaba-inc.com>
> >
> > In addition to using 'receive.maxInputSize' to limit the overall size
> > of the received packfile, a new config variable
> > 'receive.maxInputObjectSize' is added to limit the push of a single
> > object larger than this threshold.
> 
> Maybe an unfair knee-jerk reaction: I think we should really be pushing
> this sort of thing into pre-receive hooks and/or the proc-receive hook,
> i.e. see 15d3af5e22e (receive-pack: add new proc-receive hook,
> 2020-08-27).

Yes, this could be done as a hook, but it's actually kind of tricky.
We have a similar max-object-size limit in our fork at GitHub, and it's
implemented inside index-pack (we don't have a match in unpack-objects,
but we always run index-pack on incoming packs).

Unlike the patches here, ours is limiting the logical size of an object
(so a 100MB blob limit is on the uncompressed size). It happens in
sha1_object(), where we have that size.

The main reason we put it there is for performance. index-pack is
already computing the size, so it's free to check it there. But it does
make some things awkward. Rather than just dying with "woah, some object
is too big", it's nice to tell the user "hey, the object 1234abcd at
foo.bin is too big". To do that, we actually collect the list of too-big
objects and index them anyway (remember that we're in the quarantine
directory, so nothing is permanent until later). We write the list to a
".large-objects" file in the quarantine directory. And then hooks are
free to produce a much nicer message (e.g., by running something like
"log --find-objects" to get the path name at which we see the blob).

It would be nice if the hook could just find the objects itself, but
there are some gotchas:

  - finding the list of pushed objects is awkward and/or expensive. You
    can use rev-list, but that's very costly, as it has to inflate all
    of the new trees. You really just want the list of what's in the
    quarantine directory, but there's no single command to get that. You
    have to run show-index on the incoming pack, plus poke through the
    loose object directories.

    It would be much easier if "cat-file --batch-all-objects" could be
    asked to only look at local objects. That would also allow some
    other optimizations to reduce the cost. For instance, even when
    iterating packs, cat-file then feeds each sha1 back to
    oid_object_info(), even though we already know the exact pack/offset
    where it can be found. Likewise, it could be taught some basic
    filtering (like "show only objects with size > X") to avoid dumping
    millions of oid/size pairs to a separate script to do the filtering.

    But all of that would just be approaching the speed of having
    index-pack do the check, since it has the size already there.

  - it wouldn't help index-pack at all with memory attacks by malicious
    pushers. Here somebody accidentally pushed up a big blob, and it
    caused unpack-objects to OOM. But I also remember a problem ages ago
    where some of the fsck_tree() code had an integer overflow
    vulnerability, which could only be triggered by a gigantic tree.
    In the patch we use at GitHub, we allow large blobs to make it to
    the hook level, but too-large trees, commits, and tags are
    unceremoniously aborted with an immediate die(). Real users
    accidentally try to push 2GB objects. Trees of that size are no
    accident. :)

    Having index-pack enforce size limits helps a bit there. And
    streaming large blobs helps protect against accidents. But there are
    still OOM problems with maliciously constructed inputs, because the
    delta code only works on full buffers (so it really wants to
    construct the whole object before we get to the sha1_object()
    limits). It would be nice if this could stream the output, but I
    suspect it would require pretty major surgery to the delta code.

    Instead, we've put in a crude limit by having receive-pack set
    GIT_ALLOC_LIMIT in the environment, which then ensures that its
    child index-pack won't allocate too much (the limit we use is much
    higher than the more user-facing object limit, because the point is
    just to avoid DoS attacks, and not enforce any kind of policy).

    It might be nice to have a config option for setting this limit (and
    perhaps even putting it at a reasonable defensive default).

So I do think there are some interesting paths forward for doing more of
this in hooks, but there's work to be done to get there.

> The latter pre-dates c08db5a2d0d (receive-pack: allow a maximum input
> size to be specified, 2016-08-24), which may or may not be relevant
> here.

Yeah, I introduced that limit. It wasn't really about object size or
memory, but just about the fact that without it, a malicious pusher can
just send you bytes forever which the server will write to disk. It also
helps with the most naive kind of large-object attacks, but it wouldn't
help with a cleverly constructed delta.

It is, unfortunately, a pretty unceremonious die(), and has caused more
than one support ticket (especially because GitHub has used 2GB for the
limit for ages, and the practical limit for repositories has been
growing).

So there. That's probably more than anybody wanted to know about push
limits. ;) I'm happy to share any of the code from GitHub's fork in
these areas. The main reason I haven't is just that some of it is just
kind of ugly and may need polishing (e.g., the whole "write to
.large-objects" is really an awful interface).

-Peff

Junio C Hamano Oct. 1, 2021, 6:43 p.m. UTC | #8

Jeff King <peff@peff.net> writes:

> Unlike the patches here, ours is limiting the logical size of an object
> (so a 100MB blob limit is on the uncompressed size). It happens in
> sha1_object(), where we have that size.

I like it when I hear people doing things in the "right" way (the
definition of being "right" in this case is to apply the limit to
the size that is end-user facing---otherwise you cannot justify an
arbitrary limit).  ;-)

[v2] receive-pack: not receive pack file with large object

Commit Message

Comments

Patch