Message ID | 20210930132004.16075-1-chiyutianyi@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] receive-pack: not receive pack file with large object | expand |
On Thu, Sep 30 2021, Han Xin wrote: > From: Han Xin <hanxin.hx@alibaba-inc.com> > > In addition to using 'receive.maxInputSize' to limit the overall size > of the received packfile, a new config variable > 'receive.maxInputObjectSize' is added to limit the push of a single > object larger than this threshold. Maybe an unfair knee-jerk reaction: I think we should really be pushing this sort of thing into pre-receive hooks and/or the proc-receive hook, i.e. see 15d3af5e22e (receive-pack: add new proc-receive hook, 2020-08-27). The latter pre-dates c08db5a2d0d (receive-pack: allow a maximum input size to be specified, 2016-08-24), which may or may not be relevant here. Anyway, I think there may be dragons here that you haven't considered. Is the "size" here the absolute size on disk, or the delta size (I'm offhand not familiar enough with unpack-objects.c to know). Does this have the same semantics no matter the transfer.unpackLimit? Either way, if it's the absolute size you may have a 100MB object that's a fantastic delta candidate, so it may only add a few bytes to your repo, or it's /dev/urandom output and really is adding 100MB. If we're relying on deltas there's no guarantee that what the client is sending is delta-ing against something we can delta against, although admittedly this is also an issue with receive.maxInputSize. Anyway, all of this is stuff that people "in the wild" don't consider already, so maybe I'm being too much of a curmudgeon here :) At an ex-job I re-wrote some "big push" blocking hook that had had N iterations of other implementations getting it subtly wrong. I.e. if you define "size" the wrong way you end up blocking things like the revert that reverts "master" to yesterday's commit. That's somehing that takes git almost no size at all to store, but which depending on the stupidity of the hook that's on the other end may be blocked as a "big push". So I think if we're going to expand on the existing "receive.maxInputSize" we should really be considering the edge cases carefully. But even then I'm somewhat skeptical of the benefit of doing this in git's own guts v.s. a hook, or expanding the hook infrastructure to accommodate it, we have "receive.maxInputSize", now maybe "receive.maxInputObjectSize", then after that perhaps "receive.maxInputBinaryObjectSize", "receive.maxInputObjectSizeGlob=*.mp3" etc. I'm not saying our hook infrastructure is good enough for this right now, but surely anything that would be wanted as a check in this area could be done by something that "git cat-file --batch"-style feeds OIDs to a new hook over stdin from "index-pack" and "unpack-objects"? With that hook aware of the temporary staged objects in the incoming-* area of the store?. I.e. if the performance aspect of not waiting until "pre-receive" time is a concern...
Han Xin <chiyutianyi@gmail.com> writes: > @@ -519,6 +520,8 @@ static void *unpack_raw_entry(struct object_entry *obj, > shift += 7; > } > obj->size = size; > + if (max_input_object_size && size > max_input_object_size) > + die(_("object exceeds maximum allowed size ")); > > switch (obj->type) { > case OBJ_REF_DELTA: Here obj->size is the inflated payload size of a single entry in the packfile. If it happens to be represented as a base object (i.e. without delta, just deflated), it would be close to the size of the blob in the working tree (but LF->CRLF conversion and the like may further inflate it), but if it is a delta object, this size is just the size of the delta data we feed patch_delta() with, and has no relevance to the actual "file size". Sure, it is called max_INPUT_object_size and we can say we are not limiting the final disk size, and that might be a workable excuse to check based on the obj->size here, but then its usefulness from the point of view of end users, who decide to set the variable to limit "some" usage, becomes dubious. So...
On Thu, Sep 30, 2021 at 10:05 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > > On Thu, Sep 30 2021, Han Xin wrote: > > > From: Han Xin <hanxin.hx@alibaba-inc.com> > > > > In addition to using 'receive.maxInputSize' to limit the overall size > > of the received packfile, a new config variable > > 'receive.maxInputObjectSize' is added to limit the push of a single > > object larger than this threshold. > > Maybe an unfair knee-jerk reaction: I think we should really be pushing > this sort of thing into pre-receive hooks and/or the proc-receive hook, > i.e. see 15d3af5e22e (receive-pack: add new proc-receive hook, > 2020-08-27). Last week, one user complained that he cannot push to his repo in our server, and later Han Xin discovered the user was trying to push a very big blob object over 10GB. For this case, the "pre-receive" hook had no change to execute because "git-receive-pack" died early because of OOM. The function "unpack_non_delta_entry()" in "builtin/unpack-objects.c" will try to allocate memory for the whole 10GB blob but no lucky. Han Xin is preparing another patch to resolve the OOM issue found in "unpack_non_delta_entry()". But we think it is reasonable to prevent such a big blob in a pack to git-receive-pack, because it will be slower to check objects from pack and loose objects in the quarantine using pre-receive hook. > Anyway, I think there may be dragons here that you haven't > considered. Is the "size" here the absolute size on disk, or the delta > size (I'm offhand not familiar enough with unpack-objects.c to > know). Does this have the same semantics no matter the > transfer.unpackLimit? Yes, according to setting of transfer.unpackLimit, may call git-index-pack to save the pack directly, or expand it by calling git-unpack-object. The "size" may be the absolute size on disk, or the delta size. But we know blob over 500MB (default value of core.bigFileThreshold) will not be deltafied, so can we assume this "size" is the absolute size on disk? -- Jiang Xin
On Fri, Oct 1, 2021 at 12:50 AM Junio C Hamano <gitster@pobox.com> wrote: > > Han Xin <chiyutianyi@gmail.com> writes: > > > @@ -519,6 +520,8 @@ static void *unpack_raw_entry(struct object_entry *obj, > > shift += 7; > > } > > obj->size = size; > > + if (max_input_object_size && size > max_input_object_size) > > + die(_("object exceeds maximum allowed size ")); > > > > switch (obj->type) { > > case OBJ_REF_DELTA: > > Here obj->size is the inflated payload size of a single entry in the > packfile. If it happens to be represented as a base object > (i.e. without delta, just deflated), it would be close to the size > of the blob in the working tree (but LF->CRLF conversion and the > like may further inflate it), but if it is a delta object, this size > is just the size of the delta data we feed patch_delta() with, and > has no relevance to the actual "file size". > > Sure, it is called max_INPUT_object_size and we can say we are not > limiting the final disk size, and that might be a workable excuse > to check based on the obj->size here, but then its usefulness from > the point of view of end users, who decide to set the variable to > limit "some" usage, becomes dubious. Just like what I replied to Ævar, if the max_input_object_size is greater than core.bigFileThreshold, is it save to save the size here is almost the actual "file size"? BTW, Han Xin will continue to resolvie the OOM issue found in "unpack_non_delta_entry()" after our Nation Day holiday. -- Jiang Xin
On Fri, Oct 01, 2021 at 10:30:24AM +0800, Jiang Xin wrote: > > Maybe an unfair knee-jerk reaction: I think we should really be pushing > > this sort of thing into pre-receive hooks and/or the proc-receive hook, > > i.e. see 15d3af5e22e (receive-pack: add new proc-receive hook, > > 2020-08-27). > > Last week, one user complained that he cannot push to his repo in our > server, and later Han Xin discovered the user was trying to push a > very big blob object over 10GB. For this case, the "pre-receive" hook > had no change to execute because "git-receive-pack" died early because > of OOM. The function "unpack_non_delta_entry()" in > "builtin/unpack-objects.c" will try to allocate memory for the whole > 10GB blob but no lucky. > > Han Xin is preparing another patch to resolve the OOM issue found in > "unpack_non_delta_entry()". But we think it is reasonable to prevent > such a big blob in a pack to git-receive-pack, because it will be > slower to check objects from pack and loose objects in the quarantine > using pre-receive hook. I think index-pack handles this case correctly, at least for base objects. In unpack_entry_data(), it will stream anything bigger than big_file_threshold. Probably unpack-objects needs to learn the same trick. In general, the code in index-pack has gotten a _lot_ more attention over the years than unpack-objects. I'd trust its security a lot more, and it has extra performance enhancements (like multithreading and streaming). At GitHub, we always run index-pack on incoming packs, and never unpack-objects. I'm tempted to say we should stop using unpack-objects entirely for incoming packs, and then either: - just keep packs for incoming objects. It's not really any worse of a state than loose objects, and it all eventually gets rolled up by gc anyway. (The exception is that thin packs may duplicate base objects until that rollup). - teach index-pack an "--unpack" option. I actually wrote some patches for this a while back. It's not very much code, but there were some rough edges and I never came back to them. I'm happy to share if anybody's interested. Though I would note that for deltas, even index-pack will not stream the contents. You generally shouldn't have very large deltas, as clients will also avoid computing them (because that also involves putting the whole object in memory). But the notion of "large" is not necessarily the same between client and server. And of course somebody can maliciously make a giant delta. -Peff
On Fri, Oct 01, 2021 at 10:52:15AM +0800, Jiang Xin wrote: > > Sure, it is called max_INPUT_object_size and we can say we are not > > limiting the final disk size, and that might be a workable excuse > > to check based on the obj->size here, but then its usefulness from > > the point of view of end users, who decide to set the variable to > > limit "some" usage, becomes dubious. > > Just like what I replied to Ævar, if the max_input_object_size is > greater than core.bigFileThreshold, is it save to save the size here > is almost the actual "file size"? If we are storing a pack with index-pack, the on-disk size will match exactly this input size. If we unpack it to loose, then big files don't tend to have deltas or to compress with zlib, but that is not always the case. I have definitely seen people try to store gigantic text files. If your goal is introduce a user-facing object-size limit, then I think the "logical" size of the uncompressed object is the only thing that makes sense. Everything else is subject to change, and can be gamed in weird ways. If your goal is to avoid malicious pushers causing you to allocate too much memory, then you might want to have some limits on the compressed sizes you'll deal with, especially for deltas. But I don't think the checks here do that, because I can send a small delta that reconstructs a much larger object (which we'd eventually reconstruct in order to compute its sha1). -Peff
On Thu, Sep 30, 2021 at 03:42:29PM +0200, Ævar Arnfjörð Bjarmason wrote: > > From: Han Xin <hanxin.hx@alibaba-inc.com> > > > > In addition to using 'receive.maxInputSize' to limit the overall size > > of the received packfile, a new config variable > > 'receive.maxInputObjectSize' is added to limit the push of a single > > object larger than this threshold. > > Maybe an unfair knee-jerk reaction: I think we should really be pushing > this sort of thing into pre-receive hooks and/or the proc-receive hook, > i.e. see 15d3af5e22e (receive-pack: add new proc-receive hook, > 2020-08-27). Yes, this could be done as a hook, but it's actually kind of tricky. We have a similar max-object-size limit in our fork at GitHub, and it's implemented inside index-pack (we don't have a match in unpack-objects, but we always run index-pack on incoming packs). Unlike the patches here, ours is limiting the logical size of an object (so a 100MB blob limit is on the uncompressed size). It happens in sha1_object(), where we have that size. The main reason we put it there is for performance. index-pack is already computing the size, so it's free to check it there. But it does make some things awkward. Rather than just dying with "woah, some object is too big", it's nice to tell the user "hey, the object 1234abcd at foo.bin is too big". To do that, we actually collect the list of too-big objects and index them anyway (remember that we're in the quarantine directory, so nothing is permanent until later). We write the list to a ".large-objects" file in the quarantine directory. And then hooks are free to produce a much nicer message (e.g., by running something like "log --find-objects" to get the path name at which we see the blob). It would be nice if the hook could just find the objects itself, but there are some gotchas: - finding the list of pushed objects is awkward and/or expensive. You can use rev-list, but that's very costly, as it has to inflate all of the new trees. You really just want the list of what's in the quarantine directory, but there's no single command to get that. You have to run show-index on the incoming pack, plus poke through the loose object directories. It would be much easier if "cat-file --batch-all-objects" could be asked to only look at local objects. That would also allow some other optimizations to reduce the cost. For instance, even when iterating packs, cat-file then feeds each sha1 back to oid_object_info(), even though we already know the exact pack/offset where it can be found. Likewise, it could be taught some basic filtering (like "show only objects with size > X") to avoid dumping millions of oid/size pairs to a separate script to do the filtering. But all of that would just be approaching the speed of having index-pack do the check, since it has the size already there. - it wouldn't help index-pack at all with memory attacks by malicious pushers. Here somebody accidentally pushed up a big blob, and it caused unpack-objects to OOM. But I also remember a problem ages ago where some of the fsck_tree() code had an integer overflow vulnerability, which could only be triggered by a gigantic tree. In the patch we use at GitHub, we allow large blobs to make it to the hook level, but too-large trees, commits, and tags are unceremoniously aborted with an immediate die(). Real users accidentally try to push 2GB objects. Trees of that size are no accident. :) Having index-pack enforce size limits helps a bit there. And streaming large blobs helps protect against accidents. But there are still OOM problems with maliciously constructed inputs, because the delta code only works on full buffers (so it really wants to construct the whole object before we get to the sha1_object() limits). It would be nice if this could stream the output, but I suspect it would require pretty major surgery to the delta code. Instead, we've put in a crude limit by having receive-pack set GIT_ALLOC_LIMIT in the environment, which then ensures that its child index-pack won't allocate too much (the limit we use is much higher than the more user-facing object limit, because the point is just to avoid DoS attacks, and not enforce any kind of policy). It might be nice to have a config option for setting this limit (and perhaps even putting it at a reasonable defensive default). So I do think there are some interesting paths forward for doing more of this in hooks, but there's work to be done to get there. > The latter pre-dates c08db5a2d0d (receive-pack: allow a maximum input > size to be specified, 2016-08-24), which may or may not be relevant > here. Yeah, I introduced that limit. It wasn't really about object size or memory, but just about the fact that without it, a malicious pusher can just send you bytes forever which the server will write to disk. It also helps with the most naive kind of large-object attacks, but it wouldn't help with a cleverly constructed delta. It is, unfortunately, a pretty unceremonious die(), and has caused more than one support ticket (especially because GitHub has used 2GB for the limit for ages, and the practical limit for repositories has been growing). So there. That's probably more than anybody wanted to know about push limits. ;) I'm happy to share any of the code from GitHub's fork in these areas. The main reason I haven't is just that some of it is just kind of ugly and may need polishing (e.g., the whole "write to .large-objects" is really an awful interface). -Peff
Jeff King <peff@peff.net> writes: > Unlike the patches here, ours is limiting the logical size of an object > (so a 100MB blob limit is on the uncompressed size). It happens in > sha1_object(), where we have that size. I like it when I hear people doing things in the "right" way (the definition of being "right" in this case is to apply the limit to the size that is end-user facing---otherwise you cannot justify an arbitrary limit). ;-)
diff --git a/Documentation/config/receive.txt b/Documentation/config/receive.txt index 85d5b5a3d2..4823ead8e4 100644 --- a/Documentation/config/receive.txt +++ b/Documentation/config/receive.txt @@ -74,6 +74,12 @@ receive.maxInputSize:: accepting the pack file. If not set or set to 0, then the size is unlimited. +receive.maxInputObjectSize:: + If one of the objects in the incoming pack stream is larger than + this limit, then git-receive-pack will error out, instead of + accepting the pack file. If not set or set to 0, then the size + is unlimited. + receive.denyDeletes:: If set to true, git-receive-pack will deny a ref update that deletes the ref. Use this to prevent such a ref deletion via a push. diff --git a/builtin/index-pack.c b/builtin/index-pack.c index 8336466865..0e62b356c6 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -133,6 +133,7 @@ static unsigned char input_buffer[4096]; static unsigned int input_offset, input_len; static off_t consumed_bytes; static off_t max_input_size; +static off_t max_input_object_size; static unsigned deepest_delta; static git_hash_ctx input_ctx; static uint32_t input_crc32; @@ -519,6 +520,8 @@ static void *unpack_raw_entry(struct object_entry *obj, shift += 7; } obj->size = size; + if (max_input_object_size && size > max_input_object_size) + die(_("object exceeds maximum allowed size ")); switch (obj->type) { case OBJ_REF_DELTA: @@ -1825,6 +1828,8 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix) die(_("bad %s"), arg); } else if (skip_prefix(arg, "--max-input-size=", &arg)) { max_input_size = strtoumax(arg, NULL, 10); + } else if (skip_prefix(arg, "--max-input-object-size=", &arg)) { + max_input_object_size = strtoumax(arg, NULL, 10); } else if (skip_prefix(arg, "--object-format=", &arg)) { hash_algo = hash_algo_by_name(arg); if (hash_algo == GIT_HASH_UNKNOWN) diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c index 2d1f97e1ca..82ff0c61ff 100644 --- a/builtin/receive-pack.c +++ b/builtin/receive-pack.c @@ -57,6 +57,7 @@ static int advertise_push_options; static int advertise_sid; static int unpack_limit = 100; static off_t max_input_size; +static off_t max_input_object_size; static int report_status; static int report_status_v2; static int use_sideband; @@ -242,6 +243,11 @@ static int receive_pack_config(const char *var, const char *value, void *cb) return 0; } + if (strcmp(var, "receive.maxinputobjectsize") == 0) { + max_input_object_size = git_config_int64(var, value); + return 0; + } + if (strcmp(var, "receive.procreceiverefs") == 0) { if (!value) return config_error_nonbool(var); @@ -2237,6 +2243,9 @@ static const char *unpack(int err_fd, struct shallow_info *si) if (max_input_size) strvec_pushf(&child.args, "--max-input-size=%"PRIuMAX, (uintmax_t)max_input_size); + if (max_input_object_size) + strvec_pushf(&child.args, "--max-input-object-size=%"PRIuMAX, + (uintmax_t)max_input_object_size); child.no_stdout = 1; child.err = err_fd; child.git_cmd = 1; @@ -2268,6 +2277,9 @@ static const char *unpack(int err_fd, struct shallow_info *si) if (max_input_size) strvec_pushf(&child.args, "--max-input-size=%"PRIuMAX, (uintmax_t)max_input_size); + if (max_input_object_size) + strvec_pushf(&child.args, "--max-input-object-size=%"PRIuMAX, + (uintmax_t)max_input_object_size); child.out = -1; child.err = err_fd; child.git_cmd = 1; diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index 4a9466295b..04d9fa918f 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -22,6 +22,7 @@ static unsigned char buffer[4096]; static unsigned int offset, len; static off_t consumed_bytes; static off_t max_input_size; +static off_t max_input_object_size; static git_hash_ctx ctx; static struct fsck_options fsck_options = FSCK_OPTIONS_STRICT; static struct progress *progress; @@ -466,6 +467,9 @@ static void unpack_one(unsigned nr) shift += 7; } + if (max_input_object_size && size > max_input_object_size) + die(_("object exceeds maximum allowed size ")); + switch (type) { case OBJ_COMMIT: case OBJ_TREE: @@ -568,6 +572,10 @@ int cmd_unpack_objects(int argc, const char **argv, const char *prefix) max_input_size = strtoumax(arg, NULL, 10); continue; } + if (skip_prefix(arg, "--max-input-object-size=", &arg)) { + max_input_object_size = strtoumax(arg, NULL, 10); + continue; + } usage(unpack_usage); } diff --git a/t/t5546-receive-limits.sh b/t/t5546-receive-limits.sh index 0b0e987fdb..52c8f1f2e8 100755 --- a/t/t5546-receive-limits.sh +++ b/t/t5546-receive-limits.sh @@ -41,6 +41,31 @@ test_pack_input_limit () { git --git-dir=dest config receive.maxInputSize 0 && git push dest HEAD ' + + test_expect_success 'prepare destination repository (test for large object)' ' + rm -fr dest && + git --bare init dest + ' + + test_expect_success 'setting receive.maxInputObjectSize to 512 rejects push large object' ' + git -C dest config receive.maxInputObjectSize 512 && + test_must_fail git push dest HEAD + ' + + test_expect_success 'bumping limit to 2k allows push large object' ' + git -C dest config receive.maxInputObjectSize 2k && + git push dest HEAD + ' + + test_expect_success 'prepare destination repository (test for large object,again)' ' + rm -fr dest && + git --bare init dest + ' + + test_expect_success 'lifting the limit allows push' ' + git -C dest config receive.maxInputObjectSize 0 && + git push dest HEAD + ' } test_expect_success "create known-size (1024 bytes) commit" '