From patchwork Tue Dec 21 11:51:57 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12689571 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 00582C433EF for ; Tue, 21 Dec 2021 11:54:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234118AbhLULyV (ORCPT ); Tue, 21 Dec 2021 06:54:21 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37962 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230245AbhLULyU (ORCPT ); Tue, 21 Dec 2021 06:54:20 -0500 Received: from mail-pg1-x535.google.com (mail-pg1-x535.google.com [IPv6:2607:f8b0:4864:20::535]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 24250C061574 for ; Tue, 21 Dec 2021 03:54:20 -0800 (PST) Received: by mail-pg1-x535.google.com with SMTP id a23so12126156pgm.4 for ; Tue, 21 Dec 2021 03:54:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Wko/UTdkqZqdjoCsc+Wmy6d9EL+WyoI9mYH4NOXx1TM=; b=P4uMSxVGhk5Sp9EbhYYdVUj5KxI/k3pEsJjSdUA4+0iHpo+1Sf4bE863DzK0++NsF2 lwk08ElX5zBcssdEQvj1wGxeiGgQFTtdu+xavrUjB+6D9GHUo5U/DtvyTPczHmbynQRq tvIoxlK8he3V5/j/MI6vGKhATfPg1VCrLPT+vBUTelnOHy2YdUQ4dqbG7bPxflUchLFl P15mKzUDMEt5ofSfpiVLB2QQSL5SL9ZVW4JVTbtVlUL/rBCldQCXvzKKVsKanSIae1FD eQTNLWvkaNEcRn7ItnjLps9Th1KF0wyRRyJyVOOSJPmg2jozlI7GpMgYQKvXl6E5g8CM yBIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Wko/UTdkqZqdjoCsc+Wmy6d9EL+WyoI9mYH4NOXx1TM=; b=ivrnMX49pTSzQipfPA1pSR58+lb/3Eg/c1+H8ztEtoAfyoL8BQCqpI6up7OF0P+uZq q5Q9hHa9Bvj5y4WyozDDN52kPEDutHO/Rte55CnAIHZqar47ESsdemtXjEOLE4Weomrj sALEQoe1Mzcwa8UQPlJStZ/jldAcbTPuiGz39kknH0O+DgjqWxRKzD+X/1F89sydMf7h Kfji8d1t9KJq22p6qmDVs45RNxCYcTATJ3AaRBiXOVpJSajjAdEgUcNvzSvkB64PBvYm fzU2C2OQftOP7vEu60BWBa2FIU174eLeXGcmRV/MQe/4DpaeHs6/OknsbFBHZk7pblPk l6tw== X-Gm-Message-State: AOAM530EyzfT62Uh2mGw+qUSs99/RJPWiDUxmVe8eGepEKf2tYIx+SCb tT82w98N8edTQpz8nMRszpc= X-Google-Smtp-Source: ABdhPJwvXjSp9+wwExUlJInVVX4kyI45Cz0jl15QOixd3S9ytTZvOsGC0f9QJi9QBEY0rA+bptIAiw== X-Received: by 2002:a63:6c81:: with SMTP id h123mr2657966pgc.313.1640087659612; Tue, 21 Dec 2021 03:54:19 -0800 (PST) Received: from localhost.localdomain ([205.204.117.103]) by smtp.gmail.com with ESMTPSA id s30sm20513742pfw.57.2021.12.21.03.54.16 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 21 Dec 2021 03:54:19 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee , =?utf-8?q?Ren=C3=A9_Scharfe?= Cc: Han Xin Subject: [PATCH v7 1/5] unpack-objects.c: add dry_run mode for get_data() Date: Tue, 21 Dec 2021 19:51:57 +0800 Message-Id: <20211221115201.12120-2-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.1.52.g80008efde6.agit.6.5.6 In-Reply-To: <20211217112629.12334-1-chiyutianyi@gmail.com> References: <20211217112629.12334-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin In dry_run mode, "get_data()" is used to verify the inflation of data, and the returned buffer will not be used at all and will be freed immediately. Even in dry_run mode, it is dangerous to allocate a full-size buffer for a large blob object. Therefore, only allocate a low memory footprint when calling "get_data()" in dry_run mode. Suggested-by: Jiang Xin Signed-off-by: Han Xin --- builtin/unpack-objects.c | 23 +++++++++--- t/t5590-unpack-non-delta-objects.sh | 57 +++++++++++++++++++++++++++++ 2 files changed, 74 insertions(+), 6 deletions(-) create mode 100755 t/t5590-unpack-non-delta-objects.sh diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index 4a9466295b..9104eb48da 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -96,15 +96,21 @@ static void use(int bytes) display_throughput(progress, consumed_bytes); } -static void *get_data(unsigned long size) +static void *get_data(size_t size, int dry_run) { git_zstream stream; - void *buf = xmallocz(size); + size_t bufsize; + void *buf; memset(&stream, 0, sizeof(stream)); + if (dry_run && size > 8192) + bufsize = 8192; + else + bufsize = size; + buf = xmallocz(bufsize); stream.next_out = buf; - stream.avail_out = size; + stream.avail_out = bufsize; stream.next_in = fill(1); stream.avail_in = len; git_inflate_init(&stream); @@ -124,6 +130,11 @@ static void *get_data(unsigned long size) } stream.next_in = fill(1); stream.avail_in = len; + if (dry_run) { + /* reuse the buffer in dry_run mode */ + stream.next_out = buf; + stream.avail_out = bufsize; + } } git_inflate_end(&stream); return buf; @@ -323,7 +334,7 @@ static void added_object(unsigned nr, enum object_type type, static void unpack_non_delta_entry(enum object_type type, unsigned long size, unsigned nr) { - void *buf = get_data(size); + void *buf = get_data(size, dry_run); if (!dry_run && buf) write_object(nr, type, buf, size); @@ -357,7 +368,7 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size, if (type == OBJ_REF_DELTA) { oidread(&base_oid, fill(the_hash_algo->rawsz)); use(the_hash_algo->rawsz); - delta_data = get_data(delta_size); + delta_data = get_data(delta_size, dry_run); if (dry_run || !delta_data) { free(delta_data); return; @@ -396,7 +407,7 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size, if (base_offset <= 0 || base_offset >= obj_list[nr].offset) die("offset value out of bound for delta base object"); - delta_data = get_data(delta_size); + delta_data = get_data(delta_size, dry_run); if (dry_run || !delta_data) { free(delta_data); return; diff --git a/t/t5590-unpack-non-delta-objects.sh b/t/t5590-unpack-non-delta-objects.sh new file mode 100755 index 0000000000..48c4fb1ba3 --- /dev/null +++ b/t/t5590-unpack-non-delta-objects.sh @@ -0,0 +1,57 @@ +#!/bin/sh +# +# Copyright (c) 2021 Han Xin +# + +test_description='Test unpack-objects with non-delta objects' + +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME + +. ./test-lib.sh + +prepare_dest () { + test_when_finished "rm -rf dest.git" && + git init --bare dest.git +} + +test_expect_success "setup repo with big blobs (1.5 MB)" ' + test-tool genrandom foo 1500000 >big-blob && + test_commit --append foo big-blob && + test-tool genrandom bar 1500000 >big-blob && + test_commit --append bar big-blob && + ( + cd .git && + find objects/?? -type f | sort + ) >expect && + PACK=$(echo main | git pack-objects --revs test) +' + +test_expect_success 'setup env: GIT_ALLOC_LIMIT to 1MB' ' + GIT_ALLOC_LIMIT=1m && + export GIT_ALLOC_LIMIT +' + +test_expect_success 'fail to unpack-objects: cannot allocate' ' + prepare_dest && + test_must_fail git -C dest.git unpack-objects err && + grep "fatal: attempting to allocate" err && + ( + cd dest.git && + find objects/?? -type f | sort + ) >actual && + test_file_not_empty actual && + ! test_cmp expect actual +' + +test_expect_success 'unpack-objects dry-run' ' + prepare_dest && + git -C dest.git unpack-objects -n actual && + test_must_be_empty actual +' + +test_done From patchwork Tue Dec 21 11:51:58 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12689573 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 44880C433EF for ; Tue, 21 Dec 2021 11:54:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237306AbhLULyX (ORCPT ); Tue, 21 Dec 2021 06:54:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37974 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230245AbhLULyX (ORCPT ); Tue, 21 Dec 2021 06:54:23 -0500 Received: from mail-pg1-x535.google.com (mail-pg1-x535.google.com [IPv6:2607:f8b0:4864:20::535]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 068C8C06173F for ; Tue, 21 Dec 2021 03:54:23 -0800 (PST) Received: by mail-pg1-x535.google.com with SMTP id g2so9546580pgo.9 for ; Tue, 21 Dec 2021 03:54:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=n3urYzHYUkeKMa/G9nrDCMMil6jsiAeI2j0Zt4p2eys=; b=JJ8Tn52SqDqKZLfEdc45oWkt0xL28NPwfpoYcuFytTx7L5+Nhol1VWs1iQ35cpOB9g CIV9oyW6KHF3jazp3hEjykfKI8EEnwFxvF9R0Hna5i9Jzb0+sf6h78BigItucfs92GHO sWGL5/7qZA+tteS4uW26ltw/vdRhER0ls19ZJP9vgtmDS3ZfYSKRVWDUIS9ZlZkUvhQm HE8zVNMLFfItG3KVm4cxN1enfAL+KmFvarH9hxGR8qBtcwEtSiSgsUWsP2KabgUu9WHk CdmM2NgRtp7Smo08mbykwlVEgr9RXJk2LDj+ltZwmyxDDmv6k/g5iki/dG3jie6UIaCO Lu8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=n3urYzHYUkeKMa/G9nrDCMMil6jsiAeI2j0Zt4p2eys=; b=e4N0GGMve13o1DosMXKIFwb30AG0yhAS4t9xS6isBOCa5CpeN0x8LZo8gLbJCrWhNE XdFk6qYXuKAJuSG4mkUpC8QwQqkWVCxzRoHizfEpeD/0jQW9B433ymRZk7tGtpJK5uDD Zqh2ML7IaiIksz/tCH+03qHrhP08awi/wJ10j0ajuOPEein8WkvAfcx8+VQZQLwVjkkk AJs77NHcEBKqvPjc+7hxBIHWPzP8ww37DratGhQkLUFLiS38QUE/gAg31Me9bGVeLHus tUQMjRTZ9TNE0hwoALhruBUz3eREBTc1SEy9mNzHhJnX3b4pz4y4vFS6thj/cA0bxHeA 1Cpg== X-Gm-Message-State: AOAM530unVx+/BEO5eR8mGg1kDM4GJAWFcEIK/gjwOudWy8tUw/nF1Py 6+xHZQwgOuqWRbRJ486Wmd4= X-Google-Smtp-Source: ABdhPJwzplHRj9r3vjwNwYXrNWxTqqn+lo5UD7FOlRLCsCAwT+q4YO6O2TAIirzKU+YiQVoBQrpuVg== X-Received: by 2002:a63:9143:: with SMTP id l64mr360365pge.495.1640087662576; Tue, 21 Dec 2021 03:54:22 -0800 (PST) Received: from localhost.localdomain ([205.204.117.103]) by smtp.gmail.com with ESMTPSA id s30sm20513742pfw.57.2021.12.21.03.54.19 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 21 Dec 2021 03:54:22 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee , =?utf-8?q?Ren=C3=A9_Scharfe?= Cc: Han Xin Subject: [PATCH v7 2/5] object-file API: add a format_object_header() function Date: Tue, 21 Dec 2021 19:51:58 +0800 Message-Id: <20211221115201.12120-3-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.1.52.g80008efde6.agit.6.5.6 In-Reply-To: <20211217112629.12334-1-chiyutianyi@gmail.com> References: <20211217112629.12334-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Ævar Arnfjörð Bjarmason Add a convenience function to wrap the xsnprintf() command that generates loose object headers. This code was copy/pasted in various parts of the codebase, let's define it in one place and re-use it from there. All except one caller of it had a valid "enum object_type" for us, it's only write_object_file_prepare() which might need to deal with "git hash-object --literally" and a potential garbage type. Let's have the primary API use an "enum object_type", and define an *_extended() function that can take an arbitrary "const char *" for the type. See [1] for the discussion that prompted this patch, i.e. new code in object-file.c that wanted to copy/paste the xsnprintf() invocation. 1. https://lore.kernel.org/git/211213.86bl1l9bfz.gmgdl@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason Signed-off-by: Han Xin --- builtin/index-pack.c | 3 +-- bulk-checkin.c | 4 ++-- cache.h | 21 +++++++++++++++++++++ http-push.c | 2 +- object-file.c | 14 +++++++++++--- 5 files changed, 36 insertions(+), 8 deletions(-) diff --git a/builtin/index-pack.c b/builtin/index-pack.c index c23d01de7d..4a765ddae6 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -449,8 +449,7 @@ static void *unpack_entry_data(off_t offset, unsigned long size, int hdrlen; if (!is_delta_type(type)) { - hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %"PRIuMAX, - type_name(type),(uintmax_t)size) + 1; + hdrlen = format_object_header(hdr, sizeof(hdr), type, (uintmax_t)size); the_hash_algo->init_fn(&c); the_hash_algo->update_fn(&c, hdr, hdrlen); } else diff --git a/bulk-checkin.c b/bulk-checkin.c index 8785b2ac80..1733a1de4f 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -220,8 +220,8 @@ static int deflate_to_pack(struct bulk_checkin_state *state, if (seekback == (off_t) -1) return error("cannot find the current offset"); - header_len = xsnprintf((char *)obuf, sizeof(obuf), "%s %" PRIuMAX, - type_name(type), (uintmax_t)size) + 1; + header_len = format_object_header((char *)obuf, sizeof(obuf), + type, (uintmax_t)size); the_hash_algo->init_fn(&ctx); the_hash_algo->update_fn(&ctx, obuf, header_len); diff --git a/cache.h b/cache.h index cfba463aa9..64071a8d80 100644 --- a/cache.h +++ b/cache.h @@ -1310,6 +1310,27 @@ enum unpack_loose_header_result unpack_loose_header(git_zstream *stream, unsigned long bufsiz, struct strbuf *hdrbuf); +/** + * format_object_header() is a thin wrapper around s xsnprintf() that + * writes the initial " " part of the loose object + * header. It returns the size that snprintf() returns + 1. + * + * The format_object_header_extended() function allows for writing a + * type_name that's not one of the "enum object_type" types. This is + * used for "git hash-object --literally". Pass in a OBJ_NONE as the + * type, and a non-NULL "type_str" to do that. + * + * format_object_header() is a convenience wrapper for + * format_object_header_extended(). + */ +int format_object_header_extended(char *str, size_t size, enum object_type type, + const char *type_str, size_t objsize); +static inline int format_object_header(char *str, size_t size, + enum object_type type, size_t objsize) +{ + return format_object_header_extended(str, size, type, NULL, objsize); +} + /** * parse_loose_header() parses the starting " \0" of an * object. If it doesn't follow that format -1 is returned. To check diff --git a/http-push.c b/http-push.c index 3309aaf004..f55e316ff4 100644 --- a/http-push.c +++ b/http-push.c @@ -363,7 +363,7 @@ static void start_put(struct transfer_request *request) git_zstream stream; unpacked = read_object_file(&request->obj->oid, &type, &len); - hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %"PRIuMAX , type_name(type), (uintmax_t)len) + 1; + hdrlen = format_object_header(hdr, sizeof(hdr), type, (uintmax_t)len); /* Set it up */ git_deflate_init(&stream, zlib_compression_level); diff --git a/object-file.c b/object-file.c index eb1426f98c..6bba4766f9 100644 --- a/object-file.c +++ b/object-file.c @@ -1006,6 +1006,14 @@ void *xmmap(void *start, size_t length, return ret; } +int format_object_header_extended(char *str, size_t size, enum object_type type, + const char *typestr, size_t objsize) +{ + const char *s = type == OBJ_NONE ? typestr : type_name(type); + + return xsnprintf(str, size, "%s %"PRIuMAX, s, (uintmax_t)objsize) + 1; +} + /* * With an in-core object data in "map", rehash it to make sure the * object name actually matches "oid" to detect object corruption. @@ -1034,7 +1042,7 @@ int check_object_signature(struct repository *r, const struct object_id *oid, return -1; /* Generate the header */ - hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %"PRIuMAX , type_name(obj_type), (uintmax_t)size) + 1; + hdrlen = format_object_header(hdr, sizeof(hdr), obj_type, size); /* Sha1.. */ r->hash_algo->init_fn(&c); @@ -1734,7 +1742,7 @@ static void write_object_file_prepare(const struct git_hash_algo *algo, git_hash_ctx c; /* Generate the header */ - *hdrlen = xsnprintf(hdr, *hdrlen, "%s %"PRIuMAX , type, (uintmax_t)len)+1; + *hdrlen = format_object_header_extended(hdr, *hdrlen, OBJ_NONE, type, len); /* Sha1.. */ algo->init_fn(&c); @@ -2006,7 +2014,7 @@ int force_object_loose(const struct object_id *oid, time_t mtime) buf = read_object(the_repository, oid, &type, &len); if (!buf) return error(_("cannot read object for %s"), oid_to_hex(oid)); - hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %"PRIuMAX , type_name(type), (uintmax_t)len) + 1; + hdrlen = format_object_header(hdr, sizeof(hdr), type, len); ret = write_loose_object(oid, hdr, hdrlen, buf, len, mtime, 0); free(buf); From patchwork Tue Dec 21 11:51:59 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12689575 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 59239C433EF for ; Tue, 21 Dec 2021 11:54:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237316AbhLULy0 (ORCPT ); Tue, 21 Dec 2021 06:54:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37994 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237309AbhLULy0 (ORCPT ); Tue, 21 Dec 2021 06:54:26 -0500 Received: from mail-pl1-x62b.google.com (mail-pl1-x62b.google.com [IPv6:2607:f8b0:4864:20::62b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E5F30C061574 for ; Tue, 21 Dec 2021 03:54:25 -0800 (PST) Received: by mail-pl1-x62b.google.com with SMTP id u16so4500161plg.9 for ; Tue, 21 Dec 2021 03:54:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=7xuPCcE6FjoGRXir7GTGrFu3V5hWaWradHSRqi+FXXs=; b=kLlir31wHoAu/AxBaCpwwcPk/OysfWAkp/L304IbKEbaAsyNluzWfhY+Aq9VsP1FdQ YKNdRIQYNzdlSjKn3suPw1aIozhXeYkrGBY8c28Lgp/gzOvvrnhG+/EFQmCYhI5JkLZW kCCPp9uSzQ29eDCksBTQM2XNLgpnQdKi3g3zGecN0bljxwPuGy3vLMXbJOGLHp6u14kj 6GyPa3edI19WdHsE+VNM5Ga5OOceJhxjDr4OlOa81TQHgaqoMWElLViRg3/yOcN5fPM/ rlebCSyqX5TbL6AE6Uf2B4lV7q0dU9BSx+0/j9lErBVD/amqCN/larekzppCJzqiqluP 0jxA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=7xuPCcE6FjoGRXir7GTGrFu3V5hWaWradHSRqi+FXXs=; b=6sIx9zHvlPGcsW9ZkOV/3/Dm689v/rUzVHY1/mrzY++Abq0V1DPwyFmbooXGpgFn4g Ix+AwkuXutrqkF7BNBip5GhiFjrpHImnJkT1dDsNpvOZ/b+emJpViq9tS+f8htZyrLMR 23HxKPmNaY9SQJJW/S09W5oIWPdm4TVF3YBqpwREzdTsiPMMMJ8zOC4PjQu4AVpLceI4 hUWjy6Zl8jyIDd4s7JYgtLZG0LpL7gMJAeJMViGpYfMwLKdffc6fcc0EgqAW35FVSBWw 0rRBVefxVBYTVl68ajxRrYHQylSWc6D4475nIdiDr9AalzLQ9tsNOSn9W4bPrZx0cjje uMDA== X-Gm-Message-State: AOAM532joKRC6Wyp7UL3VpsuspIA7rBpqoRjgdNLy4xGOsx8wnXNUiWV wOCUwEDVPOW54PL4Ua5R2aI= X-Google-Smtp-Source: ABdhPJyqcu33DkUixZmk5yMrxfsEyYOvd2tNkrPXbG6XhN0oxC5BxLw4NZp1aOaOCnzXZajcPD4iYg== X-Received: by 2002:a17:902:e541:b0:149:35bd:b260 with SMTP id n1-20020a170902e54100b0014935bdb260mr2039922plf.41.1640087665417; Tue, 21 Dec 2021 03:54:25 -0800 (PST) Received: from localhost.localdomain ([205.204.117.103]) by smtp.gmail.com with ESMTPSA id s30sm20513742pfw.57.2021.12.21.03.54.22 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 21 Dec 2021 03:54:24 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee , =?utf-8?q?Ren=C3=A9_Scharfe?= Cc: Han Xin Subject: [PATCH v7 3/5] object-file.c: refactor write_loose_object() to reuse in stream version Date: Tue, 21 Dec 2021 19:51:59 +0800 Message-Id: <20211221115201.12120-4-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.1.52.g80008efde6.agit.6.5.6 In-Reply-To: <20211217112629.12334-1-chiyutianyi@gmail.com> References: <20211217112629.12334-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin We used to call "get_data()" in "unpack_non_delta_entry()" to read the entire contents of a blob object, no matter how big it is. This implementation may consume all the memory and cause OOM. This can be improved by feeding data to "stream_loose_object()" in stream instead of read into the whole buf. As this new method "stream_loose_object()" has many similarities with "write_loose_object()", we split up "write_loose_object()" into some steps: 1. Figuring out a path for the (temp) object file. 2. Creating the tempfile. 3. Setting up zlib and write header. 4. Write object data and handle errors. 5. Optionally, do someting after write, maybe force a loose object if "mtime". Helped-by: Ævar Arnfjörð Bjarmason Signed-off-by: Han Xin --- object-file.c | 100 ++++++++++++++++++++++++++++++++------------------ 1 file changed, 65 insertions(+), 35 deletions(-) diff --git a/object-file.c b/object-file.c index 6bba4766f9..e048f3d39e 100644 --- a/object-file.c +++ b/object-file.c @@ -1751,6 +1751,25 @@ static void write_object_file_prepare(const struct git_hash_algo *algo, algo->final_oid_fn(oid, &c); } +/* + * Move the just written object with proper mtime into its final resting place. + */ +static int finalize_object_file_with_mtime(const char *tmpfile, + const char *filename, + time_t mtime, + unsigned flags) +{ + struct utimbuf utb; + + if (mtime) { + utb.actime = mtime; + utb.modtime = mtime; + if (utime(tmpfile, &utb) < 0 && !(flags & HASH_SILENT)) + warning_errno(_("failed utime() on %s"), tmpfile); + } + return finalize_object_file(tmpfile, filename); +} + /* * Move the just written object into its final resting place. */ @@ -1836,7 +1855,8 @@ static inline int directory_size(const char *filename) * We want to avoid cross-directory filename renames, because those * can have problems on various filesystems (FAT, NFS, Coda). */ -static int create_tmpfile(struct strbuf *tmp, const char *filename) +static int create_tmpfile(struct strbuf *tmp, const char *filename, + unsigned flags) { int fd, dirlen = directory_size(filename); @@ -1844,7 +1864,9 @@ static int create_tmpfile(struct strbuf *tmp, const char *filename) strbuf_add(tmp, filename, dirlen); strbuf_addstr(tmp, "tmp_obj_XXXXXX"); fd = git_mkstemp_mode(tmp->buf, 0444); - if (fd < 0 && dirlen && errno == ENOENT) { + do { + if (fd >= 0 || !dirlen || errno != ENOENT) + break; /* * Make sure the directory exists; note that the contents * of the buffer are undefined after mkstemp returns an @@ -1854,17 +1876,48 @@ static int create_tmpfile(struct strbuf *tmp, const char *filename) strbuf_reset(tmp); strbuf_add(tmp, filename, dirlen - 1); if (mkdir(tmp->buf, 0777) && errno != EEXIST) - return -1; + break; if (adjust_shared_perm(tmp->buf)) - return -1; + break; /* Try again */ strbuf_addstr(tmp, "/tmp_obj_XXXXXX"); fd = git_mkstemp_mode(tmp->buf, 0444); + } while (0); + + if (fd < 0 && !(flags & HASH_SILENT)) { + if (errno == EACCES) + return error(_("insufficient permission for adding an " + "object to repository database %s"), + get_object_directory()); + else + return error_errno(_("unable to create temporary file")); } + return fd; } +static void setup_stream_and_header(git_zstream *stream, + unsigned char *compressed, + unsigned long compressed_size, + git_hash_ctx *c, + char *hdr, + int hdrlen) +{ + /* Set it up */ + git_deflate_init(stream, zlib_compression_level); + stream->next_out = compressed; + stream->avail_out = compressed_size; + the_hash_algo->init_fn(c); + + /* First header.. */ + stream->next_in = (unsigned char *)hdr; + stream->avail_in = hdrlen; + while (git_deflate(stream, 0) == Z_OK) + ; /* nothing */ + the_hash_algo->update_fn(c, hdr, hdrlen); +} + static int write_loose_object(const struct object_id *oid, char *hdr, int hdrlen, const void *buf, unsigned long len, time_t mtime, unsigned flags) @@ -1879,28 +1932,13 @@ static int write_loose_object(const struct object_id *oid, char *hdr, loose_object_path(the_repository, &filename, oid); - fd = create_tmpfile(&tmp_file, filename.buf); - if (fd < 0) { - if (flags & HASH_SILENT) - return -1; - else if (errno == EACCES) - return error(_("insufficient permission for adding an object to repository database %s"), get_object_directory()); - else - return error_errno(_("unable to create temporary file")); - } - - /* Set it up */ - git_deflate_init(&stream, zlib_compression_level); - stream.next_out = compressed; - stream.avail_out = sizeof(compressed); - the_hash_algo->init_fn(&c); + fd = create_tmpfile(&tmp_file, filename.buf, flags); + if (fd < 0) + return -1; - /* First header.. */ - stream.next_in = (unsigned char *)hdr; - stream.avail_in = hdrlen; - while (git_deflate(&stream, 0) == Z_OK) - ; /* nothing */ - the_hash_algo->update_fn(&c, hdr, hdrlen); + /* Set it up and write header */ + setup_stream_and_header(&stream, compressed, sizeof(compressed), + &c, hdr, hdrlen); /* Then the data itself.. */ stream.next_in = (void *)buf; @@ -1929,16 +1967,8 @@ static int write_loose_object(const struct object_id *oid, char *hdr, close_loose_object(fd); - if (mtime) { - struct utimbuf utb; - utb.actime = mtime; - utb.modtime = mtime; - if (utime(tmp_file.buf, &utb) < 0 && - !(flags & HASH_SILENT)) - warning_errno(_("failed utime() on %s"), tmp_file.buf); - } - - return finalize_object_file(tmp_file.buf, filename.buf); + return finalize_object_file_with_mtime(tmp_file.buf, filename.buf, + mtime, flags); } static int freshen_loose_object(const struct object_id *oid) From patchwork Tue Dec 21 11:52:00 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12689577 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30B0BC433F5 for ; Tue, 21 Dec 2021 11:54:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237324AbhLULyc (ORCPT ); Tue, 21 Dec 2021 06:54:32 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38018 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237309AbhLULy3 (ORCPT ); Tue, 21 Dec 2021 06:54:29 -0500 Received: from mail-pj1-x1032.google.com (mail-pj1-x1032.google.com [IPv6:2607:f8b0:4864:20::1032]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E086BC061574 for ; Tue, 21 Dec 2021 03:54:28 -0800 (PST) Received: by mail-pj1-x1032.google.com with SMTP id a11-20020a17090a854b00b001b11aae38d6so2988935pjw.2 for ; Tue, 21 Dec 2021 03:54:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=lIqUgiIIGqudgWavCOFk/d55BKq4TNgLX/VbG+IpJ9w=; b=COnbZXOCakXfC7tii+LF1YJ/ZL46I8/nWjt3ztoy9c/3AhXC3edGhYlT8DfBSFr7YE /PTwQchCByJFCamlKFTMTsZAVlUEg41W47A+KIZnYhFY/uEdrKfVr+wESqdMQZc3E3GV eM2NxnAddZic7EbCChZ0ezauPosIwiwwIJ1Y5QKYniu5uPti6zZmuy0pSy4Bh4okl+e6 ULq340J/EysnCKMfeCSqhZ1uYBLMBhSFfX4v1AD9fCf8EciLvBBowje0n6nAbSI5xXoY mdHzUjcIEr1tWn9j/+guVzodQCG1QXiULQY4XeLwwUWIllf1dNESw8l74BwazIREvZPl aFEw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=lIqUgiIIGqudgWavCOFk/d55BKq4TNgLX/VbG+IpJ9w=; b=t6O9uGQU/WfcyJ5mwh1x+9w9wMrcYThgjG/fa9epPPEztH9HKzKrmiB3aE+7nUTwWW sUC+/9loq9mhxgVC5EUl4+1vJKl66+91dGtqsBQTCEEEcyIOD3BcLOEpj+iGYVSJ12Wb kOHCbHoqUMiHFdvd8Hd3af/X33Ujtu8ywBQ1z9JLD60nJghcswyOcmIwokBMXr+/sk6N a6LTlCn7EkG+TqWxQO3ojQs8FDsAcYLuYzmWukDoVs/sCDU/N+y49AGSsaPrZlzvNYxK 41WlpKH4lBpV+zYShF7EcooZcxhk0HQKwxqpRQFT3ZZ18UEt5O6kN1KcPUTUbInmvp7j TqTA== X-Gm-Message-State: AOAM5319IGw2Jn2TMkpg7bnN0s3jvGKMronYWM95b+dRSJlYAfkA5eEB VN8Btve6voHq0I4irz+aDN0= X-Google-Smtp-Source: ABdhPJxtrW13BFQcLrN3WxeYA7zGzs3dBzjGrHhIEQ6u+IVVd/tBkkStLWrDApOxn1OaxyZPZBZ0MA== X-Received: by 2002:a17:90b:1812:: with SMTP id lw18mr3621680pjb.196.1640087668482; Tue, 21 Dec 2021 03:54:28 -0800 (PST) Received: from localhost.localdomain ([205.204.117.103]) by smtp.gmail.com with ESMTPSA id s30sm20513742pfw.57.2021.12.21.03.54.25 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 21 Dec 2021 03:54:27 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee , =?utf-8?q?Ren=C3=A9_Scharfe?= Cc: Han Xin Subject: [PATCH v7 4/5] object-file.c: add "write_stream_object_file()" to support read in stream Date: Tue, 21 Dec 2021 19:52:00 +0800 Message-Id: <20211221115201.12120-5-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.1.52.g80008efde6.agit.6.5.6 In-Reply-To: <20211217112629.12334-1-chiyutianyi@gmail.com> References: <20211217112629.12334-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin We used to call "get_data()" in "unpack_non_delta_entry()" to read the entire contents of a blob object, no matter how big it is. This implementation may consume all the memory and cause OOM. This can be improved by feeding data to "write_stream_object_file()" in a stream. The input stream is implemented as an interface. The difference with "write_loose_object()" is that we have no chance to run "write_object_file_prepare()" to calculate the oid in advance. In "write_loose_object()", we know the oid and we can write the temporary file in the same directory as the final object, but for an object with an undetermined oid, we don't know the exact directory for the object, so we have to save the temporary file in ".git/objects/" directory instead. "freshen_packed_object()" or "freshen_loose_object()" will be called inside "write_stream_object_file()" after obtaining the "oid". Helped-by: René Scharfe Helped-by: Ævar Arnfjörð Bjarmason Helped-by: Jiang Xin Signed-off-by: Han Xin --- object-file.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++ object-store.h | 9 ++++++ 2 files changed, 94 insertions(+) diff --git a/object-file.c b/object-file.c index e048f3d39e..d0573e2a61 100644 --- a/object-file.c +++ b/object-file.c @@ -1989,6 +1989,91 @@ static int freshen_packed_object(const struct object_id *oid) return 1; } +int write_stream_object_file(struct input_stream *in_stream, size_t len, + enum object_type type, time_t mtime, + unsigned flags, struct object_id *oid) +{ + int fd, ret, flush = 0; + unsigned char compressed[4096]; + git_zstream stream; + git_hash_ctx c; + struct object_id parano_oid; + static struct strbuf tmp_file = STRBUF_INIT; + static struct strbuf filename = STRBUF_INIT; + int dirlen; + char hdr[MAX_HEADER_LEN]; + int hdrlen = sizeof(hdr); + + /* Since "filename" is defined as static, it will be reused. So reset it + * first before using it. */ + strbuf_reset(&filename); + /* When oid is not determined, save tmp file to odb path. */ + strbuf_addf(&filename, "%s/", get_object_directory()); + + fd = create_tmpfile(&tmp_file, filename.buf, flags); + if (fd < 0) + return -1; + + hdrlen = format_object_header(hdr, hdrlen, type, len); + + /* Set it up and write header */ + setup_stream_and_header(&stream, compressed, sizeof(compressed), + &c, hdr, hdrlen); + + /* Then the data itself.. */ + do { + unsigned char *in0 = stream.next_in; + if (!stream.avail_in) { + const void *in = in_stream->read(in_stream, &stream.avail_in); + stream.next_in = (void *)in; + in0 = (unsigned char *)in; + /* All data has been read. */ + if (len + hdrlen == stream.total_in + stream.avail_in) + flush = Z_FINISH; + } + ret = git_deflate(&stream, flush); + the_hash_algo->update_fn(&c, in0, stream.next_in - in0); + if (write_buffer(fd, compressed, stream.next_out - compressed) < 0) + die(_("unable to write loose object file")); + stream.next_out = compressed; + stream.avail_out = sizeof(compressed); + } while (ret == Z_OK || ret == Z_BUF_ERROR); + + if (ret != Z_STREAM_END) + die(_("unable to deflate new object streamingly (%d)"), ret); + ret = git_deflate_end_gently(&stream); + if (ret != Z_OK) + die(_("deflateEnd on object streamingly failed (%d)"), ret); + the_hash_algo->final_oid_fn(¶no_oid, &c); + + close_loose_object(fd); + + oidcpy(oid, ¶no_oid); + + if (freshen_packed_object(oid) || freshen_loose_object(oid)) { + unlink_or_warn(tmp_file.buf); + return 0; + } + + loose_object_path(the_repository, &filename, oid); + + /* We finally know the object path, and create the missing dir. */ + dirlen = directory_size(filename.buf); + if (dirlen) { + struct strbuf dir = STRBUF_INIT; + strbuf_add(&dir, filename.buf, dirlen - 1); + + if (mkdir_in_gitdir(dir.buf) && errno != EEXIST) { + ret = error_errno(_("unable to create directory %s"), dir.buf); + strbuf_release(&dir); + return ret; + } + strbuf_release(&dir); + } + + return finalize_object_file_with_mtime(tmp_file.buf, filename.buf, mtime, flags); +} + int write_object_file_flags(const void *buf, unsigned long len, const char *type, struct object_id *oid, unsigned flags) diff --git a/object-store.h b/object-store.h index 952efb6a4b..061b0cb2ba 100644 --- a/object-store.h +++ b/object-store.h @@ -34,6 +34,11 @@ struct object_directory { char *path; }; +struct input_stream { + const void *(*read)(struct input_stream *, unsigned long *len); + void *data; +}; + KHASH_INIT(odb_path_map, const char * /* key: odb_path */, struct object_directory *, 1, fspathhash, fspatheq) @@ -232,6 +237,10 @@ static inline int write_object_file(const void *buf, unsigned long len, return write_object_file_flags(buf, len, type, oid, 0); } +int write_stream_object_file(struct input_stream *in_stream, size_t len, + enum object_type type, time_t mtime, + unsigned flags, struct object_id *oid); + int hash_object_file_literally(const void *buf, unsigned long len, const char *type, struct object_id *oid, unsigned flags); From patchwork Tue Dec 21 11:52:01 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12689579 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 38512C433FE for ; Tue, 21 Dec 2021 11:54:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237322AbhLULyd (ORCPT ); Tue, 21 Dec 2021 06:54:33 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38032 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237320AbhLULyc (ORCPT ); Tue, 21 Dec 2021 06:54:32 -0500 Received: from mail-pg1-x536.google.com (mail-pg1-x536.google.com [IPv6:2607:f8b0:4864:20::536]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EE2E3C061574 for ; Tue, 21 Dec 2021 03:54:31 -0800 (PST) Received: by mail-pg1-x536.google.com with SMTP id a23so12126507pgm.4 for ; Tue, 21 Dec 2021 03:54:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=L4iN/fUSOIzfCbZx26NYnzqWE8xvQjRvnk3kusvsTq4=; b=VO8/AnISlXw35/KuBTYm5/kBuL/f1ZCMp6Du45joxz/z1a1MBhoV69LP3xbMRmfwCg TIvEsYVbxTcJh9295Le64EqcUtjlrfMG4R2jAYSuHVrhX4PQVipSTivQTdOChfjXaZQC ZdXcQKwVynh0PEC6TcbJDjfrEKBq9eZrSTgfX5yX4OR0iBU/iwjNCvoSpDFNcH54v93P eoJXostqoTn2MNwmeTMexL6MSeC870igTP21fy4HEEIXH188/1Ocy5fpgBHKKggF7jDu Iosf11fSW2lkULUDxZompHhx+PZQSURbKswZ661X8MeMaUoFxSENO7vDPMCBkNkf81/C AK+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=L4iN/fUSOIzfCbZx26NYnzqWE8xvQjRvnk3kusvsTq4=; b=ndUaPSNaC8kznsG6ICugINl+D/aOKtvhdvzBIzFmNh0i+RDC4UNUflHRi5W5HIgYFE 2doRnwxdbYQpk6Tcs0agDLSSPHLbWEBgsNg3SlQE2OvURHwciJT7WogOIwYc1leDN0q3 1OAW1KNKx0BeJ+47lLYYH8KiOQNioOA3NnNoGzxBy+dScBYKVBUulyu3XIwx7bTPFN+N eICznZipTTgSr3ICDMVwGgjGOZSOHuzbiZYEtaod11gVuyU7tX3uag94lhmXKT+XK4ZQ hICcNqLZ2+hFkm3+S4d5j1GLW3F8ZyCNXno1PPi9iDmob358BJOkFRPNBK70PbJ36nHo 24aw== X-Gm-Message-State: AOAM530ARsUt7z4V9IyfLlNqSofuVTRpy2QeUk+ovkEpHp6P/wjfk6LM MHpbrr8j3d3FCWiEIjmWlLc= X-Google-Smtp-Source: ABdhPJyx5Pv+Ixqr+dyZ/aEQkm+0d/YtPZ6QBSWuhP/6ORG8OcAY0eC42KrOFSHFibA00nI0P78oxg== X-Received: by 2002:a63:f80f:: with SMTP id n15mr2601381pgh.394.1640087671404; Tue, 21 Dec 2021 03:54:31 -0800 (PST) Received: from localhost.localdomain ([205.204.117.103]) by smtp.gmail.com with ESMTPSA id s30sm20513742pfw.57.2021.12.21.03.54.28 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 21 Dec 2021 03:54:30 -0800 (PST) From: Han Xin To: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBC?= =?utf-8?b?amFybWFzb24=?= , Derrick Stolee , =?utf-8?q?Ren=C3=A9_Scharfe?= Cc: Han Xin Subject: [PATCH v7 5/5] unpack-objects: unpack_non_delta_entry() read data in a stream Date: Tue, 21 Dec 2021 19:52:01 +0800 Message-Id: <20211221115201.12120-6-chiyutianyi@gmail.com> X-Mailer: git-send-email 2.34.1.52.g80008efde6.agit.6.5.6 In-Reply-To: <20211217112629.12334-1-chiyutianyi@gmail.com> References: <20211217112629.12334-1-chiyutianyi@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin We used to call "get_data()" in "unpack_non_delta_entry()" to read the entire contents of a blob object, no matter how big it is. This implementation may consume all the memory and cause OOM. By implementing a zstream version of input_stream interface, we can use a small fixed buffer for "unpack_non_delta_entry()". However, unpack non-delta objects from a stream instead of from an entrie buffer will have 10% performance penalty. Therefore, only unpack object larger than the "core.BigFileStreamingThreshold" in zstream. See the following benchmarks: hyperfine \ --setup \ 'if ! test -d scalar.git; then git clone --bare https://github.com/microsoft/scalar.git; cp scalar.git/objects/pack/*.pack small.pack; fi' \ --prepare 'rm -rf dest.git && git init --bare dest.git' Summary './git -C dest.git -c core.bigfilethreshold=512m unpack-objects Helped-by: Derrick Stolee Helped-by: Jiang Xin Signed-off-by: Han Xin --- Documentation/config/core.txt | 11 +++++ builtin/unpack-objects.c | 73 ++++++++++++++++++++++++++++- cache.h | 1 + config.c | 5 ++ environment.c | 1 + t/t5590-unpack-non-delta-objects.sh | 36 +++++++++++++- 6 files changed, 125 insertions(+), 2 deletions(-) diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index c04f62a54a..601b7a2418 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -424,6 +424,17 @@ be delta compressed, but larger binary media files won't be. + Common unit suffixes of 'k', 'm', or 'g' are supported. +core.bigFileStreamingThreshold:: + Files larger than this will be streamed out to a temporary + object file while being hashed, which will when be renamed + in-place to a loose object, particularly if the + `core.bigFileThreshold' setting dictates that they're always + written out as loose objects. ++ +Default is 128 MiB on all platforms. ++ +Common unit suffixes of 'k', 'm', or 'g' are supported. + core.excludesFile:: Specifies the pathname to the file that contains patterns to describe paths that are not meant to be tracked, in addition diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index 9104eb48da..72d8616e00 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -331,11 +331,82 @@ static void added_object(unsigned nr, enum object_type type, } } +struct input_zstream_data { + git_zstream *zstream; + unsigned char buf[8192]; + int status; +}; + +static const void *feed_input_zstream(struct input_stream *in_stream, + unsigned long *readlen) +{ + struct input_zstream_data *data = in_stream->data; + git_zstream *zstream = data->zstream; + void *in = fill(1); + + if (!len || data->status == Z_STREAM_END) { + *readlen = 0; + return NULL; + } + + zstream->next_out = data->buf; + zstream->avail_out = sizeof(data->buf); + zstream->next_in = in; + zstream->avail_in = len; + + data->status = git_inflate(zstream, 0); + use(len - zstream->avail_in); + *readlen = sizeof(data->buf) - zstream->avail_out; + + return data->buf; +} + +static void write_stream_blob(unsigned nr, size_t size) +{ + git_zstream zstream; + struct input_zstream_data data; + struct input_stream in_stream = { + .read = feed_input_zstream, + .data = &data, + }; + + memset(&zstream, 0, sizeof(zstream)); + memset(&data, 0, sizeof(data)); + data.zstream = &zstream; + git_inflate_init(&zstream); + + if (write_stream_object_file(&in_stream, size, OBJ_BLOB, 0, 0, + &obj_list[nr].oid)) + die(_("failed to write object in stream")); + + if (zstream.total_out != size || data.status != Z_STREAM_END) + die(_("inflate returned %d"), data.status); + git_inflate_end(&zstream); + + if (strict) { + struct blob *blob = + lookup_blob(the_repository, &obj_list[nr].oid); + if (blob) + blob->object.flags |= FLAG_WRITTEN; + else + die(_("invalid blob object from stream")); + } + obj_list[nr].obj = NULL; +} + static void unpack_non_delta_entry(enum object_type type, unsigned long size, unsigned nr) { - void *buf = get_data(size, dry_run); + void *buf; + + /* Write large blob in stream without allocating full buffer. */ + if (!dry_run && type == OBJ_BLOB && + size > big_file_streaming_threshold) { + write_stream_blob(nr, size); + return; + } + buf = get_data(size, dry_run); if (!dry_run && buf) write_object(nr, type, buf, size); else diff --git a/cache.h b/cache.h index 64071a8d80..8c9123cb5d 100644 --- a/cache.h +++ b/cache.h @@ -974,6 +974,7 @@ extern size_t packed_git_window_size; extern size_t packed_git_limit; extern size_t delta_base_cache_limit; extern unsigned long big_file_threshold; +extern unsigned long big_file_streaming_threshold; extern unsigned long pack_size_limit_cfg; /* diff --git a/config.c b/config.c index c5873f3a70..7b122a142a 100644 --- a/config.c +++ b/config.c @@ -1408,6 +1408,11 @@ static int git_default_core_config(const char *var, const char *value, void *cb) return 0; } + if (!strcmp(var, "core.bigfilestreamingthreshold")) { + big_file_streaming_threshold = git_config_ulong(var, value); + return 0; + } + if (!strcmp(var, "core.packedgitlimit")) { packed_git_limit = git_config_ulong(var, value); return 0; diff --git a/environment.c b/environment.c index 0d06a31024..04bba593de 100644 --- a/environment.c +++ b/environment.c @@ -47,6 +47,7 @@ size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE; size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT; size_t delta_base_cache_limit = 96 * 1024 * 1024; unsigned long big_file_threshold = 512 * 1024 * 1024; +unsigned long big_file_streaming_threshold = 128 * 1024 * 1024; int pager_use_color = 1; const char *editor_program; const char *askpass_program; diff --git a/t/t5590-unpack-non-delta-objects.sh b/t/t5590-unpack-non-delta-objects.sh index 48c4fb1ba3..8436cbf8db 100755 --- a/t/t5590-unpack-non-delta-objects.sh +++ b/t/t5590-unpack-non-delta-objects.sh @@ -13,6 +13,11 @@ export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME prepare_dest () { test_when_finished "rm -rf dest.git" && git init --bare dest.git + if test -n "$1" + then + git -C dest.git config core.bigFileStreamingThreshold $1 + git -C dest.git config core.bigFileThreshold $1 + fi } test_expect_success "setup repo with big blobs (1.5 MB)" ' @@ -33,7 +38,7 @@ test_expect_success 'setup env: GIT_ALLOC_LIMIT to 1MB' ' ' test_expect_success 'fail to unpack-objects: cannot allocate' ' - prepare_dest && + prepare_dest 2m && test_must_fail git -C dest.git unpack-objects err && grep "fatal: attempting to allocate" err && ( @@ -44,6 +49,35 @@ test_expect_success 'fail to unpack-objects: cannot allocate' ' ! test_cmp expect actual ' +test_expect_success 'unpack big object in stream' ' + prepare_dest 1m && + mkdir -p dest.git/objects/05 && + git -C dest.git unpack-objects actual && + test_cmp expect actual +' + +test_expect_success 'unpack big object in stream with existing oids' ' + prepare_dest 1m && + git -C dest.git index-pack --stdin actual && + test_must_be_empty actual && + git -C dest.git unpack-objects actual && + test_must_be_empty actual +' + test_expect_success 'unpack-objects dry-run' ' prepare_dest && git -C dest.git unpack-objects -n