From patchwork Fri Jun 10 14:46:01 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12877673 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9819EC433EF for ; Fri, 10 Jun 2022 14:47:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233117AbiFJOrm (ORCPT ); Fri, 10 Jun 2022 10:47:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51960 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1349599AbiFJOrP (ORCPT ); Fri, 10 Jun 2022 10:47:15 -0400 Received: from mail-pj1-x102f.google.com (mail-pj1-x102f.google.com [IPv6:2607:f8b0:4864:20::102f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4FEBD1F5357 for ; Fri, 10 Jun 2022 07:47:14 -0700 (PDT) Received: by mail-pj1-x102f.google.com with SMTP id gc3-20020a17090b310300b001e33092c737so2269294pjb.3 for ; Fri, 10 Jun 2022 07:47:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=G4YGJ+ddtr5RlCelrNfaYsLPvQx8soXqleuAESoecdo=; b=n0ZaVuoLg7kLz14z1lUaMa5y2k1b8Sh/A8eMB+aA4TMTgdJ7vGFXRC5RRMt8Mmkzhp sU6Yzl/DqkPN2609C4AgZUFLRi9Aj0L465/KpBdddqSQAfIH0tkuPuKavUPX6UiVYUvR qtmstwRbMg0MGpnq+F/ekYVIVi1hGSTS2qWfJ+h5nalgQsqNv8SRGSjnNCcQn762HHV3 WbfvXLFSMWwVni7W9IaPcWpuYE9SskcfmJBz5xHa3rRcMsC0SODrsx1gcS3VDyAfz2fO j+Xm2apqrO0P7org4erpf+d1dmRIXoCQjvr6BP3g3GAxDxhJO/Vw0I2UdZVeQ8BG62D3 okkg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=G4YGJ+ddtr5RlCelrNfaYsLPvQx8soXqleuAESoecdo=; b=HNMJACjaJ0BP0J2S0I+XDeWl2snKBNNYK7xTWMxSNyrCj12zv0D3I+xg4xCeO7jb28 NuFOfekow6r/XtHWPDLl52/DeP1qbw7Ijcf4QJGtgkuSd6UQ1dF2gyd85/0uTjhlKDFO LuYdcyM7Ik+Dp8txArKHHkSy/MzyM1y0DAT6zTQM2bv2dl1NwPZUh+EVoPtpIdLtivmu cVivuy5gjzhQa07Rg8OivsQtLGIog3DpGr06CvlTEXodjpFrjkMzcFNOZKePKMfxos5c mlIFcX4CZRA7ZhPpaSyrWP6m+QGQ0KRJAzDruMRT4sdZkQPzwjPDbsZQaL9yq2FuEQD6 jw/A== X-Gm-Message-State: AOAM532y4rRAaDyD0KSchpLeCERj2wM+ysHRgWlapezAepWzVZ+UYO6h OTFwMk09ge+6pDM2m3nhw5U= X-Google-Smtp-Source: ABdhPJwQCHn32NGhotysZChN4Qwo+Ph9N+ZQ1Tc5xK4dKyio0xhvJYXdrvMWv6VAeQ2M89je7w0yaQ== X-Received: by 2002:a17:903:2592:b0:168:9708:ad73 with SMTP id jb18-20020a170903259200b001689708ad73mr14081137plb.59.1654872433744; Fri, 10 Jun 2022 07:47:13 -0700 (PDT) Received: from JMHNXMC7VH.bytedance.net ([139.177.225.227]) by smtp.gmail.com with ESMTPSA id lx9-20020a17090b4b0900b001e292e30129sm1840434pjb.22.2022.06.10.07.47.08 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 10 Jun 2022 07:47:13 -0700 (PDT) From: Han Xin To: avarab@gmail.com Cc: Han Xin , chiyutianyi@gmail.com, git@vger.kernel.org, gitster@pobox.com, l.s.r@web.de, neerajsi@microsoft.com, newren@gmail.com, philipoakley@iee.email, stolee@gmail.com, worldhello.net@gmail.com, Neeraj Singh , Jiang Xin Subject: [PATCH v14 1/7] unpack-objects: low memory footprint for get_data() in dry_run mode Date: Fri, 10 Jun 2022 22:46:01 +0800 Message-Id: X-Mailer: git-send-email 2.36.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin As the name implies, "get_data(size)" will allocate and return a given amount of memory. Allocating memory for a large blob object may cause the system to run out of memory. Before preparing to replace calling of "get_data()" to unpack large blob objects in latter commits, refactor "get_data()" to reduce memory footprint for dry_run mode. Because in dry_run mode, "get_data()" is only used to check the integrity of data, and the returned buffer is not used at all, we can allocate a smaller buffer and use it as zstream output. Make the function return NULL in the dry-run mode, as no callers use the returned buffer. The "find [...]objects/?? -type f | wc -l" test idiom being used here is adapted from the same "find" use added to another test in d9545c7f465 (fast-import: implement unpack limit, 2016-04-25). Suggested-by: Jiang Xin Signed-off-by: Han Xin Signed-off-by: Ævar Arnfjörð Bjarmason --- builtin/unpack-objects.c | 37 ++++++++++++++++++++--------- t/t5351-unpack-large-objects.sh | 41 +++++++++++++++++++++++++++++++++ 2 files changed, 67 insertions(+), 11 deletions(-) create mode 100755 t/t5351-unpack-large-objects.sh diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index 56d05e2725..32e8b47059 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -97,15 +97,27 @@ static void use(int bytes) display_throughput(progress, consumed_bytes); } +/* + * Decompress zstream from the standard input into a newly + * allocated buffer of specified size and return the buffer. + * The caller is responsible to free the returned buffer. + * + * But for dry_run mode, "get_data()" is only used to check the + * integrity of data, and the returned buffer is not used at all. + * Therefore, in dry_run mode, "get_data()" will release the small + * allocated buffer which is reused to hold temporary zstream output + * and return NULL instead of returning garbage data. + */ static void *get_data(unsigned long size) { git_zstream stream; - void *buf = xmallocz(size); + unsigned long bufsize = dry_run && size > 8192 ? 8192 : size; + void *buf = xmallocz(bufsize); memset(&stream, 0, sizeof(stream)); stream.next_out = buf; - stream.avail_out = size; + stream.avail_out = bufsize; stream.next_in = fill(1); stream.avail_in = len; git_inflate_init(&stream); @@ -125,8 +137,17 @@ static void *get_data(unsigned long size) } stream.next_in = fill(1); stream.avail_in = len; + if (dry_run) { + /* reuse the buffer in dry_run mode */ + stream.next_out = buf; + stream.avail_out = bufsize > size - stream.total_out ? + size - stream.total_out : + bufsize; + } } git_inflate_end(&stream); + if (dry_run) + FREE_AND_NULL(buf); return buf; } @@ -326,10 +347,8 @@ static void unpack_non_delta_entry(enum object_type type, unsigned long size, { void *buf = get_data(size); - if (!dry_run && buf) + if (buf) write_object(nr, type, buf, size); - else - free(buf); } static int resolve_against_held(unsigned nr, const struct object_id *base, @@ -359,10 +378,8 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size, oidread(&base_oid, fill(the_hash_algo->rawsz)); use(the_hash_algo->rawsz); delta_data = get_data(delta_size); - if (dry_run || !delta_data) { - free(delta_data); + if (!delta_data) return; - } if (has_object_file(&base_oid)) ; /* Ok we have this one */ else if (resolve_against_held(nr, &base_oid, @@ -398,10 +415,8 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size, die("offset value out of bound for delta base object"); delta_data = get_data(delta_size); - if (dry_run || !delta_data) { - free(delta_data); + if (!delta_data) return; - } lo = 0; hi = nr; while (lo < hi) { diff --git a/t/t5351-unpack-large-objects.sh b/t/t5351-unpack-large-objects.sh new file mode 100755 index 0000000000..8d84313221 --- /dev/null +++ b/t/t5351-unpack-large-objects.sh @@ -0,0 +1,41 @@ +#!/bin/sh +# +# Copyright (c) 2022 Han Xin +# + +test_description='git unpack-objects with large objects' + +. ./test-lib.sh + +prepare_dest () { + test_when_finished "rm -rf dest.git" && + git init --bare dest.git +} + +test_expect_success "create large objects (1.5 MB) and PACK" ' + test-tool genrandom foo 1500000 >big-blob && + test_commit --append foo big-blob && + test-tool genrandom bar 1500000 >big-blob && + test_commit --append bar big-blob && + PACK=$(echo HEAD | git pack-objects --revs pack) +' + +test_expect_success 'set memory limitation to 1MB' ' + GIT_ALLOC_LIMIT=1m && + export GIT_ALLOC_LIMIT +' + +test_expect_success 'unpack-objects failed under memory limitation' ' + prepare_dest && + test_must_fail git -C dest.git unpack-objects err && + grep "fatal: attempting to allocate" err +' + +test_expect_success 'unpack-objects works with memory limitation in dry-run mode' ' + prepare_dest && + git -C dest.git unpack-objects -n X-Patchwork-Id: 12877674 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7C1B3CCA47C for ; Fri, 10 Jun 2022 14:47:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1349675AbiFJOrr (ORCPT ); Fri, 10 Jun 2022 10:47:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50380 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1343873AbiFJOrV (ORCPT ); Fri, 10 Jun 2022 10:47:21 -0400 Received: from mail-pf1-x42b.google.com (mail-pf1-x42b.google.com [IPv6:2607:f8b0:4864:20::42b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 43C923F328 for ; Fri, 10 Jun 2022 07:47:20 -0700 (PDT) Received: by mail-pf1-x42b.google.com with SMTP id 15so24060974pfy.3 for ; Fri, 10 Jun 2022 07:47:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=jF2cS0KnZJI7MMdTZgCLRxpV1DUxHxTlX2Y8bFHi2uc=; b=YF3hN8DhBSLUGd/0dQ9MOBDslj9Vhuo3Qm5MU1vLuh0jh7sbE4G1kA0LSdF2vcMB03 KDGydFxXpVYHRsqSn0iumD+iQDtdwGOmRzXTUlYskHLxvDDiWJORmE+gpN+D52s1Wqhs czhRbFuMgsQkg81u9NSWTlygd5OqJuRp8EFICmlxbCx9UQ8KBabVLE1nrWclOnQOmd4k XepCXPHlHUs/1sx4q3E9ANp5ClAQINZs1wezfmEz4mT+SqMiuAhbNArr3Uy+9rj7eveG tV8bUUGfm8x2bRJ1Y4NbRRJ6AsmHRvIT0dYgBhwgXuYzJrHQJKra2GwqTBhrtCY09dB1 AJmQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=jF2cS0KnZJI7MMdTZgCLRxpV1DUxHxTlX2Y8bFHi2uc=; b=Cbi7eQt/VPnneIJQrP7ceeqwIerNn3unD6x8PB0e7uuNhyuhXTVMr01f0RGSPzovJM 3BVm2ray68n2WqBlt9SsHTHx6M6Oxi+GNnoqxUdtRcWsIFRlHfrli6uFQuzqVpfIORkZ DPQN3vyXjZsQ5EhXIbCcVbULI1D7xdcUokM+kzk++FWyIFnXSeuk8+Tq+3MBTchWHjuO jgLcid+dKWtAk6PmWhQ3/zRx1Xck1+xSbJ1oAsd8XrmR0CEsRrzCEaPkJESF7MRYcRAV bxyTJgaUPTL9qW3XqC3mWdBeX2ZXVhHtyaLH1g9dzMGe6Vo59oezZdI7WRgSQZs+rgBa lS2A== X-Gm-Message-State: AOAM531q7ItCSVa3izXXiwWcHgtiJQ5lTKz8iWJ+CLgifF4j4Fwlrj6A 7kQ1OztCZxF83eZeXlH3qt8= X-Google-Smtp-Source: ABdhPJwITL8JoiLlHl7Aps9/ig3FttT6EtKsULHiVju0RhZ/2/0PUGDntv5YbSVySbVvQM+CI7svnw== X-Received: by 2002:a65:6045:0:b0:399:3a5e:e25a with SMTP id a5-20020a656045000000b003993a5ee25amr39157502pgp.139.1654872438825; Fri, 10 Jun 2022 07:47:18 -0700 (PDT) Received: from JMHNXMC7VH.bytedance.net ([139.177.225.227]) by smtp.gmail.com with ESMTPSA id lx9-20020a17090b4b0900b001e292e30129sm1840434pjb.22.2022.06.10.07.47.14 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 10 Jun 2022 07:47:18 -0700 (PDT) From: Han Xin To: avarab@gmail.com Cc: chiyutianyi@gmail.com, git@vger.kernel.org, gitster@pobox.com, l.s.r@web.de, neerajsi@microsoft.com, newren@gmail.com, philipoakley@iee.email, stolee@gmail.com, worldhello.net@gmail.com, Neeraj Singh Subject: [PATCH v14 2/7] object-file.c: do fsync() and close() before post-write die() Date: Fri, 10 Jun 2022 22:46:02 +0800 Message-Id: X-Mailer: git-send-email 2.36.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Ævar Arnfjörð Bjarmason Change write_loose_object() to do an fsync() and close() before the oideq() sanity check at the end. This change re-joins code that was split up by the die() sanity check added in 748af44c63e (sha1_file: be paranoid when creating loose objects, 2010-02-21). I don't think that this change matters in itself, if we called die() it was possible that our data wouldn't fully make it to disk, but in any case we were writing data that we'd consider corrupted. It's possible that a subsequent "git fsck" will be less confused now. The real reason to make this change is that in a subsequent commit we'll split this code in write_loose_object() into a utility function, all its callers will want the preceding sanity checks, but not the "oideq" check. By moving the close_loose_object() earlier it'll be easier to reason about the introduction of the utility function. Signed-off-by: Ævar Arnfjörð Bjarmason --- object-file.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/object-file.c b/object-file.c index 79eb8339b6..e4a83012ba 100644 --- a/object-file.c +++ b/object-file.c @@ -2012,12 +2012,12 @@ static int write_loose_object(const struct object_id *oid, char *hdr, die(_("deflateEnd on object %s failed (%d)"), oid_to_hex(oid), ret); the_hash_algo->final_oid_fn(¶no_oid, &c); + close_loose_object(fd, tmp_file.buf); + if (!oideq(oid, ¶no_oid)) die(_("confused by unstable object source data for %s"), oid_to_hex(oid)); - close_loose_object(fd, tmp_file.buf); - if (mtime) { struct utimbuf utb; utb.actime = mtime; From patchwork Fri Jun 10 14:46:03 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12877675 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CEACEC43334 for ; Fri, 10 Jun 2022 14:48:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1348432AbiFJOsA (ORCPT ); Fri, 10 Jun 2022 10:48:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55202 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1349228AbiFJOr0 (ORCPT ); Fri, 10 Jun 2022 10:47:26 -0400 Received: from mail-pg1-x52b.google.com (mail-pg1-x52b.google.com [IPv6:2607:f8b0:4864:20::52b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D48BCD42 for ; Fri, 10 Jun 2022 07:47:24 -0700 (PDT) Received: by mail-pg1-x52b.google.com with SMTP id 129so24996867pgc.2 for ; Fri, 10 Jun 2022 07:47:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=89OTghNx96Z8tbLE1mnclAjUAHwAy9ky/yWIZ9xFWWU=; b=J9fY8Rm0wxlvMHMhRaRLr1F5l55i19BFZ/lx1ba15e3b8dG1GQ1LQlukDIZV3NlAEb o0s24S4tu/SFvwvBdTj7ohBWZcM0MJOqe6V1+1s8s9L25xuZtQ5sgZBKCy/mrj5Pwj65 W98P/UjJTWgE3c02exoHWmWfhYg5q8psMdeFsVBGNcdvTaBy5ELvy7P3nEwfruh5q55F gDIpLIxJjZ8jemAuO1axfhB/FawY+t4aP4icnLIXzWXFuntSA4sr3eXHsKLMfvNGY/Ka CaFqX4N2qkrsoIugSwxTuEjGRWIMDDOjl4qG3QXce8eHC5xy172PuGzwFhljascqTXAl WF1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=89OTghNx96Z8tbLE1mnclAjUAHwAy9ky/yWIZ9xFWWU=; b=021eGtLl87Ze+7RdMOCtJEijpLf3q+6tLstmbyO8EWqrOQlBA/k0cri4VCgWDuvQ4i k1ChJJrd/6pXkhC3dFBx4Y2TxTSixhgA60mPGTbxdOvLYFzHqNE9F8LB8aUM3YQD7wso KZF2B4+O2ZuzKnd0zGsnMWbc5ho2ZDTYRN1mwiAtr6gd/rordWH7iPhwiwf736WKF4z9 DCxtKkc/3PBXN7qRYMvrpbz2R2Q5n0CNb3ZxIHSTHDb3hnDgI/gr15eLrrtA6BzGJPnk DMSXuQ2EGyW1GJoEfURY+Nx/M0u2d00kTrMQDCehU2tXMOo6ig9PD28hSqkm5aO0Bduh wNrw== X-Gm-Message-State: AOAM530HKbLM79aNWgZImLPnLJ8nn3QnOp7xp55Lgx686HD3Z8MaEg67 334SKHigQod8l1PrV49QbW4= X-Google-Smtp-Source: ABdhPJyt4wpLiUoU6Zerm4qvtmZRS11Oco4XOnGNoWN3xdnS8abIoyM3LILcnhwBm9f+Ar5VD/6/Pw== X-Received: by 2002:a05:6a00:168a:b0:4f7:e161:83cd with SMTP id k10-20020a056a00168a00b004f7e16183cdmr46715559pfc.56.1654872444314; Fri, 10 Jun 2022 07:47:24 -0700 (PDT) Received: from JMHNXMC7VH.bytedance.net ([139.177.225.227]) by smtp.gmail.com with ESMTPSA id lx9-20020a17090b4b0900b001e292e30129sm1840434pjb.22.2022.06.10.07.47.19 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 10 Jun 2022 07:47:23 -0700 (PDT) From: Han Xin To: avarab@gmail.com Cc: Han Xin , chiyutianyi@gmail.com, git@vger.kernel.org, gitster@pobox.com, l.s.r@web.de, neerajsi@microsoft.com, newren@gmail.com, philipoakley@iee.email, stolee@gmail.com, worldhello.net@gmail.com, Neeraj Singh , Jiang Xin Subject: [PATCH v14 3/7] object-file.c: refactor write_loose_object() to several steps Date: Fri, 10 Jun 2022 22:46:03 +0800 Message-Id: <9bc8002282ddbb13b707c303281d88f377ecbdbe.1654871916.git.chiyutianyi@gmail.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin When writing a large blob using "write_loose_object()", we have to pass a buffer with the whole content of the blob, and this behavior will consume lots of memory and may cause OOM. We will introduce a stream version function ("stream_loose_object()") in later commit to resolve this issue. Before introducing that streaming function, do some refactoring on "write_loose_object()" to reuse code for both versions. Rewrite "write_loose_object()" as follows: 1. Figure out a path for the (temp) object file. This step is only used in "write_loose_object()". 2. Move common steps for starting to write loose objects into a new function "start_loose_object_common()". 3. Compress data. 4. Move common steps for ending zlib stream into a new function "end_loose_object_common()". 5. Close fd and finalize the object file. Helped-by: Ævar Arnfjörð Bjarmason Helped-by: Jiang Xin Signed-off-by: Han Xin Signed-off-by: Ævar Arnfjörð Bjarmason --- object-file.c | 101 +++++++++++++++++++++++++++++++++++++------------- 1 file changed, 75 insertions(+), 26 deletions(-) diff --git a/object-file.c b/object-file.c index e4a83012ba..f4d7f8c109 100644 --- a/object-file.c +++ b/object-file.c @@ -1951,6 +1951,74 @@ static int create_tmpfile(struct strbuf *tmp, const char *filename) return fd; } +/** + * Common steps for loose object writers to start writing loose + * objects: + * + * - Create tmpfile for the loose object. + * - Setup zlib stream for compression. + * - Start to feed header to zlib stream. + * + * Returns a "fd", which should later be provided to + * end_loose_object_common(). + */ +static int start_loose_object_common(struct strbuf *tmp_file, + const char *filename, unsigned flags, + git_zstream *stream, + unsigned char *buf, size_t buflen, + git_hash_ctx *c, + char *hdr, int hdrlen) +{ + int fd; + + fd = create_tmpfile(tmp_file, filename); + if (fd < 0) { + if (flags & HASH_SILENT) + return -1; + else if (errno == EACCES) + return error(_("insufficient permission for adding " + "an object to repository database %s"), + get_object_directory()); + else + return error_errno( + _("unable to create temporary file")); + } + + /* Setup zlib stream for compression */ + git_deflate_init(stream, zlib_compression_level); + stream->next_out = buf; + stream->avail_out = buflen; + the_hash_algo->init_fn(c); + + /* Start to feed header to zlib stream */ + stream->next_in = (unsigned char *)hdr; + stream->avail_in = hdrlen; + while (git_deflate(stream, 0) == Z_OK) + ; /* nothing */ + the_hash_algo->update_fn(c, hdr, hdrlen); + + return fd; +} + +/** + * Common steps for loose object writers to end writing loose objects: + * + * - End the compression of zlib stream. + * - Get the calculated oid to "oid". + */ +static int end_loose_object_common(git_hash_ctx *c, git_zstream *stream, + struct object_id *oid) +{ + int ret; + + ret = git_deflate_end_gently(stream); + if (ret != Z_OK) + return ret; + the_hash_algo->final_oid_fn(oid, c); + + return Z_OK; +} + static int write_loose_object(const struct object_id *oid, char *hdr, int hdrlen, const void *buf, unsigned long len, time_t mtime, unsigned flags) @@ -1968,28 +2036,11 @@ static int write_loose_object(const struct object_id *oid, char *hdr, loose_object_path(the_repository, &filename, oid); - fd = create_tmpfile(&tmp_file, filename.buf); - if (fd < 0) { - if (flags & HASH_SILENT) - return -1; - else if (errno == EACCES) - return error(_("insufficient permission for adding an object to repository database %s"), get_object_directory()); - else - return error_errno(_("unable to create temporary file")); - } - - /* Set it up */ - git_deflate_init(&stream, zlib_compression_level); - stream.next_out = compressed; - stream.avail_out = sizeof(compressed); - the_hash_algo->init_fn(&c); - - /* First header.. */ - stream.next_in = (unsigned char *)hdr; - stream.avail_in = hdrlen; - while (git_deflate(&stream, 0) == Z_OK) - ; /* nothing */ - the_hash_algo->update_fn(&c, hdr, hdrlen); + fd = start_loose_object_common(&tmp_file, filename.buf, flags, + &stream, compressed, sizeof(compressed), + &c, hdr, hdrlen); + if (fd < 0) + return -1; /* Then the data itself.. */ stream.next_in = (void *)buf; @@ -2007,11 +2058,9 @@ static int write_loose_object(const struct object_id *oid, char *hdr, if (ret != Z_STREAM_END) die(_("unable to deflate new object %s (%d)"), oid_to_hex(oid), ret); - ret = git_deflate_end_gently(&stream); + ret = end_loose_object_common(&c, &stream, ¶no_oid); if (ret != Z_OK) - die(_("deflateEnd on object %s failed (%d)"), oid_to_hex(oid), - ret); - the_hash_algo->final_oid_fn(¶no_oid, &c); + die(_("deflateEnd on object %s failed (%d)"), oid_to_hex(oid), ret); close_loose_object(fd, tmp_file.buf); if (!oideq(oid, ¶no_oid)) From patchwork Fri Jun 10 14:46:04 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12877676 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7DBBAC43334 for ; Fri, 10 Jun 2022 14:48:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229974AbiFJOsF (ORCPT ); Fri, 10 Jun 2022 10:48:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55648 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1349163AbiFJOrb (ORCPT ); Fri, 10 Jun 2022 10:47:31 -0400 Received: from mail-pg1-x536.google.com (mail-pg1-x536.google.com [IPv6:2607:f8b0:4864:20::536]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 186D842A07 for ; Fri, 10 Jun 2022 07:47:30 -0700 (PDT) Received: by mail-pg1-x536.google.com with SMTP id 123so14608900pgb.5 for ; Fri, 10 Jun 2022 07:47:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=gflyq7EOoQVeGLLM8PXlWuyszqP8mNnYSDXG4oZS4e4=; b=UG/hxqCMA/ZtfXkf6FU5R69Th710Um8XpYct8Amb/LMLEF7thwgoCEAC4AeSTMkJUm 4cc4dYInDMgJjpTiujkMe+yjXl/7UWZGozeurzj6D6glLDo+ReVvQ+ONwY8O8RUzS1dd mMelVjg+h1lh+5cFRxCu0WPZhreAukO+VXQwbAUlaPRrgFB5ERZf6kpOkxNeeWXU2kXN mIst8BMw4ZmHkknc4c6oCMosv/hXgGzSDanW31+wcZwaVf3UJf+FK06B7MsFozz56KJ4 ZVmYPR+XfC7FlPatKqeOL++4cH9bJ3c3XEEg4ZbebFlfwDNDiS/loYF2gvldlZ1fPvc4 u0RQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=gflyq7EOoQVeGLLM8PXlWuyszqP8mNnYSDXG4oZS4e4=; b=z6EYzU3u6z5AlNLqkya92H/nOncYj9HZvhQGTeEx/qjEXcCt3PZc1RKftz/DQgL7BG EXOaCI7wRMxyLyRoGibQ1dtT5jQx7ZfUHnXvXLJKkWKFoQAPZzYjueBB+6nC1jlT29C4 zwczJ2JQbD2DoxrZVK/8/Xvm4ehQUBgQe7Nj+SbnBU0fobw+Eiz3HQAdcU0x2n7pZ484 RCeT7CNx8HP0dcnKb4gHwNKVquk1KrmoDx/PBSqk+gxzmXoKzD2EGYG7gXvF/gK4pWBr xMRX4lXGwDqgyQ6pcjW4gdwW8n2MjcF80AfVCnuNKhdDdr25dXQ0x9rqnAEPU8lahWzi rpfQ== X-Gm-Message-State: AOAM531rm5dkEj7zXwzCT/VhgIitG4mB2nqD1WPSMp3Q8dWAkdTSh8Eu Q0vv2Da9zmbRDwk9rWb6CAc= X-Google-Smtp-Source: ABdhPJwr12IgTimp2RztbXxmU0OmM474skSnzd+fdCX1bVp1YT9aDvkwl1mLVpsk1aa5UuLyNBfHCA== X-Received: by 2002:a63:5f0d:0:b0:3fd:7b18:bad8 with SMTP id t13-20020a635f0d000000b003fd7b18bad8mr28238598pgb.213.1654872449606; Fri, 10 Jun 2022 07:47:29 -0700 (PDT) Received: from JMHNXMC7VH.bytedance.net ([139.177.225.227]) by smtp.gmail.com with ESMTPSA id lx9-20020a17090b4b0900b001e292e30129sm1840434pjb.22.2022.06.10.07.47.24 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 10 Jun 2022 07:47:29 -0700 (PDT) From: Han Xin To: avarab@gmail.com Cc: chiyutianyi@gmail.com, git@vger.kernel.org, gitster@pobox.com, l.s.r@web.de, neerajsi@microsoft.com, newren@gmail.com, philipoakley@iee.email, stolee@gmail.com, worldhello.net@gmail.com, Neeraj Singh Subject: [PATCH v14 4/7] object-file.c: factor out deflate part of write_loose_object() Date: Fri, 10 Jun 2022 22:46:04 +0800 Message-Id: <7c73815f188f16bb91c9b4ad981d299330dd3424.1654871916.git.chiyutianyi@gmail.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Ævar Arnfjörð Bjarmason Split out the part of write_loose_object() that deals with calling git_deflate() into a utility function, a subsequent commit will introduce another function that'll make use of it. Signed-off-by: Ævar Arnfjörð Bjarmason --- object-file.c | 31 +++++++++++++++++++++++++------ 1 file changed, 25 insertions(+), 6 deletions(-) diff --git a/object-file.c b/object-file.c index f4d7f8c109..cfae54762e 100644 --- a/object-file.c +++ b/object-file.c @@ -2000,6 +2000,28 @@ static int start_loose_object_common(struct strbuf *tmp_file, return fd; } +/** + * Common steps for the inner git_deflate() loop for writing loose + * objects. Returns what git_deflate() returns. + */ +static int write_loose_object_common(git_hash_ctx *c, + git_zstream *stream, const int flush, + unsigned char *in0, const int fd, + unsigned char *compressed, + const size_t compressed_len) +{ + int ret; + + ret = git_deflate(stream, flush ? Z_FINISH : 0); + the_hash_algo->update_fn(c, in0, stream->next_in - in0); + if (write_buffer(fd, compressed, stream->next_out - compressed) < 0) + die(_("unable to write loose object file")); + stream->next_out = compressed; + stream->avail_out = compressed_len; + + return ret; +} + /** * Common steps for loose object writers to end writing loose objects: * @@ -2047,12 +2069,9 @@ static int write_loose_object(const struct object_id *oid, char *hdr, stream.avail_in = len; do { unsigned char *in0 = stream.next_in; - ret = git_deflate(&stream, Z_FINISH); - the_hash_algo->update_fn(&c, in0, stream.next_in - in0); - if (write_buffer(fd, compressed, stream.next_out - compressed) < 0) - die(_("unable to write loose object file")); - stream.next_out = compressed; - stream.avail_out = sizeof(compressed); + + ret = write_loose_object_common(&c, &stream, 1, in0, fd, + compressed, sizeof(compressed)); } while (ret == Z_OK); if (ret != Z_STREAM_END) From patchwork Fri Jun 10 14:46:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12877677 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 513A4C433EF for ; Fri, 10 Jun 2022 14:48:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345716AbiFJOs2 (ORCPT ); Fri, 10 Jun 2022 10:48:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51948 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1349657AbiFJOri (ORCPT ); Fri, 10 Jun 2022 10:47:38 -0400 Received: from mail-pf1-x435.google.com (mail-pf1-x435.google.com [IPv6:2607:f8b0:4864:20::435]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C654A1DFC78 for ; Fri, 10 Jun 2022 07:47:35 -0700 (PDT) Received: by mail-pf1-x435.google.com with SMTP id 15so24061518pfy.3 for ; Fri, 10 Jun 2022 07:47:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=OSydusTpCLzo/fUiMaadZu62FxaITYOiofxgfkNJOjA=; b=bq4zPSqNRdpnxf0Bvd5/CPE/OwoCbM9VmlW6mAjds4YwsrZPLRwLzE0hEzEo3MzmHx rjzAlfEWusCPSmgkykLn+QrR7hgwdQ//iV6JIkrJbyW3bfD2VIRZuSfgl/RulUz0Zn2R A7ZWQC+o8Ff3dViKpHPVu1U77zjm9Q0J8MfPy5ZsWwCUD0rSR3QJA9ZMWPmxWUiC2U1m TOMg402JCustmrFqD/LBWPs9inXab/KyrMwqKR33ZSfMUhSZgPNtYnjrtl5EjDVRLQp/ w+KiQ9n9ZNpF77QzHLcblV2Z7q2BcZZcXIV3Ub19zH8ie36etUC2sTNbnT9DdCT66aaE 1+4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=OSydusTpCLzo/fUiMaadZu62FxaITYOiofxgfkNJOjA=; b=pfNGPY1UQMknmeaf83oQvos3tfHyRCnBc+3G9Eor2lJ6RCYoVYCTaBuOHuXgGwlbHO QMSP0biYrZXPu2Lf6u28O9EqBjE1UFIOhB/RKZJyKbI01yxFXZ1m6eXI0u22YPEF5E5p wxVDmDi5M8DkeDubru7IJRCaITeGeSlj6xBLPF68s+jchprbNu2rpbz4PKQFBXZ8cQpA 0wskk060Tp1B7T1fGDsHj0emVPmYP7pryfaucz5bpArTAMMkPXDfmkiHecO1GKZtc4MJ G0LGHFe7gu9sUP/y48Ti2DCPyeBRbfBTboLC/rg4mA5+nIWZgev+mxnSokB8JLWLkudH B8iA== X-Gm-Message-State: AOAM533deoz2U4zt8zt4F5OTbNfKiw4h3MzqA/h/D5ZvXoIfIz7LmrBr YkKu5viSQ6FLzuZg/2fThgQ= X-Google-Smtp-Source: ABdhPJxEhvKvVa1Oqu/qAnrxfqbeOxuRs1dWUcpqQYiCfp5RHArjMM/L0j4zTADDfZMJvAvC8miy0g== X-Received: by 2002:a05:6a00:194d:b0:51b:eb84:49b1 with SMTP id s13-20020a056a00194d00b0051beb8449b1mr36396563pfk.77.1654872455220; Fri, 10 Jun 2022 07:47:35 -0700 (PDT) Received: from JMHNXMC7VH.bytedance.net ([139.177.225.227]) by smtp.gmail.com with ESMTPSA id lx9-20020a17090b4b0900b001e292e30129sm1840434pjb.22.2022.06.10.07.47.29 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 10 Jun 2022 07:47:34 -0700 (PDT) From: Han Xin To: avarab@gmail.com Cc: Han Xin , chiyutianyi@gmail.com, git@vger.kernel.org, gitster@pobox.com, l.s.r@web.de, neerajsi@microsoft.com, newren@gmail.com, philipoakley@iee.email, stolee@gmail.com, worldhello.net@gmail.com, Neeraj Singh , Jiang Xin Subject: [PATCH v14 5/7] object-file.c: add "stream_loose_object()" to handle large object Date: Fri, 10 Jun 2022 22:46:05 +0800 Message-Id: <28a9588f9ceda2252d8ca9c4b3912177c45cb95c.1654871916.git.chiyutianyi@gmail.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin If we want unpack and write a loose object using "write_loose_object", we have to feed it with a buffer with the same size of the object, which will consume lots of memory and may cause OOM. This can be improved by feeding data to "stream_loose_object()" in a stream. Add a new function "stream_loose_object()", which is a stream version of "write_loose_object()" but with a low memory footprint. We will use this function to unpack large blob object in later commit. Another difference with "write_loose_object()" is that we have no chance to run "write_object_file_prepare()" to calculate the oid in advance. In "write_loose_object()", we know the oid and we can write the temporary file in the same directory as the final object, but for an object with an undetermined oid, we don't know the exact directory for the object. Still, we need to save the temporary file we're preparing somewhere. We'll do that in the top-level ".git/objects/" directory (or whatever "GIT_OBJECT_DIRECTORY" is set to). Once we've streamed it we'll know the OID, and will move it to its canonical path. "freshen_packed_object()" or "freshen_loose_object()" will be called inside "stream_loose_object()" after obtaining the "oid". After the temporary file is written, we wants to mark the object to recent and we may find that where indeed is already the object. We should remove the temporary and do not leave a new copy of the object. Helped-by: René Scharfe Helped-by: Ævar Arnfjörð Bjarmason Helped-by: Jiang Xin Signed-off-by: Han Xin Signed-off-by: Ævar Arnfjörð Bjarmason --- object-file.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++++ object-store.h | 8 ++++ 2 files changed, 112 insertions(+) diff --git a/object-file.c b/object-file.c index cfae54762e..0b8383ad47 100644 --- a/object-file.c +++ b/object-file.c @@ -2118,6 +2118,110 @@ static int freshen_packed_object(const struct object_id *oid) return 1; } +int stream_loose_object(struct input_stream *in_stream, size_t len, + struct object_id *oid) +{ + int fd, ret, err = 0, flush = 0; + unsigned char compressed[4096]; + git_zstream stream; + git_hash_ctx c; + struct strbuf tmp_file = STRBUF_INIT; + struct strbuf filename = STRBUF_INIT; + int dirlen; + char hdr[MAX_HEADER_LEN]; + int hdrlen; + + if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) + prepare_loose_object_bulk_checkin(); + + /* Since oid is not determined, save tmp file to odb path. */ + strbuf_addf(&filename, "%s/", get_object_directory()); + hdrlen = format_object_header(hdr, sizeof(hdr), OBJ_BLOB, len); + + /* + * Common steps for write_loose_object and stream_loose_object to + * start writing loose objects: + * + * - Create tmpfile for the loose object. + * - Setup zlib stream for compression. + * - Start to feed header to zlib stream. + */ + fd = start_loose_object_common(&tmp_file, filename.buf, 0, + &stream, compressed, sizeof(compressed), + &c, hdr, hdrlen); + if (fd < 0) { + err = -1; + goto cleanup; + } + + /* Then the data itself.. */ + do { + unsigned char *in0 = stream.next_in; + + if (!stream.avail_in && !in_stream->is_finished) { + const void *in = in_stream->read(in_stream, &stream.avail_in); + stream.next_in = (void *)in; + in0 = (unsigned char *)in; + /* All data has been read. */ + if (in_stream->is_finished) + flush = 1; + } + ret = write_loose_object_common(&c, &stream, flush, in0, fd, + compressed, sizeof(compressed)); + /* + * Unlike write_loose_object(), we do not have the entire + * buffer. If we get Z_BUF_ERROR due to too few input bytes, + * then we'll replenish them in the next input_stream->read() + * call when we loop. + */ + } while (ret == Z_OK || ret == Z_BUF_ERROR); + + if (stream.total_in != len + hdrlen) + die(_("write stream object %ld != %"PRIuMAX), stream.total_in, + (uintmax_t)len + hdrlen); + + /* + * Common steps for write_loose_object and stream_loose_object to + * end writing loose oject: + * + * - End the compression of zlib stream. + * - Get the calculated oid. + */ + if (ret != Z_STREAM_END) + die(_("unable to stream deflate new object (%d)"), ret); + ret = end_loose_object_common(&c, &stream, oid); + if (ret != Z_OK) + die(_("deflateEnd on stream object failed (%d)"), ret); + close_loose_object(fd, tmp_file.buf); + + if (freshen_packed_object(oid) || freshen_loose_object(oid)) { + unlink_or_warn(tmp_file.buf); + goto cleanup; + } + + loose_object_path(the_repository, &filename, oid); + + /* We finally know the object path, and create the missing dir. */ + dirlen = directory_size(filename.buf); + if (dirlen) { + struct strbuf dir = STRBUF_INIT; + strbuf_add(&dir, filename.buf, dirlen); + + if (mkdir_in_gitdir(dir.buf) && errno != EEXIST) { + err = error_errno(_("unable to create directory %s"), dir.buf); + strbuf_release(&dir); + goto cleanup; + } + strbuf_release(&dir); + } + + err = finalize_object_file(tmp_file.buf, filename.buf); +cleanup: + strbuf_release(&tmp_file); + strbuf_release(&filename); + return err; +} + int write_object_file_flags(const void *buf, unsigned long len, enum object_type type, struct object_id *oid, unsigned flags) diff --git a/object-store.h b/object-store.h index 539ea43904..5222ee5460 100644 --- a/object-store.h +++ b/object-store.h @@ -46,6 +46,12 @@ struct object_directory { char *path; }; +struct input_stream { + const void *(*read)(struct input_stream *, unsigned long *len); + void *data; + int is_finished; +}; + KHASH_INIT(odb_path_map, const char * /* key: odb_path */, struct object_directory *, 1, fspathhash, fspatheq) @@ -269,6 +275,8 @@ static inline int write_object_file(const void *buf, unsigned long len, int write_object_file_literally(const void *buf, unsigned long len, const char *type, struct object_id *oid, unsigned flags); +int stream_loose_object(struct input_stream *in_stream, size_t len, + struct object_id *oid); /* * Add an object file to the in-memory object store, without writing it From patchwork Fri Jun 10 14:46:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12877682 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3BAD4CCA47B for ; Fri, 10 Jun 2022 14:48:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232030AbiFJOss (ORCPT ); Fri, 10 Jun 2022 10:48:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51316 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1349683AbiFJOr5 (ORCPT ); Fri, 10 Jun 2022 10:47:57 -0400 Received: from mail-pg1-x530.google.com (mail-pg1-x530.google.com [IPv6:2607:f8b0:4864:20::530]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0E29F3BBDF for ; Fri, 10 Jun 2022 07:47:41 -0700 (PDT) Received: by mail-pg1-x530.google.com with SMTP id h192so18191851pgc.4 for ; Fri, 10 Jun 2022 07:47:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=gjcwYh0Ii8wc1gfx1hiFbX3M7RlEDLIVbiPqPcLqNgY=; b=SHADqNJcolagakLneWiYajtcljH8700nMHwBr1xyZB2MU5LukbYs8n+d5k7VpUX0+V eQFJWAwpGh3xVicT+2YE8Pn304I4sSxlKetvs8lvarGld7SDfBW+aWyV78eBnATMNE73 vC/B10c6B69w0l2pcGxrzooJfD5VZs58xSFbYNx3rtQW/zpEIyOIg6vUp7P+AT01ggT5 ixrQqXVHORJITWe4/Hy7XCJsJj/NIG3dK2wJttLP40QucYcXIanop8g0Dscz+0dcOk1a hgXX9DirPS4j2vSLvEBemC58r0bbAiIhKZxuraqnYmqz1+DbP1igpXxLEyXDbKXkQTsV guBg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=gjcwYh0Ii8wc1gfx1hiFbX3M7RlEDLIVbiPqPcLqNgY=; b=VJ4I8EoHM+GxrhdFVo8AdSJpYfjR62ep+6KpM++K0Nn1xl79uCytOLjB2bH7UzkkBZ 0MW8QpML/oxu40k0nq2cQSCbWJxA19DA3HE/1R2OvNte6OBLRxESyOx7CSpQXhbWU/BU NuxcQQA2DoRxWNwynqj3G6sQvdh9mG9QaogS27k3LQLYxo3WUJXY/MzIJdQPM1/TjWBT V9TZIxmb2wDn670S0AmIKJouGnmkb+/NrL3pShqm2nlBuvYqSSli2Dla6+iIY7rdCMp7 mO7Pefr5gENS9MOoajeVxjutR1wVx/HXEb7wbGPIWms455KRNKe9qM0MHyeG98R7ibWr EunA== X-Gm-Message-State: AOAM53241hlozfwQASkShCxbJe/s1nSYFYQlLBq8gMzp2JZ2ffQBAK6G Rm6KLldFi+VZVzuCkBfneG8= X-Google-Smtp-Source: ABdhPJwc3xGWRvBHqA427bkhk/aaPz4GWSRTvqOZwzcEnlfP1H/iMm4F7Tvc9ac4geHfthPQ3bFutg== X-Received: by 2002:a63:90c1:0:b0:3fc:7de1:b2b9 with SMTP id a184-20020a6390c1000000b003fc7de1b2b9mr39741287pge.440.1654872460561; Fri, 10 Jun 2022 07:47:40 -0700 (PDT) Received: from JMHNXMC7VH.bytedance.net ([139.177.225.227]) by smtp.gmail.com with ESMTPSA id lx9-20020a17090b4b0900b001e292e30129sm1840434pjb.22.2022.06.10.07.47.35 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 10 Jun 2022 07:47:39 -0700 (PDT) From: Han Xin To: avarab@gmail.com Cc: chiyutianyi@gmail.com, git@vger.kernel.org, gitster@pobox.com, l.s.r@web.de, neerajsi@microsoft.com, newren@gmail.com, philipoakley@iee.email, stolee@gmail.com, worldhello.net@gmail.com, Neeraj Singh Subject: [PATCH v14 6/7] core doc: modernize core.bigFileThreshold documentation Date: Fri, 10 Jun 2022 22:46:06 +0800 Message-Id: X-Mailer: git-send-email 2.36.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Ævar Arnfjörð Bjarmason The core.bigFileThreshold documentation has been largely unchanged since 5eef828bc03 (fast-import: Stream very large blobs directly to pack, 2010-02-01). But since then this setting has been expanded to affect a lot more than that description indicated. Most notably in how "git diff" treats them, see 6bf3b813486 (diff --stat: mark any file larger than core.bigfilethreshold binary, 2014-08-16). In addition to that, numerous commands and APIs make use of a streaming mode for files above this threshold. So let's attempt to summarize 12 years of changes in behavior, which can be seen with: git log --oneline -Gbig_file_thre 5eef828bc03.. -- '*.c' To do that turn this into a bullet-point list. The summary Han Xin produced in [1] helped a lot, but is a bit too detailed for documentation aimed at users. Let's instead summarize how user-observable behavior differs, and generally describe how we tend to stream these files in various commands. 1. https://lore.kernel.org/git/20220120112114.47618-5-chiyutianyi@gmail.com/ Helped-by: Han Xin Signed-off-by: Ævar Arnfjörð Bjarmason --- Documentation/config/core.txt | 33 ++++++++++++++++++++++++--------- 1 file changed, 24 insertions(+), 9 deletions(-) diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index 41e330f306..f2e75dd824 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -444,17 +444,32 @@ You probably do not need to adjust this value. Common unit suffixes of 'k', 'm', or 'g' are supported. core.bigFileThreshold:: - Files larger than this size are stored deflated, without - attempting delta compression. Storing large files without - delta compression avoids excessive memory usage, at the - slight expense of increased disk usage. Additionally files - larger than this size are always treated as binary. + The size of files considered "big", which as discussed below + changes the behavior of numerous git commands, as well as how + such files are stored within the repository. The default is + 512 MiB. Common unit suffixes of 'k', 'm', or 'g' are + supported. + -Default is 512 MiB on all platforms. This should be reasonable -for most projects as source code and other text files can still -be delta compressed, but larger binary media files won't be. +Files above the configured limit will be: + -Common unit suffixes of 'k', 'm', or 'g' are supported. +* Stored deflated in packfiles, without attempting delta compression. ++ +The default limit is primarily set with this use-case in mind. With it +most projects will have their source code and other text files delta +compressed, but not larger binary media files. ++ +Storing large files without delta compression avoids excessive memory +usage, at the slight expense of increased disk usage. ++ +* Will be treated as if though they were labeled "binary" (see + linkgit:gitattributes[5]). e.g. linkgit:git-log[1] and + linkgit:git-diff[1] will not diffs for files above this limit. ++ +* Will be generally be streamed when written, which avoids excessive +memory usage, at the cost of some fixed overhead. Commands that make +use of this include linkgit:git-archive[1], +linkgit:git-fast-import[1], linkgit:git-index-pack[1] and +linkgit:git-fsck[1]. core.excludesFile:: Specifies the pathname to the file that contains patterns to From patchwork Fri Jun 10 14:46:07 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Han Xin X-Patchwork-Id: 12877683 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 11EADC43334 for ; Fri, 10 Jun 2022 14:49:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1349626AbiFJOtE (ORCPT ); Fri, 10 Jun 2022 10:49:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55642 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1348608AbiFJOsA (ORCPT ); Fri, 10 Jun 2022 10:48:00 -0400 Received: from mail-pf1-x435.google.com (mail-pf1-x435.google.com [IPv6:2607:f8b0:4864:20::435]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DD5AE31DE3 for ; Fri, 10 Jun 2022 07:47:46 -0700 (PDT) Received: by mail-pf1-x435.google.com with SMTP id u2so24058725pfc.2 for ; Fri, 10 Jun 2022 07:47:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=bdvSoqzcUsqZjm6DpC5TzqgfLDBezMYtYyvwFUVvZgw=; b=B4Wo8iXjO+1Oug1B+DvQfFcWaMzZOSMRmn2kXaWEqtw6pX871+8i17HjdTalUc1XpG zFNdBwvWKqO49/TaZ07eBJdRBkCMI3HM7TRL7jk+3ZOZGqv9WxakLMEKbNDNxHIlAahh miYq2YuyCaKCWvvZF79371uza+UTahapThTPr2hNgCHTDQDuRMen2GjdRHrIFbGkKQh2 8L8/Mfv+U/uY0QzIu9KotR9Pq6c95vrfjaXvbLKPe+ixbfeL8WMcjgtzydX+ZWNI52lh XhLTLf7pE7GJHWpo1uziYIOO/Jp1b42kNOl1t4X7mqmx9VWGza2HUltND/Dbpjtnhvba 2aDQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=bdvSoqzcUsqZjm6DpC5TzqgfLDBezMYtYyvwFUVvZgw=; b=ComPZhBPFmM7wJlbPcSxCLlvl8ejp83w7PjvdN8DOqlkjAjh45yX8g0YFcJPnTygmf VGCgmm7CZecqVbMTIBQEzO0aEeOyBj7/4yJAVoMs79ch1NxRBI5IVaA2wOC5O/wBgDrW eds7TcynhXm/R1U/wQCgsf7pqjWYjsT77fHht8sqBmBBmfTXl4oOx83a5nrxhllHHA6F fhmomuo3ss5WR5ksseAhEq9BLYM4O9dXOsSlaN5HTFTcoLt7jQNrR2ABTRqEVlmFbJOm HW8RB8W50pzM5FhUC2HBB+ioK/ke37oGRwvD26J08TviIAwT2Fee537e1m6AfnaXNtIH yTyg== X-Gm-Message-State: AOAM530NEzAMDjpDDfKfipJ0slWYc6yRKvg2WpRYA4714X948PJ65U4H RVsMN8n9fbX9wbfRu4BWMzU= X-Google-Smtp-Source: ABdhPJx6eOuSc3fQ+wknici99lblBp4kFZO2PoMISf2oLMmV3XfTehs02SoFSiH5fI4WNm/w9FqP2Q== X-Received: by 2002:a63:f0d:0:b0:401:9819:c6ee with SMTP id e13-20020a630f0d000000b004019819c6eemr7524533pgl.450.1654872466320; Fri, 10 Jun 2022 07:47:46 -0700 (PDT) Received: from JMHNXMC7VH.bytedance.net ([139.177.225.227]) by smtp.gmail.com with ESMTPSA id lx9-20020a17090b4b0900b001e292e30129sm1840434pjb.22.2022.06.10.07.47.41 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 10 Jun 2022 07:47:45 -0700 (PDT) From: Han Xin To: avarab@gmail.com Cc: Han Xin , chiyutianyi@gmail.com, git@vger.kernel.org, gitster@pobox.com, l.s.r@web.de, neerajsi@microsoft.com, newren@gmail.com, philipoakley@iee.email, stolee@gmail.com, worldhello.net@gmail.com, Neeraj Singh , Jiang Xin Subject: [PATCH v14 7/7] unpack-objects: use stream_loose_object() to unpack large objects Date: Fri, 10 Jun 2022 22:46:07 +0800 Message-Id: X-Mailer: git-send-email 2.36.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin Make use of the stream_loose_object() function introduced in the preceding commit to unpack large objects. Before this we'd need to malloc() the size of the blob before unpacking it, which could cause OOM with very large blobs. We could use the new streaming interface to unpack all blobs, but doing so would be much slower, as demonstrated e.g. with this benchmark using git-hyperfine[0]: rm -rf /tmp/scalar.git && git clone --bare https://github.com/Microsoft/scalar.git /tmp/scalar.git && mv /tmp/scalar.git/objects/pack/*.pack /tmp/scalar.git/my.pack && git hyperfine \ -r 2 --warmup 1 \ -L rev origin/master,HEAD -L v "10,512,1k,1m" \ -s 'make' \ -p 'git init --bare dest.git' \ -c 'rm -rf dest.git' \ './git -C dest.git -c core.bigFileThreshold={v} unpack-objects &1 | grep Maximum' Using this test we'll always use >100MB of memory on origin/master (around ~105MB), but max out at e.g. ~55MB if we set core.bigFileThreshold=50m. The relevant "Maximum resident set size" lines were manually added below the relevant benchmark: '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=50m unpack-objects &1 | grep Maximum' in 'origin/master' ran Maximum resident set size (kbytes): 107080 1.02 ± 0.78 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=512 unpack-objects &1 | grep Maximum' in 'origin/master' Maximum resident set size (kbytes): 106968 1.09 ± 0.79 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=100m unpack-objects &1 | grep Maximum' in 'origin/master' Maximum resident set size (kbytes): 107032 1.42 ± 1.07 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=100m unpack-objects &1 | grep Maximum' in 'HEAD' Maximum resident set size (kbytes): 107072 1.83 ± 1.02 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=50m unpack-objects &1 | grep Maximum' in 'HEAD' Maximum resident set size (kbytes): 55704 2.16 ± 1.19 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=512 unpack-objects &1 | grep Maximum' in 'HEAD' Maximum resident set size (kbytes): 4564 This shows that if you have enough memory this new streaming method is slower the lower you set the streaming threshold, but the benefit is more bounded memory use. An earlier version of this patch introduced a new "core.bigFileStreamingThreshold" instead of re-using the existing "core.bigFileThreshold" variable[1]. As noted in a detailed overview of its users in [2] using it has several different meanings. Still, we consider it good enough to simply re-use it. While it's possible that someone might want to e.g. consider objects "small" for the purposes of diffing but "big" for the purposes of writing them such use-cases are probably too obscure to worry about. We can always split up "core.bigFileThreshold" in the future if there's a need for that. 0. https://github.com/avar/git-hyperfine/ 1. https://lore.kernel.org/git/20211210103435.83656-1-chiyutianyi@gmail.com/ 2. https://lore.kernel.org/git/20220120112114.47618-5-chiyutianyi@gmail.com/ Helped-by: Ævar Arnfjörð Bjarmason Helped-by: Derrick Stolee Helped-by: Jiang Xin Signed-off-by: Han Xin Signed-off-by: Ævar Arnfjörð Bjarmason --- Documentation/config/core.txt | 4 +- builtin/unpack-objects.c | 69 ++++++++++++++++++++++++++++++++- t/t5351-unpack-large-objects.sh | 43 ++++++++++++++++++-- 3 files changed, 109 insertions(+), 7 deletions(-) diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index f2e75dd824..a599dcb96b 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -468,8 +468,8 @@ usage, at the slight expense of increased disk usage. * Will be generally be streamed when written, which avoids excessive memory usage, at the cost of some fixed overhead. Commands that make use of this include linkgit:git-archive[1], -linkgit:git-fast-import[1], linkgit:git-index-pack[1] and -linkgit:git-fsck[1]. +linkgit:git-fast-import[1], linkgit:git-index-pack[1], +linkgit:git-unpack-objects[1] and linkgit:git-fsck[1]. core.excludesFile:: Specifies the pathname to the file that contains patterns to diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index 32e8b47059..43789b8ef2 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -351,6 +351,68 @@ static void unpack_non_delta_entry(enum object_type type, unsigned long size, write_object(nr, type, buf, size); } +struct input_zstream_data { + git_zstream *zstream; + unsigned char buf[8192]; + int status; +}; + +static const void *feed_input_zstream(struct input_stream *in_stream, + unsigned long *readlen) +{ + struct input_zstream_data *data = in_stream->data; + git_zstream *zstream = data->zstream; + void *in = fill(1); + + if (in_stream->is_finished) { + *readlen = 0; + return NULL; + } + + zstream->next_out = data->buf; + zstream->avail_out = sizeof(data->buf); + zstream->next_in = in; + zstream->avail_in = len; + + data->status = git_inflate(zstream, 0); + + in_stream->is_finished = data->status != Z_OK; + use(len - zstream->avail_in); + *readlen = sizeof(data->buf) - zstream->avail_out; + + return data->buf; +} + +static void stream_blob(unsigned long size, unsigned nr) +{ + git_zstream zstream = { 0 }; + struct input_zstream_data data = { 0 }; + struct input_stream in_stream = { + .read = feed_input_zstream, + .data = &data, + }; + struct obj_info *info = &obj_list[nr]; + + data.zstream = &zstream; + git_inflate_init(&zstream); + + if (stream_loose_object(&in_stream, size, &info->oid)) + die(_("failed to write object in stream")); + + if (data.status != Z_STREAM_END) + die(_("inflate returned (%d)"), data.status); + git_inflate_end(&zstream); + + if (strict) { + struct blob *blob = lookup_blob(the_repository, &info->oid); + + if (!blob) + die(_("invalid blob object from stream")); + blob->object.flags |= FLAG_WRITTEN; + } + info->obj = NULL; +} + static int resolve_against_held(unsigned nr, const struct object_id *base, void *delta_data, unsigned long delta_size) { @@ -483,9 +545,14 @@ static void unpack_one(unsigned nr) } switch (type) { + case OBJ_BLOB: + if (!dry_run && size > big_file_threshold) { + stream_blob(size, nr); + return; + } + /* fallthrough */ case OBJ_COMMIT: case OBJ_TREE: - case OBJ_BLOB: case OBJ_TAG: unpack_non_delta_entry(type, size, nr); return; diff --git a/t/t5351-unpack-large-objects.sh b/t/t5351-unpack-large-objects.sh index 8d84313221..8ce8aa3b14 100755 --- a/t/t5351-unpack-large-objects.sh +++ b/t/t5351-unpack-large-objects.sh @@ -9,7 +9,8 @@ test_description='git unpack-objects with large objects' prepare_dest () { test_when_finished "rm -rf dest.git" && - git init --bare dest.git + git init --bare dest.git && + git -C dest.git config core.bigFileThreshold "$1" } test_expect_success "create large objects (1.5 MB) and PACK" ' @@ -17,7 +18,10 @@ test_expect_success "create large objects (1.5 MB) and PACK" ' test_commit --append foo big-blob && test-tool genrandom bar 1500000 >big-blob && test_commit --append bar big-blob && - PACK=$(echo HEAD | git pack-objects --revs pack) + PACK=$(echo HEAD | git pack-objects --revs pack) && + git verify-pack -v pack-$PACK.pack >out && + sed -n -e "s/^\([0-9a-f][0-9a-f]*\).*\(commit\|tree\|blob\).*/\1/p" \ + obj-list ' test_expect_success 'set memory limitation to 1MB' ' @@ -26,16 +30,47 @@ test_expect_success 'set memory limitation to 1MB' ' ' test_expect_success 'unpack-objects failed under memory limitation' ' - prepare_dest && + prepare_dest 2m && test_must_fail git -C dest.git unpack-objects err && grep "fatal: attempting to allocate" err ' test_expect_success 'unpack-objects works with memory limitation in dry-run mode' ' - prepare_dest && + prepare_dest 2m && git -C dest.git unpack-objects -n current && + cmp obj-list current +' + +test_expect_success 'do not unpack existing large objects' ' + prepare_dest 1m && + git -C dest.git index-pack --stdin