From patchwork Tue Feb 23 02:25:03 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Taylor Blau X-Patchwork-Id: 12099733 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1D0D3C433E0 for ; Tue, 23 Feb 2021 02:26:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E1F826146D for ; Tue, 23 Feb 2021 02:25:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230206AbhBWCZs (ORCPT ); Mon, 22 Feb 2021 21:25:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43750 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229886AbhBWCZr (ORCPT ); Mon, 22 Feb 2021 21:25:47 -0500 Received: from mail-qk1-x732.google.com (mail-qk1-x732.google.com [IPv6:2607:f8b0:4864:20::732]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C0D7EC06174A for ; Mon, 22 Feb 2021 18:25:06 -0800 (PST) Received: by mail-qk1-x732.google.com with SMTP id 204so9920478qke.11 for ; Mon, 22 Feb 2021 18:25:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ttaylorr-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=gnooYuzFE9U5xs3bVfru21QnB8P7E+qaJoxRTCQ7obA=; b=dWES6mahqyTQf2YSEHo0aCnwBmFudH/K4LwwzjmB52/bH235pTfdokTDWQFyfiMoKN 4pNiEZYTjOAnkjbjwClVCuToGmCT6U0dhJ+FUR+Vy1Z6mK3C1wnX1wJftwdDgcbpQoV0 Y99jHUngea75Qq4Gexls2BG1YOOXgpNZxX/QkAbHMTTQNQxZPLKH4igd/2pX7u3DKz3c d1YDwb61lmnhqj0ahGZbFZQZnyEl5DPD97Dl58A7193RZqLsU0Eaj7TW+vLmt0XoaJxt QHRBj6t5S+BxXEaY7sr8uUdltbwaxI17H9xykjiC3itYe/VTcupcTXqB1x3SVbkA81h9 5MkA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=gnooYuzFE9U5xs3bVfru21QnB8P7E+qaJoxRTCQ7obA=; b=rjz3j3KUtXfynPX8Vh/iWC1LbcHVCss1g/U7pP8zW8KYLhO4O7F8/6rtOqP7alWL3M jcf8rqAWrV2aCdehxdePa7IJbiRcGlNhAVnCVR8Ft5p/4KIIbe4za+i2nLIZNQk3sluD yduoh2WjeTw+niURrmlIWN5X9V2BIBm+hkgCcSZ0MmV8q0EPIkRieOaLjv4i/GZLlGgh 1ndKfxCTSgtQ5Wa7sKPIMAu8d3qmXBcCBEu1Vn8gkgB6AFvAMD1q1lhZx4nI8dVWtRvf DvrOttREbHUCGAbW4LufQZg9Avv2pMw2RSL6J+rk7WLRYdT2KV2YabVJmU0GT9mZNdK/ N8cQ== X-Gm-Message-State: AOAM533bZcjov7E6o29SvejKBrZWEMh0V6e2rccclznZ7dRilKjk/xRq x4Fn6tcVoB5pIM24CAiyjhOkjLlZKeRNHr97 X-Google-Smtp-Source: ABdhPJyI14f7c7sXuD61DFw0/JCoesj4AkKJwgRaI6X/8CBahyE/ZWRDuGtsb6iFxqANVK3pIwFyrw== X-Received: by 2002:a37:aad7:: with SMTP id t206mr23757724qke.407.1614047105724; Mon, 22 Feb 2021 18:25:05 -0800 (PST) Received: from localhost ([2605:9480:22e:ff10:29df:918f:599f:2c96]) by smtp.gmail.com with ESMTPSA id s29sm13457662qks.31.2021.02.22.18.25.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Feb 2021 18:25:05 -0800 (PST) Date: Mon, 22 Feb 2021 21:25:03 -0500 From: Taylor Blau To: git@vger.kernel.org Cc: peff@peff.net, dstolee@microsoft.com, gitster@pobox.com Subject: [PATCH v4 1/8] packfile: introduce 'find_kept_pack_entry()' Message-ID: References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Future callers will want a function to fill a 'struct pack_entry' for a given object id but _only_ from its position in any kept pack(s). In particular, an new 'git repack' mode which ensures the resulting packs form a geometric progress by object count will mark packs that it does not want to repack as "kept in-core", and it will want to halt a reachability traversal as soon as it visits an object in any of the kept packs. But, it does not want to halt the traversal at non-kept, or .keep packs. The obvious alternative is 'find_pack_entry()', but this doesn't quite suffice since it only returns the first pack it finds, which may or may not be kept (and the mru cache makes it unpredictable which one you'll get if there are options). Short of that, you could walk over all packs looking for the object in each one, but it scales with the number of packs, which may be prohibitive. Introduce 'find_kept_pack_entry()', a function which is like 'find_pack_entry()', but only fills in objects in the kept packs. Handle packs which have .keep files, as well as in-core kept packs separately, since certain callers will want to distinguish one from the other. (Though on-disk and in-core kept packs share the adjective "kept", it is best to think of the two sets as independent.) There is a gotcha when looking up objects that are duplicated in kept and non-kept packs, particularly when the MIDX stores the non-kept version and the caller asked for kept objects only. This could be resolved by teaching the MIDX to resolve duplicates by always favoring the kept pack (if one exists), but this breaks an assumption in existing MIDXs, and so it would require a format change. The benefit to changing the MIDX in this way is marginal, so we instead have a more thorough check here which is explained with a comment. Callers will be added in subsequent patches. Co-authored-by: Jeff King Signed-off-by: Jeff King Signed-off-by: Taylor Blau --- packfile.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++----- packfile.h | 5 +++++ 2 files changed, 64 insertions(+), 5 deletions(-) diff --git a/packfile.c b/packfile.c index 1fec12ac5f..7f84f221ce 100644 --- a/packfile.c +++ b/packfile.c @@ -2042,7 +2042,10 @@ static int fill_pack_entry(const struct object_id *oid, return 1; } -int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e) +static int find_one_pack_entry(struct repository *r, + const struct object_id *oid, + struct pack_entry *e, + int kept_only) { struct list_head *pos; struct multi_pack_index *m; @@ -2052,26 +2055,77 @@ int find_pack_entry(struct repository *r, const struct object_id *oid, struct pa return 0; for (m = r->objects->multi_pack_index; m; m = m->next) { - if (fill_midx_entry(r, oid, e, m)) + if (!fill_midx_entry(r, oid, e, m)) + continue; + + if (!kept_only) + return 1; + + if (((kept_only & ON_DISK_KEEP_PACKS) && e->p->pack_keep) || + ((kept_only & IN_CORE_KEEP_PACKS) && e->p->pack_keep_in_core)) return 1; } list_for_each(pos, &r->objects->packed_git_mru) { struct packed_git *p = list_entry(pos, struct packed_git, mru); - if (!p->multi_pack_index && fill_pack_entry(oid, e, p)) { - list_move(&p->mru, &r->objects->packed_git_mru); - return 1; + if (p->multi_pack_index && !kept_only) { + /* + * If this pack is covered by the MIDX, we'd have found + * the object already in the loop above if it was here, + * so don't bother looking. + * + * The exception is if we are looking only at kept + * packs. An object can be present in two packs covered + * by the MIDX, one kept and one not-kept. And as the + * MIDX points to only one copy of each object, it might + * have returned only the non-kept version above. We + * have to check again to be thorough. + */ + continue; + } + if (!kept_only || + (((kept_only & ON_DISK_KEEP_PACKS) && p->pack_keep) || + ((kept_only & IN_CORE_KEEP_PACKS) && p->pack_keep_in_core))) { + if (fill_pack_entry(oid, e, p)) { + list_move(&p->mru, &r->objects->packed_git_mru); + return 1; + } } } return 0; } +int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e) +{ + return find_one_pack_entry(r, oid, e, 0); +} + +int find_kept_pack_entry(struct repository *r, + const struct object_id *oid, + unsigned flags, + struct pack_entry *e) +{ + /* + * Load all packs, including midx packs, since our "kept" strategy + * relies on that. We're relying on the side effect of it setting up + * r->objects->packed_git, which is a little ugly. + */ + get_all_packs(r); + return find_one_pack_entry(r, oid, e, flags); +} + int has_object_pack(const struct object_id *oid) { struct pack_entry e; return find_pack_entry(the_repository, oid, &e); } +int has_object_kept_pack(const struct object_id *oid, unsigned flags) +{ + struct pack_entry e; + return find_kept_pack_entry(the_repository, oid, flags, &e); +} + int has_pack_index(const unsigned char *sha1) { struct stat st; diff --git a/packfile.h b/packfile.h index 4cfec9e8d3..3ae117a8ae 100644 --- a/packfile.h +++ b/packfile.h @@ -162,13 +162,18 @@ int packed_object_info(struct repository *r, void mark_bad_packed_object(struct packed_git *p, const unsigned char *sha1); const struct packed_git *has_packed_and_bad(struct repository *r, const unsigned char *sha1); +#define ON_DISK_KEEP_PACKS 1 +#define IN_CORE_KEEP_PACKS 2 + /* * Iff a pack file in the given repository contains the object named by sha1, * return true and store its location to e. */ int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e); +int find_kept_pack_entry(struct repository *r, const struct object_id *oid, unsigned flags, struct pack_entry *e); int has_object_pack(const struct object_id *oid); +int has_object_kept_pack(const struct object_id *oid, unsigned flags); int has_pack_index(const unsigned char *sha1); From patchwork Tue Feb 23 02:25:07 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Taylor Blau X-Patchwork-Id: 12099737 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 429CBC433E6 for ; Tue, 23 Feb 2021 02:26:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 01EE164E41 for ; Tue, 23 Feb 2021 02:25:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231431AbhBWCZv (ORCPT ); Mon, 22 Feb 2021 21:25:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43764 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229886AbhBWCZu (ORCPT ); Mon, 22 Feb 2021 21:25:50 -0500 Received: from mail-qt1-x832.google.com (mail-qt1-x832.google.com [IPv6:2607:f8b0:4864:20::832]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2CABEC061786 for ; Mon, 22 Feb 2021 18:25:10 -0800 (PST) Received: by mail-qt1-x832.google.com with SMTP id w6so3445053qti.6 for ; Mon, 22 Feb 2021 18:25:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ttaylorr-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=Syl7pgPRDV3Zgw4+z7AOzr7mqoMDXf9fvjaN5CZKjXo=; b=dNJ0AYHfGTGnf+jPM5MoClBB0Tr2i0gqZCh02x/8E0P437MD290QbCqUMLO41L4S3C BZATGQIDi3mCmYQfAZsYk5QPuy5mBu+fNpeOXbYtv3nWVemAhpNo/7sPUF6GxuNKFVCW mjatyXdy7oVUUap4hUNHwT9mQZKDDYe13MYcAi0hRxNpvYO12l81cU/oYHn2PX9A3adD d/CUNhoJ+Kb6Mb2+R50y4G6r/eOAPBJBNsIxIzuu5DIcupvl5MMKqILqd9mK1rc3J+E8 BRExsjzDDz69caoM6shK2DSnQCYO6QU8E99rg0295fxjLeztVrPM3Lwq78vzly/85uFB p/ow== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=Syl7pgPRDV3Zgw4+z7AOzr7mqoMDXf9fvjaN5CZKjXo=; b=EJ68jMdero3Tu9QEcnXBZPZ+4CzZ6mbUYuEoZHSI5B13r7yfH5PEFH9CKbIKuV7ZDb Lro+8tFKiFCGu1XHeCYHa9oHlB6glAKvU6Q5zMZB1VBDWiHVKI6EtxeKNBKzRrsD1o66 EXT6lef2WMX48pl3M0mezdtyWLLdodIV+Skw+k8F+qN8EgxCKgjbEYyhwahdgKNbhkQA GTAO4JKRXlWsMBYxTEPMMQgVbxeNKFk01Yav39nYm7uoyyujOM5zEKXttHrx447a7QZW P9H/PePMTwIglJSf9V/E0t9ZyJRdnMMD8/jzhpyDt5C/u0vqjhduegGmnw8RGRbMqRw5 CuUw== X-Gm-Message-State: AOAM533mYSzVq/QvD+ezvVfYY7uY8oeS2B7uhA7A3COFUN4AhuZXGgBh m0ipekK9YXsWrlI3fJsjPFDbrm3UqdMHh6rs X-Google-Smtp-Source: ABdhPJw5Lj0Miv5G6NCLVmizwE2RZYh9f9GMnuzoFqmixQCiPSmrtZPvXo+/M53HSUaf6WlVgWRlfw== X-Received: by 2002:a05:622a:347:: with SMTP id r7mr21547484qtw.279.1614047108959; Mon, 22 Feb 2021 18:25:08 -0800 (PST) Received: from localhost ([2605:9480:22e:ff10:29df:918f:599f:2c96]) by smtp.gmail.com with ESMTPSA id s60sm12602996qtd.16.2021.02.22.18.25.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Feb 2021 18:25:08 -0800 (PST) Date: Mon, 22 Feb 2021 21:25:07 -0500 From: Taylor Blau To: git@vger.kernel.org Cc: peff@peff.net, dstolee@microsoft.com, gitster@pobox.com Subject: [PATCH v4 2/8] revision: learn '--no-kept-objects' Message-ID: References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org A future caller will want to be able to perform a reachability traversal which terminates when visiting an object found in a kept pack. The closest existing option is '--honor-pack-keep', but this isn't quite what we want. Instead of halting the traversal midway through, a full traversal is always performed, and the results are only trimmed afterwords. Besides needing to introduce a new flag (since culling results post-facto can be different than halting the traversal as it's happening), there is an additional wrinkle handling the distinction in-core and on-disk kept packs. That is: what kinds of kept pack should stop the traversal? Introduce '--no-kept-objects[=]' to specify which kinds of kept packs, if any, should stop a traversal. This can be useful for callers that want to perform a reachability analysis, but want to leave certain packs alone (for e.g., when doing a geometric repack that has some "large" packs which are kept in-core that it wants to leave alone). Note that this option is not guaranteed to produce exactly the set of objects that aren't in kept packs, since it's possible the traversal order may end up in a situation where a non-kept ancestor was "cut off" by a kept object (at which point we would stop traversing). But, we don't care about absolute correctness here, since this will eventually be used as a purely additive guide in an upcoming new repack mode. Explicitly avoid documenting this new flag, since it is only used internally. In theory we could avoid even adding it rev-list, but being able to spell this option out on the command-line makes some special cases easier to test without promising to keep it behaving consistently forever. Those tricky cases are exercised in t6114. Signed-off-by: Taylor Blau --- revision.c | 15 ++++++++++ revision.h | 4 +++ t/t6114-keep-packs.sh | 69 +++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 88 insertions(+) create mode 100755 t/t6114-keep-packs.sh diff --git a/revision.c b/revision.c index b78733f508..dca2b8c801 100644 --- a/revision.c +++ b/revision.c @@ -2336,6 +2336,16 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg revs->unpacked = 1; } else if (starts_with(arg, "--unpacked=")) { die(_("--unpacked= no longer supported")); + } else if (!strcmp(arg, "--no-kept-objects")) { + revs->no_kept_objects = 1; + revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS; + revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS; + } else if (skip_prefix(arg, "--no-kept-objects=", &optarg)) { + revs->no_kept_objects = 1; + if (!strcmp(optarg, "in-core")) + revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS; + if (!strcmp(optarg, "on-disk")) + revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS; } else if (!strcmp(arg, "-r")) { revs->diff = 1; revs->diffopt.flags.recursive = 1; @@ -3795,6 +3805,11 @@ enum commit_action get_commit_action(struct rev_info *revs, struct commit *commi return commit_ignore; if (revs->unpacked && has_object_pack(&commit->object.oid)) return commit_ignore; + if (revs->no_kept_objects) { + if (has_object_kept_pack(&commit->object.oid, + revs->keep_pack_cache_flags)) + return commit_ignore; + } if (commit->object.flags & UNINTERESTING) return commit_ignore; if (revs->line_level_traverse && !want_ancestry(revs)) { diff --git a/revision.h b/revision.h index e6be3c845e..a20a530d52 100644 --- a/revision.h +++ b/revision.h @@ -148,6 +148,7 @@ struct rev_info { edge_hint_aggressive:1, limited:1, unpacked:1, + no_kept_objects:1, boundary:2, count:1, left_right:1, @@ -317,6 +318,9 @@ struct rev_info { * This is loaded from the commit-graph being used. */ struct bloom_filter_settings *bloom_filter_settings; + + /* misc. flags related to '--no-kept-objects' */ + unsigned keep_pack_cache_flags; }; int ref_excluded(struct string_list *, const char *path); diff --git a/t/t6114-keep-packs.sh b/t/t6114-keep-packs.sh new file mode 100755 index 0000000000..9239d8aa46 --- /dev/null +++ b/t/t6114-keep-packs.sh @@ -0,0 +1,69 @@ +#!/bin/sh + +test_description='rev-list with .keep packs' +. ./test-lib.sh + +test_expect_success 'setup' ' + test_commit loose && + test_commit packed && + test_commit kept && + + KEPT_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF + refs/tags/kept + ^refs/tags/packed + EOF + ) && + MISC_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF + refs/tags/packed + ^refs/tags/loose + EOF + ) && + + touch .git/objects/pack/pack-$KEPT_PACK.keep +' + +rev_list_objects () { + git rev-list "$@" >out && + sort out +} + +idx_objects () { + git show-index <$1 >expect-idx && + cut -d" " -f2 kept && + rev_list_objects --objects --all --no-object-names --no-kept-objects >no-kept && + + idx_objects .git/objects/pack/pack-$KEPT_PACK.idx >expect && + comm -3 kept no-kept >actual && + + test_cmp expect actual +' + +test_expect_success '--no-kept-objects excludes kept non-MIDX object' ' + test_config core.multiPackIndex true && + + # Create a pack with just the commit object in pack, and do not mark it + # as kept (even though it appears in $KEPT_PACK, which does have a .keep + # file). + MIDX_PACK=$(git pack-objects .git/objects/pack/pack <<-EOF + $(git rev-parse kept) + EOF + ) && + + # Write a MIDX containing all packs, but use the version of the commit + # at "kept" in a non-kept pack by touching $MIDX_PACK. + touch .git/objects/pack/pack-$MIDX_PACK.pack && + git multi-pack-index write && + + rev_list_objects --objects --no-object-names --no-kept-objects HEAD >actual && + ( + idx_objects .git/objects/pack/pack-$MISC_PACK.idx && + git rev-list --objects --no-object-names refs/tags/loose + ) | sort >expect && + test_cmp expect actual +' + +test_done From patchwork Tue Feb 23 02:25:10 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Taylor Blau X-Patchwork-Id: 12099739 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 227A2C433DB for ; Tue, 23 Feb 2021 02:26:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DA34B64E31 for ; Tue, 23 Feb 2021 02:26:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231534AbhBWC0E (ORCPT ); Mon, 22 Feb 2021 21:26:04 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43780 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229886AbhBWCZy (ORCPT ); Mon, 22 Feb 2021 21:25:54 -0500 Received: from mail-qv1-xf30.google.com (mail-qv1-xf30.google.com [IPv6:2607:f8b0:4864:20::f30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A978FC06178A for ; Mon, 22 Feb 2021 18:25:13 -0800 (PST) Received: by mail-qv1-xf30.google.com with SMTP id l19so2488324qvz.2 for ; Mon, 22 Feb 2021 18:25:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ttaylorr-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=bGezaKpkys+uYjGATSDpW/8yRcGhgfjSrf7hsUh4kC0=; b=0NDz9Hr/L+izjMkyQfLkG6OctkLGNJf7dP7O6mc0TNAveuo4z7nmYJvTR6qHXMgd86 cYuvZ34+V+gvKEYV9q84VwRWxRwPHVB+chCUDffzJypTvus+prM8d1u7qGOqsfU0EM3m WbHg03KoNa+EyqcVe/9aVycLlNV6v53DmTUyOVwDmNhyyF/KORJNRVFn+33P8JIxjERb 3n3YT2kJSYiwuvuYIns21YJBvUEEDSBWsyKuJSA+oiBnj2wZIfx90Zy0ZB8OUUPzvwPm /QoG5Tbm3ZXw/2d9noorfr/SCye3CtfPv4pMZyiPuMVmfVDC7UAc+OwDANnEWA4CyZ/p u5Wg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=bGezaKpkys+uYjGATSDpW/8yRcGhgfjSrf7hsUh4kC0=; b=TU9vYfeUcvB8ERJfoQB+BPDVYM1Frs7wG8cHqvZ/YpTcoN01Kn+EDvYyEb3wfqDPLt BEiXSlQnYJeB4HS/FvLL4C4G+LO+J9Xd31Q/M+Ye3kGtSU0ZC3k9Z1Qh/PcIUIzDvRxr NmWdvBRtDlZl+gJ3+vwv384HNdBh32CH3l+ULJHj1ncX3hA3PbTidr66okQqut47Nz11 WbifUy57MPKFUm9+s0PxGER1uutmU81VXN9tHFOyiWEMODI5lRDS+LIOitE9Mnd4R2VS Csnl6T+A86MHoIn5ckZUFc9WbDs3XX3lspnIxWrqD8jv2o88ydUpr5cKzJnibiHMvcBF MGgg== X-Gm-Message-State: AOAM532zyEvLGc+TJuR40MVyTD007E1jKHywTM6SQTcWlUbeRv5Lskd+ FpkC6x48//pNac5x/lyTKlgIdrOehSMAMYPR X-Google-Smtp-Source: ABdhPJxKuLo8HS6cV6V4kiCEbI4Zqj3t7HUrdjh05im9ZsEg9FvNrFtRmixccp8laK92DV3O9evu5w== X-Received: by 2002:a0c:b7a1:: with SMTP id l33mr21050256qve.17.1614047112276; Mon, 22 Feb 2021 18:25:12 -0800 (PST) Received: from localhost ([2605:9480:22e:ff10:29df:918f:599f:2c96]) by smtp.gmail.com with ESMTPSA id q8sm13778728qkm.38.2021.02.22.18.25.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Feb 2021 18:25:11 -0800 (PST) Date: Mon, 22 Feb 2021 21:25:10 -0500 From: Taylor Blau To: git@vger.kernel.org Cc: peff@peff.net, dstolee@microsoft.com, gitster@pobox.com Subject: [PATCH v4 3/8] builtin/pack-objects.c: add '--stdin-packs' option Message-ID: <649cf9020bfdae9f48e3efbfbc52429cefd31432.1614047097.git.me@ttaylorr.com> References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org In an upcoming commit, 'git repack' will want to create a pack comprised of all of the objects in some packs (the included packs) excluding any objects in some other packs (the excluded packs). This caller could iterate those packs themselves and feed the objects it finds to 'git pack-objects' directly over stdin, but this approach has a few downsides: - It requires every caller that wants to drive 'git pack-objects' in this way to implement pack iteration themselves. This forces the caller to think about details like what order objects are fed to pack-objects, which callers would likely rather not do. - If the set of objects in included packs is large, it requires sending a lot of data over a pipe, which is inefficient. - The caller is forced to keep track of the excluded objects, too, and make sure that it doesn't send any objects that appear in both included and excluded packs. But the biggest downside is the lack of a reachability traversal. Because the caller passes in a list of objects directly, those objects don't get a namehash assigned to them, which can have a negative impact on the delta selection process, causing 'git pack-objects' to fail to find good deltas even when they exist. The caller could formulate a reachability traversal themselves, but the only way to drive 'git pack-objects' in this way is to do a full traversal, and then remove objects in the excluded packs after the traversal is complete. This can be detrimental to callers who care about performance, especially in repositories with many objects. Introduce 'git pack-objects --stdin-packs' which remedies these four concerns. 'git pack-objects --stdin-packs' expects a list of pack names on stdin, where 'pack-xyz.pack' denotes that pack as included, and '^pack-xyz.pack' denotes it as excluded. The resulting pack includes all objects that are present in at least one included pack, and aren't present in any excluded pack. To address the delta selection problem, 'git pack-objects --stdin-packs' works as follows. First, it assembles a list of objects that it is going to pack, as above. Then, a reachability traversal is started, whose tips are any commits mentioned in included packs. Upon visiting an object, we find its corresponding object_entry in the to_pack list, and set its namehash parameter appropriately. To avoid the traversal visiting more objects than it needs to, the traversal is halted upon encountering an object which can be found in an excluded pack (by marking the excluded packs as kept in-core, and passing --no-kept-objects=in-core to the revision machinery). This can cause the traversal to halt early, for example if an object in an included pack is an ancestor of ones in excluded packs. But stopping early is OK, since filling in the namehash fields of objects in the to_pack list is only additive (i.e., having it helps the delta selection process, but leaving it blank doesn't impact the correctness of the resulting pack). Even still, it is unlikely that this hurts us much in practice, since the 'git repack --geometric' caller (which is introduced in a later commit) marks small packs as included, and large ones as excluded. During ordinary use, the small packs usually represent pushes after a large repack, and so are unlikely to be ancestors of objects that already exist in the repository. (I found it convenient while developing this patch to have 'git pack-objects' report the number of objects which were visited and got their namehash fields filled in during traversal. This is also included in the below patch via trace2 data lines). Suggested-by: Jeff King Signed-off-by: Taylor Blau --- Documentation/git-pack-objects.txt | 10 ++ builtin/pack-objects.c | 202 ++++++++++++++++++++++++++++- t/t5300-pack-object.sh | 97 ++++++++++++++ 3 files changed, 307 insertions(+), 2 deletions(-) diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index 54d715ead1..df533c3b19 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -85,6 +85,16 @@ base-name:: reference was included in the resulting packfile. This can be useful to send new tags to native Git clients. +--stdin-packs:: + Read the basenames of packfiles (e.g., `pack-1234abcd.pack`) + from the standard input, instead of object names or revision + arguments. The resulting pack contains all objects listed in the + included packs (those not beginning with `^`), excluding any + objects listed in the excluded packs (beginning with `^`). ++ +Incompatible with `--revs`, or options that imply `--revs` (such as +`--all`), with the exception of `--unpacked`, which is compatible. + --window=:: --depth=:: These two options affect how the objects contained in diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 6d62aaf59a..6ee8e40665 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2986,6 +2986,190 @@ static int git_pack_config(const char *k, const char *v, void *cb) return git_default_config(k, v, cb); } +/* Counters for trace2 output when in --stdin-packs mode. */ +static int stdin_packs_found_nr; +static int stdin_packs_hints_nr; + +static int add_object_entry_from_pack(const struct object_id *oid, + struct packed_git *p, + uint32_t pos, + void *_data) +{ + struct rev_info *revs = _data; + struct object_info oi = OBJECT_INFO_INIT; + off_t ofs; + enum object_type type; + + display_progress(progress_state, ++nr_seen); + + if (have_duplicate_entry(oid, 0)) + return 0; + + ofs = nth_packed_object_offset(p, pos); + if (!want_object_in_pack(oid, 0, &p, &ofs)) + return 0; + + oi.typep = &type; + if (packed_object_info(the_repository, p, ofs, &oi) < 0) + die(_("could not get type of object %s in pack %s"), + oid_to_hex(oid), p->pack_name); + else if (type == OBJ_COMMIT) { + /* + * commits in included packs are used as starting points for the + * subsequent revision walk + */ + add_pending_oid(revs, NULL, oid, 0); + } + + stdin_packs_found_nr++; + + create_object_entry(oid, type, 0, 0, 0, p, ofs); + + return 0; +} + +static void show_commit_pack_hint(struct commit *commit, void *_data) +{ + /* nothing to do; commits don't have a namehash */ +} + +static void show_object_pack_hint(struct object *object, const char *name, + void *_data) +{ + struct object_entry *oe = packlist_find(&to_pack, &object->oid); + if (!oe) + return; + + /* + * Our 'to_pack' list was constructed by iterating all objects packed in + * included packs, and so doesn't have a non-zero hash field that you + * would typically pick up during a reachability traversal. + * + * Make a best-effort attempt to fill in the ->hash and ->no_try_delta + * here using a now in order to perhaps improve the delta selection + * process. + */ + oe->hash = pack_name_hash(name); + oe->no_try_delta = name && no_try_delta(name); + + stdin_packs_hints_nr++; +} + +static int pack_mtime_cmp(const void *_a, const void *_b) +{ + struct packed_git *a = ((const struct string_list_item*)_a)->util; + struct packed_git *b = ((const struct string_list_item*)_b)->util; + + /* + * order packs by descending mtime so that objects are laid out + * roughly as newest-to-oldest + */ + if (a->mtime < b->mtime) + return 1; + else if (b->mtime < a->mtime) + return -1; + else + return 0; +} + +static void read_packs_list_from_stdin(void) +{ + struct strbuf buf = STRBUF_INIT; + struct string_list include_packs = STRING_LIST_INIT_DUP; + struct string_list exclude_packs = STRING_LIST_INIT_DUP; + struct string_list_item *item = NULL; + + struct packed_git *p; + struct rev_info revs; + + repo_init_revisions(the_repository, &revs, NULL); + /* + * Use a revision walk to fill in the namehash of objects in the include + * packs. To save time, we'll avoid traversing through objects that are + * in excluded packs. + * + * That may cause us to avoid populating all of the namehash fields of + * all included objects, but our goal is best-effort, since this is only + * an optimization during delta selection. + */ + revs.no_kept_objects = 1; + revs.keep_pack_cache_flags |= IN_CORE_KEEP_PACKS; + revs.blob_objects = 1; + revs.tree_objects = 1; + revs.tag_objects = 1; + + while (strbuf_getline(&buf, stdin) != EOF) { + if (!buf.len) + continue; + + if (*buf.buf == '^') + string_list_append(&exclude_packs, buf.buf + 1); + else + string_list_append(&include_packs, buf.buf); + + strbuf_reset(&buf); + } + + string_list_sort(&include_packs); + string_list_sort(&exclude_packs); + + for (p = get_all_packs(the_repository); p; p = p->next) { + const char *pack_name = pack_basename(p); + + item = string_list_lookup(&include_packs, pack_name); + if (!item) + item = string_list_lookup(&exclude_packs, pack_name); + + if (item) + item->util = p; + } + + /* + * First handle all of the excluded packs, marking them as kept in-core + * so that later calls to add_object_entry() discards any objects that + * are also found in excluded packs. + */ + for_each_string_list_item(item, &exclude_packs) { + struct packed_git *p = item->util; + if (!p) + die(_("could not find pack '%s'"), item->string); + p->pack_keep_in_core = 1; + } + + /* + * Order packs by ascending mtime; use QSORT directly to access the + * string_list_item's ->util pointer, which string_list_sort() does not + * provide. + */ + QSORT(include_packs.items, include_packs.nr, pack_mtime_cmp); + + for_each_string_list_item(item, &include_packs) { + struct packed_git *p = item->util; + if (!p) + die(_("could not find pack '%s'"), item->string); + for_each_object_in_pack(p, + add_object_entry_from_pack, + &revs, + FOR_EACH_OBJECT_PACK_ORDER); + } + + if (prepare_revision_walk(&revs)) + die(_("revision walk setup failed")); + traverse_commit_list(&revs, + show_commit_pack_hint, + show_object_pack_hint, + NULL); + + trace2_data_intmax("pack-objects", the_repository, "stdin_packs_found", + stdin_packs_found_nr); + trace2_data_intmax("pack-objects", the_repository, "stdin_packs_hints", + stdin_packs_hints_nr); + + strbuf_release(&buf); + string_list_clear(&include_packs, 0); + string_list_clear(&exclude_packs, 0); +} + static void read_object_list_from_stdin(void) { char line[GIT_MAX_HEXSZ + 1 + PATH_MAX + 2]; @@ -3489,6 +3673,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) struct strvec rp = STRVEC_INIT; int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0; int rev_list_index = 0; + int stdin_packs = 0; struct string_list keep_pack_list = STRING_LIST_INIT_NODUP; struct option pack_objects_options[] = { OPT_SET_INT('q', "quiet", &progress, @@ -3539,6 +3724,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) OPT_SET_INT_F(0, "indexed-objects", &rev_list_index, N_("include objects referred to by the index"), 1, PARSE_OPT_NONEG), + OPT_BOOL(0, "stdin-packs", &stdin_packs, + N_("read packs from stdin")), OPT_BOOL(0, "stdout", &pack_to_stdout, N_("output pack to stdout")), OPT_BOOL(0, "include-tag", &include_tag, @@ -3645,7 +3832,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) use_internal_rev_list = 1; strvec_push(&rp, "--indexed-objects"); } - if (rev_list_unpacked) { + if (rev_list_unpacked && !stdin_packs) { use_internal_rev_list = 1; strvec_push(&rp, "--unpacked"); } @@ -3690,8 +3877,13 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) if (filter_options.choice) { if (!pack_to_stdout) die(_("cannot use --filter without --stdout")); + if (stdin_packs) + die(_("cannot use --filter with --stdin-packs")); } + if (stdin_packs && use_internal_rev_list) + die(_("cannot use internal rev list with --stdin-packs")); + /* * "soft" reasons not to use bitmaps - for on-disk repack by default we want * @@ -3750,7 +3942,13 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) if (progress) progress_state = start_progress(_("Enumerating objects"), 0); - if (!use_internal_rev_list) + if (stdin_packs) { + /* avoids adding objects in excluded packs */ + ignore_packed_keep_in_core = 1; + read_packs_list_from_stdin(); + if (rev_list_unpacked) + add_unreachable_loose_objects(); + } else if (!use_internal_rev_list) read_object_list_from_stdin(); else { get_object_list(rp.nr, rp.v); diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh index 392201cabd..7138a54595 100755 --- a/t/t5300-pack-object.sh +++ b/t/t5300-pack-object.sh @@ -532,4 +532,101 @@ test_expect_success 'prefetch objects' ' test_line_count = 1 donelines ' +test_expect_success 'setup for --stdin-packs tests' ' + git init stdin-packs && + ( + cd stdin-packs && + + test_commit A && + test_commit B && + test_commit C && + + for id in A B C + do + git pack-objects .git/objects/pack/pack-$id \ + --incremental --revs <<-EOF + refs/tags/$id + EOF + done && + + ls -la .git/objects/pack + ) +' + +test_expect_success '--stdin-packs with excluded packs' ' + ( + cd stdin-packs && + + PACK_A="$(basename .git/objects/pack/pack-A-*.pack)" && + PACK_B="$(basename .git/objects/pack/pack-B-*.pack)" && + PACK_C="$(basename .git/objects/pack/pack-C-*.pack)" && + + git pack-objects test --stdin-packs <<-EOF && + $PACK_A + ^$PACK_B + $PACK_C + EOF + + ( + git show-index <$(ls .git/objects/pack/pack-A-*.idx) && + git show-index <$(ls .git/objects/pack/pack-C-*.idx) + ) >expect.raw && + git show-index <$(ls test-*.idx) >actual.raw && + + cut -d" " -f2 expect && + cut -d" " -f2 actual && + test_cmp expect actual + ) +' + +test_expect_success '--stdin-packs is incompatible with --filter' ' + ( + cd stdin-packs && + test_must_fail git pack-objects --stdin-packs --stdout \ + --filter=blob:none err && + test_i18ngrep "cannot use --filter with --stdin-packs" err + ) +' + +test_expect_success '--stdin-packs is incompatible with --revs' ' + ( + cd stdin-packs && + test_must_fail git pack-objects --stdin-packs --revs out \ + err && + test_i18ngrep "cannot use internal rev list with --stdin-packs" err + ) +' + +test_expect_success '--stdin-packs with loose objects' ' + ( + cd stdin-packs && + + PACK_A="$(basename .git/objects/pack/pack-A-*.pack)" && + PACK_B="$(basename .git/objects/pack/pack-B-*.pack)" && + PACK_C="$(basename .git/objects/pack/pack-C-*.pack)" && + + test_commit D && # loose + + git pack-objects test2 --stdin-packs --unpacked <<-EOF && + $PACK_A + ^$PACK_B + $PACK_C + EOF + + ( + git show-index <$(ls .git/objects/pack/pack-A-*.idx) && + git show-index <$(ls .git/objects/pack/pack-C-*.idx) && + git rev-list --objects --no-object-names \ + refs/tags/C..refs/tags/D + + ) >expect.raw && + ls -la . && + git show-index <$(ls test2-*.idx) >actual.raw && + + cut -d" " -f2 expect && + cut -d" " -f2 actual && + test_cmp expect actual + ) +' + test_done From patchwork Tue Feb 23 02:25:13 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Taylor Blau X-Patchwork-Id: 12099741 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4F44AC433E0 for ; Tue, 23 Feb 2021 02:26:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1A08064E31 for ; Tue, 23 Feb 2021 02:26:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231542AbhBWC0G (ORCPT ); Mon, 22 Feb 2021 21:26:06 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43794 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231483AbhBWCZ5 (ORCPT ); Mon, 22 Feb 2021 21:25:57 -0500 Received: from mail-qt1-x833.google.com (mail-qt1-x833.google.com [IPv6:2607:f8b0:4864:20::833]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 949CFC06178B for ; Mon, 22 Feb 2021 18:25:16 -0800 (PST) Received: by mail-qt1-x833.google.com with SMTP id c1so10912761qtc.1 for ; Mon, 22 Feb 2021 18:25:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ttaylorr-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=D7a2htNh5TmV1PuPKkr4mM1Sv6UdleXGHbTmlSyJ708=; b=MlrecnNLiUXBrRTqbqsy7puDKBy/wEaT5mdFdbdichXPQS3ZHTY7xuMRJcjVlbUJbQ jzLqkZs3f4EhCKLpyWzLkHYaTjPveu5h/SqarJSjqzu3gpkcpoYnFK78QindxkpkFKbU VoNNzndznoZDUbdbi7ZhZK5utmPqc8WEtnCm3yK+CLGf3RsW0+OxD2Vg3gSPOkNojth9 P0Z0XSuzqGQvQDqBh+J3jzfkOv7r/KCLQlTbnwLq6B5gllJ1BwSFSeZk6dEmNfXyTbWk NG48LdTmLt3Db3uciT2ElBObKryCdoc+bt3VHDh7AwxX2Ze55GCdmyB1gJ473tGNU76F 1IoA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=D7a2htNh5TmV1PuPKkr4mM1Sv6UdleXGHbTmlSyJ708=; b=Z/nww4tcAYeA36WfuysprBuMUysFsU6JvJocXtCICobT0dAff8sXdaU5OFFE1C5YfW ofmLfneE3gxClkU0gYxZfN/C6/tysoihr+09BIaEiF1kNlZQo3z9HRB2NtQvhptZefWQ XqvU5b7BuCDY6Tav3HofOcbIniwDiMtb3Ybe+9PdqKEXFHu6miN2gU4JPaJxjsADSgFG 0PqWO5SbXWyqTKJsVqYF7ONXbHJyuOAEPsF4yD+I4WE9KCCC7NgaZQGv5gBTNNoWpwNy 7uYq+a6+UGcvh0WwDt8hLGFMpiKEJJK959sTmhKWw56pUJ4XirjIfPV/ziESFdHJiMey FI+g== X-Gm-Message-State: AOAM533MpyVeWZs9lXLye7Ji++Hoec5hYz6prgc8GBTK5zHlt+5bWjbu y5DeyWB9++CBDLH9eQ8BSlH8HThghHRSXfk7 X-Google-Smtp-Source: ABdhPJw8u2nILnoSADTNn6oDyhj7gPTNK6fUZCF3XRXOIsMOKh8kbcFIJ9DMGiaXLxg/U8KPY4xYSw== X-Received: by 2002:ac8:7b4b:: with SMTP id m11mr23267418qtu.208.1614047115646; Mon, 22 Feb 2021 18:25:15 -0800 (PST) Received: from localhost ([2605:9480:22e:ff10:29df:918f:599f:2c96]) by smtp.gmail.com with ESMTPSA id o1sm12334283qtq.81.2021.02.22.18.25.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Feb 2021 18:25:15 -0800 (PST) Date: Mon, 22 Feb 2021 21:25:13 -0500 From: Taylor Blau To: git@vger.kernel.org Cc: peff@peff.net, dstolee@microsoft.com, gitster@pobox.com Subject: [PATCH v4 4/8] p5303: add missing &&-chains Message-ID: <6de9f0c52bbde21b478fa5e45743b7e687001744.1614047097.git.me@ttaylorr.com> References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Jeff King These are in a helper function, so the usual chain-lint doesn't notice them. This function is still not perfect, as it has some git invocations on the left-hand-side of the pipe, but it's primary purpose is timing, not finding bugs or correctness issues. Signed-off-by: Jeff King Signed-off-by: Taylor Blau --- t/perf/p5303-many-packs.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh index ce0c42cc9f..d90d714923 100755 --- a/t/perf/p5303-many-packs.sh +++ b/t/perf/p5303-many-packs.sh @@ -28,11 +28,11 @@ repack_into_n () { push @commits, $_ if $. % 5 == 1; } print reverse @commits; - ' "$1" >pushes + ' "$1" >pushes && # create base packfile head -n 1 pushes | - git pack-objects --delta-base-offset --revs staging/pack + git pack-objects --delta-base-offset --revs staging/pack && # and then incrementals between each pair of commits last= && From patchwork Tue Feb 23 02:25:17 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Taylor Blau X-Patchwork-Id: 12099745 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4C20EC433E6 for ; Tue, 23 Feb 2021 02:26:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1EF5C64E31 for ; Tue, 23 Feb 2021 02:26:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231567AbhBWC01 (ORCPT ); Mon, 22 Feb 2021 21:26:27 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43886 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231483AbhBWC0X (ORCPT ); Mon, 22 Feb 2021 21:26:23 -0500 Received: from mail-qk1-x734.google.com (mail-qk1-x734.google.com [IPv6:2607:f8b0:4864:20::734]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EE698C06178C for ; Mon, 22 Feb 2021 18:25:19 -0800 (PST) Received: by mail-qk1-x734.google.com with SMTP id q85so14848244qke.8 for ; Mon, 22 Feb 2021 18:25:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ttaylorr-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=rWSKPw4BeEPa0eE4eDcAGTacttRzk18BiUNG1gfxjLI=; b=BBqzr85pD0z20YjW5NqSIa7S0xJy5zA2kaR/YDiOgUNXYbJcLX3M2VA9p7b16N34xd 0b+zyv1LALIbQKqoFkXFBVV+YEiCwlvaRENotn4qUw48CbQZvpYJPsEW+oHARZvARnBm ZMWOHCefKYupekyzgXwiYO4hikqjk61mtgNBl4D42+mnwTCXyCdcPC9NhWlb4igiS6MP MS1CyupyvKVYYWAlIgB5EGIl/pDev8Ry6ztuUl95lV3kiep7ZCC0BQcsQpI49SHnS5zI qQYC/2V2ynuYTbdM9R0A2WmwdUXFZIFPOLt6Shqp4t4QfYNJ1qZ9mFyWbWdkaBUcd64c V/vw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=rWSKPw4BeEPa0eE4eDcAGTacttRzk18BiUNG1gfxjLI=; b=ku2rlDhZshz820hoNDxmL+FgfJTVEuu5l6Fxk8zmTIsCZAXqeySYnsAyWktpD5Zyiy YA5dDboGtrfdBKjQY3LfNHFrCeKtHiZJf+aqetiJ/ER7peQHtmdIRPJZUNbajelZoH2P hAHcj8ZKZxUPvk0+JiQ64A8NbsVFuvB0ok1WIgSe8i+hrmU4DizqE+c7InBIGmXdWs3P s7vYC8D/mpJNUbqP3XopozFrITmUjwG5Ow7V0UzbJX73CkFsYVPe+jQO01HfXAtNl0cA S0LeJ6HLpm8ffroUWi9S5jABuE88P65PIrGHUmDtTgdG3PuYpgOy7tKZG+flkIp8DAn0 adVg== X-Gm-Message-State: AOAM533WHE71iAEErAyjJF+k7lCvg0DpcPPDutvYz+L3eLpNkxxYm2+a EibYC6N9P0SxLSxmOIxDN9B2iXehKUSvL0Ru X-Google-Smtp-Source: ABdhPJzklvBZDn6snQy5ZgK5RIBVK/OvMA+CS4qCGN8Uymje+Uwsn/RjvJLMNvyxF4sGOo3N9ral3A== X-Received: by 2002:a05:620a:22e1:: with SMTP id p1mr5055402qki.238.1614047118939; Mon, 22 Feb 2021 18:25:18 -0800 (PST) Received: from localhost ([2605:9480:22e:ff10:29df:918f:599f:2c96]) by smtp.gmail.com with ESMTPSA id j20sm4592510qtl.36.2021.02.22.18.25.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Feb 2021 18:25:18 -0800 (PST) Date: Mon, 22 Feb 2021 21:25:17 -0500 From: Taylor Blau To: git@vger.kernel.org Cc: peff@peff.net, dstolee@microsoft.com, gitster@pobox.com Subject: [PATCH v4 5/8] p5303: measure time to repack with keep Message-ID: <94e4f3ee3af3181c3805ee397d043e343038005a.1614047097.git.me@ttaylorr.com> References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Jeff King Add two new tests to measure repack performance. Both tests split the repository into synthetic "pushes", and then leave the remaining objects in a big base pack. The first new test marks an empty pack as "kept" and then passes --honor-pack-keep to avoid including objects in it. That doesn't change the resulting pack, but it does let us compare to the normal repack case to see how much overhead we add to check whether objects are kept or not. The other test is of --stdin-packs, which gives us a sense of how that number scales based on the number of packs we provide as input. In each of those tests, the empty pack isn't considered, but the residual pack (objects that were left over and not included in one of the synthetic push packs) is marked as kept. (Note that in the single-pack case of the --stdin-packs test, there is nothing do since there are no non-excluded packs). Here are some timings on a recent clone of the kernel: 5303.5: repack (1) 57.26(54.59+10.84) 5303.6: repack with kept (1) 57.33(54.80+10.51) in the 50-pack case, things start to slow down: 5303.11: repack (50) 71.54(88.57+4.84) 5303.12: repack with kept (50) 85.12(102.05+4.94) and by the time we hit 1,000 packs, things are substantially worse, even though the resulting pack produced is the same: 5303.17: repack (1000) 216.87(490.79+14.57) 5303.18: repack with kept (1000) 665.63(938.87+15.76) That's because the code paths around handling .keep files are known to scale badly; they look in every single pack file to find each object. Our solution to that was to notice that most repos don't have keep files, and to make that case a fast path. But as soon as you add a single .keep, that part of pack-objects slows down again (even if we have fewer objects total to look at). Likewise, the scaling is pretty extreme on --stdin-packs (but each subsequent test is also being asked to do more work): 5303.7: repack with --stdin-packs (1) 0.01(0.01+0.00) 5303.13: repack with --stdin-packs (50) 3.53(12.07+0.24) 5303.19: repack with --stdin-packs (1000) 195.83(371.82+8.10) Signed-off-by: Jeff King Signed-off-by: Taylor Blau --- t/perf/p5303-many-packs.sh | 34 ++++++++++++++++++++++++++++++++-- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh index d90d714923..35c0cbdf49 100755 --- a/t/perf/p5303-many-packs.sh +++ b/t/perf/p5303-many-packs.sh @@ -31,8 +31,15 @@ repack_into_n () { ' "$1" >pushes && # create base packfile - head -n 1 pushes | - git pack-objects --delta-base-offset --revs staging/pack && + base_pack=$( + head -n 1 pushes | + git pack-objects --delta-base-offset --revs staging/pack + ) && + test_export base_pack && + + # create an empty packfile + empty_pack=$(git pack-objects staging/pack stdin.packs + # and install the whole thing rm -f .git/objects/pack/* && mv staging/* .git/objects/pack/ @@ -91,6 +104,23 @@ do --reflog --indexed-objects --delta-base-offset \ --stdout /dev/null ' + + test_perf "repack with kept ($nr_packs)" ' + git pack-objects --keep-true-parents \ + --keep-pack=pack-$empty_pack.pack \ + --honor-pack-keep --non-empty --all \ + --reflog --indexed-objects --delta-base-offset \ + --stdout /dev/null + ' + + test_perf "repack with --stdin-packs ($nr_packs)" ' + git pack-objects \ + --keep-true-parents \ + --stdin-packs \ + --non-empty \ + --delta-base-offset \ + --stdout /dev/null + ' done # Measure pack loading with 10,000 packs. From patchwork Tue Feb 23 02:25:20 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Taylor Blau X-Patchwork-Id: 12099743 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6F495C433DB for ; Tue, 23 Feb 2021 02:26:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3F9A164E41 for ; Tue, 23 Feb 2021 02:26:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231561AbhBWC0Z (ORCPT ); Mon, 22 Feb 2021 21:26:25 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43888 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231543AbhBWC0X (ORCPT ); Mon, 22 Feb 2021 21:26:23 -0500 Received: from mail-qk1-x72e.google.com (mail-qk1-x72e.google.com [IPv6:2607:f8b0:4864:20::72e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5CA85C061793 for ; Mon, 22 Feb 2021 18:25:23 -0800 (PST) Received: by mail-qk1-x72e.google.com with SMTP id v206so14876760qkb.3 for ; Mon, 22 Feb 2021 18:25:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ttaylorr-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=c5LrHyS6VOi9YEihFiTzV6206Yy2BvjhlxUxDUojloM=; b=mTh0EttW7cbQEnXQFVtCovtPCaDIGD4qiz8m0Y6a37XQBe84Sw2L35LKBUKsu6ahvk /bcgj4VLAxdW5qInlOeNWMXhpJvFq4M2NEuiy/HfYk3fMAxUs+8wdGa6mMUxTQw5utVy CT8GO3kJOCSPK+CJDeKnHHeJ64W8eXn79Rrb1LcOjanp/KQqlP7TiAmTbV1iaOZFR1pt l5E3rDMCYqSNhcMV11DX4O5/9hQKX1SwxFFGO29LZnsr1CRApRLEFs7Qowee/1Lja82i wppKhQbuZ+YgKMQnyGAvO5LU6lyU3sg5M/mqVtIW5evasNEaJ5WhzEfxguoURczaeB36 GYhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=c5LrHyS6VOi9YEihFiTzV6206Yy2BvjhlxUxDUojloM=; b=TGPuukXkg9oNw80pT3T9AtkA+IRw8zaHtpfUt+mdfh0qX8EVUt8birAjKYy9+lb8s5 Px873x9Jv7kpNb7Gc3WHLSVk6G3FSeIHCR6zeZfk9U9xPJ/1XI76d628CwmYr6UjVudW QqyCom+NRLd4YKnldnimnaurM6kFXUbLcGLAoXU2ppRJUkis2BmlddpsijasSscWBqWY 7jOk2dGNMBvyrsaVSvPfoT3GwnbV+yER+cMoWdDjI5+vfwixKALLaEV8zSYtw+BrNIz7 viwSKBy6KuIy6p/gfwzisom6Z+bzBOJmVFIyYlQ36ORb76zxFs5oNDho83WLgNIVyQ5b u28g== X-Gm-Message-State: AOAM531+y0DlDcd8ea7VgPJZ36D/Tx2MmBPeXKbOxOCc6z0SXhF2h9Fn wPwGqbkSORlDYrj7qxKjeJAGS5k08zFOP3rd X-Google-Smtp-Source: ABdhPJz0kiNykpMCP8alIHaoS7IKemIzrJQvPfRK2wQ/yHUBZI9+zDlEXXeMnfbhU+ZJWVco7M9/MQ== X-Received: by 2002:ae9:f812:: with SMTP id x18mr22570922qkh.377.1614047122289; Mon, 22 Feb 2021 18:25:22 -0800 (PST) Received: from localhost ([2605:9480:22e:ff10:29df:918f:599f:2c96]) by smtp.gmail.com with ESMTPSA id b20sm12470587qto.45.2021.02.22.18.25.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Feb 2021 18:25:21 -0800 (PST) Date: Mon, 22 Feb 2021 21:25:20 -0500 From: Taylor Blau To: git@vger.kernel.org Cc: peff@peff.net, dstolee@microsoft.com, gitster@pobox.com Subject: [PATCH v4 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic Message-ID: References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Jeff King Now that we have find_kept_pack_entry(), we don't have to manually keep hunting through every pack to find a possible "kept" duplicate of the object. This should be faster, assuming only a portion of your total packs are actually kept. Note that we have to re-order the logic a bit here; we can deal with the disqualifying situations first (e.g., finding the object in a non-local pack with --local), then "kept" situation(s), and then just fall back to other "--local" conditions. Here are the results from p5303 (measurements again taken on the kernel): Test HEAD^ HEAD ----------------------------------------------------------------------------------------------- 5303.5: repack (1) 57.26(54.59+10.84) 57.34(54.66+10.88) +0.1% 5303.6: repack with kept (1) 57.33(54.80+10.51) 57.38(54.83+10.49) +0.1% 5303.11: repack (50) 71.54(88.57+4.84) 71.70(88.99+4.74) +0.2% 5303.12: repack with kept (50) 85.12(102.05+4.94) 72.58(89.61+4.78) -14.7% 5303.17: repack (1000) 216.87(490.79+14.57) 217.19(491.72+14.25) +0.1% 5303.18: repack with kept (1000) 665.63(938.87+15.76) 246.12(520.07+14.93) -63.0% and the --stdin-packs timings: 5303.7: repack with --stdin-packs (1) 0.01(0.01+0.00) 0.00(0.00+0.00) -100.0% 5303.13: repack with --stdin-packs (50) 3.53(12.07+0.24) 3.43(11.75+0.24) -2.8% 5303.19: repack with --stdin-packs (1000) 195.83(371.82+8.10) 130.50(307.15+7.66) -33.4% So our repack with an empty .keep pack is roughly as fast as one without a .keep pack up to 50 packs. But the --stdin-packs case scales a little better, too. Notably, it is faster than a repack of the same size and a kept pack. It looks at fewer objects, of course, but the penalty for looking at many packs isn't as costly. Signed-off-by: Jeff King Signed-off-by: Taylor Blau --- builtin/pack-objects.c | 131 ++++++++++++++++++++++++----------------- 1 file changed, 78 insertions(+), 53 deletions(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 6ee8e40665..8cb32763b7 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -1188,7 +1188,8 @@ static int have_duplicate_entry(const struct object_id *oid, return 1; } -static int want_found_object(int exclude, struct packed_git *p) +static int want_found_object(const struct object_id *oid, int exclude, + struct packed_git *p) { if (exclude) return 1; @@ -1204,27 +1205,82 @@ static int want_found_object(int exclude, struct packed_git *p) * make sure no copy of this object appears in _any_ pack that makes us * to omit the object, so we need to check all the packs. * - * We can however first check whether these options can possible matter; + * We can however first check whether these options can possibly matter; * if they do not matter we know we want the object in generated pack. * Otherwise, we signal "-1" at the end to tell the caller that we do * not know either way, and it needs to check more packs. */ - if (!ignore_packed_keep_on_disk && - !ignore_packed_keep_in_core && - (!local || !have_non_local_packs)) - return 1; + /* + * Objects in packs borrowed from elsewhere are discarded regardless of + * if they appear in other packs that weren't borrowed. + */ if (local && !p->pack_local) return 0; - if (p->pack_local && - ((ignore_packed_keep_on_disk && p->pack_keep) || - (ignore_packed_keep_in_core && p->pack_keep_in_core))) - return 0; + + /* + * Then handle .keep first, as we have a fast(er) path there. + */ + if (ignore_packed_keep_on_disk || ignore_packed_keep_in_core) { + /* + * Set the flags for the kept-pack cache to be the ones we want + * to ignore. + * + * That is, if we are ignoring objects in on-disk keep packs, + * then we want to search through the on-disk keep and ignore + * the in-core ones. + */ + unsigned flags = 0; + if (ignore_packed_keep_on_disk) + flags |= ON_DISK_KEEP_PACKS; + if (ignore_packed_keep_in_core) + flags |= IN_CORE_KEEP_PACKS; + + if (ignore_packed_keep_on_disk && p->pack_keep) + return 0; + if (ignore_packed_keep_in_core && p->pack_keep_in_core) + return 0; + if (has_object_kept_pack(oid, flags)) + return 0; + } + + /* + * At this point we know definitively that either we don't care about + * keep-packs, or the object is not in one. Keep checking other + * conditions... + */ + if (!local || !have_non_local_packs) + return 1; /* we don't know yet; keep looking for more packs */ return -1; } +static int want_object_in_pack_one(struct packed_git *p, + const struct object_id *oid, + int exclude, + struct packed_git **found_pack, + off_t *found_offset) +{ + off_t offset; + + if (p == *found_pack) + offset = *found_offset; + else + offset = find_pack_entry_one(oid->hash, p); + + if (offset) { + if (!*found_pack) { + if (!is_pack_valid(p)) + return -1; + *found_offset = offset; + *found_pack = p; + } + return want_found_object(oid, exclude, p); + } + return -1; +} + /* * Check whether we want the object in the pack (e.g., we do not want * objects found in non-local stores if the "--local" option was used). @@ -1252,7 +1308,7 @@ static int want_object_in_pack(const struct object_id *oid, * are present we will determine the answer right now. */ if (*found_pack) { - want = want_found_object(exclude, *found_pack); + want = want_found_object(oid, exclude, *found_pack); if (want != -1) return want; } @@ -1260,53 +1316,22 @@ static int want_object_in_pack(const struct object_id *oid, for (m = get_multi_pack_index(the_repository); m; m = m->next) { struct pack_entry e; if (fill_midx_entry(the_repository, oid, &e, m)) { - struct packed_git *p = e.p; - off_t offset; - - if (p == *found_pack) - offset = *found_offset; - else - offset = find_pack_entry_one(oid->hash, p); - - if (offset) { - if (!*found_pack) { - if (!is_pack_valid(p)) - continue; - *found_offset = offset; - *found_pack = p; - } - want = want_found_object(exclude, p); - if (want != -1) - return want; - } - } - } - - list_for_each(pos, get_packed_git_mru(the_repository)) { - struct packed_git *p = list_entry(pos, struct packed_git, mru); - off_t offset; - - if (p == *found_pack) - offset = *found_offset; - else - offset = find_pack_entry_one(oid->hash, p); - - if (offset) { - if (!*found_pack) { - if (!is_pack_valid(p)) - continue; - *found_offset = offset; - *found_pack = p; - } - want = want_found_object(exclude, p); - if (!exclude && want > 0) - list_move(&p->mru, - get_packed_git_mru(the_repository)); + want = want_object_in_pack_one(e.p, oid, exclude, found_pack, found_offset); if (want != -1) return want; } } + list_for_each(pos, get_packed_git_mru(the_repository)) { + struct packed_git *p = list_entry(pos, struct packed_git, mru); + want = want_object_in_pack_one(p, oid, exclude, found_pack, found_offset); + if (!exclude && want > 0) + list_move(&p->mru, + get_packed_git_mru(the_repository)); + if (want != -1) + return want; + } + if (uri_protocols.nr) { struct configured_exclusion *ex = oidmap_get(&configured_exclusions, oid); From patchwork Tue Feb 23 02:25:23 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Taylor Blau X-Patchwork-Id: 12099759 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E94B8C433E0 for ; Tue, 23 Feb 2021 02:26:40 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BC99564E41 for ; Tue, 23 Feb 2021 02:26:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231573AbhBWC0i (ORCPT ); Mon, 22 Feb 2021 21:26:38 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43908 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231569AbhBWC02 (ORCPT ); Mon, 22 Feb 2021 21:26:28 -0500 Received: from mail-qk1-x730.google.com (mail-qk1-x730.google.com [IPv6:2607:f8b0:4864:20::730]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BC0BFC061794 for ; Mon, 22 Feb 2021 18:25:26 -0800 (PST) Received: by mail-qk1-x730.google.com with SMTP id f17so14868999qkl.5 for ; Mon, 22 Feb 2021 18:25:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ttaylorr-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=W7BAgY6QH8/c+NOhD4ioZpf2YtEHxcDQW5EvyimR7YQ=; b=jqjbjbqJgwuFYyIxonXcGfl47CRwhrnESVzd6eb0O0272I+V2XpZiMWNiAN2KUr6lU 2yiQwWOQnvjubf2KNv6wFEZxw70ndVNBRfYOo0U60pnP7Y99hRCD/OK3tswHmNVv0Np9 eZEz188Hr0iRSFRhpDyBWuSDyoLTEblLuqKrb3lIfbsO03bjED7vOKMvRB1/zZXSO2Q3 yTkCZJqhhafFMPfEmkDLwxyGbIwRwL8k8qTI2qsTYu60wrXv/lkijPUxncDI89ymo89N xrgqep4YDEGqzB+uXNVH673kDZKRA2YFboK5r9vleK4/IBK9bSlP7rfm/qKAGHlcOAiZ nmYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=W7BAgY6QH8/c+NOhD4ioZpf2YtEHxcDQW5EvyimR7YQ=; b=EF87d2oOFJog22pECcX138YslgtPOgxlTiZtWJUFzgcmYzW08DlJbac/6g+xV14fML FCDV/ZdVJPT4FYbHn+z1JNWxwj2g72TSPpIF3Ou/tlF45W2AEzCFv/UDKEUHQiDk31p6 AQ4bp8fFvqfTAAw46Icv1ZrfgVB3bVyVTBw5TQHXi77tbyEIH+R2RiQG+ixEFVUQG+zf /5J0lsZCewZLedFrokVDHXBCknkwq4glKja10PVDlUWa1if6B5m/5cKkzdq1CQBjjyob VlhSLEi7Jc84mfw8w+wHU87jyozCr3dZ7hgvRvVPudtZKJREyJkRVi+A1XysT7+PfMNx 8Vgg== X-Gm-Message-State: AOAM5319t1OF6AmSGkwSxD0Oto4Vt8qKeq6l5l+M2CzhsPaleo+V6Bx4 UHfuocdd99MUOxQaMRGntwOmZQHeCBYoG2jj X-Google-Smtp-Source: ABdhPJw9V9UBm9afFvaCnzthFFZ9wjB+WVKjY+F/UNHhoTs4zExrsKSaiWCSXh8ObZ798oKOWAk3Gw== X-Received: by 2002:a37:5088:: with SMTP id e130mr24533226qkb.321.1614047125620; Mon, 22 Feb 2021 18:25:25 -0800 (PST) Received: from localhost ([2605:9480:22e:ff10:29df:918f:599f:2c96]) by smtp.gmail.com with ESMTPSA id a206sm14042801qkc.7.2021.02.22.18.25.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Feb 2021 18:25:25 -0800 (PST) Date: Mon, 22 Feb 2021 21:25:23 -0500 From: Taylor Blau To: git@vger.kernel.org Cc: peff@peff.net, dstolee@microsoft.com, gitster@pobox.com Subject: [PATCH v4 7/8] packfile: add kept-pack cache for find_kept_pack_entry() Message-ID: References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Jeff King In a recent patch we added a function 'find_kept_pack_entry()' to look for an object only among kept packs. While this function avoids doing any lookup work in non-kept packs, it is still linear in the number of packs, since we have to traverse the linked list of packs once per object. Let's cache a reduced version of that list to save us time. Note that this cache will last the lifetime of the program. We could invalidate it on reprepare_packed_git(), but there's not much point in being rigorous here: - we might already fail to notice new .keep packs showing up after the program starts. We only reprepare_packed_git() when we fail to find an object. But adding a new pack won't cause that to happen. Somebody repacking could add a new pack and delete an old one, but most of the time we'd have a descriptor or mmap open to the old pack anyway, so we might not even notice. - in pack-objects we already cache the .keep state at startup, since 56dfeb6263 (pack-objects: compute local/ignore_pack_keep early, 2016-07-29). So this is just extending that concept further. - we don't have to worry about any packed_git being removed; we always keep the old structs around, even after reprepare_packed_git() We do defensively invalidate the cache in case the set of kept packs being asked for changes (e.g., only in-core kept packs were cached, but suddenly the caller also wants on-disk kept packs, too). In theory we could build all three caches and switch between them, but it's not necessary, since this patch (and series) never changes the set of kept packs that it wants to inspect from the cache. So that "optimization" is more about being defensive in the face of future changes than it is about asking for multiple kinds of kept packs in this patch. Here are p5303 results (as always, measured against the kernel): Test HEAD^ HEAD ----------------------------------------------------------------------------------------------- 5303.5: repack (1) 57.34(54.66+10.88) 56.98(54.36+10.98) -0.6% 5303.6: repack with kept (1) 57.38(54.83+10.49) 57.17(54.97+10.26) -0.4% 5303.11: repack (50) 71.70(88.99+4.74) 71.62(88.48+5.08) -0.1% 5303.12: repack with kept (50) 72.58(89.61+4.78) 71.56(88.80+4.59) -1.4% 5303.17: repack (1000) 217.19(491.72+14.25) 217.31(490.82+14.53) +0.1% 5303.18: repack with kept (1000) 246.12(520.07+14.93) 217.08(490.37+15.10) -11.8% and the --stdin-packs case, which scales a little bit better (although not by that much even at 1,000 packs): 5303.7: repack with --stdin-packs (1) 0.00(0.00+0.00) 0.00(0.00+0.00) = 5303.13: repack with --stdin-packs (50) 3.43(11.75+0.24) 3.43(11.69+0.30) +0.0% 5303.19: repack with --stdin-packs (1000) 130.50(307.15+7.66) 125.13(301.36+8.04) -4.1% Signed-off-by: Jeff King Signed-off-by: Taylor Blau --- object-store.h | 5 +++ packfile.c | 99 ++++++++++++++++++++++++++++---------------------- 2 files changed, 61 insertions(+), 43 deletions(-) diff --git a/object-store.h b/object-store.h index 541dab0858..ec32c23dcb 100644 --- a/object-store.h +++ b/object-store.h @@ -153,6 +153,11 @@ struct raw_object_store { /* A most-recently-used ordered version of the packed_git list. */ struct list_head packed_git_mru; + struct { + struct packed_git **packs; + unsigned flags; + } kept_pack_cache; + /* * A map of packfiles to packed_git structs for tracking which * packs have been loaded already. diff --git a/packfile.c b/packfile.c index 7f84f221ce..57d5b436fb 100644 --- a/packfile.c +++ b/packfile.c @@ -2042,10 +2042,7 @@ static int fill_pack_entry(const struct object_id *oid, return 1; } -static int find_one_pack_entry(struct repository *r, - const struct object_id *oid, - struct pack_entry *e, - int kept_only) +int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e) { struct list_head *pos; struct multi_pack_index *m; @@ -2055,49 +2052,63 @@ static int find_one_pack_entry(struct repository *r, return 0; for (m = r->objects->multi_pack_index; m; m = m->next) { - if (!fill_midx_entry(r, oid, e, m)) - continue; - - if (!kept_only) - return 1; - - if (((kept_only & ON_DISK_KEEP_PACKS) && e->p->pack_keep) || - ((kept_only & IN_CORE_KEEP_PACKS) && e->p->pack_keep_in_core)) + if (fill_midx_entry(r, oid, e, m)) return 1; } list_for_each(pos, &r->objects->packed_git_mru) { struct packed_git *p = list_entry(pos, struct packed_git, mru); - if (p->multi_pack_index && !kept_only) { - /* - * If this pack is covered by the MIDX, we'd have found - * the object already in the loop above if it was here, - * so don't bother looking. - * - * The exception is if we are looking only at kept - * packs. An object can be present in two packs covered - * by the MIDX, one kept and one not-kept. And as the - * MIDX points to only one copy of each object, it might - * have returned only the non-kept version above. We - * have to check again to be thorough. - */ - continue; - } - if (!kept_only || - (((kept_only & ON_DISK_KEEP_PACKS) && p->pack_keep) || - ((kept_only & IN_CORE_KEEP_PACKS) && p->pack_keep_in_core))) { - if (fill_pack_entry(oid, e, p)) { - list_move(&p->mru, &r->objects->packed_git_mru); - return 1; - } + if (!p->multi_pack_index && fill_pack_entry(oid, e, p)) { + list_move(&p->mru, &r->objects->packed_git_mru); + return 1; } } return 0; } -int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e) +static void maybe_invalidate_kept_pack_cache(struct repository *r, + unsigned flags) { - return find_one_pack_entry(r, oid, e, 0); + if (!r->objects->kept_pack_cache.packs) + return; + if (r->objects->kept_pack_cache.flags == flags) + return; + FREE_AND_NULL(r->objects->kept_pack_cache.packs); + r->objects->kept_pack_cache.flags = 0; +} + +static struct packed_git **kept_pack_cache(struct repository *r, unsigned flags) +{ + maybe_invalidate_kept_pack_cache(r, flags); + + if (!r->objects->kept_pack_cache.packs) { + struct packed_git **packs = NULL; + size_t nr = 0, alloc = 0; + struct packed_git *p; + + /* + * We want "all" packs here, because we need to cover ones that + * are used by a midx, as well. We need to look in every one of + * them (instead of the midx itself) to cover duplicates. It's + * possible that an object is found in two packs that the midx + * covers, one kept and one not kept, but the midx returns only + * the non-kept version. + */ + for (p = get_all_packs(r); p; p = p->next) { + if ((p->pack_keep && (flags & ON_DISK_KEEP_PACKS)) || + (p->pack_keep_in_core && (flags & IN_CORE_KEEP_PACKS))) { + ALLOC_GROW(packs, nr + 1, alloc); + packs[nr++] = p; + } + } + ALLOC_GROW(packs, nr + 1, alloc); + packs[nr] = NULL; + + r->objects->kept_pack_cache.packs = packs; + r->objects->kept_pack_cache.flags = flags; + } + + return r->objects->kept_pack_cache.packs; } int find_kept_pack_entry(struct repository *r, @@ -2105,13 +2116,15 @@ int find_kept_pack_entry(struct repository *r, unsigned flags, struct pack_entry *e) { - /* - * Load all packs, including midx packs, since our "kept" strategy - * relies on that. We're relying on the side effect of it setting up - * r->objects->packed_git, which is a little ugly. - */ - get_all_packs(r); - return find_one_pack_entry(r, oid, e, flags); + struct packed_git **cache; + + for (cache = kept_pack_cache(r, flags); *cache; cache++) { + struct packed_git *p = *cache; + if (fill_pack_entry(oid, e, p)) + return 1; + } + + return 0; } int has_object_pack(const struct object_id *oid) From patchwork Tue Feb 23 02:25:27 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Taylor Blau X-Patchwork-Id: 12099757 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E11C8C433E6 for ; Tue, 23 Feb 2021 02:26:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AE3FA64E41 for ; Tue, 23 Feb 2021 02:26:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231579AbhBWC0h (ORCPT ); Mon, 22 Feb 2021 21:26:37 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43910 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231483AbhBWC02 (ORCPT ); Mon, 22 Feb 2021 21:26:28 -0500 Received: from mail-qk1-x730.google.com (mail-qk1-x730.google.com [IPv6:2607:f8b0:4864:20::730]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 53DA2C061797 for ; Mon, 22 Feb 2021 18:25:30 -0800 (PST) Received: by mail-qk1-x730.google.com with SMTP id x124so3752438qkc.1 for ; Mon, 22 Feb 2021 18:25:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ttaylorr-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=rF9FlwQyNBepQYC+OaIfBBzdYexTa11Lr09h3+qpdKE=; b=05v7vSgaek6HMKe7E9JIATypQnRCPoxZDb0DidvG6wSjH3etlzqJXvS4ohSUejQ6oW urnU0zCjB7vlADcemBMotVaBS7jjX+8YTnj9iaFcsYK3cCYx8k6x+iAi3mOlVpynkG7d Db+++zfJn2lw+Y9seh8BklajkgikK2n14BZYR+/fZ161bP/qkOxUTP5zCgjDdMVSeSZE UZcDCEaIG+eSEb4gxVPp7Ki0gEDKPLsDkZU6kPH76GT4mZakGS2cuM4ny3Nrlq80jDA6 n6NG1FZMPZ0Uyt1wpaoTjbBHZ4C+S2UhCmCz8pna4aPvoyp0/cUg/dhikEjQjSAk6aak O9Pg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=rF9FlwQyNBepQYC+OaIfBBzdYexTa11Lr09h3+qpdKE=; b=DeiF4Bo7Gls2gEJFnb/ii7W4Ib8Qs65L2Le0pYPM0g5FeubDf28dw98iJ5X+AZFfvk zwTzeEdTO4NOiEVi+Orr+8Zt7Pu6d0Kgr2hV5yU1kz17qeMat1WoFakVaUR9EKWFHssR 2tbtHT27D8nwhlfTNumIbJLyJLrYCN1QrQogRFPLTBrC8zqTo0f8cgH9N45Y3IIkpf8p gVLvdhwD9GbXy2H9VrZwyrYJAFvbNmf1+eXro5rcH4O4AcVVCFcj11RMR4dHLG0Kf5ro DAlpfJ5MaqX9s7pvBUMMZ7LuaGzNjMmTAudHzRm0RfhGr1qobiUMGdK5wPqotPYcUzXG tnEw== X-Gm-Message-State: AOAM533bteu0C6LIWFtlbDiUubfdQwKN/2SkHClV9A2ut9mN7tgLFHCf surrBLZCxQhQhtcvNQk4xLZYDm1xFrAu92J0 X-Google-Smtp-Source: ABdhPJwlHRO9v+lnjzN/2zPD/mCIgJWB5jiTl2P1fkLUqyLoo5U5x59/NadI0ZoHcBT2zJLYz7pNZw== X-Received: by 2002:a37:a945:: with SMTP id s66mr24175472qke.272.1614047129006; Mon, 22 Feb 2021 18:25:29 -0800 (PST) Received: from localhost ([2605:9480:22e:ff10:29df:918f:599f:2c96]) by smtp.gmail.com with ESMTPSA id o6sm12251718qtr.20.2021.02.22.18.25.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Feb 2021 18:25:28 -0800 (PST) Date: Mon, 22 Feb 2021 21:25:27 -0500 From: Taylor Blau To: git@vger.kernel.org Cc: peff@peff.net, dstolee@microsoft.com, gitster@pobox.com Subject: [PATCH v4 8/8] builtin/repack.c: add '--geometric' option Message-ID: <51f57d5da23244ebde27ad6c14cbf4b63da3317d.1614047097.git.me@ttaylorr.com> References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Often it is useful to both: - have relatively few packfiles in a repository, and - avoid having so few packfiles in a repository that we repack its entire contents regularly This patch implements a '--geometric=' option in 'git repack'. This allows the caller to specify that they would like each pack to be at least a factor times as large as the previous largest pack (by object count). Concretely, say that a repository has 'n' packfiles, labeled P1, P2, ..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'. With a geometric factor of 'r', it should be that: objects(Pi) > r*objects(P(i-1)) for all i in [1, n], where the packs are sorted by objects(P1) <= objects(P2) <= ... <= objects(Pn). Since finding a true optimal repacking is NP-hard, we approximate it along two directions: 1. We assume that there is a cutoff of packs _before starting the repack_ where everything to the right of that cut-off already forms a geometric progression (or no cutoff exists and everything must be repacked). 2. We assume that everything smaller than the cutoff count must be repacked. This forms our base assumption, but it can also cause even the "heavy" packs to get repacked, for e.g., if we have 6 packs containing the following number of objects: 1, 1, 1, 2, 4, 32 then we would place the cutoff between '1, 1' and '1, 2, 4, 32', rolling up the first two packs into a pack with 2 objects. That breaks our progression and leaves us: 2, 1, 2, 4, 32 ^ (where the '^' indicates the position of our split). To restore a progression, we move the split forward (towards larger packs) joining each pack into our new pack until a geometric progression is restored. Here, that looks like: 2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32 ^ ^ ^ ^ This has the advantage of not repacking the heavy-side of packs too often while also only creating one new pack at a time. Another wrinkle is that we assume that loose, indexed, and reflog'd objects are insignificant, and lump them into any new pack that we create. This can lead to non-idempotent results. Suggested-by: Derrick Stolee Signed-off-by: Taylor Blau --- Documentation/git-repack.txt | 23 +++++ builtin/repack.c | 187 ++++++++++++++++++++++++++++++++++- t/t7703-repack-geometric.sh | 137 +++++++++++++++++++++++++ 3 files changed, 343 insertions(+), 4 deletions(-) create mode 100755 t/t7703-repack-geometric.sh diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt index 92f146d27d..136da9fa0b 100644 --- a/Documentation/git-repack.txt +++ b/Documentation/git-repack.txt @@ -165,6 +165,29 @@ depth is 4095. Pass the `--delta-islands` option to `git-pack-objects`, see linkgit:git-pack-objects[1]. +-g=:: +--geometric=:: + Arrange resulting pack structure so that each successive pack + contains at least `` times the number of objects as the + next-largest pack. ++ +`git repack` ensures this by determining a "cut" of packfiles that need +to be repacked into one in order to ensure a geometric progression. It +picks the smallest set of packfiles such that as many of the larger +packfiles (by count of objects contained in that pack) may be left +intact. ++ +Unlike other repack modes, the set of objects to pack is determined +uniquely by the set of packs being "rolled-up"; in other words, the +packs determined to need to be combined in order to restore a geometric +progression. ++ +When `--unpacked` is specified, loose objects are implicitly included in +this "roll-up", without respect to their reachability. This is subject +to change in the future. This option (implying a drastically different +repack mode) is not guaranteed to work with all other combinations of +option to `git repack`). + Configuration ------------- diff --git a/builtin/repack.c b/builtin/repack.c index 01440de2d5..bcf280b10d 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -297,6 +297,124 @@ static void repack_promisor_objects(const struct pack_objects_args *args, #define ALL_INTO_ONE 1 #define LOOSEN_UNREACHABLE 2 +struct pack_geometry { + struct packed_git **pack; + uint32_t pack_nr, pack_alloc; + uint32_t split; +}; + +static uint32_t geometry_pack_weight(struct packed_git *p) +{ + if (open_pack_index(p)) + die(_("cannot open index for %s"), p->pack_name); + return p->num_objects; +} + +static int geometry_cmp(const void *va, const void *vb) +{ + uint32_t aw = geometry_pack_weight(*(struct packed_git **)va), + bw = geometry_pack_weight(*(struct packed_git **)vb); + + if (aw < bw) + return -1; + if (aw > bw) + return 1; + return 0; +} + +static void init_pack_geometry(struct pack_geometry **geometry_p) +{ + struct packed_git *p; + struct pack_geometry *geometry; + + *geometry_p = xcalloc(1, sizeof(struct pack_geometry)); + geometry = *geometry_p; + + for (p = get_all_packs(the_repository); p; p = p->next) { + if (!pack_kept_objects && p->pack_keep) + continue; + + ALLOC_GROW(geometry->pack, + geometry->pack_nr + 1, + geometry->pack_alloc); + + geometry->pack[geometry->pack_nr] = p; + geometry->pack_nr++; + } + + QSORT(geometry->pack, geometry->pack_nr, geometry_cmp); +} + +static void split_pack_geometry(struct pack_geometry *geometry, int factor) +{ + uint32_t i; + uint32_t split; + off_t total_size = 0; + + if (geometry->pack_nr <= 1) { + geometry->split = geometry->pack_nr; + return; + } + + split = geometry->pack_nr - 1; + + /* + * First, count the number of packs (in descending order of size) which + * already form a geometric progression. + */ + for (i = geometry->pack_nr - 1; i > 0; i--) { + struct packed_git *ours = geometry->pack[i]; + struct packed_git *prev = geometry->pack[i - 1]; + if (geometry_pack_weight(ours) >= factor * geometry_pack_weight(prev)) + split--; + else + break; + } + + if (split) { + /* + * Move the split one to the right, since the top element in the + * last-compared pair can't be in the progression. Only do this + * when we split in the middle of the array (otherwise if we got + * to the end, then the split is in the right place). + */ + split++; + } + + /* + * Then, anything to the left of 'split' must be in a new pack. But, + * creating that new pack may cause packs in the heavy half to no longer + * form a geometric progression. + * + * Compute an expected size of the new pack, and then determine how many + * packs in the heavy half need to be joined into it (if any) to restore + * the geometric progression. + */ + for (i = 0; i < split; i++) + total_size += geometry_pack_weight(geometry->pack[i]); + for (i = split; i < geometry->pack_nr; i++) { + struct packed_git *ours = geometry->pack[i]; + if (geometry_pack_weight(ours) < factor * total_size) { + split++; + total_size += geometry_pack_weight(ours); + } else + break; + } + + geometry->split = split; +} + +static void clear_pack_geometry(struct pack_geometry *geometry) +{ + if (!geometry) + return; + + free(geometry->pack); + geometry->pack_nr = 0; + geometry->pack_alloc = 0; + geometry->split = 0; +} + int cmd_repack(int argc, const char **argv, const char *prefix) { struct child_process cmd = CHILD_PROCESS_INIT; @@ -304,6 +422,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) struct string_list names = STRING_LIST_INIT_DUP; struct string_list rollback = STRING_LIST_INIT_NODUP; struct string_list existing_packs = STRING_LIST_INIT_DUP; + struct pack_geometry *geometry = NULL; struct strbuf line = STRBUF_INIT; int i, ext, ret; FILE *out; @@ -316,6 +435,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) struct string_list keep_pack_list = STRING_LIST_INIT_NODUP; int no_update_server_info = 0; struct pack_objects_args po_args = {NULL}; + int geometric_factor = 0; struct option builtin_repack_options[] = { OPT_BIT('a', NULL, &pack_everything, @@ -356,6 +476,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix) N_("repack objects in packs marked with .keep")), OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"), N_("do not repack this pack")), + OPT_INTEGER('g', "geometric", &geometric_factor, + N_("find a geometric progression with factor ")), OPT_END() }; @@ -382,6 +504,13 @@ int cmd_repack(int argc, const char **argv, const char *prefix) if (write_bitmaps && !(pack_everything & ALL_INTO_ONE)) die(_(incremental_bitmap_conflict_error)); + if (geometric_factor) { + if (pack_everything) + die(_("--geometric is incompatible with -A, -a")); + init_pack_geometry(&geometry); + split_pack_geometry(geometry, geometric_factor); + } + packdir = mkpathdup("%s/pack", get_object_directory()); packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid()); @@ -396,9 +525,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix) strvec_pushf(&cmd.args, "--keep-pack=%s", keep_pack_list.items[i].string); strvec_push(&cmd.args, "--non-empty"); - strvec_push(&cmd.args, "--all"); - strvec_push(&cmd.args, "--reflog"); - strvec_push(&cmd.args, "--indexed-objects"); + if (!geometry) { + /* + * 'git pack-objects' will up all objects loose or packed + * (either rolling them up or leaving them alone), so don't pass + * these options. + * + * The implementation of 'git pack-objects --stdin-packs' + * makes them redundant (and the two are incompatible). + */ + strvec_push(&cmd.args, "--all"); + strvec_push(&cmd.args, "--reflog"); + strvec_push(&cmd.args, "--indexed-objects"); + } if (has_promisor_remote()) strvec_push(&cmd.args, "--exclude-promisor-objects"); if (write_bitmaps > 0) @@ -429,17 +568,37 @@ int cmd_repack(int argc, const char **argv, const char *prefix) strvec_push(&cmd.env_array, "GIT_REF_PARANOIA=1"); } } + } else if (geometry) { + strvec_push(&cmd.args, "--stdin-packs"); + strvec_push(&cmd.args, "--unpacked"); } else { strvec_push(&cmd.args, "--unpacked"); strvec_push(&cmd.args, "--incremental"); } - cmd.no_stdin = 1; + if (geometry) + cmd.in = -1; + else + cmd.no_stdin = 1; ret = start_command(&cmd); if (ret) return ret; + if (geometry) { + FILE *in = xfdopen(cmd.in, "w"); + /* + * The resulting pack should contain all objects in packs that + * are going to be rolled up, but exclude objects in packs which + * are being left alone. + */ + for (i = 0; i < geometry->split; i++) + fprintf(in, "%s\n", pack_basename(geometry->pack[i])); + for (i = geometry->split; i < geometry->pack_nr; i++) + fprintf(in, "^%s\n", pack_basename(geometry->pack[i])); + fclose(in); + } + out = xfdopen(cmd.out, "r"); while (strbuf_getline_lf(&line, out) != EOF) { if (line.len != the_hash_algo->hexsz) @@ -507,6 +666,25 @@ int cmd_repack(int argc, const char **argv, const char *prefix) if (!string_list_has_string(&names, sha1)) remove_redundant_pack(packdir, item->string); } + + if (geometry) { + struct strbuf buf = STRBUF_INIT; + + uint32_t i; + for (i = 0; i < geometry->split; i++) { + struct packed_git *p = geometry->pack[i]; + if (string_list_has_string(&names, + hash_to_hex(p->hash))) + continue; + + strbuf_reset(&buf); + strbuf_addstr(&buf, pack_basename(p)); + strbuf_strip_suffix(&buf, ".pack"); + + remove_redundant_pack(packdir, buf.buf); + } + strbuf_release(&buf); + } if (!po_args.quiet && isatty(2)) opts |= PRUNE_PACKED_VERBOSE; prune_packed_objects(opts); @@ -528,6 +706,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) string_list_clear(&names, 0); string_list_clear(&rollback, 0); string_list_clear(&existing_packs, 0); + clear_pack_geometry(geometry); strbuf_release(&line); return 0; diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh new file mode 100755 index 0000000000..96917fc163 --- /dev/null +++ b/t/t7703-repack-geometric.sh @@ -0,0 +1,137 @@ +#!/bin/sh + +test_description='git repack --geometric works correctly' + +. ./test-lib.sh + +GIT_TEST_MULTI_PACK_INDEX=0 + +objdir=.git/objects +midx=$objdir/pack/multi-pack-index + +test_expect_success '--geometric with no packs' ' + git init geometric && + test_when_finished "rm -fr geometric" && + ( + cd geometric && + + git repack --geometric 2 >out && + test_i18ngrep "Nothing new to pack" out + ) +' + +test_expect_success '--geometric with an intact progression' ' + git init geometric && + test_when_finished "rm -fr geometric" && + ( + cd geometric && + + # These packs already form a geometric progression. + test_commit_bulk --start=1 1 && # 3 objects + test_commit_bulk --start=2 2 && # 6 objects + test_commit_bulk --start=4 4 && # 12 objects + + find $objdir/pack -name "*.pack" | sort >expect && + git repack --geometric 2 -d && + find $objdir/pack -name "*.pack" | sort >actual && + + test_cmp expect actual + ) +' + +test_expect_success '--geometric with small-pack rollup' ' + git init geometric && + test_when_finished "rm -fr geometric" && + ( + cd geometric && + + test_commit_bulk --start=1 1 && # 3 objects + test_commit_bulk --start=2 1 && # 3 objects + find $objdir/pack -name "*.pack" | sort >small && + test_commit_bulk --start=3 4 && # 12 objects + test_commit_bulk --start=7 8 && # 24 objects + find $objdir/pack -name "*.pack" | sort >before && + + git repack --geometric 2 -d && + + # Three packs in total; two of the existing large ones, and one + # new one. + find $objdir/pack -name "*.pack" | sort >after && + test_line_count = 3 after && + comm -3 small before | tr -d "\t" >large && + grep -qFf large after + ) +' + +test_expect_success '--geometric with small- and large-pack rollup' ' + git init geometric && + test_when_finished "rm -fr geometric" && + ( + cd geometric && + + # size(small1) + size(small2) > size(medium) / 2 + test_commit_bulk --start=1 1 && # 3 objects + test_commit_bulk --start=2 1 && # 3 objects + test_commit_bulk --start=2 3 && # 7 objects + test_commit_bulk --start=6 9 && # 27 objects && + + find $objdir/pack -name "*.pack" | sort >before && + + git repack --geometric 2 -d && + + find $objdir/pack -name "*.pack" | sort >after && + comm -12 before after >untouched && + + # Two packs in total; the largest pack from before running "git + # repack", and one new one. + test_line_count = 1 untouched && + test_line_count = 2 after + ) +' + +test_expect_success '--geometric ignores kept packs' ' + git init geometric && + test_when_finished "rm -fr geometric" && + ( + cd geometric && + + test_commit kept && # 3 objects + test_commit pack && # 3 objects + + KEPT=$(git pack-objects --revs $objdir/pack/pack <<-EOF + refs/tags/kept + EOF + ) && + PACK=$(git pack-objects --revs $objdir/pack/pack <<-EOF + refs/tags/pack + ^refs/tags/kept + EOF + ) && + + # neither pack contains more than twice the number of objects in + # the other, so they should be combined. but, marking one as + # .kept on disk will "freeze" it, so the pack structure should + # remain unchanged. + touch $objdir/pack/pack-$KEPT.keep && + + find $objdir/pack -name "*.pack" | sort >before && + git repack --geometric 2 -d && + find $objdir/pack -name "*.pack" | sort >after && + + # both packs should still exist + test_path_is_file $objdir/pack/pack-$KEPT.pack && + test_path_is_file $objdir/pack/pack-$PACK.pack && + + # and no new packs should be created + test_cmp before after && + + # Passing --pack-kept-objects causes packs with a .keep file to + # be repacked, too. + git repack --geometric 2 -d --pack-kept-objects && + + find $objdir/pack -name "*.pack" >after && + test_line_count = 1 after + ) +' + +test_done