From patchwork Wed Apr 24 15:14:17 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Derrick Stolee X-Patchwork-Id: 10915013 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A526F922 for ; Wed, 24 Apr 2019 15:14:46 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 92E4228949 for ; Wed, 24 Apr 2019 15:14:46 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 8647428A3C; Wed, 24 Apr 2019 15:14:46 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6DEF228A10 for ; Wed, 24 Apr 2019 15:14:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730534AbfDXPOn (ORCPT ); Wed, 24 Apr 2019 11:14:43 -0400 Received: from mail-qk1-f195.google.com ([209.85.222.195]:40010 "EHLO mail-qk1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730510AbfDXPOn (ORCPT ); Wed, 24 Apr 2019 11:14:43 -0400 Received: by mail-qk1-f195.google.com with SMTP id w20so11049155qka.7 for ; Wed, 24 Apr 2019 08:14:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Dq+eRemhk83SfDy34UMb3sjFRTZE3p2x9M95llTD8HM=; b=r5sjjYM39l47g+0/sUH7p1FAUzOCj2dcxRTtazyYcm+wPSeClkgaSMuxA+6Dd/iLSr XWpMXHHat7L0X/+ii6y1kODYYj/wFhQqmtGaPr9VERa8yuEwRahF3PArBlZWgw1VWCqi qA35322FtVCiVhuZln3pfiOnNlsEs89y5ODf3bJvStpAOUN0PJU5aMY371DRaMQ3Zg7p DMtzYYME3MGMc6lfxbOP37f4G7KJ9bVk7ntx0ZOgz/hR6MUbeFQNrrOX1CLskVy9pW9y aKTUpYmif126tZmaEOnWY70qvAJgEs2JOaX+G43V+DqeVs/wJKazLSji7o/iFTdOyAdP V0/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Dq+eRemhk83SfDy34UMb3sjFRTZE3p2x9M95llTD8HM=; b=g/JaLb9x+QWfeRIiBWnv/KfT3t9vJrE+f0uhP0tNpGHg3TPa+AG15pE7Gqh2JQGoA9 z2XPoBMbZUGDvG0uH+NJsQMIJMfXIW0+IHk/9KH3uTweXpNmNvksFgrgNI0gdH0papr/ coboEhFDAFqLno4JavNJrtzX5TOasMOAIGFTi/jRii9IosVnpW4KPr1ESfslY/ChBCHA XX9VYF4t9XtxeXYReBIqBv58DNFEQHyV7/Uo+PQrIqDDW/thfXAHWdjuw24x2Jsaq3i9 xA+asBX9cEu45qTSnplw/sLcuWm54si8Zw1depPSRSsmmLqq7gqZUzP3eBQJYcamekrW /eQQ== X-Gm-Message-State: APjAAAU/kgK7ApUvjOl5uflgfIM/YJcSwGjlO3GLte5ndY11XOFXWnww ip4j9PqI51QeADGzxqDwVIzvjgJf X-Google-Smtp-Source: APXvYqyE0Y5mCGiJfYtSdw2eTpLzFwShVWb1z5wNWtiKA5XGC75O9d4q8/DgAo+aWJkuZF0u4SjLYQ== X-Received: by 2002:a05:620a:14b2:: with SMTP id x18mr17987423qkj.19.1556118881748; Wed, 24 Apr 2019 08:14:41 -0700 (PDT) Received: from stolee-gitdev.corp.microsoft.com ([2001:4898:8010:2:9efc:3578:ef3e:58d5]) by smtp.gmail.com with ESMTPSA id j129sm9671005qkd.51.2019.04.24.08.14.40 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 24 Apr 2019 08:14:40 -0700 (PDT) From: Derrick Stolee X-Google-Original-From: Derrick Stolee To: git@vger.kernel.org Cc: peff@peff.net, jrnieder@gmail.com, avarab@gmail.com, gitster@pobox.com, Derrick Stolee Subject: [PATCH v5 00/11] Create 'expire' and 'repack' verbs for git-multi-pack-index Date: Wed, 24 Apr 2019 11:14:17 -0400 Message-Id: <20190424151428.170316-1-dstolee@microsoft.com> X-Mailer: git-send-email 2.21.0.1096.g1c91fdc207 In-Reply-To: References: MIME-Version: 1.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP The multi-pack-index provides a fast way to find an object among a large list of pack-files. It stores a single pack-reference for each object id, so duplicate objects are ignored. Among a list of pack-files storing the same object, the most-recently modified one is used. Create new subcommands for the multi-pack-index builtin. * 'git multi-pack-index expire': If we have a pack-file indexed by the multi-pack-index, but all objects in that pack are duplicated in more-recently modified packs, then delete that pack (and any others like it). Delete the reference to that pack in the multi-pack-index. * 'git multi-pack-index repack --batch-size=': Starting from the oldest pack-files covered by the multi-pack-index, find those whose "expected size" is below the batch size until we have a collection of packs whose expected sizes add up to the batch size. We compute the expected size by multiplying the number of referenced objects by the pack-size and dividing by the total number of objects in the pack. If the batch-size is zero, then select all packs. Create a new pack containing all objects that the multi-pack-index references to those packs. This allows us to create a new pattern for repacking objects: run 'repack'. After enough time has passed that all Git commands that started before the last 'repack' are finished, run 'expire' again. This approach has some advantages over the existing "repack everything" model: 1. Incremental. We can repack a small batch of objects at a time, instead of repacking all reachable objects. We can also limit ourselves to the objects that do not appear in newer pack-files. 2. Highly Available. By adding a new pack-file (and not deleting the old pack-files) we do not interrupt concurrent Git commands, and do not suffer performance degradation. By expiring only pack-files that have no referenced objects, we know that Git commands that are doing normal object lookups* will not be interrupted. * Note: if someone concurrently runs a Git command that uses get_all_packs(), * then that command could try to read the pack-files and pack-indexes that we * are deleting during an expire command. Such commands are usually related to * object maintenance (i.e. fsck, gc, pack-objects) or are related to * less-often-used features (i.e. fast-import, http-backend, server-info). We **are using** this approach in VFS for Git to do background maintenance of the "shared object cache" which is a Git alternate directory filled with packfiles containing commits and trees. We currently download pack-files on an hourly basis to keep up-to-date with the central server. The cache servers supply packs on an hourly and daily basis, so most of the hourly packs become useless after a new daily pack is downloaded. The 'expire' command would clear out most of those packs, but many will still remain with fewer than 100 objects remaining. The 'repack' command (with a batch size of 1-3gb, probably) can condense the remaining packs in commands that run for 1-3 min at a time. Since the daily packs range from 100-250mb, we will also combine and condense those packs. Updates in V5: * Fixed the error in PATCH 7 due to a missing line that existed in PATCH 8. Thanks, Josh Steadmon! * The 'repack' subcommand now computes the "expected size" of a pack instead of relying on the total size of the pack. This is actually really important to the way VFS for Git uses prefetch packs, and some packs are not being repacked because the pack size is larger than the batch size, but really there are only a few referenced objects. * The 'repack' subcommand now allows a batch size of zero to mean "create one pack containing all objects in the multi-pack-index". A new commit adds a test that hits the boundary cases here, but follows the 'expire' subcommand so we can show that cycle of repack-then-expire to safely replace the packs. Junio: It appears that there are some conflicts with the trace2 changes in master. These are not new to the updates in this version. I saw how you resolved these conflicts and replaying that resolution should work for you. Thanks, -Stolee Derrick Stolee (11): repack: refactor pack deletion for future use Docs: rearrange subcommands for multi-pack-index multi-pack-index: prepare for 'expire' subcommand midx: simplify computation of pack name lengths midx: refactor permutation logic and pack sorting multi-pack-index: implement 'expire' subcommand multi-pack-index: prepare 'repack' subcommand midx: implement midx_repack() multi-pack-index: test expire while adding packs midx: add test that 'expire' respects .keep files t5319-multi-pack-index.sh: test batch size zero Documentation/git-multi-pack-index.txt | 32 +- builtin/multi-pack-index.c | 14 +- builtin/repack.c | 14 +- midx.c | 440 +++++++++++++++++++------ midx.h | 2 + packfile.c | 28 ++ packfile.h | 7 + t/t5319-multi-pack-index.sh | 184 +++++++++++ 8 files changed, 602 insertions(+), 119 deletions(-) base-commit: 26aa9fc81d4c7f6c3b456a29da0b7ec72e5c6595