From patchwork Wed Mar 10 19:30:43 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Philippe Blain via GitGitGadget X-Patchwork-Id: 12129135 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DBA1CC433E0 for ; Wed, 10 Mar 2021 19:31:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9ED1464FD0 for ; Wed, 10 Mar 2021 19:31:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233431AbhCJTbN (ORCPT ); Wed, 10 Mar 2021 14:31:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40702 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233470AbhCJTbI (ORCPT ); Wed, 10 Mar 2021 14:31:08 -0500 Received: from mail-wr1-x42f.google.com (mail-wr1-x42f.google.com [IPv6:2a00:1450:4864:20::42f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 62B58C061760 for ; Wed, 10 Mar 2021 11:31:08 -0800 (PST) Received: by mail-wr1-x42f.google.com with SMTP id f12so24632729wrx.8 for ; Wed, 10 Mar 2021 11:31:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=4vFzYKJO90agG78/KWXHjfzDVT/uh9gsi1HJ/WmDVEU=; b=MBUt9gmqcoJOg/y8/fio12zMVaoNuvPh/fpIjZqmk5fTuffRFdZyYDW2vWPcJ7F1kj 6/bxzdBxDrX9rrruCP8BZStb9Lh4VQ7dkUQzO3r9zt7gzYkGYWXoTF6we4zcurvNDe3M moFE8NMYL7GovmYeuTwl6ZlOupVCpWd2KTXhKJPlkE9LhSjBg082epaBNJrn4aPyPGDF ZWbmhCtXxhX40pP7bA6VriwtkqihrGSNm6MKp+8EdJdMQCihPJVRloS5lvq4PWCGUmW9 3UsDvZjPQ7v1c/rGyeViZUGLalC4738oErGXFt8nVxnWCHrXpGaZ0+jPfoKlRS4N8PV6 OYlg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=4vFzYKJO90agG78/KWXHjfzDVT/uh9gsi1HJ/WmDVEU=; b=eU9od1+Zohw5yw1GyLNnePtJ09n7Hu6yJ3YzBte9kSGAhYUeBdD3/Xe7txUePTNBPe OlLA3GUF5Q+Jmi1IWlpt1UGqol70INiuwJBC8k/7IUxK6h/vOemyU+9JIiYAIA0BMyNO Ja315ypPq7ztZl6+7CDuZlWSMMSs2EA8qjMPyTvDi0rSYPW/8KPNPxb9Jmis9bNTwKlK 4CSPDmEHctUi1dR+f7uQNLUXFmw/5KuOzavZ0wMxsCcjIZ6zwCPsp01fZgmFjqwuB/uW GD/VPdI4ub/ti7eEbHdsPytK96yVqqH4Ni+mAEmTqyaf+kKSbZfpfYdC1JpRKoGnBWVO 0XWg== X-Gm-Message-State: AOAM532fAcxVEYrvcPXPd5shiU7TATs3679q8mcgYieLA0u7QhwzHGJS ycTnVEVdOEmB9FpVouV6UB6lDmQW+EY= X-Google-Smtp-Source: ABdhPJxJRWXPoZetazgvFpH1KU0YV196k5MwoowuhNzbZNx8xtqPIo0L1zWiG4gKt8IXAawCF7KcyQ== X-Received: by 2002:adf:f743:: with SMTP id z3mr5104506wrp.304.1615404667093; Wed, 10 Mar 2021 11:31:07 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id j30sm382243wrj.62.2021.03.10.11.31.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 10 Mar 2021 11:31:06 -0800 (PST) Message-Id: In-Reply-To: References: From: "Derrick Stolee via GitGitGadget" Date: Wed, 10 Mar 2021 19:30:43 +0000 Subject: [PATCH v2 00/20] Sparse Index: Design, Format, Tests Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: newren@gmail.com, gitster@pobox.com, pclouds@gmail.com, jrnieder@gmail.com, Martin =?utf-8?b?w4VncmVu?= , Derrick Stolee , SZEDER =?utf-8?b?R8OhYm9y?= , Derrick Stolee Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Here is the first full patch series submission coming out of the sparse-index RFC [1]. [1] https://lore.kernel.org/git/pull.847.git.1611596533.gitgitgadget@gmail.com/ I won't waste too much space here, because PATCH 1 includes a sizeable design document that describes the feature, the reasoning behind it, and my plan for getting this implemented widely throughout the codebase. There are some new things here that were not in the RFC: * Design doc and format updates. (Patch 1) * Performance test script. (Patches 2 and 20) Notably missing in this series from the RFC: * The mega-patch inserting ensure_full_index() throughout the codebase. That will be a follow-up series to this one. * The integrations with git status and git add to demonstrate the improved performance. Those will also appear in their own series later. I plan to keep my latest work in this area in my 'sparse-index/wip' branch [2]. It includes all of the work from the RFC right now, updated with the work from this series. [2] https://github.com/derrickstolee/git/tree/sparse-index/wip Updates in V2 ============= * Various typos and awkward grammar is fixed. * Cleaned up unnecessary commands in p2000-sparse-operations.sh * Added a comment to the sparse_index member of struct index_state. * Used tree_type, commit_type, and blob_type in test-read-cache.c. Thanks, -Stolee Derrick Stolee (20): sparse-index: design doc and format update t/perf: add performance test for sparse operations t1092: clean up script quoting sparse-index: add guard to ensure full index sparse-index: implement ensure_full_index() t1092: compare sparse-checkout to sparse-index test-read-cache: print cache entries with --table test-tool: don't force full index unpack-trees: ensure full index sparse-checkout: hold pattern list in index sparse-index: convert from full to sparse submodule: sparse-index should not collapse links unpack-trees: allow sparse directories sparse-index: check index conversion happens sparse-index: create extension for compatibility sparse-checkout: toggle sparse index from builtin sparse-checkout: disable sparse-index cache-tree: integrate with sparse directory entries sparse-index: loose integration with cache_tree_verify() p2000: add sparse-index repos Documentation/config/extensions.txt | 8 + Documentation/git-sparse-checkout.txt | 14 ++ Documentation/technical/index-format.txt | 7 + Documentation/technical/sparse-index.txt | 173 ++++++++++++++ Makefile | 1 + builtin/sparse-checkout.c | 44 +++- cache-tree.c | 40 ++++ cache.h | 18 +- read-cache.c | 35 ++- repo-settings.c | 15 ++ repository.c | 11 +- repository.h | 3 + setup.c | 3 + sparse-index.c | 290 +++++++++++++++++++++++ sparse-index.h | 11 + t/README | 3 + t/helper/test-read-cache.c | 66 +++++- t/perf/p2000-sparse-operations.sh | 102 ++++++++ t/t1091-sparse-checkout-builtin.sh | 13 + t/t1092-sparse-checkout-compatibility.sh | 136 +++++++++-- unpack-trees.c | 16 +- 21 files changed, 969 insertions(+), 40 deletions(-) create mode 100644 Documentation/technical/sparse-index.txt create mode 100644 sparse-index.c create mode 100644 sparse-index.h create mode 100755 t/perf/p2000-sparse-operations.sh base-commit: 966e671106b2fd38301e7c344c754fd118d0bb07 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-883%2Fderrickstolee%2Fsparse-index%2Fformat-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-883/derrickstolee/sparse-index/format-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/883 Range-diff vs v1: 1: daa9a6bcefbc ! 1: 2fe413fdac80 sparse-index: design doc and format update @@ Documentation/technical/sparse-index.txt (new) +If we need to discover the details for paths within that directory, we +can parse trees to find that list. + -+This addition of sparse-directory entries violates expectations about the ++At time of writing, sparse-directory entries violate expectations about the +index format and its in-memory data structure. There are many consumers in +the codebase that expect to iterate through all of the index entries and +see only files. In addition, they expect to see all files at `HEAD`. One @@ Documentation/technical/sparse-index.txt (new) +* `git merge` +* `git rebase` + ++Hopefully, commands such as `git merge` and `git rebase` can benefit ++instead from merge algorithms that do not use the index as a data ++structure, such as the merge-ORT strategy. As these topics mature, we ++may enalbe the ORT strategy by default for repositories using the ++sparse-index feature. ++ +Along with `git status` and `git add`, these commands cover the majority +of users' interactions with the working directory. In addition, we can +integrate with these commands: 2: a8c6322a3dbe ! 2: 540ab5495065 t/perf: add performance test for sparse operations @@ t/perf/p2000-sparse-operations.sh (new) + # Remove submodules from the example repo, because our + # duplication of the entire repo creates an unlikly data shape. + git config --file .gitmodules --get-regexp "submodule.*.path" >modules && -+ rm -f .gitmodules && -+ git add .gitmodules && ++ git rm -f .gitmodules && + for module in $(awk "{print \$2}" modules) + do + git rm $module || return 1 + done && -+ git add . && + git commit -m "remove submodules" && + + echo bogus >a && 3: 6e783c88821e = 3: 5cbedb377b37 t1092: clean up script quoting 4: 01da4c48a1fa = 4: 6e21f776e883 sparse-index: add guard to ensure full index 5: 2b83989fbcd3 ! 5: 399ddb0bad56 sparse-index: implement ensure_full_index() @@ cache.h: struct index_state { updated_skipworktree : 1, - fsmonitor_has_run_once : 1; + fsmonitor_has_run_once : 1, ++ ++ /* ++ * sparse_index == 1 when sparse-directory ++ * entries exist. Requires sparse-checkout ++ * in cone mode. ++ */ + sparse_index : 1; struct hashmap name_hash; struct hashmap dir_hash; 6: c9910a37579c = 6: eac2db5efc22 t1092: compare sparse-checkout to sparse-index 7: 3d92df7a0cf9 ! 7: e9c82d2eda82 test-read-cache: print cache entries with --table @@ Commit message ## t/helper/test-read-cache.c ## @@ + #include "test-tool.h" #include "cache.h" #include "config.h" - ++#include "blob.h" ++#include "commit.h" ++#include "tree.h" ++ +static void print_cache_entry(struct cache_entry *ce) +{ -+ printf("%06o ", ce->ce_mode & 0777777); ++ const char *type; ++ printf("%06o ", ce->ce_mode & 0177777); + + if (S_ISSPARSEDIR(ce->ce_mode)) -+ printf("tree "); ++ type = tree_type; + else if (S_ISGITLINK(ce->ce_mode)) -+ printf("commit "); ++ type = commit_type; + else -+ printf("blob "); ++ type = blob_type; + -+ printf("%s\t%s\n", ++ printf("%s %s\t%s\n", ++ type, + oid_to_hex(&ce->oid), + ce->name); +} + -+static void print_cache(struct index_state *cache) ++static void print_cache(struct index_state *istate) +{ + int i; -+ for (i = 0; i < the_index.cache_nr; i++) -+ print_cache_entry(the_index.cache[i]); ++ for (i = 0; i < istate->cache_nr; i++) ++ print_cache_entry(istate->cache[i]); +} -+ + int cmd__read_cache(int argc, const char **argv) { + struct repository *r = the_repository; 8: 94373e2bfbbc ! 8: 243541fc5820 test-tool: don't force full index @@ Commit message ## t/helper/test-read-cache.c ## @@ - #include "test-tool.h" - #include "cache.h" - #include "config.h" + #include "blob.h" + #include "commit.h" + #include "tree.h" +#include "sparse-index.h" static void print_cache_entry(struct cache_entry *ce) 9: e71f033c2871 = 9: 48f65093b3da unpack-trees: ensure full index 10: f86d3dc154d1 ! 10: 83aac8b7a1ec sparse-checkout: hold pattern list in index @@ Commit message pattern set, we need access to that in-memory copy. Place a pointer to a 'struct pattern_list' in the index so we can access this on-demand. This will be used in the next change which uses the sparse-checkout - definition to filter out directories that are outsie the sparse cone. + definition to filter out directories that are outside the sparse cone. Signed-off-by: Derrick Stolee 11: a2d77c23a0cb ! 11: f6db0c27a285 sparse-index: convert from full to sparse @@ read-cache.c: int verify_path(const char *path, unsigned mode) return 0; + /* + * allow terminating directory separators for -+ * sparse directory enries. ++ * sparse directory entries. + */ + if (c == '\0') + return S_ISDIR(mode); @@ sparse-index.c + struct cache_entry *ce = istate->cache[i]; + + /* -+ * Detect if this is a normal entry oustide of any subtree ++ * Detect if this is a normal entry outside of any subtree + * entry. + */ + base = ce->name + ct_pathlen; 12: 4405a9115c3b = 12: f2a3e7298798 submodule: sparse-index should not collapse links 13: fda23f07e6a2 ! 13: 6f1ebe6ccc08 unpack-trees: allow sparse directories @@ Commit message is possible to have a directory in a sparse index as long as that entry is itself marked with the skip-worktree bit. - The negation of the 'pos' variable must be conditioned to only when it - starts as negative. This is identical behavior as before when the index - is full. + The 'pos' variable is assigned a negative value if an exact match is not + found. Since a directory name can be an exact match, it is no longer an + error to have a nonnegative 'pos' value. Signed-off-by: Derrick Stolee 14: 7d4627574bb8 = 14: 3fa684b315fb sparse-index: check index conversion happens 15: 564503f78784 ! 15: d74576d677f6 sparse-index: create extension for compatibility @@ Commit message We _could_ add a new index version that explicitly adds these capabilities, but there are nuances to index formats 2, 3, and 4 that - are still valuable to select as options. For now, create a repo - extension, "extensions.sparseIndex", that specifies that the tool - reading this repository must understand sparse directory entries. + are still valuable to select as options. Until we add index format + version 5, create a repo extension, "extensions.sparseIndex", that + specifies that the tool reading this repository must understand sparse + directory entries. This change only encodes the extension and enables it when GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI @@ Documentation/config/extensions.txt: extensions.objectFormat:: + When combined with `core.sparseCheckout=true` and + `core.sparseCheckoutCone=true`, the index may contain entries + corresponding to directories outside of the sparse-checkout -+ definition. Versions of Git that do not understand this extension -+ do not expect directory entries in the index. ++ definition in lieu of containing each path under such directories. ++ Versions of Git that do not understand this extension do not ++ expect directory entries in the index. ## cache.h ## @@ cache.h: struct repository_format { 16: 6d6b230e3318 ! 16: e530ca5f668d sparse-checkout: toggle sparse index from builtin @@ Documentation/git-sparse-checkout.txt: To avoid interfering with other worktrees +a sparse index until they are properly integrated with the feature. ++ +**WARNING:** Using a sparse index requires modifying the index in a way -+that is not completely understood by other tools. Enabling sparse index -+enables the `extensions.spareseIndex` config value, which might cause -+other tools to stop working with your repository. If you have trouble with -+this compatibility, then run `git sparse-checkout sparse-index disable` to -+remove this config and rewrite your index to not be sparse. ++that is not completely understood by external tools. If you have trouble ++with this compatibility, then run `git sparse-checkout sparse-index disable` ++to rewrite your index to not be sparse. Older versions of Git will not ++understand the `sparseIndex` repository extension and may fail to interact ++with your repository until it is disabled. 'set':: Write a set of patterns to the sparse-checkout file, as given as 17: bcf960ef2362 = 17: 42d0da9c5def sparse-checkout: disable sparse-index 18: e6afec58674e = 18: 6bb0976a6295 cache-tree: integrate with sparse directory entries 19: 2be4981fe698 = 19: 07f34e80609a sparse-index: loose integration with cache_tree_verify() 20: a738b0ba8ab4 = 20: 41e3b56b9c17 p2000: add sparse-index repos Reviewed-by: Elijah Newren