Message ID | pull.847.git.1611596533.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Sparse Index | expand |
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > This RFC proposes an update to the index formats to allow "sparse directory > entries". These entries correspond to directories that are completely > excluded from the sparse checkout definition. We can detect that a directory > is excluded when using "cone mode" patterns. Yay. > Since having directory entries is a radical departure from the existing > index format, a new extension "extensions.sparseIndex" is added. Using a > sparse index should cause incompatible tools to fail because they do not > understand this extension. Safety is good, but because the index is purely a local matter, we do not have to be so careful as updating the network protocols or pack/object formats. I think the use of "extensions.*" mechanism to render the repository that uses the new feature unusable by older Git is safe enough, but it may be too draconian. For example, when things go wrong, don't you want to "fetch"/"clone" from it into another repository to first save the objects and refs? You do not need a version of the index file you understand in order to do that. The index format has a mechanism to make older versions of Git bail when it encounters a file that uses newer feature that they do not understand. Perhaps using it is sufficient instead?
On 1/25/21 3:10 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> This RFC proposes an update to the index formats to allow "sparse directory >> entries". These entries correspond to directories that are completely >> excluded from the sparse checkout definition. We can detect that a directory >> is excluded when using "cone mode" patterns. > > Yay. > >> Since having directory entries is a radical departure from the existing >> index format, a new extension "extensions.sparseIndex" is added. Using a >> sparse index should cause incompatible tools to fail because they do not >> understand this extension. > > Safety is good, but because the index is purely a local matter, we > do not have to be so careful as updating the network protocols or > pack/object formats. > > I think the use of "extensions.*" mechanism to render the repository > that uses the new feature unusable by older Git is safe enough, but > it may be too draconian. For example, when things go wrong, don't > you want to "fetch"/"clone" from it into another repository to first > save the objects and refs? You do not need a version of the index > file you understand in order to do that. > > The index format has a mechanism to make older versions of Git bail > when it encounters a file that uses newer feature that they do not > understand. Perhaps using it is sufficient instead? There are interesting subtleties with the differences between index formats 2, 3, and 4 that are worth keeping around. Perhaps the extension could be a mechanism for allowing sparse directories in those versions, but then a future "index version 5" includes sparse directories without the extension. I could spend time working on such an index v5 in parallel with the updates to Git commands to make them sparse aware. The logic from patches 1-14 in this series will be required before that could begin. Thanks, -Stolee
On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > This is based on ds/more-index-cleanups also available as GitGitGadget PR > #839. > > The sparse checkout feature allows users to specify a "populated set" that > is smaller than the full list of files at HEAD. Files outside the sparse > checkout definition are not present in the working directory, but are still > present in the index (and marked with the CE_SKIP_WORKTREE bit). > > This means that the working directory has size O(populated), and commands > like "git status" or "git checkout" operate using an O(populated) number of > filesystem operations. However, parsing the index still operates on the > scale of O(HEAD). > > This can be particularly jarring if you are merging a small repository with > a large monorepo for the purpose of simplifying dependency management. Even > if users have nothing more in their working directory than they had before, > they suddenly see a significant increase in their "git status" or "git add" > times. In these cases, simply parsing the index can be a huge portion of the > command time. > > This RFC proposes an update to the index formats to allow "sparse directory > entries". These entries correspond to directories that are completely > excluded from the sparse checkout definition. We can detect that a directory > is excluded when using "cone mode" patterns. > > Since having directory entries is a radical departure from the existing > index format, a new extension "extensions.sparseIndex" is added. Using a > sparse index should cause incompatible tools to fail because they do not > understand this extension. > > The index is a critical data structure, so making such a drastic change must > be handled carefully. This RFC does only enough adjustments to demonstrate > performance improvements for "git status" and "git add." Other commands > should operate identically to before, since the other commands will expand a > sparse index into a full index by parsing trees. > > WARNING: I'm getting a failure on the FreeBSD build with my sparse-checkout > tests. I'm not sure what is causing these failures, but I will explore while > we discuss the possibility of the feature as a whole. > > Here is an outline for this RFC: > > * Patches 1-14: create and test the sparse index format. This is just > enough to start writing the format, but all Git commands become slower > for using it. This is because everything is guarded to expand to a full > index before actually operating on the cache entries. > > * Patch 15: This massive patch is actually a bunch of patches squashed > together. I have a branch that adds "ensure_full_index()" guards in each > necessary file along with some commentary about how the index is being > used. This patch is presented here as one big dump because that > commentary isn't particularly interesting if the RFC leads to a very > different approach. > > * Patches 16-27: These changes make enough code "sparse aware" such that > "git status" and "git add" start operating in time O(populated) instead > of O(HEAD). > > Performance numbers are given in patch 27, but repeated somewhat here. The > test environment I use has ~2.1 million paths at HEAD, but only 68,000 > populated paths given the sparse-checkout I'm using. The sparse index has > about 2,000 sparse directory entries. > > 1. Use the full index. The index size is ~186 MB. > 2. Use the sparse index. The index size is ~5.5 MB. > 3. Use a commit where HEAD matches the populated set. The full index size > is ~5.3MB. > > The third benchmark is included as a theoretical optimum for a repository of > the same object database. > > First, a clean 'git status' improves from 3.1s to 240ms. > > Benchmark #1: full index (git status) Time (mean ± σ): 3.167 s ± 0.036 s > [User: 2.006 s, System: 1.078 s] Range (min … max): 3.100 s … 3.208 s 10 > runs > > Benchmark #2: sparse index (git status) Time (mean ± σ): 239.5 ms ± 8.1 ms > [User: 189.4 ms, System: 226.8 ms] Range (min … max): 226.0 ms … 251.9 ms 13 > runs > > Benchmark #3: small tree (git status) Time (mean ± σ): 195.3 ms ± 4.5 ms > [User: 116.5 ms, System: 84.4 ms] Range (min … max): 188.8 ms … 202.8 ms 15 > runs > > The performance numbers for 'git add .' are much closer to optimal: > > Benchmark #1: full index (git add .) Time (mean ± σ): 3.076 s ± 0.022 s > [User: 2.065 s, System: 0.943 s] Range (min … max): 3.044 s … 3.116 s 10 > runs > > Benchmark #2: sparse index (git add .) Time (mean ± σ): 218.0 ms ± 6.6 ms > [User: 195.7 ms, System: 206.6 ms] Range (min … max): 209.8 ms … 228.2 ms 13 > runs > > Benchmark #3: small tree (git add .) Time (mean ± σ): 217.6 ms ± 5.4 ms > [User: 131.9 ms, System: 86.7 ms] Range (min … max): 212.1 ms … 228.4 ms 14 > runs > > I expect that making a sparse index work optimally through the most common > Git commands will take a year of effort. During this process, I expect to > add a lot of testing infrastructure around the sparse-checkout feature, > especially in corner cases. (This RFC focuses on the happy paths of > operating only within the sparse cone, but that will change in the future.) > > If this general approach is acceptable, then I would follow it with a > sequence of patch submissions that follow this approach: > > 1. Basics of the format. (Patches 1-14) > 2. Add additional guards around index interactions (Patch 15, but split > appropriately.) > 3. Speed up "git status" and "git add" (Patches 16-27) > > After those three items that are represented in this RFC, the work starts to > parallelize a bit. My basic ideas for moving forward from this point are to > do these basic steps: > > * Add new index API abstractions where appropriate, make them sparse-aware. > * Add new tests around sparse-checkout corner cases. Ensure the sparse > index works correctly. > * For a given builtin, add extra testing for sparse-checkouts then it them > sparse-aware. > > Here are some specific questions I'm hoping to answer in this RFC period: > > 1. Are these sparse directory entries an appropriate way to extend the > index format? > 2. Is extensions.sparseIndex a good way to signal that these entries can > exist? > 3. Is git sparse-checkout init --cone --sparse-index an appropriate way to > toggle the format? > 4. Are there specific areas that I should target to harden the index API > before I submit this work? > 5. Does anyone have a good idea how to test a large portion of the test > suite with sparse-index enabled? The problem I see is that most tests > don't use sparse-checkout, so the sparse index is identical to the full > index. Would it be interesting to enable the test setup to add such > "extra" directories during the test setup? > > Thanks, -Stolee Thanks for working on this. It's very exciting seeing this idea come alive. I had lots of little nitpicks and questions and whatnot here and there, but that almost feels like a diversion. Overall, I think you divided up the series in a very logical and easy to follow fashion, and actually achieved quite a bit already. I suspect I have partially answered some of your questions above among all my comments, and left others unanswered or worse, just re-asked the same question(s) myself. Feel free to ping again with the next round and I'll see if I can dodge your questions again...er, I mean, try to think of something helpful to say. :-) > > Derrick Stolee (27): > sparse-index: add guard to ensure full index > sparse-index: implement ensure_full_index() > t1092: compare sparse-checkout to sparse-index > test-read-cache: print cache entries with --table > test-tool: read-cache --table --no-stat > test-tool: don't force full index > unpack-trees: ensure full index > sparse-checkout: hold pattern list in index > sparse-index: convert from full to sparse > submodule: sparse-index should not collapse links > unpack-trees: allow sparse directories > sparse-index: check index conversion happens > sparse-index: create extension for compatibility > sparse-checkout: toggle sparse index from builtin > [RFC-VERSION] *: ensure full index > unpack-trees: make sparse aware > dir.c: accept a directory as part of cone-mode patterns > status: use sparse-index throughout > status: skip sparse-checkout percentage with sparse-index > sparse-index: expand_to_path() trivial implementation > sparse-index: expand_to_path no-op if path exists > add: allow operating on a sparse-only index > submodule: die_path_inside_submodule is sparse aware > dir: use expand_to_path in add_patterns() > fsmonitor: disable if index is sparse > pathspec: stop calling ensure_full_index > cache-tree: integrate with sparse directory entries > > Documentation/config/extensions.txt | 7 + > Documentation/git-sparse-checkout.txt | 14 + > Makefile | 1 + > apply.c | 10 +- > blame.c | 7 +- > builtin/add.c | 3 + > builtin/checkout-index.c | 5 +- > builtin/commit.c | 3 + > builtin/grep.c | 2 + > builtin/ls-files.c | 9 +- > builtin/merge-index.c | 2 + > builtin/mv.c | 2 + > builtin/rm.c | 2 + > builtin/sparse-checkout.c | 35 ++- > builtin/update-index.c | 2 + > cache-tree.c | 21 ++ > cache.h | 15 +- > diff.c | 2 + > dir.c | 19 +- > dir.h | 2 +- > entry.c | 2 + > fsmonitor.c | 18 +- > merge-recursive.c | 22 +- > name-hash.c | 10 + > pathspec.c | 4 +- > pathspec.h | 4 +- > preload-index.c | 2 + > read-cache.c | 51 +++- > repo-settings.c | 15 + > repository.c | 11 +- > repository.h | 3 + > rerere.c | 2 + > resolve-undo.c | 6 + > setup.c | 3 + > sha1-name.c | 3 + > sparse-index.c | 360 +++++++++++++++++++++++ > sparse-index.h | 23 ++ > split-index.c | 2 + > submodule.c | 22 +- > submodule.h | 6 +- > t/helper/test-read-cache.c | 77 ++++- > t/t1092-sparse-checkout-compatibility.sh | 130 +++++++- > tree.c | 2 + > unpack-trees.c | 40 ++- > wt-status.c | 21 +- > wt-status.h | 1 + > 46 files changed, 942 insertions(+), 61 deletions(-) > create mode 100644 sparse-index.c > create mode 100644 sparse-index.h > > > base-commit: 2271fe7848aa11b30e5313d95d9caebc2937fce5 > Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-847%2Fderrickstolee%2Fsparse-index%2Frfc-v1 > Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-847/derrickstolee/sparse-index/rfc-v1 > Pull-Request: https://github.com/gitgitgadget/git/pull/847 > -- > gitgitgadget