mbox series

[v2,0/8] Sparse-index: integrate with status

Message ID pull.932.v2.git.1619213665.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Sparse-index: integrate with status | expand

Message

Philippe Blain via GitGitGadget April 23, 2021, 9:34 p.m. UTC
This is the first "payoff" series in the sparse-index work. It makes 'git
status' very fast when a sparse-index is enabled on a repository with
cone-mode sparse-checkout (and a small populated set).

This is based on ds/sparse-index-protections AND mt/add-rm-sparse-checkout.
The latter branch is needed because it changes the behavior of 'git add'
around sparse entries, which changes the expectations of a test added in
patch 1.

The approach here is to audit the places where ensure_full_index() pops up
while doing normal commands with pathspecs within the sparse-checkout
definition. Each of these are checked and tested. In the end, the
sparse-index is integrated with these features:

 * git status
 * FS Monitor index extension.

The performance tests in p2000-sparse-operations.sh improve by 95% or more,
even when compared with the full-index cases, not just the sparse-index
cases that previously had extra overhead.

Hopefully this is the first example of how ds/sparse-index-protections has
done the basic work to do these conversions safely, making them look easier
than they seemed when starting this adventure.

Thanks, -Stolee


Updates in V2
=============

 * Based on the feedback, it is clear that 'git add' will require much more
   careful testing and thought. I'm splitting it out of this series and it
   will return with a follow-up.
 * Test cases are improved, both in coverage and organization.
 * The previous "unpack-trees: make sparse aware" patch is split into three
   now.
 * Stale messages based on an old implementation of the "protections" topic
   are now fixed.
 * Performance tests were re-run.

Derrick Stolee (8):
  t1092: add tests for status/add and sparse files
  unpack-trees: preserve cache_bottom
  unpack-trees: compare sparse directories correctly
  unpack-trees: stop recursing into sparse directories
  dir.c: accept a directory as part of cone-mode patterns
  status: skip sparse-checkout percentage with sparse-index
  status: use sparse-index throughout
  fsmonitor: test with sparse index

 builtin/commit.c                         |  3 ++
 dir.c                                    | 11 +++++
 read-cache.c                             | 10 +++-
 t/t1092-sparse-checkout-compatibility.sh | 61 ++++++++++++++++++++++--
 t/t7519-status-fsmonitor.sh              | 48 +++++++++++++++++++
 unpack-trees.c                           | 25 ++++++++--
 wt-status.c                              | 14 ++++--
 wt-status.h                              |  1 +
 8 files changed, 161 insertions(+), 12 deletions(-)


base-commit: f723f370c89ad61f4f40aabfd3540b1ce19c00e5
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-932%2Fderrickstolee%2Fsparse-index%2Fstatus-and-add-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-932/derrickstolee/sparse-index/status-and-add-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/932

Range-diff vs v1:

  1:  b2cb5401eff8 !  1:  3bac9edae7d8 t1092: add tests for status/add and sparse files
     @@ Commit message
          Untracked files are fine: adding new files both with 'git add .' and
          'git add folder1/' works just as in a full checkout. This may not be
          entirely desirable, but we are not intending to change behavior at the
     -    moment, only document it.
     +    moment, only document it. A future change could alter the behavior to
     +    be more sensible, and this test could be modified to satisfy the new
     +    expected behavior.
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
     @@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'add, commit, chec
      +	# This "git add folder1/a" is completely ignored
      +	# by the sparse-checkout repos. It causes the
      +	# full repo to have a different staged environment.
     -+	test_must_fail git -C sparse-checkout add folder1/a &&
     -+	test_must_fail git -C sparse-index add folder1/a &&
     ++	#
     ++	# This is not a desirable behavior, but this test
     ++	# ensures that the sparse-index is not the cause
     ++	# of a behavior change.
     ++	test_sparse_match test_must_fail git add folder1/a &&
     ++	test_sparse_match test_must_fail git add --refresh folder1/a &&
      +	git -C full-checkout checkout HEAD -- folder1/a &&
     -+	test_sparse_match git status --porcelain=v2 &&
     ++	test_all_match git status --porcelain=v2 &&
      +
      +	test_all_match git add . &&
      +	test_all_match git status --porcelain=v2 &&
  -:  ------------ >  2:  19344394379d unpack-trees: preserve cache_bottom
  -:  ------------ >  3:  24e71d8c0622 unpack-trees: compare sparse directories correctly
  2:  0a3892d2ec9e !  4:  d3c8948d0a33 unpack-trees: make sparse aware
     @@ Metadata
      Author: Derrick Stolee <dstolee@microsoft.com>
      
       ## Commit message ##
     -    unpack-trees: make sparse aware
     +    unpack-trees: stop recursing into sparse directories
      
     -    As a first step to integrate 'git status' and 'git add' with the sparse
     -    index, we must start integrating unpack_trees() with sparse directory
     -    entries. These changes are currently impossible to trigger because
     -    unpack_trees() calls ensure_full_index() if command_requires_full_index
     -    is true. This is the case for all commands at the moment. As we expand
     -    more commands to be sparse-aware, we might find that more changes are
     -    required to unpack_trees(). The current changes will suffice for
     -    'status' and 'add'.
     +    When walking trees using traverse_trees_recursive() and
     +    unpack_callback(), we must not attempt to walk into a sparse directory
     +    entry. There are no index entries within that directory to compare to
     +    the tree object at that position, so skip over the entries of that tree.
      
     -    unpack_trees() calls the traverse_trees() API using unpack_callback()
     -    to decide if we should recurse into a subtree. We must add new abilities
     -    to skip a subtree if it corresponds to a sparse directory entry.
     -
     -    It is important to be careful about the trailing directory separator
     -    that exists in the sparse directory entries but not in the subtree
     -    paths.
     +    This code is used in many places, so the only way to test it is to start
     +    removing the command_requres_full_index option from one builtin at a
     +    time and carefully test that its use of unpack_trees() behaves correctly
     +    with a sparse-index. Such tests will be added by later changes.
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
     - ## dir.h ##
     -@@ dir.h: static inline int ce_path_match(struct index_state *istate,
     - 				char *seen)
     - {
     - 	return match_pathspec(istate, pathspec, ce->name, ce_namelen(ce), 0, seen,
     --			      S_ISDIR(ce->ce_mode) || S_ISGITLINK(ce->ce_mode));
     -+			      S_ISSPARSEDIR(ce->ce_mode) || S_ISDIR(ce->ce_mode) || S_ISGITLINK(ce->ce_mode));
     - }
     - 
     - static inline int dir_path_match(struct index_state *istate,
     -
     - ## preload-index.c ##
     -@@ preload-index.c: static void *preload_thread(void *_data)
     - 			continue;
     - 		if (S_ISGITLINK(ce->ce_mode))
     - 			continue;
     -+		if (S_ISSPARSEDIR(ce->ce_mode))
     -+			continue;
     - 		if (ce_uptodate(ce))
     - 			continue;
     - 		if (ce_skip_worktree(ce))
     -
     - ## read-cache.c ##
     -@@ read-cache.c: int refresh_index(struct index_state *istate, unsigned int flags,
     - 		if (ignore_skip_worktree && ce_skip_worktree(ce))
     - 			continue;
     - 
     -+		if (istate->sparse_index && S_ISSPARSEDIR(ce->ce_mode))
     -+			continue;
     -+
     - 		if (pathspec && !ce_path_match(istate, ce, pathspec, seen))
     - 			filtered = 1;
     - 
     -
       ## unpack-trees.c ##
     -@@ unpack-trees.c: static void mark_ce_used(struct cache_entry *ce, struct unpack_trees_options *o)
     - {
     - 	ce->ce_flags |= CE_UNPACKED;
     - 
     -+	/*
     -+	 * If this is a sparse directory, don't advance cache_bottom.
     -+	 * That will be advanced later using the cache-tree data.
     -+	 */
     -+	if (S_ISSPARSEDIR(ce->ce_mode))
     -+		return;
     -+
     - 	if (o->cache_bottom < o->src_index->cache_nr &&
     - 	    o->src_index->cache[o->cache_bottom] == ce) {
     - 		int bottom = o->cache_bottom;
     -@@ unpack-trees.c: static int do_compare_entry(const struct cache_entry *ce,
     - 	ce_len -= pathlen;
     - 	ce_name = ce->name + pathlen;
     - 
     -+	/* remove directory separator if a sparse directory entry */
     -+	if (S_ISSPARSEDIR(ce->ce_mode))
     -+		ce_len--;
     - 	return df_name_compare(ce_name, ce_len, S_IFREG, name, namelen, mode);
     - }
     - 
     -@@ unpack-trees.c: static int compare_entry(const struct cache_entry *ce, const struct traverse_inf
     - 	if (cmp)
     - 		return cmp;
     - 
     -+	/* If ce is a sparse directory, then allow equality here. */
     -+	if (S_ISSPARSEDIR(ce->ce_mode))
     -+		return 0;
     -+
     - 	/*
     - 	 * Even if the beginning compared identically, the ce should
     - 	 * compare as bigger than a directory leading up to it!
      @@ unpack-trees.c: static int unpack_callback(int n, unsigned long mask, unsigned long dirmask, str
       	struct cache_entry *src[MAX_UNPACK_TREES + 1] = { NULL, };
       	struct unpack_trees_options *o = info->data;
       	const struct name_entry *p = names;
     -+	unsigned recurse = 1;
     ++	unsigned unpack_tree = 1;
       
       	/* Find first entry with a real name (we could use "mask" too) */
       	while (!p->mode)
     @@ unpack-trees.c: static int unpack_callback(int n, unsigned long mask, unsigned l
       				src[0] = ce;
      +
      +				if (S_ISSPARSEDIR(ce->ce_mode))
     -+					recurse = 0;
     ++					unpack_tree = 0;
       			}
       			break;
       		}
       	}
       
      -	if (unpack_nondirectories(n, mask, dirmask, src, names, info) < 0)
     -+	if (recurse &&
     ++	if (unpack_tree &&
      +	    unpack_nondirectories(n, mask, dirmask, src, names, info) < 0)
       		return -1;
       
     @@ unpack-trees.c: static int unpack_callback(int n, unsigned long mask, unsigned l
       		}
       
      -		if (traverse_trees_recursive(n, dirmask, mask & ~dirmask,
     -+		if (recurse &&
     ++		if (unpack_tree &&
      +		    traverse_trees_recursive(n, dirmask, mask & ~dirmask,
       					     names, info) < 0)
       			return -1;
  3:  28ca717e6526 !  5:  fd96b71968b6 dir.c: accept a directory as part of cone-mode patterns
     @@ dir.c: enum pattern_match_result path_matches_pattern_list(
       	strbuf_addch(&parent_pathname, '/');
       	strbuf_add(&parent_pathname, pathname, pathlen);
       
     -+	/* Directory requests should be added as if they are a file */
     -+	if (parent_pathname.len > 1 &&
     ++	/*
     ++	 * Directory entries are matched if and only if a file
     ++	 * contained immediately within them is matched. For the
     ++	 * case of a directory entry, modify the path to create
     ++	 * a fake filename within this directory, allowing us to
     ++	 * use the file-base matching logic in an equivalent way.
     ++	 */
     ++	if (parent_pathname.len > 0 &&
      +	    parent_pathname.buf[parent_pathname.len - 1] == '/')
      +		strbuf_add(&parent_pathname, "-", 1);
      +
  4:  e86f874dd412 =  6:  1f4ba56e7416 status: skip sparse-checkout percentage with sparse-index
  5:  d7d4cad8be0b !  7:  3d09368c0541 status: use sparse-index throughout
     @@ Commit message
          implementation details are already integrated with sparse-checkout, so
          modify command_requires_full_index to be zero for cmd_status().
      
     -    By running the debugger for 'git status -uno' after that change, we find
     -    two instances of ensure_full_index() that were added for extra safety,
     -    but can be removed without issue.
     +    In refresh_index(), we loop through the index entries to refresh their
     +    stat() information. However, sparse directories have no stat()
     +    information to populate. Ignore these entries.
      
     -    In refresh_index(), we loop through the index entries. The
     -    refresh_cache_ent() method copies the sparse directories into the
     -    refreshed index without issue.
     +    This allows 'git status' to no longer expand a sparse index to a full
     +    one. This is further tested by dropping the "-uno" option and adding an
     +    untracked file into the worktree.
      
     -    The loop within run_diff_files() skips things that are in stage 0 and
     -    have skip-worktree enabled, so seems safe to disable ensure_full_index()
     -    here.
     -
     -    This allows some cases of 'git status' to no longer expand a sparse
     -    index to a full one, giving the following performance improvements for
     -    p2000-sparse-checkout-operations.sh:
     +    The performance test p2000-sparse-checkout-operations.sh demonstrates
     +    these improvements:
      
          Test                                  HEAD~1           HEAD
          -----------------------------------------------------------------------------
     -    2000.2: git status (full-index-v3)    0.38(0.36+0.07)  0.37(0.31+0.10) -2.6%
     -    2000.3: git status (full-index-v4)    0.38(0.29+0.12)  0.37(0.30+0.11) -2.6%
     -    2000.4: git status (sparse-index-v3)  2.43(2.33+0.14)  0.04(0.05+0.04) -98.4%
     -    2000.5: git status (sparse-index-v4)  2.44(2.35+0.13)  0.05(0.04+0.07) -98.0%
     +    2000.2: git status (full-index-v3)    0.31(0.30+0.05)  0.31(0.29+0.06) +0.0%
     +    2000.3: git status (full-index-v4)    0.31(0.29+0.07)  0.34(0.30+0.08) +9.7%
     +    2000.4: git status (sparse-index-v3)  2.35(2.28+0.10)  0.04(0.04+0.05) -98.3%
     +    2000.5: git status (sparse-index-v4)  2.35(2.24+0.15)  0.05(0.04+0.06) -97.9%
      
          Note that since HEAD~1 was expanding the sparse index by parsing trees,
          it was artificially slower than the full index case. Thus, the 98%
     -    improvement is misleading, and instead we should celebrate the 0.37s to
     -    0.05s improvement of 82%. This is more indicative of the peformance
     +    improvement is misleading, and instead we should celebrate the 0.34s to
     +    0.05s improvement of 85%. This is more indicative of the peformance
          gains we are expecting by using a sparse index.
      
          Note: we are dropping the assignment of core.fsmonitor here. This is not
     @@ read-cache.c: int refresh_index(struct index_state *istate, unsigned int flags,
       	trace2_region_enter("index", "refresh", NULL);
      -	/* TODO: audit for interaction with sparse-index. */
      -	ensure_full_index(istate);
     ++
       	for (i = 0; i < istate->cache_nr; i++) {
       		struct cache_entry *ce, *new_entry;
       		int cache_errno = 0;
     +@@ read-cache.c: int refresh_index(struct index_state *istate, unsigned int flags,
     + 		if (ignore_skip_worktree && ce_skip_worktree(ce))
     + 			continue;
     + 
     ++		/*
     ++		 * If this entry is a sparse directory, then there isn't
     ++		 * any stat() information to update. Ignore the entry.
     ++		 */
     ++		if (S_ISSPARSEDIR(ce->ce_mode))
     ++			continue;
     ++
     + 		if (pathspec && !ce_path_match(istate, ce, pathspec, seen))
     + 			filtered = 1;
     + 
      
       ## t/t1092-sparse-checkout-compatibility.sh ##
      @@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'sparse-index is expanded and converted back' '
     @@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'sparse-index is e
      +	init_repos &&
      +
      +	rm -f trace2.txt &&
     ++	echo >>sparse-index/untracked.txt &&
       	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
      -		git -C sparse-index -c core.fsmonitor="" status -uno &&
      -	test_region index ensure_full_index trace2.txt
     -+		git -C sparse-index status -uno &&
     ++		git -C sparse-index status &&
      +	test_region ! index ensure_full_index trace2.txt
       '
       
  6:  434306541613 <  -:  ------------ dir: use expand_to_path() for sparse directories
  7:  f1a9ce4ef0e5 <  -:  ------------ add: allow operating on a sparse-only index
  8:  6d7f30f2b90a <  -:  ------------ pathspec: stop calling ensure_full_index
  9:  75199bbe8ca1 <  -:  ------------ t7519: add sparse directories to FS monitor tests
 10:  9d1183ddd280 !  8:  1fd033a6ebb2 fsmonitor: test with sparse index
     @@ t/t7519-status-fsmonitor.sh: test_expect_success 'status succeeds after staging/
       	)
       '
       
     -+test_expect_success 'status succeeds with sparse index' '
     -+	test_config core.fsmonitor "$TEST_DIRECTORY/t7519/fsmonitor-all" &&
     ++# Usage:
     ++# check_sparse_index_behavior [!]
     ++# If "!" is supplied, then we verify that we do not call ensure_full_index
     ++# during a call to 'git status'. Otherwise, we verify that we _do_ call it.
     ++check_sparse_index_behavior () {
      +	git status --porcelain=v2 >expect &&
      +	git sparse-checkout init --cone --sparse-index &&
     ++	git sparse-checkout set dir1 dir2 &&
      +	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
      +		git status --porcelain=v2 >actual &&
     -+	test_region ! index ensure_full_index trace2.txt &&
     ++	test_region $1 index ensure_full_index trace2.txt &&
      +	test_cmp expect actual &&
      +	rm trace2.txt &&
     ++	git sparse-checkout disable
     ++}
     ++
     ++test_expect_success 'status succeeds with sparse index' '
     ++	git reset --hard &&
     ++
     ++	test_config core.fsmonitor "$TEST_DIRECTORY/t7519/fsmonitor-all" &&
     ++	check_sparse_index_behavior ! &&
      +
      +	write_script .git/hooks/fsmonitor-test<<-\EOF &&
      +		printf "last_update_token\0"
      +	EOF
      +	git config core.fsmonitor .git/hooks/fsmonitor-test &&
     -+	git status --porcelain=v2 >expect &&
     -+	git sparse-checkout init --cone --sparse-index &&
     -+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
     -+		git status --porcelain=v2 >actual &&
     -+	test_region ! index ensure_full_index trace2.txt &&
     -+	test_cmp expect actual &&
     -+	rm trace2.txt &&
     ++	check_sparse_index_behavior ! &&
      +
      +	write_script .git/hooks/fsmonitor-test<<-\EOF &&
      +		printf "last_update_token\0"
      +		printf "dir1/modified\0"
      +	EOF
     -+	git config core.fsmonitor .git/hooks/fsmonitor-test &&
     -+	git status --porcelain=v2 >expect &&
     -+	git sparse-checkout init --cone --sparse-index &&
     -+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
     -+		git status --porcelain=v2 >actual &&
     -+	test_region ! index ensure_full_index trace2.txt &&
     -+	test_cmp expect actual &&
     ++	check_sparse_index_behavior ! &&
      +
     ++	cp -r dir1 dir1a &&
     ++	git add dir1a &&
     ++	git commit -m "add dir1a" &&
     ++
     ++	# This one modifies outside the sparse-checkout definition
     ++	# and hence we expect to expand the sparse-index.
      +	write_script .git/hooks/fsmonitor-test<<-\EOF &&
      +		printf "last_update_token\0"
      +		printf "dir1a/modified\0"
      +	EOF
     -+	git config core.fsmonitor .git/hooks/fsmonitor-test &&
     -+	git status --porcelain=v2 >expect &&
     -+	git sparse-checkout init --cone --sparse-index &&
     -+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
     -+		git status --porcelain=v2 >actual &&
     -+	test_region index ensure_full_index trace2.txt &&
     -+	test_cmp expect actual
     ++	check_sparse_index_behavior
      +'
      +
       test_done

Comments

Elijah Newren May 13, 2021, 4:12 a.m. UTC | #1
On Fri, Apr 23, 2021 at 2:34 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This is the first "payoff" series in the sparse-index work. It makes 'git
> status' very fast when a sparse-index is enabled on a repository with
> cone-mode sparse-checkout (and a small populated set).
>
> This is based on ds/sparse-index-protections AND mt/add-rm-sparse-checkout.
> The latter branch is needed because it changes the behavior of 'git add'
> around sparse entries, which changes the expectations of a test added in
> patch 1.
>
> The approach here is to audit the places where ensure_full_index() pops up
> while doing normal commands with pathspecs within the sparse-checkout
> definition. Each of these are checked and tested. In the end, the
> sparse-index is integrated with these features:
>
>  * git status
>  * FS Monitor index extension.
>
> The performance tests in p2000-sparse-operations.sh improve by 95% or more,
> even when compared with the full-index cases, not just the sparse-index
> cases that previously had extra overhead.
>
> Hopefully this is the first example of how ds/sparse-index-protections has
> done the basic work to do these conversions safely, making them look easier
> than they seemed when starting this adventure.
>
> Thanks, -Stolee
>
>
> Updates in V2
> =============
>
>  * Based on the feedback, it is clear that 'git add' will require much more
>    careful testing and thought. I'm splitting it out of this series and it
>    will return with a follow-up.
>  * Test cases are improved, both in coverage and organization.
>  * The previous "unpack-trees: make sparse aware" patch is split into three
>    now.
>  * Stale messages based on an old implementation of the "protections" topic
>    are now fixed.
>  * Performance tests were re-run.

I read through the topic, both my old comments, the range-diff, and
the new patches where the range-diff wasn't enough.  I tried to spot
issues, and was hoping to find problems you alluded to in your recent
comments at https://lore.kernel.org/git/05932ebc-04ac-b3c5-a460-5d37d8604fd9@gmail.com/,
but I failed to spot them.  I hope it has to do with the cache bottom
stuff that I just don't understand, because otherwise I just missed
the problems in my review.  I can say that in v2 you fixed the issues
I did spot in my review of v1.

I'll look forward to v3 to see what it was I missed.  If I somehow
don't respond soon (in a week at the latest), do feel free to ping me;
sorry for somehow having this one slip through the cracks.
Derrick Stolee May 14, 2021, 6:28 p.m. UTC | #2
On 5/13/2021 12:12 AM, Elijah Newren wrote:
> On Fri, Apr 23, 2021 at 2:34 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> This is the first "payoff" series in the sparse-index work. It makes 'git
>> status' very fast when a sparse-index is enabled on a repository with
>> cone-mode sparse-checkout (and a small populated set).
> 
> I read through the topic, both my old comments, the range-diff, and
> the new patches where the range-diff wasn't enough.  I tried to spot
> issues, and was hoping to find problems you alluded to in your recent
> comments at https://lore.kernel.org/git/05932ebc-04ac-b3c5-a460-5d37d8604fd9@gmail.com/,
> but I failed to spot them.  I hope it has to do with the cache bottom
> stuff that I just don't understand, because otherwise I just missed
> the problems in my review.  I can say that in v2 you fixed the issues
> I did spot in my review of v1.
> 
> I'll look forward to v3 to see what it was I missed.  If I somehow
> don't respond soon (in a week at the latest), do feel free to ping me;
> sorry for somehow having this one slip through the cracks.

v3 is on the way. The changes related to issues I found in my
deeper testing are more about what wasn't previously tested in
my test script as opposed to things actually being wrong in
the patch series. (There is one case where some new code was
incorrect, but it wasn't being tested because of the test repo's
data shape.)

Thanks,
-Stolee