[v2,08/10] sparse-index: complete partial expansion

Message ID	ed640e3645ba4f60f06bd335b9ac7bf350dd81f9.1652982759.git.gitgitgadget@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <ed640e3645ba4f60f06bd335b9ac7bf350dd81f9.1652982759.git.gitgitgadget@gmail.com> In-Reply-To: <pull.1208.v2.git.1652982758.gitgitgadget@gmail.com> References: <pull.1208.git.1652724693.gitgitgadget@gmail.com> <pull.1208.v2.git.1652982758.gitgitgadget@gmail.com> Date: Thu, 19 May 2022 17:52:36 +0000 Subject: [PATCH v2 08/10] sparse-index: complete partial expansion Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, vdye@github.com, shaoxuan.yuan02@gmail.com, Derrick Stolee <derrickstolee@github.com>, Derrick Stolee <dstolee@microsoft.com> Precedence: bulk From: Derrick Stolee <dstolee@microsoft.com>
Series	Sparse index: integrate with sparse-checkout \| expand [v2,00/10] Sparse index: integrate with sparse-checkout [v2,01/10] t1092: refactor 'sparse-index contents' test [v2,02/10] t1092: stress test 'git sparse-checkout set' [v2,03/10] sparse-index: create expand_to_pattern_list() [v2,04/10] sparse-index: introduce partially-sparse indexes [v2,05/10] cache-tree: implement cache_tree_find_path() [v2,06/10] sparse-checkout: --no-sparse-index needs a full index [v2,07/10] sparse-index: partially expand directories [v2,08/10] sparse-index: complete partial expansion [v2,09/10] p2000: add test for 'git sparse-checkout [add\|set]' [v2,10/10] sparse-checkout: integrate with sparse index

Derrick Stolee May 19, 2022, 5:52 p.m. UTC

From: Derrick Stolee <dstolee@microsoft.com>

To complete the implementation of expand_to_pattern_list(), we need to
detect when a sparse directory entry should remain sparse. This avoids a
full expansion, so we now need to use the PARTIALLY_SPARSE mode to
indicate this state.

There still are no callers to this method, but we will add one in the
next change.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 sparse-index.c | 41 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 37 insertions(+), 4 deletions(-)

Junio C Hamano May 21, 2022, 7:45 a.m. UTC | #1

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +	if (pl && !pl->use_cone_patterns) {
>  		pl = NULL;
> +	} else {
> +		/*
> +		 * We might contract file entries into sparse-directory
> +		 * entries, and for that we will need the cache tree to
> +		 * be recomputed.
> +		 */
> +		cache_tree_free(&istate->cache_tree);
> +
> +		/*
> +		 * If there is a problem creating the cache tree, then we
> +		 * need to expand to a full index since we cannot satisfy
> +		 * the current request as a sparse index.
> +		 */
> +		if (cache_tree_update(istate, WRITE_TREE_MISSING_OK))
> +			pl = NULL;
> +	}

So, if the current index has some irrelevant (i.e. do not match the
pattern list) subtrees in collapsed form, presense of an unmerged
entry, presumably inside the cone(s) we are interested in, makes us
lose the pattern list here, and we end up expanding everything?

I suspect that is a situation that is not so uncommon.  Working
inside a narrow cone of a wide tree, performing a merge would
hopefully allow many subtrees that are outside of the cones of our
interest merged without getting expanded at all (e.g. only the other
side touched these subtrees we are not working on, so their version
will become the merge result), while changes to some paths in the
cone of our interest may result in true conflicts represented as
cache entries at higher stages, needing conflict resolution
concluded with "git add".  Having to expand these subtrees that we
managed to merge while still collapsed, only because we have
conflicts in some other parts of the tree, feels somewhat sad.

By the way, why are we passing the "--missing-ok" option to "git
write-tree" here?

> @@ -330,11 +352,22 @@ void expand_to_pattern_list(struct index_state *istate,
>  		struct cache_entry *ce = istate->cache[i];
>  		struct tree *tree;
>  		struct pathspec ps;
> +		int dtype;
>  
>  		if (!S_ISSPARSEDIR(ce->ce_mode)) {
>  			set_index_entry(full, full->cache_nr++, ce);
>  			continue;
>  		}
> +
> +		/* We now have a sparse directory entry. Should we expand? */
> +		if (pl &&
> +		    path_matches_pattern_list(ce->name, ce->ce_namelen,
> +					      NULL, &dtype,
> +					      pl, istate) == NOT_MATCHED) {
> +			set_index_entry(full, full->cache_nr++, ce);
> +			continue;
> +		}
> +
>  		if (!(ce->ce_flags & CE_SKIP_WORKTREE))
>  			warning(_("index entry is a directory, but not sparse (%08x)"),
>  				ce->ce_flags);
> @@ -360,7 +393,7 @@ void expand_to_pattern_list(struct index_state *istate,
>  	/* Copy back into original index. */
>  	memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
>  	memcpy(&istate->dir_hash, &full->dir_hash, sizeof(full->dir_hash));
> -	istate->sparse_index = 0;
> +	istate->sparse_index = pl ? PARTIALLY_SPARSE : COMPLETELY_FULL;
>  	free(istate->cache);
>  	istate->cache = full->cache;
>  	istate->cache_nr = full->cache_nr;
> @@ -374,7 +407,7 @@ void expand_to_pattern_list(struct index_state *istate,
>  
>  	/* Clear and recompute the cache-tree */
>  	cache_tree_free(&istate->cache_tree);
> -	cache_tree_update(istate, 0);
> +	cache_tree_update(istate, WRITE_TREE_MISSING_OK);

The same question here.  We didn't say "missing trees are OK".  What
made it OK in this change?

Derrick Stolee May 23, 2022, 1:13 p.m. UTC | #2

On 5/21/2022 3:45 AM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> +	if (pl && !pl->use_cone_patterns) {
>>  		pl = NULL;
>> +	} else {
>> +		/*
>> +		 * We might contract file entries into sparse-directory
>> +		 * entries, and for that we will need the cache tree to
>> +		 * be recomputed.
>> +		 */
>> +		cache_tree_free(&istate->cache_tree);
>> +
>> +		/*
>> +		 * If there is a problem creating the cache tree, then we
>> +		 * need to expand to a full index since we cannot satisfy
>> +		 * the current request as a sparse index.
>> +		 */
>> +		if (cache_tree_update(istate, WRITE_TREE_MISSING_OK))
>> +			pl = NULL;
>> +	}
> 
> So, if the current index has some irrelevant (i.e. do not match the
> pattern list) subtrees in collapsed form, presense of an unmerged
> entry, presumably inside the cone(s) we are interested in, makes us
> lose the pattern list here, and we end up expanding everything?
> 
> I suspect that is a situation that is not so uncommon.  Working
> inside a narrow cone of a wide tree, performing a merge would
> hopefully allow many subtrees that are outside of the cones of our
> interest merged without getting expanded at all (e.g. only the other
> side touched these subtrees we are not working on, so their version
> will become the merge result), while changes to some paths in the
> cone of our interest may result in true conflicts represented as
> cache entries at higher stages, needing conflict resolution
> concluded with "git add".  Having to expand these subtrees that we
> managed to merge while still collapsed, only because we have
> conflicts in some other parts of the tree, feels somewhat sad.

You are correct that conflicts outside of the sparse-checkout cone will
cause index expansion. That happens during the 'git merge' command, but
the index will continue to fail to collapse as long as those conflicts
still exist in the index.

When there are conflicts like this during the merge, then the index
expansion is not as large of a portion of the command as normal, because
the conflict resolution also takes some time to compute. The commands
afterwards do take longer purely because of the expanded index.

However, this state is also not as common as you might think. If a user
has a sparse-checkout cone specified, then they are unlikely to change
files outside of the sparse-checkout cone. They would not be the reason
that those files have a conflict. The conflicts would exist only if they
are merging branches that had conflicts outside of the cone. Typically,
any merge of external changes like this are of the form of "git pull" or
"git rebase", in which case the conflicts are still "local" to the
developer's changes.

You are right that there is additional work that could be done here,
specifically allowing the cache tree to partially succeed and use the
successfully generated trees to create sparse directory entries where
possible. This was not viable before because we lacked the "partially
expanded" index state. This series establishes the necessary vocabulary to
do such an improvement later.
> By the way, why are we passing the "--missing-ok" option to "git
> write-tree" here?
> 
>> +	cache_tree_update(istate, WRITE_TREE_MISSING_OK);
> 
> The same question here.  We didn't say "missing trees are OK".  What
> made it OK in this change?

Both of these additions of WRITE_TREE_MISSING_OK are not needed. I
think I added them in an earlier version, thinking they were needed
due to something in the Scalar functional tests. I confirmed just now
that they are not needed for that. I will remove them.

Thanks,
-Stolee

Derrick Stolee May 23, 2022, 1:18 p.m. UTC | #3

On 5/23/2022 9:13 AM, Derrick Stolee wrote:
> On 5/21/2022 3:45 AM, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

>> By the way, why are we passing the "--missing-ok" option to "git
>> write-tree" here?
>>
>>> +	cache_tree_update(istate, WRITE_TREE_MISSING_OK);
>>
>> The same question here.  We didn't say "missing trees are OK".  What
>> made it OK in this change?
>  
> Both of these additions of WRITE_TREE_MISSING_OK are not needed. I
> think I added them in an earlier version, thinking they were needed
> due to something in the Scalar functional tests. I confirmed just now
> that they are not needed for that. I will remove them.

As I went to remove these from sparse-index.c, I found another that
exists, but for good reason. See 8a96b9d0a (sparse-index: use
WRITE_TREE_MISSING_OK, 2021-09-08), which explains that that instance
needs the flag because it could be used during 'git add'.

It still isn't necessary in the instances being added here, so I will
remove them.

Thanks,
-Stolee

Junio C Hamano May 23, 2022, 6:01 p.m. UTC | #4

Derrick Stolee <derrickstolee@github.com> writes:

> As I went to remove these from sparse-index.c, I found another that
> exists, but for good reason. See 8a96b9d0a (sparse-index: use
> WRITE_TREE_MISSING_OK, 2021-09-08), which explains that that instance
> needs the flag because it could be used during 'git add'.

Hmph, re-reading that explanation, it appears to me that it is
working around a lack of another option that could be useful in such
a situation, namely, "update the cache tree, and if a tree object is
missing then create one locally; do not bother asking any promisor
remote".  I am reasonably sure that back when I invented missing-ok
there wasn't any such thing as "promisor remote" or "lazy fetch",
and it sounds like those who added "promissor remote" broke the
cache-tree subsystem by introducing calls to fetch from elsewhere?

How does a normal "new object creation" codepath work when promisor
remotes exist?  Does write_object_file() first ask if any promisor
remote have that object and fetch before creating one locally?  That
sounds absurd (i.e. if we have data to create locally why ask others
if they have one already?).  Why should cache-tree behave any
differently?

All of the above are mere observations on changes that have nothing
to do with the change in this series, but cache_tree_update()
knowing about promisor remotes feels serious breakage to me.

> It still isn't necessary in the instances being added here, so I will
> remove them.

OK.

Junio C Hamano May 23, 2022, 10:48 p.m. UTC | #5

Derrick Stolee <derrickstolee@github.com> writes:

>> I suspect that is a situation that is not so uncommon.  Working
>> inside a narrow cone of a wide tree, performing a merge would
>> hopefully allow many subtrees that are outside of the cones of our
>> interest merged without getting expanded at all (e.g. only the other
>> side touched these subtrees we are not working on, so their version
>> will become the merge result), while changes to some paths in the
>> cone of our interest may result in true conflicts represented as
>> cache entries at higher stages, needing conflict resolution
>> concluded with "git add".  Having to expand these subtrees that we
>> managed to merge while still collapsed, only because we have
>> conflicts in some other parts of the tree, feels somewhat sad.
>
> You are correct that conflicts outside of the sparse-checkout cone will
> cause index expansion. That happens during the 'git merge' command, but
> the index will continue to fail to collapse as long as those conflicts
> still exist in the index.
>
> When there are conflicts like this during the merge, then the index
> expansion is not as large of a portion of the command as normal, because
> the conflict resolution also takes some time to compute. The commands
> afterwards do take longer purely because of the expanded index.

I was imagining a situation more like "tech-writers only have
Documentation/ inside the cone of interest, attempt a pull from
somebody else, have conflicts inside Documentation/, but everything
else could be resolved cleanly without expanding the index".  If the
puller's tree is based on the pristine upstream release tag, and the
pullee's tree is based on a slightly newer version of upstream
snapshot, everything that happened outside Documentation/ in their
trees would fast-forward, so such a merge wouldn't have to expand
directories like "builtin/" or "contrib/" in the index and instead
can merge at the tree level, right?

On the other hand, ...

> However, this state is also not as common as you might think. If a user
> has a sparse-checkout cone specified, then they are unlikely to change
> files outside of the sparse-checkout cone. They would not be the reason
> that those files have a conflict. The conflicts would exist only if they
> are merging branches that had conflicts outside of the cone. Typically,
> any merge of external changes like this are of the form of "git pull" or
> "git rebase", in which case the conflicts are still "local" to the
> developer's changes.

... you seem to be talking about the opposite case (e.g. in the
above paragraph), where a conflict happens outside the cone of
interest of the person who is making a merge.  So, I am a bit
puzzled.

> You are right that there is additional work that could be done here,
> specifically allowing the cache tree to partially succeed and use the
> successfully generated trees to create sparse directory entries where
> possible. This was not viable before because we lacked the "partially
> expanded" index state. This series establishes the necessary vocabulary to
> do such an improvement later.
>> By the way, why are we passing the "--missing-ok" option to "git
>> write-tree" here?
>> 
>>> +	cache_tree_update(istate, WRITE_TREE_MISSING_OK);
>> 
>> The same question here.  We didn't say "missing trees are OK".  What
>> made it OK in this change?
>  
> Both of these additions of WRITE_TREE_MISSING_OK are not needed. I
> think I added them in an earlier version, thinking they were needed
> due to something in the Scalar functional tests. I confirmed just now
> that they are not needed for that. I will remove them.
>
> Thanks,
> -Stolee

Derrick Stolee May 25, 2022, 2:26 p.m. UTC | #6

On 5/23/2022 6:48 PM, Junio C Hamano wrote:
> Derrick Stolee <derrickstolee@github.com> writes:
> 
>>> I suspect that is a situation that is not so uncommon.  Working
>>> inside a narrow cone of a wide tree, performing a merge would
>>> hopefully allow many subtrees that are outside of the cones of our
>>> interest merged without getting expanded at all (e.g. only the other
>>> side touched these subtrees we are not working on, so their version
>>> will become the merge result), while changes to some paths in the
>>> cone of our interest may result in true conflicts represented as
>>> cache entries at higher stages, needing conflict resolution
>>> concluded with "git add".  Having to expand these subtrees that we
>>> managed to merge while still collapsed, only because we have
>>> conflicts in some other parts of the tree, feels somewhat sad.
>>
>> You are correct that conflicts outside of the sparse-checkout cone will
>> cause index expansion. That happens during the 'git merge' command, but
>> the index will continue to fail to collapse as long as those conflicts
>> still exist in the index.
>>
>> When there are conflicts like this during the merge, then the index
>> expansion is not as large of a portion of the command as normal, because
>> the conflict resolution also takes some time to compute. The commands
>> afterwards do take longer purely because of the expanded index.
> 
> I was imagining a situation more like "tech-writers only have
> Documentation/ inside the cone of interest, attempt a pull from
> somebody else, have conflicts inside Documentation/, but everything
> else could be resolved cleanly without expanding the index".  If the
> puller's tree is based on the pristine upstream release tag, and the
> pullee's tree is based on a slightly newer version of upstream
> snapshot, everything that happened outside Documentation/ in their
> trees would fast-forward, so such a merge wouldn't have to expand
> directories like "builtin/" or "contrib/" in the index and instead
> can merge at the tree level, right?
> 
> On the other hand, ...
> 
>> However, this state is also not as common as you might think. If a user
>> has a sparse-checkout cone specified, then they are unlikely to change
>> files outside of the sparse-checkout cone. They would not be the reason
>> that those files have a conflict. The conflicts would exist only if they
>> are merging branches that had conflicts outside of the cone. Typically,
>> any merge of external changes like this are of the form of "git pull" or
>> "git rebase", in which case the conflicts are still "local" to the
>> developer's changes.
> 
> ... you seem to be talking about the opposite case (e.g. in the
> above paragraph), where a conflict happens outside the cone of
> interest of the person who is making a merge.  So, I am a bit
> puzzled.

Hm. We must be getting mixed up with each other. Let me try again
from the beginning.

When the conflict happens inside of the sparse-checkout cone,
then the sparse index is not expanded. This is checked by the
test 'sparse-index is not expanded: merge conflict in cone' in
t1092.

Most merges get to "fast forward" changes that are outside of
the sparse-checkout cone because the only interesting changes
that could lead to a conflict are those created by the current
user (and hence within the sparse-checkout cone).

So, the typical case of a tech writer only editing "Documentation/"
should only see conflicts within "Documentation/".

The case where we would see conflicts outside of the cone include
cases where long-lived branches are being merged by someone with
a small cone. I can imagine an automated process using an empty
sparse-checkout cone to occasionally merge a "deployed" and a
"develop" branch, and it gets conflicts when someone ships a
hotfix directly to "deployed" without first going through "develop".
Any conflict is likely to cause index expansion in this case.

Let's re-introduce the patch section we are talking about:

+	if (pl && !pl->use_cone_patterns) {
 		pl = NULL;
+	} else {
+		/*
+		 * We might contract file entries into sparse-directory
+		 * entries, and for that we will need the cache tree to
+		 * be recomputed.
+		 */
+		cache_tree_free(&istate->cache_tree);
+
+		/*
+		 * If there is a problem creating the cache tree, then we
+		 * need to expand to a full index since we cannot satisfy
+		 * the current request as a sparse index.
+		 */
+		if (cache_tree_update(istate, WRITE_TREE_MISSING_OK))
+			pl = NULL;
+	}

If a user is in a conflict state and modifies their sparse-checkout
cone, then we will hit this "recompute the cache-tree" state, fail,
and cause full index expansion. I think that combination (have a
conflict AND modify sparse-checkout cone) is rare enough to not
optimize for (yet).

Does that make the situation more clear?

Thanks,
-Stolee

Junio C Hamano May 25, 2022, 4:32 p.m. UTC | #7

Derrick Stolee <derrickstolee@github.com> writes:

> If a user is in a conflict state and modifies their sparse-checkout
> cone, then we will hit this "recompute the cache-tree" state, fail,
> and cause full index expansion. I think that combination (have a
> conflict AND modify sparse-checkout cone) is rare enough to not
> optimize for (yet).
>
> Does that make the situation more clear?

Yes.  "AND modify sparse-checkout cone" part was what I was missing.
I didn't see it because it wasn't there at all in the code that was
removed or added---it comes from the fact that the user is running
the "sparse-checkout" command in the first place, and that was what
I missed.

Thanks.

[v2,08/10] sparse-index: complete partial expansion

Commit Message

Comments

Patch