Message ID | ed640e3645ba4f60f06bd335b9ac7bf350dd81f9.1652982759.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Sparse index: integrate with sparse-checkout | expand |
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > + if (pl && !pl->use_cone_patterns) { > pl = NULL; > + } else { > + /* > + * We might contract file entries into sparse-directory > + * entries, and for that we will need the cache tree to > + * be recomputed. > + */ > + cache_tree_free(&istate->cache_tree); > + > + /* > + * If there is a problem creating the cache tree, then we > + * need to expand to a full index since we cannot satisfy > + * the current request as a sparse index. > + */ > + if (cache_tree_update(istate, WRITE_TREE_MISSING_OK)) > + pl = NULL; > + } So, if the current index has some irrelevant (i.e. do not match the pattern list) subtrees in collapsed form, presense of an unmerged entry, presumably inside the cone(s) we are interested in, makes us lose the pattern list here, and we end up expanding everything? I suspect that is a situation that is not so uncommon. Working inside a narrow cone of a wide tree, performing a merge would hopefully allow many subtrees that are outside of the cones of our interest merged without getting expanded at all (e.g. only the other side touched these subtrees we are not working on, so their version will become the merge result), while changes to some paths in the cone of our interest may result in true conflicts represented as cache entries at higher stages, needing conflict resolution concluded with "git add". Having to expand these subtrees that we managed to merge while still collapsed, only because we have conflicts in some other parts of the tree, feels somewhat sad. By the way, why are we passing the "--missing-ok" option to "git write-tree" here? > @@ -330,11 +352,22 @@ void expand_to_pattern_list(struct index_state *istate, > struct cache_entry *ce = istate->cache[i]; > struct tree *tree; > struct pathspec ps; > + int dtype; > > if (!S_ISSPARSEDIR(ce->ce_mode)) { > set_index_entry(full, full->cache_nr++, ce); > continue; > } > + > + /* We now have a sparse directory entry. Should we expand? */ > + if (pl && > + path_matches_pattern_list(ce->name, ce->ce_namelen, > + NULL, &dtype, > + pl, istate) == NOT_MATCHED) { > + set_index_entry(full, full->cache_nr++, ce); > + continue; > + } > + > if (!(ce->ce_flags & CE_SKIP_WORKTREE)) > warning(_("index entry is a directory, but not sparse (%08x)"), > ce->ce_flags); > @@ -360,7 +393,7 @@ void expand_to_pattern_list(struct index_state *istate, > /* Copy back into original index. */ > memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash)); > memcpy(&istate->dir_hash, &full->dir_hash, sizeof(full->dir_hash)); > - istate->sparse_index = 0; > + istate->sparse_index = pl ? PARTIALLY_SPARSE : COMPLETELY_FULL; > free(istate->cache); > istate->cache = full->cache; > istate->cache_nr = full->cache_nr; > @@ -374,7 +407,7 @@ void expand_to_pattern_list(struct index_state *istate, > > /* Clear and recompute the cache-tree */ > cache_tree_free(&istate->cache_tree); > - cache_tree_update(istate, 0); > + cache_tree_update(istate, WRITE_TREE_MISSING_OK); The same question here. We didn't say "missing trees are OK". What made it OK in this change?
On 5/21/2022 3:45 AM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> + if (pl && !pl->use_cone_patterns) { >> pl = NULL; >> + } else { >> + /* >> + * We might contract file entries into sparse-directory >> + * entries, and for that we will need the cache tree to >> + * be recomputed. >> + */ >> + cache_tree_free(&istate->cache_tree); >> + >> + /* >> + * If there is a problem creating the cache tree, then we >> + * need to expand to a full index since we cannot satisfy >> + * the current request as a sparse index. >> + */ >> + if (cache_tree_update(istate, WRITE_TREE_MISSING_OK)) >> + pl = NULL; >> + } > > So, if the current index has some irrelevant (i.e. do not match the > pattern list) subtrees in collapsed form, presense of an unmerged > entry, presumably inside the cone(s) we are interested in, makes us > lose the pattern list here, and we end up expanding everything? > > I suspect that is a situation that is not so uncommon. Working > inside a narrow cone of a wide tree, performing a merge would > hopefully allow many subtrees that are outside of the cones of our > interest merged without getting expanded at all (e.g. only the other > side touched these subtrees we are not working on, so their version > will become the merge result), while changes to some paths in the > cone of our interest may result in true conflicts represented as > cache entries at higher stages, needing conflict resolution > concluded with "git add". Having to expand these subtrees that we > managed to merge while still collapsed, only because we have > conflicts in some other parts of the tree, feels somewhat sad. You are correct that conflicts outside of the sparse-checkout cone will cause index expansion. That happens during the 'git merge' command, but the index will continue to fail to collapse as long as those conflicts still exist in the index. When there are conflicts like this during the merge, then the index expansion is not as large of a portion of the command as normal, because the conflict resolution also takes some time to compute. The commands afterwards do take longer purely because of the expanded index. However, this state is also not as common as you might think. If a user has a sparse-checkout cone specified, then they are unlikely to change files outside of the sparse-checkout cone. They would not be the reason that those files have a conflict. The conflicts would exist only if they are merging branches that had conflicts outside of the cone. Typically, any merge of external changes like this are of the form of "git pull" or "git rebase", in which case the conflicts are still "local" to the developer's changes. You are right that there is additional work that could be done here, specifically allowing the cache tree to partially succeed and use the successfully generated trees to create sparse directory entries where possible. This was not viable before because we lacked the "partially expanded" index state. This series establishes the necessary vocabulary to do such an improvement later. > By the way, why are we passing the "--missing-ok" option to "git > write-tree" here? > >> + cache_tree_update(istate, WRITE_TREE_MISSING_OK); > > The same question here. We didn't say "missing trees are OK". What > made it OK in this change? Both of these additions of WRITE_TREE_MISSING_OK are not needed. I think I added them in an earlier version, thinking they were needed due to something in the Scalar functional tests. I confirmed just now that they are not needed for that. I will remove them. Thanks, -Stolee
On 5/23/2022 9:13 AM, Derrick Stolee wrote: > On 5/21/2022 3:45 AM, Junio C Hamano wrote: >> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: >> By the way, why are we passing the "--missing-ok" option to "git >> write-tree" here? >> >>> + cache_tree_update(istate, WRITE_TREE_MISSING_OK); >> >> The same question here. We didn't say "missing trees are OK". What >> made it OK in this change? > > Both of these additions of WRITE_TREE_MISSING_OK are not needed. I > think I added them in an earlier version, thinking they were needed > due to something in the Scalar functional tests. I confirmed just now > that they are not needed for that. I will remove them. As I went to remove these from sparse-index.c, I found another that exists, but for good reason. See 8a96b9d0a (sparse-index: use WRITE_TREE_MISSING_OK, 2021-09-08), which explains that that instance needs the flag because it could be used during 'git add'. It still isn't necessary in the instances being added here, so I will remove them. Thanks, -Stolee
Derrick Stolee <derrickstolee@github.com> writes: > As I went to remove these from sparse-index.c, I found another that > exists, but for good reason. See 8a96b9d0a (sparse-index: use > WRITE_TREE_MISSING_OK, 2021-09-08), which explains that that instance > needs the flag because it could be used during 'git add'. Hmph, re-reading that explanation, it appears to me that it is working around a lack of another option that could be useful in such a situation, namely, "update the cache tree, and if a tree object is missing then create one locally; do not bother asking any promisor remote". I am reasonably sure that back when I invented missing-ok there wasn't any such thing as "promisor remote" or "lazy fetch", and it sounds like those who added "promissor remote" broke the cache-tree subsystem by introducing calls to fetch from elsewhere? How does a normal "new object creation" codepath work when promisor remotes exist? Does write_object_file() first ask if any promisor remote have that object and fetch before creating one locally? That sounds absurd (i.e. if we have data to create locally why ask others if they have one already?). Why should cache-tree behave any differently? All of the above are mere observations on changes that have nothing to do with the change in this series, but cache_tree_update() knowing about promisor remotes feels serious breakage to me. > It still isn't necessary in the instances being added here, so I will > remove them. OK.
Derrick Stolee <derrickstolee@github.com> writes: >> I suspect that is a situation that is not so uncommon. Working >> inside a narrow cone of a wide tree, performing a merge would >> hopefully allow many subtrees that are outside of the cones of our >> interest merged without getting expanded at all (e.g. only the other >> side touched these subtrees we are not working on, so their version >> will become the merge result), while changes to some paths in the >> cone of our interest may result in true conflicts represented as >> cache entries at higher stages, needing conflict resolution >> concluded with "git add". Having to expand these subtrees that we >> managed to merge while still collapsed, only because we have >> conflicts in some other parts of the tree, feels somewhat sad. > > You are correct that conflicts outside of the sparse-checkout cone will > cause index expansion. That happens during the 'git merge' command, but > the index will continue to fail to collapse as long as those conflicts > still exist in the index. > > When there are conflicts like this during the merge, then the index > expansion is not as large of a portion of the command as normal, because > the conflict resolution also takes some time to compute. The commands > afterwards do take longer purely because of the expanded index. I was imagining a situation more like "tech-writers only have Documentation/ inside the cone of interest, attempt a pull from somebody else, have conflicts inside Documentation/, but everything else could be resolved cleanly without expanding the index". If the puller's tree is based on the pristine upstream release tag, and the pullee's tree is based on a slightly newer version of upstream snapshot, everything that happened outside Documentation/ in their trees would fast-forward, so such a merge wouldn't have to expand directories like "builtin/" or "contrib/" in the index and instead can merge at the tree level, right? On the other hand, ... > However, this state is also not as common as you might think. If a user > has a sparse-checkout cone specified, then they are unlikely to change > files outside of the sparse-checkout cone. They would not be the reason > that those files have a conflict. The conflicts would exist only if they > are merging branches that had conflicts outside of the cone. Typically, > any merge of external changes like this are of the form of "git pull" or > "git rebase", in which case the conflicts are still "local" to the > developer's changes. ... you seem to be talking about the opposite case (e.g. in the above paragraph), where a conflict happens outside the cone of interest of the person who is making a merge. So, I am a bit puzzled. > You are right that there is additional work that could be done here, > specifically allowing the cache tree to partially succeed and use the > successfully generated trees to create sparse directory entries where > possible. This was not viable before because we lacked the "partially > expanded" index state. This series establishes the necessary vocabulary to > do such an improvement later. >> By the way, why are we passing the "--missing-ok" option to "git >> write-tree" here? >> >>> + cache_tree_update(istate, WRITE_TREE_MISSING_OK); >> >> The same question here. We didn't say "missing trees are OK". What >> made it OK in this change? > > Both of these additions of WRITE_TREE_MISSING_OK are not needed. I > think I added them in an earlier version, thinking they were needed > due to something in the Scalar functional tests. I confirmed just now > that they are not needed for that. I will remove them. > > Thanks, > -Stolee
On 5/23/2022 6:48 PM, Junio C Hamano wrote: > Derrick Stolee <derrickstolee@github.com> writes: > >>> I suspect that is a situation that is not so uncommon. Working >>> inside a narrow cone of a wide tree, performing a merge would >>> hopefully allow many subtrees that are outside of the cones of our >>> interest merged without getting expanded at all (e.g. only the other >>> side touched these subtrees we are not working on, so their version >>> will become the merge result), while changes to some paths in the >>> cone of our interest may result in true conflicts represented as >>> cache entries at higher stages, needing conflict resolution >>> concluded with "git add". Having to expand these subtrees that we >>> managed to merge while still collapsed, only because we have >>> conflicts in some other parts of the tree, feels somewhat sad. >> >> You are correct that conflicts outside of the sparse-checkout cone will >> cause index expansion. That happens during the 'git merge' command, but >> the index will continue to fail to collapse as long as those conflicts >> still exist in the index. >> >> When there are conflicts like this during the merge, then the index >> expansion is not as large of a portion of the command as normal, because >> the conflict resolution also takes some time to compute. The commands >> afterwards do take longer purely because of the expanded index. > > I was imagining a situation more like "tech-writers only have > Documentation/ inside the cone of interest, attempt a pull from > somebody else, have conflicts inside Documentation/, but everything > else could be resolved cleanly without expanding the index". If the > puller's tree is based on the pristine upstream release tag, and the > pullee's tree is based on a slightly newer version of upstream > snapshot, everything that happened outside Documentation/ in their > trees would fast-forward, so such a merge wouldn't have to expand > directories like "builtin/" or "contrib/" in the index and instead > can merge at the tree level, right? > > On the other hand, ... > >> However, this state is also not as common as you might think. If a user >> has a sparse-checkout cone specified, then they are unlikely to change >> files outside of the sparse-checkout cone. They would not be the reason >> that those files have a conflict. The conflicts would exist only if they >> are merging branches that had conflicts outside of the cone. Typically, >> any merge of external changes like this are of the form of "git pull" or >> "git rebase", in which case the conflicts are still "local" to the >> developer's changes. > > ... you seem to be talking about the opposite case (e.g. in the > above paragraph), where a conflict happens outside the cone of > interest of the person who is making a merge. So, I am a bit > puzzled. Hm. We must be getting mixed up with each other. Let me try again from the beginning. When the conflict happens inside of the sparse-checkout cone, then the sparse index is not expanded. This is checked by the test 'sparse-index is not expanded: merge conflict in cone' in t1092. Most merges get to "fast forward" changes that are outside of the sparse-checkout cone because the only interesting changes that could lead to a conflict are those created by the current user (and hence within the sparse-checkout cone). So, the typical case of a tech writer only editing "Documentation/" should only see conflicts within "Documentation/". The case where we would see conflicts outside of the cone include cases where long-lived branches are being merged by someone with a small cone. I can imagine an automated process using an empty sparse-checkout cone to occasionally merge a "deployed" and a "develop" branch, and it gets conflicts when someone ships a hotfix directly to "deployed" without first going through "develop". Any conflict is likely to cause index expansion in this case. Let's re-introduce the patch section we are talking about: + if (pl && !pl->use_cone_patterns) { pl = NULL; + } else { + /* + * We might contract file entries into sparse-directory + * entries, and for that we will need the cache tree to + * be recomputed. + */ + cache_tree_free(&istate->cache_tree); + + /* + * If there is a problem creating the cache tree, then we + * need to expand to a full index since we cannot satisfy + * the current request as a sparse index. + */ + if (cache_tree_update(istate, WRITE_TREE_MISSING_OK)) + pl = NULL; + } If a user is in a conflict state and modifies their sparse-checkout cone, then we will hit this "recompute the cache-tree" state, fail, and cause full index expansion. I think that combination (have a conflict AND modify sparse-checkout cone) is rare enough to not optimize for (yet). Does that make the situation more clear? Thanks, -Stolee
Derrick Stolee <derrickstolee@github.com> writes: > If a user is in a conflict state and modifies their sparse-checkout > cone, then we will hit this "recompute the cache-tree" state, fail, > and cause full index expansion. I think that combination (have a > conflict AND modify sparse-checkout cone) is rare enough to not > optimize for (yet). > > Does that make the situation more clear? Yes. "AND modify sparse-checkout cone" part was what I was missing. I didn't see it because it wasn't there at all in the code that was removed or added---it comes from the fact that the user is running the "sparse-checkout" command in the first place, and that was what I missed. Thanks.
diff --git a/sparse-index.c b/sparse-index.c index 73b82e5017b..a65169030a2 100644 --- a/sparse-index.c +++ b/sparse-index.c @@ -297,8 +297,24 @@ void expand_to_pattern_list(struct index_state *istate, * continue. A NULL pattern set indicates a full expansion to a * full index. */ - if (pl && !pl->use_cone_patterns) + if (pl && !pl->use_cone_patterns) { pl = NULL; + } else { + /* + * We might contract file entries into sparse-directory + * entries, and for that we will need the cache tree to + * be recomputed. + */ + cache_tree_free(&istate->cache_tree); + + /* + * If there is a problem creating the cache tree, then we + * need to expand to a full index since we cannot satisfy + * the current request as a sparse index. + */ + if (cache_tree_update(istate, WRITE_TREE_MISSING_OK)) + pl = NULL; + } if (!istate->repo) istate->repo = the_repository; @@ -317,8 +333,14 @@ void expand_to_pattern_list(struct index_state *istate, full = xcalloc(1, sizeof(struct index_state)); memcpy(full, istate, sizeof(struct index_state)); + /* + * This slightly-misnamed 'full' index might still be sparse if we + * are only modifying the list of sparse directories. This hinges + * on whether we have a non-NULL pattern list. + */ + full->sparse_index = pl ? PARTIALLY_SPARSE : COMPLETELY_FULL; + /* then change the necessary things */ - full->sparse_index = 0; full->cache_alloc = (3 * istate->cache_alloc) / 2; full->cache_nr = 0; ALLOC_ARRAY(full->cache, full->cache_alloc); @@ -330,11 +352,22 @@ void expand_to_pattern_list(struct index_state *istate, struct cache_entry *ce = istate->cache[i]; struct tree *tree; struct pathspec ps; + int dtype; if (!S_ISSPARSEDIR(ce->ce_mode)) { set_index_entry(full, full->cache_nr++, ce); continue; } + + /* We now have a sparse directory entry. Should we expand? */ + if (pl && + path_matches_pattern_list(ce->name, ce->ce_namelen, + NULL, &dtype, + pl, istate) == NOT_MATCHED) { + set_index_entry(full, full->cache_nr++, ce); + continue; + } + if (!(ce->ce_flags & CE_SKIP_WORKTREE)) warning(_("index entry is a directory, but not sparse (%08x)"), ce->ce_flags); @@ -360,7 +393,7 @@ void expand_to_pattern_list(struct index_state *istate, /* Copy back into original index. */ memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash)); memcpy(&istate->dir_hash, &full->dir_hash, sizeof(full->dir_hash)); - istate->sparse_index = 0; + istate->sparse_index = pl ? PARTIALLY_SPARSE : COMPLETELY_FULL; free(istate->cache); istate->cache = full->cache; istate->cache_nr = full->cache_nr; @@ -374,7 +407,7 @@ void expand_to_pattern_list(struct index_state *istate, /* Clear and recompute the cache-tree */ cache_tree_free(&istate->cache_tree); - cache_tree_update(istate, 0); + cache_tree_update(istate, WRITE_TREE_MISSING_OK); trace2_region_leave("index", pl ? "expand_to_pattern_list" : "ensure_full_index",