Message ID | bfa0164cc3c167e383cdb5405526202432ae624e.1615929436.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Sparse Index: API protections | expand |
On Tue, Mar 16, 2021 at 2:17 PM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > Before we iterate over all cache entries, ensure that the index is not > sparse. This loop in checkout_all() might be safe to iterate over a > sparse index, but let's put this protection here until it can be > carefully tested. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > builtin/checkout-index.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/builtin/checkout-index.c b/builtin/checkout-index.c > index 023e49e271c2..ba31a036f193 100644 > --- a/builtin/checkout-index.c > +++ b/builtin/checkout-index.c > @@ -119,6 +119,7 @@ static void checkout_all(const char *prefix, int prefix_length) > int i, errs = 0; > struct cache_entry *last_ce = NULL; > > + ensure_full_index(&the_index); > for (i = 0; i < active_nr ; i++) { > struct cache_entry *ce = active_cache[i]; > if (ce_stage(ce) != checkout_stage With the caveat in the commit message, this change looks okay, but checkout-index may be buggy regardless of the presence of ensure_full_index(). If ensure_full_index() really is needed here because it needs to operate on all SKIP_WORKTREE paths and not just leading directories, that's because it's writing all those SKIP_WORKTREE entries to the working tree. When it writes them to the working tree, is it clearing the SKIP_WORKTREE bit? If not, we're in a bit of a pickle... Might be nice to add a /* TODO: audit if this is needed; if it is, we may have other bugs... */ or something like that. But then again, perhaps you're considering all uses of ensure_full_index() to be need-to-be-reaudited codepaths? If so, and we determine we really do need one and want to keep it indefinitely, will we mark those with a comment about why it's considered correct? I just want a way to know what still needs to be audited and what doesn't without doing a lot of history spelunking...
On 3/17/2021 1:50 PM, Elijah Newren wrote: > On Tue, Mar 16, 2021 at 2:17 PM Derrick Stolee via GitGitGadget > <gitgitgadget@gmail.com> wrote: > With the caveat in the commit message, this change looks okay, but > checkout-index may be buggy regardless of the presence of > ensure_full_index(). If ensure_full_index() really is needed here > because it needs to operate on all SKIP_WORKTREE paths and not just > leading directories, that's because it's writing all those > SKIP_WORKTREE entries to the working tree. When it writes them to the > working tree, is it clearing the SKIP_WORKTREE bit? If not, we're in > a bit of a pickle... Perhaps I'm unclear in my intentions with this series: _every_ insertion of ensure_full_index() is intended to be audited with tests in the future. Some might need behavior change, and others will not. In this series, I'm just putting in the protections so we don't accidentally trigger unexpected behavior. Since tests take time to write and review, I was hoping that these insertions were minimal enough to get us to a safe place where we can remove the guards carefully. So with that in mind... > Might be nice to add a > /* TODO: audit if this is needed; if it is, we may have other bugs... */ > or something like that. But then again, perhaps you're considering > all uses of ensure_full_index() to be need-to-be-reaudited codepaths? > If so, and we determine we really do need one and want to keep it > indefinitely, will we mark those with a comment about why it's > considered correct? > > I just want a way to know what still needs to be audited and what > doesn't without doing a lot of history spelunking... ...every insertion "needs to be audited" in the future. That's a big part of the next "phases" in the implementation plan. As you suggest, it might be a good idea to add a comment to every insertion, to mark it as un-audited, such as: /* TODO: test if ensure_full_index() is necessary */ We can come back later to delete the comment if it truly is necessary (and add tests to guarantee correct behavior). We can also remove the comment _and_ the call by modifying the loop behavior to do the right thing in some cases. Thanks, -Stolee
On Wed, Mar 17, 2021 at 1:05 PM Derrick Stolee <stolee@gmail.com> wrote: > > On 3/17/2021 1:50 PM, Elijah Newren wrote: > > On Tue, Mar 16, 2021 at 2:17 PM Derrick Stolee via GitGitGadget > > <gitgitgadget@gmail.com> wrote: > > With the caveat in the commit message, this change looks okay, but > > checkout-index may be buggy regardless of the presence of > > ensure_full_index(). If ensure_full_index() really is needed here > > because it needs to operate on all SKIP_WORKTREE paths and not just > > leading directories, that's because it's writing all those > > SKIP_WORKTREE entries to the working tree. When it writes them to the > > working tree, is it clearing the SKIP_WORKTREE bit? If not, we're in > > a bit of a pickle... > > Perhaps I'm unclear in my intentions with this series: _every_ > insertion of ensure_full_index() is intended to be audited with > tests in the future. Some might need behavior change, and others > will not. In this series, I'm just putting in the protections so > we don't accidentally trigger unexpected behavior. I think this may be part of my qualms -- what do you mean by not accidentally triggering unexpected behavior? In particular, does your statement imply that whatever behavior you get after putting in ensure_full_index() is "expected"? I think I'm reading that implication into it, and objecting that the behavior with the ensure_full_index() still isn't expected. You've only removed a certain class of unexpected behavior, namely code that wasn't written to expect tree entries that suddenly gets them. You haven't handled the class of "user wants to work with a subset of files, why are all these unrelated files being munged/updated/computed/shown/etc." unexpected behavior. I'm worrying that expectations are being set up such that working with just a small section of the code will be unusably hard. There may be several commands/flags where it could make sense to operate on either (a) all files in the repo or (b) just on files within your sparse paths. If, though, folks interpret operate-on-all-files as the "normal" mode (and history suggests they will), then people start adding all kinds of --no-do-this-sparsely flags to each command, and then users who want sparse operation have to remember to type such a flag with each and every command they ever run -- despite having taken at least three steps already to get a sparse-index. I believe the extended discussions (for _months_!) on just grep & rm, plus watching a --sparse patch being floated just in the last day for ls-files suggest to me that this is a _very_ likely outcome and I'm worried about it. > Since tests take time to write and review, I was hoping that these > insertions were minimal enough to get us to a safe place where we > can remove the guards carefully. > > So with that in mind... > > > Might be nice to add a > > /* TODO: audit if this is needed; if it is, we may have other bugs... */ > > or something like that. But then again, perhaps you're considering > > all uses of ensure_full_index() to be need-to-be-reaudited codepaths? > > If so, and we determine we really do need one and want to keep it > > indefinitely, will we mark those with a comment about why it's > > considered correct? > > > > I just want a way to know what still needs to be audited and what > > doesn't without doing a lot of history spelunking... > > ...every insertion "needs to be audited" in the future. That's a > big part of the next "phases" in the implementation plan. > > As you suggest, it might be a good idea to add a comment to every > insertion, to mark it as un-audited, such as: > > /* TODO: test if ensure_full_index() is necessary */ > > We can come back later to delete the comment if it truly is > necessary (and add tests to guarantee correct behavior). We can > also remove the comment _and_ the call by modifying the loop > behavior to do the right thing in some cases. If it's "needs to be audited for both performance reasons (can we operate on fewer entries as an invisible doesn't-change-results optimization) and correctness reasons (should we operate on fewer entries and given a modified result within a sparse-index because users would expect that, but maybe provide a special flag for the users who want to operate on all files in the repo)" and there's also an agreement that either audited or unaudited ones will be marked (or both), then great, I'm happy. If not, can we discuss which part of my performance/correctness/marking we aren't in agreement on? Thanks, Elijah
On 3/17/2021 5:10 PM, Elijah Newren wrote: > On Wed, Mar 17, 2021 at 1:05 PM Derrick Stolee <stolee@gmail.com> wrote: >> >> On 3/17/2021 1:50 PM, Elijah Newren wrote: >>> On Tue, Mar 16, 2021 at 2:17 PM Derrick Stolee via GitGitGadget >>> <gitgitgadget@gmail.com> wrote: >>> With the caveat in the commit message, this change looks okay, but >>> checkout-index may be buggy regardless of the presence of >>> ensure_full_index(). If ensure_full_index() really is needed here >>> because it needs to operate on all SKIP_WORKTREE paths and not just >>> leading directories, that's because it's writing all those >>> SKIP_WORKTREE entries to the working tree. When it writes them to the >>> working tree, is it clearing the SKIP_WORKTREE bit? If not, we're in >>> a bit of a pickle... >> >> Perhaps I'm unclear in my intentions with this series: _every_ >> insertion of ensure_full_index() is intended to be audited with >> tests in the future. Some might need behavior change, and others >> will not. In this series, I'm just putting in the protections so >> we don't accidentally trigger unexpected behavior. > > I think this may be part of my qualms -- what do you mean by not > accidentally triggering unexpected behavior? In particular, does your > statement imply that whatever behavior you get after putting in > ensure_full_index() is "expected"? I think I'm reading that > implication into it, and objecting that the behavior with the > ensure_full_index() still isn't expected. You've only removed a > certain class of unexpected behavior, namely code that wasn't written > to expect tree entries that suddenly gets them. You haven't handled > the class of "user wants to work with a subset of files, why are all > these unrelated files being munged/updated/computed/shown/etc." > unexpected behavior. My intention is to ensure that (at this moment) choosing to use the on-disk sparse-index format does not alter Git's end-to-end behavior. I want to avoid as much as possible a state where enabling the sparse-index can start changing how Git commands behave, perhaps in destructive ways. By adding these checks, we ensure the in-memory data structure matches whatever a full index would have created, and then the behavior matches what Git would do there. It might not be the "correct" behavior, but it is _consistent_. > I'm worrying that expectations are being set up such that working with > just a small section of the code will be unusably hard. There may be > several commands/flags where it could make sense to operate on either > (a) all files in the repo or (b) just on files within your sparse > paths. If, though, folks interpret operate-on-all-files as the > "normal" mode (and history suggests they will), then people start > adding all kinds of --no-do-this-sparsely flags to each command, and > then users who want sparse operation have to remember to type such a > flag with each and every command they ever run -- despite having taken > at least three steps already to get a sparse-index. > > I believe the extended discussions (for _months_!) on just grep & rm, > plus watching a --sparse patch being floated just in the last day for > ls-files suggest to me that this is a _very_ likely outcome and I'm > worried about it. It's these behavior changes that I would like to delay as much as possible and focus on the format and making commands fast that don't need a change in behavior. (Yes, there will be exceptions, like when "git add" specifically adds a file that is in a directory that should be out of the cone, but the user added it anyway. Atypical behavior like that can be slow for now.) >> Since tests take time to write and review, I was hoping that these >> insertions were minimal enough to get us to a safe place where we >> can remove the guards carefully. >> >> So with that in mind... >> >>> Might be nice to add a >>> /* TODO: audit if this is needed; if it is, we may have other bugs... */ >>> or something like that. But then again, perhaps you're considering >>> all uses of ensure_full_index() to be need-to-be-reaudited codepaths? >>> If so, and we determine we really do need one and want to keep it >>> indefinitely, will we mark those with a comment about why it's >>> considered correct? >>> >>> I just want a way to know what still needs to be audited and what >>> doesn't without doing a lot of history spelunking... >> >> ...every insertion "needs to be audited" in the future. That's a >> big part of the next "phases" in the implementation plan. >> >> As you suggest, it might be a good idea to add a comment to every >> insertion, to mark it as un-audited, such as: >> >> /* TODO: test if ensure_full_index() is necessary */ >> >> We can come back later to delete the comment if it truly is >> necessary (and add tests to guarantee correct behavior). We can >> also remove the comment _and_ the call by modifying the loop >> behavior to do the right thing in some cases. > > If it's "needs to be audited for both performance reasons (can we > operate on fewer entries as an invisible doesn't-change-results > optimization) and correctness reasons (should we operate on fewer > entries and given a modified result within a sparse-index because > users would expect that, but maybe provide a special flag for the > users who want to operate on all files in the repo)" and there's also > an agreement that either audited or unaudited ones will be marked (or > both), then great, I'm happy. If not, can we discuss which part of my > performance/correctness/marking we aren't in agreement on? I will mark all of the ones I'm inserting. My hope is to eventually remove it entirely except for when disabling the sparse-index. That is likely too far out to really hope for, but it is the direction I am trying to go. As I indicate that we should carefully test each of these instances where ensure_full_index() _might_ be necessary before removing them, it is even more important to test the scenarios where the behavior changes from a full index with sparse-checkout. Preferably, we just change the behavior under sparse-checkout and then the sparse-index can match that (see "test_sparse_match" in t1092). Thanks, -Stolee
On Wed, Mar 17, 2021 at 2:33 PM Derrick Stolee <stolee@gmail.com> wrote: > > On 3/17/2021 5:10 PM, Elijah Newren wrote: > > On Wed, Mar 17, 2021 at 1:05 PM Derrick Stolee <stolee@gmail.com> wrote: > >> > >> On 3/17/2021 1:50 PM, Elijah Newren wrote: > >>> On Tue, Mar 16, 2021 at 2:17 PM Derrick Stolee via GitGitGadget > >>> <gitgitgadget@gmail.com> wrote: > >>> With the caveat in the commit message, this change looks okay, but > >>> checkout-index may be buggy regardless of the presence of > >>> ensure_full_index(). If ensure_full_index() really is needed here > >>> because it needs to operate on all SKIP_WORKTREE paths and not just > >>> leading directories, that's because it's writing all those > >>> SKIP_WORKTREE entries to the working tree. When it writes them to the > >>> working tree, is it clearing the SKIP_WORKTREE bit? If not, we're in > >>> a bit of a pickle... > >> > >> Perhaps I'm unclear in my intentions with this series: _every_ > >> insertion of ensure_full_index() is intended to be audited with > >> tests in the future. Some might need behavior change, and others > >> will not. In this series, I'm just putting in the protections so > >> we don't accidentally trigger unexpected behavior. > > > > I think this may be part of my qualms -- what do you mean by not > > accidentally triggering unexpected behavior? In particular, does your > > statement imply that whatever behavior you get after putting in > > ensure_full_index() is "expected"? I think I'm reading that > > implication into it, and objecting that the behavior with the > > ensure_full_index() still isn't expected. You've only removed a > > certain class of unexpected behavior, namely code that wasn't written > > to expect tree entries that suddenly gets them. You haven't handled > > the class of "user wants to work with a subset of files, why are all > > these unrelated files being munged/updated/computed/shown/etc." > > unexpected behavior. > > My intention is to ensure that (at this moment) choosing to use > the on-disk sparse-index format does not alter Git's end-to-end > behavior. > > I want to avoid as much as possible a state where enabling the > sparse-index can start changing how Git commands behave, perhaps > in destructive ways. > > By adding these checks, we ensure the in-memory data structure > matches whatever a full index would have created, and then the > behavior matches what Git would do there. It might not be the > "correct" behavior, but it is _consistent_. That sounds good. Could this be included in Documentation/technical/sparse-index.txt? There's only an oblique reference to it when talking about grep and rm, which incidentally were already updated by Matheus. Perhaps also a reference to the warning in Documentation/git-sparse-checkout.txt would be worthwhile. > > I'm worrying that expectations are being set up such that working with > > just a small section of the code will be unusably hard. There may be > > several commands/flags where it could make sense to operate on either > > (a) all files in the repo or (b) just on files within your sparse > > paths. If, though, folks interpret operate-on-all-files as the > > "normal" mode (and history suggests they will), then people start > > adding all kinds of --no-do-this-sparsely flags to each command, and > > then users who want sparse operation have to remember to type such a > > flag with each and every command they ever run -- despite having taken > > at least three steps already to get a sparse-index. > > > > I believe the extended discussions (for _months_!) on just grep & rm, > > plus watching a --sparse patch being floated just in the last day for > > ls-files suggest to me that this is a _very_ likely outcome and I'm > > worried about it. > > It's these behavior changes that I would like to delay as much as > possible and focus on the format and making commands fast that don't > need a change in behavior. Delaying sounds great, just so long as that delay doesn't also cement the behavior and confuse consistency for correctness. I still think we don't have correct behavior for sparse-checkouts in many cases (I mean, when "git reset --hard" throws errors about not removing files and then removes them, we've obviously got some problems), but we had a decade long cementing of sorts with SKIP_WORKTREE and now a year-or-two long cementing since sparse-checkout was introduced and we never went through and cleaned up the commands. We should at some point, especially since we put the huge scary warning in Documenation/git-sparse-checkout.txt expressly for this purpose. (I would have started this sooner, but trying to feed merge-ort and keep up with patch review already keeps me at less time on non-git projects than I think is expected for me. Once merge-ort is done...) > (Yes, there will be exceptions, like when "git add" specifically > adds a file that is in a directory that should be out of the cone, > but the user added it anyway. Atypical behavior like that can be > slow for now.) I agree that there will be exceptions where we can't make the behavior be fast, but I disagree with that specific example. "git add" should just give a warning message and not add any file outside the cone, for both sparse-index and sparse-checkout. That can be done quickly. I know you want to delay the discussion of behavior fixes for specific commands somewhat, but this particular discussion has already been ongoing for about 4-12 months now (started at [1] and [2]) and it's reached a point where I've put my Reviewed-by on it; see [3] for the git-add piece specifically. [1] https://lore.kernel.org/git/9f2135f90ffea7f4ccb226f506bf554deab324cc.1605205427.git.matheus.bernardino@usp.br/ [2] https://lore.kernel.org/git/0b9b4c4b414a571877163667694afa3053bf8890.1585027716.git.matheus.bernardino@usp.br/ [3] https://lore.kernel.org/git/66d5c71182274c78e1fcfe84e77deb17e4f0d7e6.1615588109.git.matheus.bernardino@usp.br/ > >> Since tests take time to write and review, I was hoping that these > >> insertions were minimal enough to get us to a safe place where we > >> can remove the guards carefully. > >> > >> So with that in mind... > >> > >>> Might be nice to add a > >>> /* TODO: audit if this is needed; if it is, we may have other bugs... */ > >>> or something like that. But then again, perhaps you're considering > >>> all uses of ensure_full_index() to be need-to-be-reaudited codepaths? > >>> If so, and we determine we really do need one and want to keep it > >>> indefinitely, will we mark those with a comment about why it's > >>> considered correct? > >>> > >>> I just want a way to know what still needs to be audited and what > >>> doesn't without doing a lot of history spelunking... > >> > >> ...every insertion "needs to be audited" in the future. That's a > >> big part of the next "phases" in the implementation plan. > >> > >> As you suggest, it might be a good idea to add a comment to every > >> insertion, to mark it as un-audited, such as: > >> > >> /* TODO: test if ensure_full_index() is necessary */ > >> > >> We can come back later to delete the comment if it truly is > >> necessary (and add tests to guarantee correct behavior). We can > >> also remove the comment _and_ the call by modifying the loop > >> behavior to do the right thing in some cases. > > > > If it's "needs to be audited for both performance reasons (can we > > operate on fewer entries as an invisible doesn't-change-results > > optimization) and correctness reasons (should we operate on fewer > > entries and given a modified result within a sparse-index because > > users would expect that, but maybe provide a special flag for the > > users who want to operate on all files in the repo)" and there's also > > an agreement that either audited or unaudited ones will be marked (or > > both), then great, I'm happy. If not, can we discuss which part of my > > performance/correctness/marking we aren't in agreement on? > > I will mark all of the ones I'm inserting. My hope is to eventually > remove it entirely except for when disabling the sparse-index. That > is likely too far out to really hope for, but it is the direction I > am trying to go. > > As I indicate that we should carefully test each of these instances > where ensure_full_index() _might_ be necessary before removing them, > it is even more important to test the scenarios where the behavior > changes from a full index with sparse-checkout. Preferably, we just > change the behavior under sparse-checkout and then the sparse-index > can match that (see "test_sparse_match" in t1092). Makes sense. I agree that it'd be nice to have the two generally match, though I think we should be open to there being special cases that differ. The only one I can think of right now is `git ls-files` (list the entries in the index) -- since there are tree entries in a sparse-index, ls-files would naturally show the tree entries. However, we can discuss that and any other cases -- if there are any -- later.
On 3/17/2021 6:36 PM, Elijah Newren wrote: > On Wed, Mar 17, 2021 at 2:33 PM Derrick Stolee <stolee@gmail.com> wrote: >> >> On 3/17/2021 5:10 PM, Elijah Newren wrote: >>> On Wed, Mar 17, 2021 at 1:05 PM Derrick Stolee <stolee@gmail.com> wrote: >>>> >>>> On 3/17/2021 1:50 PM, Elijah Newren wrote: >>>>> On Tue, Mar 16, 2021 at 2:17 PM Derrick Stolee via GitGitGadget >>>>> <gitgitgadget@gmail.com> wrote: >>>>> With the caveat in the commit message, this change looks okay, but >>>>> checkout-index may be buggy regardless of the presence of >>>>> ensure_full_index(). If ensure_full_index() really is needed here >>>>> because it needs to operate on all SKIP_WORKTREE paths and not just >>>>> leading directories, that's because it's writing all those >>>>> SKIP_WORKTREE entries to the working tree. When it writes them to the >>>>> working tree, is it clearing the SKIP_WORKTREE bit? If not, we're in >>>>> a bit of a pickle... >>>> >>>> Perhaps I'm unclear in my intentions with this series: _every_ >>>> insertion of ensure_full_index() is intended to be audited with >>>> tests in the future. Some might need behavior change, and others >>>> will not. In this series, I'm just putting in the protections so >>>> we don't accidentally trigger unexpected behavior. >>> >>> I think this may be part of my qualms -- what do you mean by not >>> accidentally triggering unexpected behavior? In particular, does your >>> statement imply that whatever behavior you get after putting in >>> ensure_full_index() is "expected"? I think I'm reading that >>> implication into it, and objecting that the behavior with the >>> ensure_full_index() still isn't expected. You've only removed a >>> certain class of unexpected behavior, namely code that wasn't written >>> to expect tree entries that suddenly gets them. You haven't handled >>> the class of "user wants to work with a subset of files, why are all >>> these unrelated files being munged/updated/computed/shown/etc." >>> unexpected behavior. >> >> My intention is to ensure that (at this moment) choosing to use >> the on-disk sparse-index format does not alter Git's end-to-end >> behavior. >> >> I want to avoid as much as possible a state where enabling the >> sparse-index can start changing how Git commands behave, perhaps >> in destructive ways. >> >> By adding these checks, we ensure the in-memory data structure >> matches whatever a full index would have created, and then the >> behavior matches what Git would do there. It might not be the >> "correct" behavior, but it is _consistent_. > > That sounds good. Could this be included in > Documentation/technical/sparse-index.txt? There's only an oblique > reference to it when talking about grep and rm, which incidentally > were already updated by Matheus. Perhaps also a reference to the > warning in Documentation/git-sparse-checkout.txt would be worthwhile. As I was thinking about how to make this more clear, I realized that an update to that doc early in this series would be wise. Thanks. >>> I'm worrying that expectations are being set up such that working with >>> just a small section of the code will be unusably hard. There may be >>> several commands/flags where it could make sense to operate on either >>> (a) all files in the repo or (b) just on files within your sparse >>> paths. If, though, folks interpret operate-on-all-files as the >>> "normal" mode (and history suggests they will), then people start >>> adding all kinds of --no-do-this-sparsely flags to each command, and >>> then users who want sparse operation have to remember to type such a >>> flag with each and every command they ever run -- despite having taken >>> at least three steps already to get a sparse-index. >>> >>> I believe the extended discussions (for _months_!) on just grep & rm, >>> plus watching a --sparse patch being floated just in the last day for >>> ls-files suggest to me that this is a _very_ likely outcome and I'm >>> worried about it. >> >> It's these behavior changes that I would like to delay as much as >> possible and focus on the format and making commands fast that don't >> need a change in behavior. > > Delaying sounds great, just so long as that delay doesn't also cement > the behavior and confuse consistency for correctness. > > I still think we don't have correct behavior for sparse-checkouts in > many cases (I mean, when "git reset --hard" throws errors about not > removing files and then removes them, we've obviously got some > problems), but we had a decade long cementing of sorts with > SKIP_WORKTREE and now a year-or-two long cementing since > sparse-checkout was introduced and we never went through and cleaned > up the commands. We should at some point, especially since we put the > huge scary warning in Documenation/git-sparse-checkout.txt expressly > for this purpose. > > (I would have started this sooner, but trying to feed merge-ort and > keep up with patch review already keeps me at less time on non-git > projects than I think is expected for me. Once merge-ort is done...) > >> (Yes, there will be exceptions, like when "git add" specifically >> adds a file that is in a directory that should be out of the cone, >> but the user added it anyway. Atypical behavior like that can be >> slow for now.) > > I agree that there will be exceptions where we can't make the behavior > be fast, but I disagree with that specific example. "git add" should > just give a warning message and not add any file outside the cone, for > both sparse-index and sparse-checkout. That can be done quickly. That would be great! I would love behavior changes that make the performance work I'm interested in easier. > I know you want to delay the discussion of behavior fixes for specific > commands somewhat, but this particular discussion has already been > ongoing for about 4-12 months now (started at [1] and [2]) and it's > reached a point where I've put my Reviewed-by on it; see [3] for the > git-add piece specifically. > > [1] https://lore.kernel.org/git/9f2135f90ffea7f4ccb226f506bf554deab324cc.1605205427.git.matheus.bernardino@usp.br/ > [2] https://lore.kernel.org/git/0b9b4c4b414a571877163667694afa3053bf8890.1585027716.git.matheus.bernardino@usp.br/ > [3] https://lore.kernel.org/git/66d5c71182274c78e1fcfe84e77deb17e4f0d7e6.1615588109.git.matheus.bernardino@usp.br/ I'm in full support of these ideas to change the behavior, when possible. I would love to see that those changes make it really easy to integrate the sparse-index into the commands. >>>> Since tests take time to write and review, I was hoping that these >>>> insertions were minimal enough to get us to a safe place where we >>>> can remove the guards carefully. >>>> >>>> So with that in mind... >>>> >>>>> Might be nice to add a >>>>> /* TODO: audit if this is needed; if it is, we may have other bugs... */ >>>>> or something like that. But then again, perhaps you're considering >>>>> all uses of ensure_full_index() to be need-to-be-reaudited codepaths? >>>>> If so, and we determine we really do need one and want to keep it >>>>> indefinitely, will we mark those with a comment about why it's >>>>> considered correct? >>>>> >>>>> I just want a way to know what still needs to be audited and what >>>>> doesn't without doing a lot of history spelunking... >>>> >>>> ...every insertion "needs to be audited" in the future. That's a >>>> big part of the next "phases" in the implementation plan. >>>> >>>> As you suggest, it might be a good idea to add a comment to every >>>> insertion, to mark it as un-audited, such as: >>>> >>>> /* TODO: test if ensure_full_index() is necessary */ >>>> >>>> We can come back later to delete the comment if it truly is >>>> necessary (and add tests to guarantee correct behavior). We can >>>> also remove the comment _and_ the call by modifying the loop >>>> behavior to do the right thing in some cases. >>> >>> If it's "needs to be audited for both performance reasons (can we >>> operate on fewer entries as an invisible doesn't-change-results >>> optimization) and correctness reasons (should we operate on fewer >>> entries and given a modified result within a sparse-index because >>> users would expect that, but maybe provide a special flag for the >>> users who want to operate on all files in the repo)" and there's also >>> an agreement that either audited or unaudited ones will be marked (or >>> both), then great, I'm happy. If not, can we discuss which part of my >>> performance/correctness/marking we aren't in agreement on? >> >> I will mark all of the ones I'm inserting. My hope is to eventually >> remove it entirely except for when disabling the sparse-index. That >> is likely too far out to really hope for, but it is the direction I >> am trying to go. >> >> As I indicate that we should carefully test each of these instances >> where ensure_full_index() _might_ be necessary before removing them, >> it is even more important to test the scenarios where the behavior >> changes from a full index with sparse-checkout. Preferably, we just >> change the behavior under sparse-checkout and then the sparse-index >> can match that (see "test_sparse_match" in t1092). > > Makes sense. I agree that it'd be nice to have the two generally > match, though I think we should be open to there being special cases > that differ. The only one I can think of right now is `git ls-files` > (list the entries in the index) -- since there are tree entries in a > sparse-index, ls-files would naturally show the tree entries. > However, we can discuss that and any other cases -- if there are any > -- later. Here, I at least want to make it very clear that the change is happening and include updates to the documentation. This is a case where it is unclear what the "promise" is to _external_ tools that depend on ls-files (somehow). What is their intended use of ls-files, and how do those expectations change when in a sparse-checkout? It's questions like this where I at least want to be making a clear change that motivates why a change in behavior is merited, then test that behavior to demonstrate that we expect that to be the same in perpetuity. Thanks, -Stolee
diff --git a/builtin/checkout-index.c b/builtin/checkout-index.c index 023e49e271c2..ba31a036f193 100644 --- a/builtin/checkout-index.c +++ b/builtin/checkout-index.c @@ -119,6 +119,7 @@ static void checkout_all(const char *prefix, int prefix_length) int i, errs = 0; struct cache_entry *last_ce = NULL; + ensure_full_index(&the_index); for (i = 0; i < active_nr ; i++) { struct cache_entry *ce = active_cache[i]; if (ce_stage(ce) != checkout_stage