Message ID | 34a61a0d03868c43d68a04bca8d86dd98de2aa28.1615588109.git.matheus.bernardino@usp.br (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | add/rm: honor sparse checkout and warn on sparse paths | expand |
Matheus Tavares <matheus.bernardino@usp.br> writes: > refresh_index() optionally takes a seen[] array to mark the pathspec > items that had matches in the index. This is used by `git add --refresh` > to find out if there was any pathspec without matches, and display an > error accordingly. It smells a bit iffy in that this allows a path to "match" a pattern but yet not result in seen[] to record that the pattern participated in the match. If the series is working towards certain operations not to touch paths that are outside the sparse checkout, shouldn't it be making these paths not to match the pattern, and by doing so it would automatically cause the pattern to be considered "not yet matching any path" when the matcher attempts to match the pattern to such a path?
Junio C Hamano <gitster@pobox.com> writes: > Matheus Tavares <matheus.bernardino@usp.br> writes: > >> refresh_index() optionally takes a seen[] array to mark the pathspec >> items that had matches in the index. This is used by `git add --refresh` >> to find out if there was any pathspec without matches, and display an >> error accordingly. > > It smells a bit iffy in that this allows a path to "match" a pattern > but yet not result in seen[] to record that the pattern participated > in the match. If the series is working towards certain operations > not to touch paths that are outside the sparse checkout, shouldn't > it be making these paths not to match the pattern, and by doing so > it would automatically cause the pattern to be considered "not yet > matching any path" when the matcher attempts to match the pattern to > such a path? In other words, the change makes me wonder why we are not adding a flag that says "do we or do we not want to match paths outside the sparse checkout cone?", with which the seen[] would automatically record the right thing.
On Thu, Mar 18, 2021 at 9:00 PM Junio C Hamano <gitster@pobox.com> wrote: > > Junio C Hamano <gitster@pobox.com> writes: > > > Matheus Tavares <matheus.bernardino@usp.br> writes: > > > >> refresh_index() optionally takes a seen[] array to mark the pathspec > >> items that had matches in the index. This is used by `git add --refresh` > >> to find out if there was any pathspec without matches, and display an > >> error accordingly. > > > > It smells a bit iffy in that this allows a path to "match" a pattern > > but yet not result in seen[] to record that the pattern participated > > in the match. If the series is working towards certain operations > > not to touch paths that are outside the sparse checkout, shouldn't > > it be making these paths not to match the pattern, and by doing so > > it would automatically cause the pattern to be considered "not yet > > matching any path" when the matcher attempts to match the pattern to > > such a path? > > In other words, the change makes me wonder why we are not adding a > flag that says "do we or do we not want to match paths outside the > sparse checkout cone?", with which the seen[] would automatically > record the right thing. Yeah, makes sense. I didn't want to make the flag skip the sparse paths unconditionally (i.e. without matching them) because then we would also skip the ce_stage() checkings below and the later ce_mark_uptodate(). And I wasn't sure whether this could cause any unwanted side effects. But thinking more carefully about this now, unmerged paths should never have the SKIP_WORKTREE bit set anyway, right? What about the CE_UPTODATE mark, would it be safe to skip it? I'm not very familiar with this code, but I'll try to investigate more later.
Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes: >> In other words, the change makes me wonder why we are not adding a >> flag that says "do we or do we not want to match paths outside the >> sparse checkout cone?", with which the seen[] would automatically >> record the right thing. > > Yeah, makes sense. I didn't want to make the flag skip the sparse > paths unconditionally (i.e. without matching them) because then we > would also skip the ce_stage() checkings below and the later > ce_mark_uptodate(). And I wasn't sure whether this could cause any > unwanted side effects. > > But thinking more carefully about this now, unmerged paths should > never have the SKIP_WORKTREE bit set anyway, right? What about the > CE_UPTODATE mark, would it be safe to skip it? I'm not very familiar > with this code, but I'll try to investigate more later. I do not offhand know the answers to these questions you ask in the last paragraph myself, and it could turn out to be that the "let processing go as before, matching what those excluded paths by the matcher, but making the caller responsible for filtering out the paths excluded by sparcity out of the result from the matcher, but tweak seen[]" approach, however ugly, is the most practical, after your investigation. But I'd prefer to see how bad the situation is first ;-). Thanks.
On Fri, Mar 19, 2021 at 1:05 PM Junio C Hamano <gitster@pobox.com> wrote: > > Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes: > > >> In other words, the change makes me wonder why we are not adding a > >> flag that says "do we or do we not want to match paths outside the > >> sparse checkout cone?", with which the seen[] would automatically > >> record the right thing. > > > > Yeah, makes sense. I didn't want to make the flag skip the sparse > > paths unconditionally (i.e. without matching them) because then we > > would also skip the ce_stage() checkings below and the later > > ce_mark_uptodate(). And I wasn't sure whether this could cause any > > unwanted side effects. > > > > But thinking more carefully about this now, unmerged paths should > > never have the SKIP_WORKTREE bit set anyway, right? What about the > > CE_UPTODATE mark, would it be safe to skip it? I'm not very familiar > > with this code, but I'll try to investigate more later. Sorry I haven't given any update on this yet. From what I could see so far, it seems OK to ignore the skip_worktree entries in refresh_index() when it is called from `git add --refresh`. But because we would no longer mark the skip_worktree entries that match the pathspec with CE_UPTODATE, do_write_index() would start checking if they are racy clean (this is only done when `!ce_uptodate(ce)`), which could add some lstat() overhead. However, this made me think what happens today if we do have a racy clean entry with the skip_worktree bit set... `git add --refresh` will probably update the index without noticing that the entry is racy clean (because it won't check CE_UPTODATE entries, and skip_worktree entries will have this bit set in refresh_index()). Thus the entries' size won't be truncated to zero when writing the index, and the entry will appear unchanged even if we later unset the skip_worktree bit. But can we have a "racy clean skip_worktree entry"? Yes, this seems possible e.g. if the following sequence happens fast enough for mtime to be the same before and after the update: echo x >file git update-index --refresh --skip-worktree file echo y>file Here is a more complete example which artificially creates a "racy clean skip_worktree entry", runs `git add --refresh`, and shows that the racy clean entry was not detected: # Setup echo sparse >sparse echo dense >dense git add . git commit -m files # Emulate a racy clean situation touch -d yesterday date touch -r date sparse git update-index --refresh --skip-worktree sparse touch -r date .git/index echo xparse >sparse touch -r date sparse # `git add` will now write a new index without checking if # `sparse` is racy clean nor truncating its size touch -r date dense git add --refresh . git update-index --no-skip-worktree sparse git status <doesn't show that `sparse` was modified> This situation feels rather uncommon, but perhaps the same could happen with `git sparse-checkout set` instead of `git update-index --refresh --skip-worktree`? IDK. This made me think whether refresh_index() should really mark skip_worktree entries with CE_UPTODATE, in the first place. Any thoughts?
Hi, On Tue, Mar 30, 2021 at 11:51 AM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote: > > On Fri, Mar 19, 2021 at 1:05 PM Junio C Hamano <gitster@pobox.com> wrote: > > > > Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes: > > > > >> In other words, the change makes me wonder why we are not adding a > > >> flag that says "do we or do we not want to match paths outside the > > >> sparse checkout cone?", with which the seen[] would automatically > > >> record the right thing. > > > > > > Yeah, makes sense. I didn't want to make the flag skip the sparse > > > paths unconditionally (i.e. without matching them) because then we > > > would also skip the ce_stage() checkings below and the later > > > ce_mark_uptodate(). And I wasn't sure whether this could cause any > > > unwanted side effects. > > > > > > But thinking more carefully about this now, unmerged paths should > > > never have the SKIP_WORKTREE bit set anyway, right? What about the > > > CE_UPTODATE mark, would it be safe to skip it? I'm not very familiar > > > with this code, but I'll try to investigate more later. > > Sorry I haven't given any update on this yet. From what I could see so > far, it seems OK to ignore the skip_worktree entries in > refresh_index() when it is called from `git add --refresh`. But > because we would no longer mark the skip_worktree entries that match > the pathspec with CE_UPTODATE, do_write_index() would start checking > if they are racy clean (this is only done when `!ce_uptodate(ce)`), > which could add some lstat() overhead. > > However, this made me think what happens today if we do have a racy > clean entry with the skip_worktree bit set... `git add --refresh` will > probably update the index without noticing that the entry is racy > clean (because it won't check CE_UPTODATE entries, and skip_worktree > entries will have this bit set in refresh_index()). Thus the entries' > size won't be truncated to zero when writing the index, and the entry > will appear unchanged even if we later unset the skip_worktree bit. > > But can we have a "racy clean skip_worktree entry"? Yes, this seems > possible e.g. if the following sequence happens fast enough for mtime > to be the same before and after the update: > > echo x >file > git update-index --refresh --skip-worktree file > echo y>file > > Here is a more complete example which artificially creates a "racy > clean skip_worktree entry", runs `git add --refresh`, and shows that > the racy clean entry was not detected: > > # Setup > echo sparse >sparse > echo dense >dense > git add . > git commit -m files > > # Emulate a racy clean situation > touch -d yesterday date > touch -r date sparse > git update-index --refresh --skip-worktree sparse > touch -r date .git/index > echo xparse >sparse > touch -r date sparse > > # `git add` will now write a new index without checking if > # `sparse` is racy clean nor truncating its size > touch -r date dense > git add --refresh . > > git update-index --no-skip-worktree sparse > git status > <doesn't show that `sparse` was modified> > > This situation feels rather uncommon, but perhaps the same could > happen with `git sparse-checkout set` instead of `git update-index > --refresh --skip-worktree`? IDK. This made me think whether > refresh_index() should really mark skip_worktree entries with > CE_UPTODATE, in the first place. > > Any thoughts? Wow, that's a weird one. Nice digging. I don't understand the racily clean stuff that well, or the CE_UPTODATE handling...but based on my meager understanding of these things I'd say that marking skip_worktree entries with CE_UPTODATE seems wrong and I'd agree with your hunch that we shouldn't be updating it for those files. If an entry is skip_worktree, it's assumed to not be present in the working tree and that we'll treat it as "unchanged". So, when the filehappens to be present despite that bit being set, checking the files' stat information or pretending to have done so so we can claim it is up to date seems wrong to me. In fact, I'd say the mere recognition that the file is present in the working tree despite being SKIP_WORKTREE to me implies it's not "up-to-date" for at least one definition of that term. I suspect that if someone flips the skip-worktree bit on and off for a file without removing it from the working tree, not marking skip_worktree entries with CE_UPTODATE as you mention above might force us to incur the cost of an extra stat on said file when we run "git status" later. If so, I think that's a totally reasonable cost, so if that was your worry, I'd say that this is shifting the cost where it belongs. But, like I said, I'm not very familiar with the racily clean code or CE_UPTODATE, so it's possible I said something above that doesn't even make sense. Does my reasoning help at all?
diff --git a/cache.h b/cache.h index 6fda8091f1..0b6074d49e 100644 --- a/cache.h +++ b/cache.h @@ -879,13 +879,14 @@ int match_stat_data_racy(const struct index_state *istate, void fill_stat_cache_info(struct index_state *istate, struct cache_entry *ce, struct stat *st); -#define REFRESH_REALLY 0x0001 /* ignore_valid */ -#define REFRESH_UNMERGED 0x0002 /* allow unmerged */ -#define REFRESH_QUIET 0x0004 /* be quiet about it */ -#define REFRESH_IGNORE_MISSING 0x0008 /* ignore non-existent */ -#define REFRESH_IGNORE_SUBMODULES 0x0010 /* ignore submodules */ -#define REFRESH_IN_PORCELAIN 0x0020 /* user friendly output, not "needs update" */ -#define REFRESH_PROGRESS 0x0040 /* show progress bar if stderr is tty */ +#define REFRESH_REALLY (1 << 0) /* ignore_valid */ +#define REFRESH_UNMERGED (1 << 1) /* allow unmerged */ +#define REFRESH_QUIET (1 << 2) /* be quiet about it */ +#define REFRESH_IGNORE_MISSING (1 << 3) /* ignore non-existent */ +#define REFRESH_IGNORE_SUBMODULES (1 << 4) /* ignore submodules */ +#define REFRESH_IN_PORCELAIN (1 << 5) /* user friendly output, not "needs update" */ +#define REFRESH_PROGRESS (1 << 6) /* show progress bar if stderr is tty */ +#define REFRESH_DONT_MARK_SPARSE_MATCHES (1 << 7) /* don't mark sparse entries' matches on seen[] */ int refresh_index(struct index_state *, unsigned int flags, const struct pathspec *pathspec, char *seen, const char *header_msg); /* * Refresh the index and write it to disk. diff --git a/read-cache.c b/read-cache.c index 1e9a50c6c7..0bf7c64eff 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1514,6 +1514,7 @@ int refresh_index(struct index_state *istate, unsigned int flags, int quiet = (flags & REFRESH_QUIET) != 0; int not_new = (flags & REFRESH_IGNORE_MISSING) != 0; int ignore_submodules = (flags & REFRESH_IGNORE_SUBMODULES) != 0; + int no_sparse_on_seen = (flags & REFRESH_DONT_MARK_SPARSE_MATCHES) != 0; int first = 1; int in_porcelain = (flags & REFRESH_IN_PORCELAIN); unsigned int options = (CE_MATCH_REFRESH | @@ -1552,12 +1553,14 @@ int refresh_index(struct index_state *istate, unsigned int flags, int filtered = 0; int t2_did_lstat = 0; int t2_did_scan = 0; + char *cur_seen; ce = istate->cache[i]; if (ignore_submodules && S_ISGITLINK(ce->ce_mode)) continue; - if (pathspec && !ce_path_match(istate, ce, pathspec, seen)) + cur_seen = no_sparse_on_seen && ce_skip_worktree(ce) ? NULL : seen; + if (pathspec && !ce_path_match(istate, ce, pathspec, cur_seen)) filtered = 1; if (ce_stage(ce)) {
refresh_index() optionally takes a seen[] array to mark the pathspec items that had matches in the index. This is used by `git add --refresh` to find out if there was any pathspec without matches, and display an error accordingly. In the following patch, `git add` will also learn to warn about pathspecs that match no eligible path for updating, but do match sparse entries. For that, we will need a seen[] array marked exclusively with matches from dense entries. To avoid having to call ce_path_match() again for these entries after refresh_index() returns, add a flag that implements this restriction inside the function itself. Note that refresh_index() does not update sparse entries, regardless of passing the flag or not. The flag only controls whether matches with these entries should appear in the seen[] array. While we are here, also realign the REFRESH_* flags and convert the hex values to the more natural bit shift format, which makes it easier to spot holes. Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> --- cache.h | 15 ++++++++------- read-cache.c | 5 ++++- 2 files changed, 12 insertions(+), 8 deletions(-)