Message ID | pull.1439.v4.git.1671204678.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Optionally skip hashing index on write | expand |
On Fri, Dec 16 2022, Derrick Stolee via GitGitGadget wrote: Thanks for the update! > Updates since v3 > ================ > > * Patch 1 uses an "int" instead of "unsigned int". This was a carry-over > from Kevin's patch in microsoft/git, but our use of it in patch 2 is > different and thus is better with a signed int. > * Patch 2 uses a fallback to the_repository if istate->repo is NULL. This > allows the setting to be applied to more cases where istate->repo is not > set. > * Patch 2 also uses 'start' in more places, since it already computed the > beginning of the hash. > * Patch 4 slightly modifies its use of prepare_repo_settings() due to Patch > 2's fallback to the_repository. It's good that it's "int" now instead of "unsigned int", but just doing that search-replacement I think misses the implied (which I really should have made more explicit) question in my v3 feedback (https://lore.kernel.org/git/221215.861qp03agm.gmgdl@evledraar.gmail.com/): What should we do about the -1 state then? With your 2/4 here we'll accept e.g.: ./git -c index.skipHash=12345 status As well as: ./git -c index.skipHash=blahblah status But with the migration to repo-settings.c we start refusing the latter of those, as you inherit a stricture in repo-settings.c. Aside: I think this series would be much more readable if it were just squashed into one patch. 1/4 introduces code, but no way to reach it, with 2/4 we have config, 3/4 adds a test 4/4 tweaks how to read/parse/interpret/combine the config. But as it is split up the individual steps should make sense. The 2/4 here should really just use "bool", not "maybe_bool" to begin with, no? And part of this in 4/4 is inheriting a non-stricture in repo-settings.c, but for this new config variable that we're introducing as only a boolean from day one can't we just die() on anything that's not a boolean? On other implied feedback, in https://lore.kernel.org/git/221215.865yec3b1j.gmgdl@evledraar.gmail.com/ I noted the switching between istate->repo & the_repository, and that you could hit a BUG() (when uncommenting a line in my testing patch on top) if we didn't have istate->repo: There was a commit message update for that: > 2: 00738c81a12 ! 2: aae047cbc9f read-cache: add index.skipHash config option > @@ Commit message > > [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201 > > + We load this config from the repository config given by istate->repo, > + with a fallback to the_repository if it is not set. But I think that really sweeps a potential issue under the rug. What are these cases where we don't get istate->repo, are those expected? Is preferring istate->repo always known to be redundant now, and just done for good measure, or are there subtle cases (e.g. when reading submodules) where we pick the wrong repo's config? In that context I'd really like to see some testing of submodules in the added tests, i.e. does this act as we'd like it to when you set the config as "true" in the parent, but "false" in the submodule, & the other way around? That's a case that should stress this "the_repository" v.s. "istate->repo". > While older Git versions will not recognize the null hash as a special > case, the file format itself is still being met in terms of its > structure. Using this null hash will still allow Git operations to > @@ read-cache.c: static int verify_hdr(const struct cache_header *hdr, unsigned lon > the_hash_algo->update_fn(&c, hdr, size - the_hash_algo->rawsz); > the_hash_algo->final_fn(hash, &c); > - if (!hasheq(hash, (unsigned char *)hdr + size - the_hash_algo->rawsz)) > -+ if (!hasheq(hash, end - the_hash_algo->rawsz)) > ++ if (!hasheq(hash, start)) > return error(_("bad index file sha1 signature")); > return 0; Thanks, good to have this suggested simplification. > @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempfile *tempfile, > + int ieot_entries = 1; > + struct index_entry_offset_table *ieot = NULL; > + int nr, nr_threads; > ++ struct repository *r = istate->repo ? istate->repo : the_repository; > > f = hashfd(tempfile->fd, tempfile->filename.buf); > > -+ git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash); > ++ repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash); > + > for (i = removed = extended = 0; i < entries; i++) { Re the above this looks good, but did we introduce an untested behavior change here or not? > if (cache[i]->ce_flags & CE_REMOVE) > 3: 86370af1351 = 3: 27fbe52e748 test-lib-functions: add helper for trailing hash > 4: 6490bd445eb ! 4: 075921514f2 features: feature.manyFiles implies fast index writes > @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf > > f = hashfd(tempfile->fd, tempfile->filename.buf); > > -- git_config_get_maybe_bool("index.skiphash", (int *)&f->skip_hash); > -+ if (istate->repo) { > -+ prepare_repo_settings(istate->repo); > -+ f->skip_hash = istate->repo->settings.index_skip_hash; > -+ } > +- repo_config_get_maybe_bool(r, "index.skiphash", &f->skip_hash); > ++ prepare_repo_settings(r); > ++ f->skip_hash = r->settings.index_skip_hash; > > for (i = removed = extended = 0; i < entries; i++) { > if (cache[i]->ce_flags & CE_REMOVE) I really don't care per-se where we read the config, as long as it's doing what we expect from the UX point of view. But in your https://lore.kernel.org/git/9e754af8-ecd3-6aed-60e8-2fc09a6a8759@github.com/ you noted "There's no reason to load the config inside repo-settings.c unless it's part of something like feature.manyFiles.". I think that's true, but on the flip side of that: Is there a reason to move the reading of such localized config (only needed in read-cache.c, as 2/4 shows) to repo-settings.c, just because it's now relying on the "manyfiles"? I think the unstated reason for that is that while we read the "manyfiles" config we didn't stick it in the "struct repo_settings", therefore the reading of it is conveniently done in repo-settings.c, which the 4/4 here does. But I think it makes more sense to just stick the "manyfiles" in that struct, not stick "skip_hash" there, and then just check the "manyfiles" in 4/4, but read "index.skiphash" locally. Maybe not worth the churn at this point, but I do wonder if we're going in the wrong direction here, i.e. if all config that happens to rely on features must be added to repo-settings.c, or whether we should reserve that for truly "global" (or mostly global) config.
On 12/16/2022 10:43 AM, Ævar Arnfjörð Bjarmason wrote: > But as it is split up the individual steps should make sense. The 2/4 > here should really just use "bool", not "maybe_bool" to begin with, no? > And part of this in 4/4 is inheriting a non-stricture in > repo-settings.c, but for this new config variable that we're introducing > as only a boolean from day one can't we just die() on anything that's > not a boolean? > > On other implied feedback, in > https://lore.kernel.org/git/221215.865yec3b1j.gmgdl@evledraar.gmail.com/ > I noted the switching between istate->repo & the_repository, and that > you could hit a BUG() (when uncommenting a line in my testing patch on > top) if we didn't have istate->repo: > > There was a commit message update for that: > >> 2: 00738c81a12 ! 2: aae047cbc9f read-cache: add index.skipHash config option >> @@ Commit message >> >> [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201 >> >> + We load this config from the repository config given by istate->repo, >> + with a fallback to the_repository if it is not set. > > But I think that really sweeps a potential issue under the rug. What are > these cases where we don't get istate->repo, are those expected? Is > preferring istate->repo always known to be redundant now, and just done > for good measure, or are there subtle cases (e.g. when reading > submodules) where we pick the wrong repo's config? After investigating some of the failures from creating a BUG() statement when istate->repo is NULL I see several problems, and they are not related to submodules for the most part. The first issues I've found are due to code paths that operate on the_index without actually initializing it with a do_read_index(), or otherwise initialize an index using a memset() instead of a common initializer. This looks to be a frequent enough problem that solving it would require a significant effort. It's not a quick fix. > In that context I'd really like to see some testing of submodules in the > added tests, i.e. does this act as we'd like it to when you set the > config as "true" in the parent, but "false" in the submodule, & the > other way around? That's a case that should stress this "the_repository" > v.s. "istate->repo". I can add a test for this, though we do not do that for every new config option. Further, we can only rely on commands like 'git add' that correctly initialize the index from a do_read_index(). It should be sufficient to use the index-local repository (when available) and trust that the repo_config_...() method is doing the right thing. > I really don't care per-se where we read the config, as long as it's > doing what we expect from the UX point of view. > > But in your > https://lore.kernel.org/git/9e754af8-ecd3-6aed-60e8-2fc09a6a8759@github.com/ > you noted "There's no reason to load the config inside repo-settings.c > unless it's part of something like feature.manyFiles.". > > I think that's true, but on the flip side of that: Is there a reason to > move the reading of such localized config (only needed in read-cache.c, > as 2/4 shows) to repo-settings.c, just because it's now relying on the > "manyfiles"? Yes, because it makes it clear what options are part of these meta-config keys. It is better to centralize the parsing instead of having several different ad-hoc parsing strategies. Thanks, -Stolee
Derrick Stolee <derrickstolee@github.com> writes: > After investigating some of the failures from creating a BUG() statement > when istate->repo is NULL I see several problems, and they are not related > to submodules for the most part. > > The first issues I've found are due to code paths that operate on the_index > without actually initializing it with a do_read_index(), or otherwise > initialize an index using a memset() instead of a common initializer. This > looks to be a frequent enough problem that solving it would require a > significant effort. It's not a quick fix. Thanks. It is not entirely unexpected X-<, considering how the connection from the in-core repository and the in-core index has been developed and evolved. So in short, istate->repo is not yet ready to be used, until the problems you identified are resolved? We probably should start paying down such technical debts. We've punted on them too many times, saying "yes this is klunky but what we have is good enough for adding this feature", I suspect? Thanks.
On 1/6/2023 5:45 PM, Junio C Hamano wrote: > Derrick Stolee <derrickstolee@github.com> writes: > >> After investigating some of the failures from creating a BUG() statement >> when istate->repo is NULL I see several problems, and they are not related >> to submodules for the most part. >> >> The first issues I've found are due to code paths that operate on the_index >> without actually initializing it with a do_read_index(), or otherwise >> initialize an index using a memset() instead of a common initializer. This >> looks to be a frequent enough problem that solving it would require a >> significant effort. It's not a quick fix. > > Thanks. It is not entirely unexpected X-<, considering how the > connection from the in-core repository and the in-core index has > been developed and evolved. So in short, istate->repo is not yet > ready to be used, until the problems you identified are resolved? > > We probably should start paying down such technical debts. We've > punted on them too many times, saying "yes this is klunky but what > we have is good enough for adding this feature", I suspect? Yes, I'll make note to prioritize this soon. Thanks, -Stolee
On Fri, Jan 06 2023, Derrick Stolee wrote: > On 1/6/2023 5:45 PM, Junio C Hamano wrote: >> Derrick Stolee <derrickstolee@github.com> writes: >> >>> After investigating some of the failures from creating a BUG() statement >>> when istate->repo is NULL I see several problems, and they are not related >>> to submodules for the most part. >>> >>> The first issues I've found are due to code paths that operate on the_index >>> without actually initializing it with a do_read_index(), or otherwise >>> initialize an index using a memset() instead of a common initializer. This >>> looks to be a frequent enough problem that solving it would require a >>> significant effort. It's not a quick fix. >> >> Thanks. It is not entirely unexpected X-<, considering how the >> connection from the in-core repository and the in-core index has >> been developed and evolved. So in short, istate->repo is not yet >> ready to be used, until the problems you identified are resolved? >> >> We probably should start paying down such technical debts. We've >> punted on them too many times, saying "yes this is klunky but what >> we have is good enough for adding this feature", I suspect? > > Yes, I'll make note to prioritize this soon. I noted in passing in [1] that I had those patches locally, if I'm understanding you correctly and you're planning to work on changes that'll make "istate->repo" always non-NULL. I've rebased those on top of your "ds/omit-trailing-hash-in-index". I'm CI-ing those now & hoping to submit them soon (I've had it working for a while, but there was some interaction with your patches). Preview at [2]. 1. https://lore.kernel.org/git/221215.865yec3b1j.gmgdl@evledraar.gmail.com/ 2. https://github.com/avar/git/compare/avar-ds/omit-trailing-hash-in-index...avar/do-not-lazy-populate-istate-repo
On 1/9/2023 12:15 PM, Ævar Arnfjörð Bjarmason wrote: > On Fri, Jan 06 2023, Derrick Stolee wrote: >> On 1/6/2023 5:45 PM, Junio C Hamano wrote: >>> We probably should start paying down such technical debts. We've >>> punted on them too many times, saying "yes this is klunky but what >>> we have is good enough for adding this feature", I suspect? >> >> Yes, I'll make note to prioritize this soon. > > I noted in passing in [1] that I had those patches locally, if I'm > understanding you correctly and you're planning to work on changes > that'll make "istate->repo" always non-NULL. > > I've rebased those on top of your "ds/omit-trailing-hash-in-index". I'm > CI-ing those now & hoping to submit them soon (I've had it working for a > while, but there was some interaction with your patches). Preview at > [2]. > > 1. https://lore.kernel.org/git/221215.865yec3b1j.gmgdl@evledraar.gmail.com/ > 2. https://github.com/avar/git/compare/avar-ds/omit-trailing-hash-in-index...avar/do-not-lazy-populate-istate-repo Thanks for doing this so I don't need to. Some quick pre-submission feedback: your "treewide" patch [1] is far too big and doing too much all at once. It's difficult to know why you're doing things when you're doing them, especially the choices for the validate_cache_repo() calls. In particular, I noticed that you used "the_repository" in some cases where a local "struct repository *" would be better (even if it is currently pointing to the_repository as in builtin/sparse-checkout.c in update_working_directory()). These would be easier to catch in smaller diffs. [1] https://github.com/avar/git/commit/7732a41800dfc8f5dbf909560615d6048d583ed9 Looking forward to your series. -Stolee
On Mon, Jan 09 2023, Derrick Stolee wrote: > On 1/9/2023 12:15 PM, Ævar Arnfjörð Bjarmason wrote: >> On Fri, Jan 06 2023, Derrick Stolee wrote: >>> On 1/6/2023 5:45 PM, Junio C Hamano wrote: > >>>> We probably should start paying down such technical debts. We've >>>> punted on them too many times, saying "yes this is klunky but what >>>> we have is good enough for adding this feature", I suspect? >>> >>> Yes, I'll make note to prioritize this soon. >> >> I noted in passing in [1] that I had those patches locally, if I'm >> understanding you correctly and you're planning to work on changes >> that'll make "istate->repo" always non-NULL. >> >> I've rebased those on top of your "ds/omit-trailing-hash-in-index". I'm >> CI-ing those now & hoping to submit them soon (I've had it working for a >> while, but there was some interaction with your patches). Preview at >> [2]. >> >> 1. https://lore.kernel.org/git/221215.865yec3b1j.gmgdl@evledraar.gmail.com/ >> 2. https://github.com/avar/git/compare/avar-ds/omit-trailing-hash-in-index...avar/do-not-lazy-populate-istate-repo > > Thanks for doing this so I don't need to. > > Some quick pre-submission feedback: your "treewide" patch [1] is far > too big and doing too much all at once. It's difficult to know why > you're doing things when you're doing them, especially the choices > for the validate_cache_repo() calls. Yeah, I don't like it either :) I'm mulling over whether to keep it at all. I.e. the fix to pass the data is relatively small, but most of that patch is validation paranoia. It was helpful to write that to validate it for my own sanity, but maybe it's not worth keeping it. Ultimately it's just making things BUG(...) during testing that would otherwise be a NULL-pointer dereference, so I'm currently leaning towards (especially with this early feedback) just skipping it. Now that we have a SANITIZE=address CI job the error feedback if/when that would happen is arguably worse than the BUG(...) output. > In particular, I noticed that you used "the_repository" in some cases > where a local "struct repository *" would be better (even if it is > currently pointing to the_repository as in builtin/sparse-checkout.c > in update_working_directory()). These would be easier to catch in > smaller diffs. Yes, definitely. I was careful to use an existing "struct repository *" in favor of "the_repository", but clearly not careful enough! I'll fix that, and look carefully over the rest before submission in case I've missed other similar cases. > [1] https://github.com/avar/git/commit/7732a41800dfc8f5dbf909560615d6048d583ed9 > > Looking forward to your series. Thanks, since I sent the above CI passed, so it'll be sooner than later, but I'll massage it a bit per the above before submission.