Message ID | 20240514011437.3779151-2-gitster@pobox.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Fix use of uninitialized hash algorithms | expand |
On Mon, May 13, 2024 at 06:14:33PM -0700, Junio C Hamano wrote: [snip] > diff --git a/repository.c b/repository.c > index 15c10015b0..f912ee9a7c 100644 > --- a/repository.c > +++ b/repository.c > @@ -1,5 +1,6 @@ > #include "git-compat-util.h" > #include "abspath.h" > +#include "environment.h" > #include "repository.h" > #include "object-store-ll.h" > #include "config.h" > @@ -19,6 +20,27 @@ > static struct repository the_repo; > struct repository *the_repository = &the_repo; > > +static void set_default_hash_algo(struct repository *repo) > +{ > + const char *hash_name; > + int algo; > + > + hash_name = getenv(GIT_DEFAULT_HASH_ENVIRONMENT); > + if (!hash_name) > + return; > + algo = hash_algo_by_name(hash_name); > + > + /* > + * NEEDSWORK: after all, falling back to SHA-1 by assigning > + * GIT_HASH_SHA1 to algo here, instead of returning, may give > + * us better behaviour. > + */ > + if (algo == GIT_HASH_UNKNOWN) > + return; > + > + repo_set_hash_algo(repo, algo); > +} The problem with reusing "GIT_DEFAULT_HASH" is that we unconditionally set it in our test suite in "test-lib.sh". This will have the effect that we will never hit segfaults in our tests because we always end up setting up the default hash, whereas our users now will. I would propose to revert this back to the first iteration you had, where the workaround only enables the SHA1 fallback. No users have yet complained about the inability to pick the hash algo outside of a repo, indicating that it's not widely used. And when they complain, there is more motivation to fix this properly by adding a `--object-hash=` switch to the respective commands so that a user can pick the desired object hash. Patrick
Patrick Steinhardt <ps@pks.im> writes: > The problem with reusing "GIT_DEFAULT_HASH" is that we unconditionally > set it in our test suite in "test-lib.sh". This will have the effect > that we will never hit segfaults in our tests because we always end up > setting up the default hash, whereas our users now will. I tried the version you posted with only SHA-1 fallback and it failed with SHA256 CI jobs, and eventually came up with using the existing environment variable. The variable is the perfect fit in the longer term for our purpose. It is what the end-users will set, not for papering over remaining bugs from the "no longer there is a fallback default" change, but for telling Git that they want to use sha256 repositories. With GIT_DEFAULT_HASH set to sha256, we will be testing a configuration that is very close to those end-users. There probably are four cases we need to check: - In a repository, invocations of commands that should honor the hash function that is in use by the repository (i.e. this is true for most commands and their invocations). - In a repository, invocations of commands that should ignore the repository settings and always use SHA-1 (i.e. proposed fix to patch-id). - Outside a repository, invocations with GIT_DEFAULT_HASH set should probably parallel the above two, as if GIT_DEFAULT_HASH came from the repository. - Outside a repository, invocations without GIT_DEFAULT_HASH set should all default to SHA-1??? The last combination cannot be tested UNLESS we unset GIT_DEFAULT_HASH that is given by test-lib.sh, but that can easily be arranged, now we have one primary/central place that tests out-of-repository invocations of various commands in t1517. And using the existing environment variable has an added benefit that we do not have to add an extra option to random commands.
Junio C Hamano <gitster@pobox.com> writes: > Partially revert c8aed5e8 (repository: stop setting SHA1 as the > default object hash, 2024-05-07), to keep end-user systems still > broken when we have gap in our test coverage but yet give them an > escape hatch to set the GIT_DEFAULT_HASH environment variable to > "sha1" in order to revert to the previous behaviour. This variable > has been in use for using SHA-256 hash by default, and it should be > a better fit than inventing a new and test-only knob. Having done all of this, I actually am very tempted to add the "always default to SHA-1" back as a fallback position to the set_default_hash_algo() function. We know we are going to get the right hash algorithm when working in the repository, so the only case the default matters in practice is when working outside the repository. We already have such a custom code for "git diff --no-index", and we are adding a few more back in here, but they can disappear if we had code to set the fallback default when GIT_DEFAULT_HASH does not exist here. The "always use SHA-1 regardless of the hash used by the repository" code like "patch-id" should not depend on such a fallback default but should have its own code to explicitly set it. As the user can tweak what algorithm they want if the wanted to, it does not sound too bad to have a fallback default when the user did not choose.
On Tue, May 14, 2024 at 10:19:01AM -0700, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > > > Partially revert c8aed5e8 (repository: stop setting SHA1 as the > > default object hash, 2024-05-07), to keep end-user systems still > > broken when we have gap in our test coverage but yet give them an > > escape hatch to set the GIT_DEFAULT_HASH environment variable to > > "sha1" in order to revert to the previous behaviour. This variable > > has been in use for using SHA-256 hash by default, and it should be > > a better fit than inventing a new and test-only knob. > > Having done all of this, I actually am very tempted to add the > "always default to SHA-1" back as a fallback position to the > set_default_hash_algo() function. We know we are going to get the > right hash algorithm when working in the repository, so the only > case the default matters in practice is when working outside the > repository. > > We already have such a custom code for "git diff --no-index", and we > are adding a few more back in here, but they can disappear if we had > code to set the fallback default when GIT_DEFAULT_HASH does not > exist here. The "always use SHA-1 regardless of the hash used by > the repository" code like "patch-id" should not depend on such a > fallback default but should have its own code to explicitly set it. > > As the user can tweak what algorithm they want if the wanted to, it > does not sound too bad to have a fallback default when the user did > not choose. I think that this is going into the wrong direction. The patches that we have added surface real issues. If we now unconditionally add back the fallback to SHA-1, then we only punt those issues further down the road to the point in time where we drop `the_hash_algo` and/or `the_repository`. All the issues that were surfaced until now are technical debt, and the proposed fixes have been documented with a "TODO" item that they need more work. In some cases it's as simple as adding in an option to the respective command that lets the user pick the hash algorithm, and I do not think that GIT_DEFAULT_HASH is a proper replacement for that. In other cases like for git-patch-id(1) the issue runs deeper and we need to refactor the whole subsystem to stop relying on `the_hash_algo` altogether. Patrick
Junio C Hamano <gitster@pobox.com> writes: > Having done all of this, I actually am very tempted to add the > "always default to SHA-1" back as a fallback position to the > set_default_hash_algo() function. We know we are going to get the > right hash algorithm when working in the repository, so the only > case the default matters in practice is when working outside the > repository. Not really. It does not add anything to help either real world or our tests. The current test setting is already bad enough in that, unlike in the real world settings, even tests with the SHA-1 algorithm has GIT_DEFAULT_HASH environment variable set, which means that such a "if the environment variable is not set, further fall back to SHA-1" does not do anything. Unless we change t/test-lib.sh not to set GIT_DEFAULT_HASH tweaking the fallback default in repository.c:set_default_hash_algo() based on GIT_DEFAULT_HASH would not be a workable solution. I wanted to arrange things so that the end-user exectuion by default has an extra fallback (perhaps to SHA-1, or GIT_DEFAULT_HASH) to avoid disrupting their real-world use, which we can disable in our tests to expose code paths that still rely on the "default" set when in-core repository struct gets initialized, but that is not possible without changing the way t/test-lib.sh uses GIT_DEFAULT_HASH, it seems. So the arrangement unfortunately has to be "we have no default, and bugs will break the real-world uses as well as tests the same way. The real-world users have to export an extra 'workaround' environment variable to force "default" to SHA-1 (or GIT_DEFAULT_HASH) --- which may be "workable" but very far from being intuitive. They can set GIT_DEFAULT_HASH but to make it effective everywhere, including the "default" given by set_default_hash_algo(), they need to set this other "workaround" thing. > We already have such a custom code for "git diff --no-index", and we > are adding a few more back in here, but they can disappear if we had > code to set the fallback default when GIT_DEFAULT_HASH does not > exist here. While I think a manual setting of the_hash_algo in "diff --no-index" code path should not hardcode "SHA-1" but instead use the hash specified by the GIT_DEFAULT_HASH environment to be consistent with the use of "git" by the same parent process that had that variable exported to the environment, that should not be done globally in repository.c:set_default_hash_algo().
diff --git a/environment.h b/environment.h index 05fd94d7be..deaa29408f 100644 --- a/environment.h +++ b/environment.h @@ -56,6 +56,7 @@ const char *getenv_safe(struct strvec *argv, const char *name); #define GIT_OPTIONAL_LOCKS_ENVIRONMENT "GIT_OPTIONAL_LOCKS" #define GIT_TEXT_DOMAIN_DIR_ENVIRONMENT "GIT_TEXTDOMAINDIR" #define GIT_ATTR_SOURCE_ENVIRONMENT "GIT_ATTR_SOURCE" +#define GIT_DEFAULT_HASH_ENVIRONMENT "GIT_DEFAULT_HASH" /* * Environment variable used in handshaking the wire protocol. diff --git a/repository.c b/repository.c index 15c10015b0..f912ee9a7c 100644 --- a/repository.c +++ b/repository.c @@ -1,5 +1,6 @@ #include "git-compat-util.h" #include "abspath.h" +#include "environment.h" #include "repository.h" #include "object-store-ll.h" #include "config.h" @@ -19,6 +20,27 @@ static struct repository the_repo; struct repository *the_repository = &the_repo; +static void set_default_hash_algo(struct repository *repo) +{ + const char *hash_name; + int algo; + + hash_name = getenv(GIT_DEFAULT_HASH_ENVIRONMENT); + if (!hash_name) + return; + algo = hash_algo_by_name(hash_name); + + /* + * NEEDSWORK: after all, falling back to SHA-1 by assigning + * GIT_HASH_SHA1 to algo here, instead of returning, may give + * us better behaviour. + */ + if (algo == GIT_HASH_UNKNOWN) + return; + + repo_set_hash_algo(repo, algo); +} + void initialize_repository(struct repository *repo) { repo->objects = raw_object_store_new(); @@ -26,6 +48,24 @@ void initialize_repository(struct repository *repo) repo->parsed_objects = parsed_object_pool_new(); ALLOC_ARRAY(repo->index, 1); index_state_init(repo->index, repo); + + /* + * When a command runs inside a repository, it learns what + * hash algorithm is in use from the repository, but some + * commands are designed to work outside a repository, yet + * they want to access the_hash_algo, if only for the length + * of the hashed value to see if their input looks like a + * plausible hash value. + * + * We are in the process of identifying the codepaths and + * giving them an appropriate default individually; any + * unconverted codepath that tries to access the_hash_algo + * will thus fail. The end-users however have an escape hatch + * to set GIT_DEFAULT_HASH environment variable to "sha1" get + * back the old behaviour of defaulting to SHA-1. + */ + if (repo == the_repository) + set_default_hash_algo(repo); } static void expand_base_dir(char **out, const char *in, diff --git a/setup.c b/setup.c index 7c996659bd..d084703465 100644 --- a/setup.c +++ b/setup.c @@ -1840,8 +1840,6 @@ int daemonize(void) #define TEST_FILEMODE 1 #endif -#define GIT_DEFAULT_HASH_ENVIRONMENT "GIT_DEFAULT_HASH" - static void copy_templates_1(struct strbuf *path, struct strbuf *template_path, DIR *dir) {
Partially revert c8aed5e8 (repository: stop setting SHA1 as the default object hash, 2024-05-07), to keep end-user systems still broken when we have gap in our test coverage but yet give them an escape hatch to set the GIT_DEFAULT_HASH environment variable to "sha1" in order to revert to the previous behaviour. This variable has been in use for using SHA-256 hash by default, and it should be a better fit than inventing a new and test-only knob. Signed-off-by: Junio C Hamano <gitster@pobox.com> --- environment.h | 1 + repository.c | 40 ++++++++++++++++++++++++++++++++++++++++ setup.c | 2 -- 3 files changed, 41 insertions(+), 2 deletions(-)