diff mbox series

[v3,2/5] config: read protected config with `git_protected_config()`

Message ID 7499a2809615d42eaf3649e1c33f38d099d27c1a.1653685761.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series config: introduce discovery.bare and protected config | expand

Commit Message

Glen Choo May 27, 2022, 9:09 p.m. UTC
From: Glen Choo <chooglen@google.com>

Protected config is read using `read_very_early_config()`, which has
several downsides:

- Every call to `read_very_early_config()` parses global and
  system-level config files anew, but this can be optimized by just
  parsing them once [1].
- Protected variables should respect "-c" because we can reasonably
  assume that it comes from the user. But, `read_very_early_config()`
  can't use "-c" because it is called so early that it does not have
  access to command line arguments.

Introduce `git_protected_config()`, which reads protected config and
caches the values in `the_repository.protected_config`. Then, refactor
`safe.directory` to use `git_protected_config()`.

This implementation can still be improved, however:

- `git_protected_config()` iterates through every variable in
  `the_repository.protected_config`, which may still be too expensive to
  be called in every "git" invocation. There exist constant time lookup
  functions for non-protected config (repo_config_get_*()), but for
  simplicity, this commit does not implement similar functions for
  protected config.

- Protected config is stored in `the_repository` so that we don't need
  to statically allocate it. But this might be confusing since protected
  config ignores repository config by definition.

[1] While `git_protected_config()` should save on file I/O, I wasn't
able to measure a meaningful difference between that and
`read_very_early_config()` on my machine (which has an SSD).

Signed-off-by: Glen Choo <chooglen@google.com>
---
 config.c     | 35 +++++++++++++++++++++++++++++++++++
 config.h     |  8 ++++++++
 repository.c |  5 +++++
 repository.h |  8 ++++++++
 setup.c      |  2 +-
 5 files changed, 57 insertions(+), 1 deletion(-)

Comments

Junio C Hamano May 28, 2022, 12:28 a.m. UTC | #1
"Glen Choo via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Glen Choo <chooglen@google.com>
>
> Protected config is read using `read_very_early_config()`, which has
> several downsides:
>
> - Every call to `read_very_early_config()` parses global and
>   system-level config files anew, but this can be optimized by just
>   parsing them once [1].
> - Protected variables should respect "-c" because we can reasonably
>   assume that it comes from the user. But, `read_very_early_config()`
>   can't use "-c" because it is called so early that it does not have
>   access to command line arguments.

Now we are talking about protected "variable".  Is that a synonym
for "config", or are there some distinctions between them?

> - Protected config is stored in `the_repository` so that we don't need
>   to statically allocate it. But this might be confusing since protected
>   config ignores repository config by definition.

Yes, it indeed is.  Is it because we were over-eager when we
introduced the "struct repository *repo" parameter to many functions
and the configuration system wants you to have some repository, even
when you know you are not reading from any repository?  

I am wondering if it is a cleaner solution *not* to hang the
protected config as a configset in the_repository, but keep the
configset as a separate global variable, perhaps static to config.c
and is meant to be only accessed via git_protected_config() and the
like.

> @@ -295,6 +295,11 @@ void repo_clear(struct repository *repo)
>  		FREE_AND_NULL(repo->remote_state);
>  	}
>  
> +	if (repo->protected_config) {
> +		git_configset_clear(repo->protected_config);
> +		FREE_AND_NULL(repo->protected_config);
> +	}
> +

This becomes necessary only because each repository instance has
protected_config, even though we need only one instance, no matter
how many repositories we are accessing in this single invocation of
Git, no?

How should "git config -l" interact with "protected config" and
"protected variables", by the way?  Should a user be able to tell
which ones are coming from protected scope?  Should we gain, next to
--global, --system, etc., --protected option to list only the
protected config/variable?

This is another thing that I find iffy on terminology.  Should a
random variable, like user.name, be a "protected config", if it is
found in $HOME/.gitconfig?  If it comes from there, surely we can
trust its value, but unlike things like safe.directory, there is no
code that wants to enforce that we pay attention only to user.name
that came from trusted scopes.  Should such a variable be called
"protected variable"?

Thanks.
Glen Choo May 31, 2022, 5:43 p.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

>> Protected config is read using `read_very_early_config()`, which has
>> several downsides:
>>
>> - Every call to `read_very_early_config()` parses global and
>>   system-level config files anew, but this can be optimized by just
>>   parsing them once [1].
>> - Protected variables should respect "-c" because we can reasonably
>>   assume that it comes from the user. But, `read_very_early_config()`
>>   can't use "-c" because it is called so early that it does not have
>>   access to command line arguments.
>
> Now we are talking about protected "variable".  Is that a synonym
> for "config", or are there some distinctions between them?

Sorry, that's an old term I was toying with (this somehow snuck through
my proofreading). I just meant "variable that is only read from
protected config", aka a "protected config only variable".

A goal in this version was to introduce as little jargon as possible, so
- "protected config" refers to the set of config sources, and
- "protected config only" refers to config variables/settings that are
  only read from protected config.

>> - Protected config is stored in `the_repository` so that we don't need
>>   to statically allocate it. But this might be confusing since protected
>>   config ignores repository config by definition.
>
> Yes, it indeed is.  Is it because we were over-eager when we
> introduced the "struct repository *repo" parameter to many functions
> and the configuration system wants you to have some repository, even
> when you know you are not reading from any repository?  

Ah no, I was just trying to avoid yet-another global variable (since
IIRC we want to move towards a more lib-like Git), and the_repository
was a convenient global variable to (ab)use.

> I am wondering if it is a cleaner solution *not* to hang the
> protected config as a configset in the_repository, but keep the
> configset as a separate global variable, perhaps static to config.c
> and is meant to be only accessed via git_protected_config() and the
> like.

I think your suggestion to use a global variable is better, as much as I
want to avoid another global variable. Protected config would affect any
repositories that we work with in-core, so using a global sounds ok.

environment.c might be a better place since we already make a concerted
effort to put global config variables there instead of config.c.

As an aside, I wonder how we could get rid of all of the globals in
environment.c in the long term. Maybe we would have yet-another all
encompassing global, the_environment, and then figure out which
variables belong to the repository and which belong to the environment.

>> @@ -295,6 +295,11 @@ void repo_clear(struct repository *repo)
>>  		FREE_AND_NULL(repo->remote_state);
>>  	}
>>  
>> +	if (repo->protected_config) {
>> +		git_configset_clear(repo->protected_config);
>> +		FREE_AND_NULL(repo->protected_config);
>> +	}
>> +
>
> This becomes necessary only because each repository instance has
> protected_config, even though we need only one instance, no matter
> how many repositories we are accessing in this single invocation of
> Git, no?

Yes.

> How should "git config -l" interact with "protected config" and
> "protected variables", by the way?  Should a user be able to tell
> which ones are coming from protected scope?  Should we gain, next to
> --global, --system, etc., --protected option to list only the
> protected config/variable?

I'll have to think about this some more. My initial thoughts are that we
should do this if we formalize 'protected' as a scope-like concept, but
I don't see the lack of "--protected" as a significant hindrance to
users because they can use "--global" and "--system" (albeit in two
invocations instead of one).

> This is another thing that I find iffy on terminology.  Should a
> random variable, like user.name, be a "protected config", if it is
> found in $HOME/.gitconfig?  If it comes from there, surely we can
> trust its value, but unlike things like safe.directory, there is no
> code that wants to enforce that we pay attention only to user.name
> that came from trusted scopes.  Should such a variable be called
> "protected variable"?

Ah.. I think it would be best to pretend that the "Protected variable"
typo never happened. That term was destined to be confusing and
meaningless.

Instead, we can use "protected config" to refer to the config and
"protected config only" to refer to variables. Since "protected config"
is defined as (global + system + CLI) config, then yes, we would say
that it is "protected config". But since we do not enforce that
"user.name" _must_ come from only protected config, it is not "protected
config only".
Junio C Hamano June 1, 2022, 3:58 p.m. UTC | #3
Glen Choo <chooglen@google.com> writes:

> A goal in this version was to introduce as little jargon as possible, so
> - "protected config" refers to the set of config sources, and
> - "protected config only" refers to config variables/settings that are
>   only read from protected config.

OK.  Let's have such a clear pair of definitions somewhere in the
doc or at least in a proposed log message.

>
>>> - Protected config is stored in `the_repository` so that we don't need
>>>   to statically allocate it. But this might be confusing since protected
>>>   config ignores repository config by definition.
>>
>> Yes, it indeed is.  Is it because we were over-eager when we
>> introduced the "struct repository *repo" parameter to many functions
>> and the configuration system wants you to have some repository, even
>> when you know you are not reading from any repository?  
>
> Ah no, I was just trying to avoid yet-another global variable (since
> IIRC we want to move towards a more lib-like Git), and the_repository
> was a convenient global variable to (ab)use.

If this does not have to be known only inside config.c, until we
introduce a more global bag of things, which may have the current
the_repository as one of its components, I do not think it hurts to
have a file-scope static there.  Then, perhaps git_configset_get*()
helper functions can recognize cs==NULL as a sign that the caller
wants to grab from the "protected config", or something?  If we do
not want to expose the underying global variable to the public, that
is.

> As an aside, I wonder how we could get rid of all of the globals in
> environment.c in the long term. Maybe we would have yet-another all
> encompassing global, the_environment, and then figure out which
> variables belong to the repository and which belong to the environment.

I think we are on the same page, we'd probably need something called
the_world ;-)

> Instead, we can use "protected config" to refer to the config and
> "protected config only" to refer to variables. Since "protected config"
> is defined as (global + system + CLI) config, then yes, we would say
> that it is "protected config". But since we do not enforce that
> "user.name" _must_ come from only protected config, it is not "protected
> config only".

Very clear.  Thanks.
Derrick Stolee June 2, 2022, 12:56 p.m. UTC | #4
On 5/27/2022 5:09 PM, Glen Choo via GitGitGadget wrote:
> From: Glen Choo <chooglen@google.com>
> 
> Protected config is read using `read_very_early_config()`, which has
> several downsides:
> 
> - Every call to `read_very_early_config()` parses global and
>   system-level config files anew, but this can be optimized by just
>   parsing them once [1].
> - Protected variables should respect "-c" because we can reasonably
>   assume that it comes from the user. But, `read_very_early_config()`
>   can't use "-c" because it is called so early that it does not have
>   access to command line arguments.
> 
> Introduce `git_protected_config()`, which reads protected config and
> caches the values in `the_repository.protected_config`. Then, refactor
> `safe.directory` to use `git_protected_config()`.
> 
> This implementation can still be improved, however:
> 
> - `git_protected_config()` iterates through every variable in
>   `the_repository.protected_config`, which may still be too expensive to
>   be called in every "git" invocation. There exist constant time lookup
>   functions for non-protected config (repo_config_get_*()), but for
>   simplicity, this commit does not implement similar functions for
>   protected config.

I originally thought that we should jump to that "right" solution, but
the existing logic in ensure_valid_ownership() uses the iterator method,
mostly because it uses a multi-valued string. There are helpers that
allow iterating over a specific multi-valued key, but there is no reason
to complicate the current patch with that amount of refactoring. That
can be handled as a completely separate topic.
 
> - Protected config is stored in `the_repository` so that we don't need
>   to statically allocate it. But this might be confusing since protected
>   config ignores repository config by definition.

I agree with Junio's suggestion of keeping this as a static global in
config.c, accessible only by the public methods from config.h. A future
where we have "the_world" might be nice for inventory on all these
globals. Definitely not something to hold up this series.

> +/* Read protected config into the_repository->protected_config. */
> +static void read_protected_config(void)
> +{
> +	char *xdg_config = NULL, *user_config = NULL, *system_config = NULL;
> +
> +	CALLOC_ARRAY(the_repository->protected_config, 1);
> +	git_configset_init(the_repository->protected_config);
> +
> +	system_config = git_system_config();
> +	git_global_config(&user_config, &xdg_config);
> +
> +	git_configset_add_file(the_repository->protected_config, system_config);
> +	git_configset_add_file(the_repository->protected_config, xdg_config);
> +	git_configset_add_file(the_repository->protected_config, user_config);
> +
> +	free(system_config);
> +	free(xdg_config);
> +	free(user_config);
> +}

This loads the config from three files, including the xdg_config, which
I wasn't thinking about before.

This implementation does not use the -c config yet, which you listed as
a downside of read_very_early_config(). I see that you include that in
your patch 4, but the commit message for this patch could list that as a
step that will be handled by a later change.

(You could also do that as patch 3 and add a test near the existing
safe.directory tests instead of waiting for discovery.bare.)

> +
> +/* Ensure that the_repository->protected_config has been initialized. */
> +static void git_protected_config_check_init(void)
> +{
> +	if (the_repository->protected_config &&
> +	    the_repository->protected_config->hash_initialized)
> +		return;
> +	read_protected_config();
> +}
> +
> +void git_protected_config(config_fn_t fn, void *data)
> +{
> +	git_protected_config_check_init();
> +	configset_iter(the_repository->protected_config, fn, data);
> +}

These two methods are clearly correct.

..._check_init() is an OK name. I've seen us use "prepare_...()" in
other areas as a way of making sure that we have the proper state
(see prepare_packed_git() and the like), so maybe a rename here to
match would be worthwhile. Feel free to ignore.

> +	if (repo->protected_config) {
> +		git_configset_clear(repo->protected_config);
> +		FREE_AND_NULL(repo->protected_config);
> +	}

This will have no equivalent when protected_config is left as a
static global, but that is fine. It only goes out of scope with
the end of the process, anyway.

> @@ -1128,7 +1128,7 @@ static int ensure_valid_ownership(const char *path)
>  	    is_path_owned_by_current_user(path))
>  		return 1;
>  
> -	read_very_early_config(safe_directory_cb, &data);
> +	git_protected_config(safe_directory_cb, &data);

Nice to have a very simple conversion here.

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/config.c b/config.c
index fa471dbdb89..c30bb7c5d09 100644
--- a/config.c
+++ b/config.c
@@ -2614,6 +2614,41 @@  int repo_config_get_pathname(struct repository *repo,
 	return ret;
 }
 
+/* Read protected config into the_repository->protected_config. */
+static void read_protected_config(void)
+{
+	char *xdg_config = NULL, *user_config = NULL, *system_config = NULL;
+
+	CALLOC_ARRAY(the_repository->protected_config, 1);
+	git_configset_init(the_repository->protected_config);
+
+	system_config = git_system_config();
+	git_global_config(&user_config, &xdg_config);
+
+	git_configset_add_file(the_repository->protected_config, system_config);
+	git_configset_add_file(the_repository->protected_config, xdg_config);
+	git_configset_add_file(the_repository->protected_config, user_config);
+
+	free(system_config);
+	free(xdg_config);
+	free(user_config);
+}
+
+/* Ensure that the_repository->protected_config has been initialized. */
+static void git_protected_config_check_init(void)
+{
+	if (the_repository->protected_config &&
+	    the_repository->protected_config->hash_initialized)
+		return;
+	read_protected_config();
+}
+
+void git_protected_config(config_fn_t fn, void *data)
+{
+	git_protected_config_check_init();
+	configset_iter(the_repository->protected_config, fn, data);
+}
+
 /* Functions used historically to read configuration from 'the_repository' */
 void git_config(config_fn_t fn, void *data)
 {
diff --git a/config.h b/config.h
index 7654f61c634..411965f52b5 100644
--- a/config.h
+++ b/config.h
@@ -505,6 +505,14 @@  int repo_config_get_maybe_bool(struct repository *repo,
 int repo_config_get_pathname(struct repository *repo,
 			     const char *key, const char **dest);
 
+/*
+ * Functions for reading protected config. By definition, protected
+ * config ignores repository config, so it is unnecessary to read
+ * protected config from any `struct repository` other than
+ * the_repository.
+ */
+void git_protected_config(config_fn_t fn, void *data);
+
 /**
  * Querying For Specific Variables
  * -------------------------------
diff --git a/repository.c b/repository.c
index 5d166b692c8..ec319a5e09a 100644
--- a/repository.c
+++ b/repository.c
@@ -295,6 +295,11 @@  void repo_clear(struct repository *repo)
 		FREE_AND_NULL(repo->remote_state);
 	}
 
+	if (repo->protected_config) {
+		git_configset_clear(repo->protected_config);
+		FREE_AND_NULL(repo->protected_config);
+	}
+
 	repo_clear_path_cache(&repo->cached_paths);
 }
 
diff --git a/repository.h b/repository.h
index 6cc661e5a43..24251aac553 100644
--- a/repository.h
+++ b/repository.h
@@ -126,6 +126,14 @@  struct repository {
 
 	struct repo_settings settings;
 
+	/*
+	 * Config that comes from trusted sources, namely
+	 * - system config files (e.g. /etc/gitconfig)
+	 * - global config files (e.g. $HOME/.gitconfig,
+	 *   $XDG_CONFIG_HOME/git)
+	 */
+	struct config_set *protected_config;
+
 	/* Subsystems */
 	/*
 	 * Repository's config which contains key-value pairs from the usual
diff --git a/setup.c b/setup.c
index f818dd858c6..847d47f9195 100644
--- a/setup.c
+++ b/setup.c
@@ -1128,7 +1128,7 @@  static int ensure_valid_ownership(const char *path)
 	    is_path_owned_by_current_user(path))
 		return 1;
 
-	read_very_early_config(safe_directory_cb, &data);
+	git_protected_config(safe_directory_cb, &data);
 
 	return data.is_safe;
 }