diff mbox series

[WIP,v2,2/2] config: include file if remote URL matches a glob

Message ID 3c0d51ee39b8e149b5be57b8cd3f8cd403fe49c9.1635527389.git.jonathantanmy@google.com (mailing list archive)
State Superseded
Headers show
Series Conditional config includes based on remote URL | expand

Commit Message

Jonathan Tan Oct. 29, 2021, 5:31 p.m. UTC
This is a feature that supports config file inclusion conditional on
whether the repo has a remote with a URL that matches a glob.

Similar to my previous work on remote-suggested hooks [1], the main
motivation is to allow remote repo administrators to provide recommended
configs in a way that can be consumed more easily (e.g. through a
package installable by a package manager).

NEEDSWORK: The way this works is that if we see such an include, we
shunt all subsequent configs into a stash (while looking for URLs), then
process the stash. In particular, this means that more memory is needed,
and the nature of error reporting changes (currently, if a callback
returns nonzero for a variable, processing halts immediately, but with
this patch, all the config might be read from disk before the callback
even sees the variable). I'll need to expand on this and write a
documentation section.

One alternative is to rerun the config parsing mechanism upon noticing
the first URL-conditional include in order to find all URLs. This would
require the config files to be read from disk twice, though.

[1] https://lore.kernel.org/git/cover.1623881977.git.jonathantanmy@google.com/

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 config.c          | 132 +++++++++++++++++++++++++++++++++++++++++-----
 t/t1300-config.sh |  60 +++++++++++++++++++++
 2 files changed, 180 insertions(+), 12 deletions(-)

Comments

Emily Shaffer Nov. 5, 2021, 8:24 p.m. UTC | #1
On Fri, Oct 29, 2021 at 10:31:10AM -0700, Jonathan Tan wrote:
> 
> This is a feature that supports config file inclusion conditional on
> whether the repo has a remote with a URL that matches a glob.
> 
> Similar to my previous work on remote-suggested hooks [1], the main
> motivation is to allow remote repo administrators to provide recommended
> configs in a way that can be consumed more easily (e.g. through a
> package installable by a package manager).

To expand a little more on this:

At Google we ship /etc/gitconfig, as well as /usr/share/git-core/. Our
/etc/gitconfig looks basically like:

 [include]
   path = /usr/share/git-core/gitconfig
   path = /usr/share/git-core/some-specific-config
   path = /usr/share/git-core/other-specific-config

Jonathan's WIP allows us to append lines to /etc/gitconfig sort of like

 [includeIf "hasRemoteUrl:https://internal-google/big-project"]
   path = /usr/share/big-project/gitconfig

That's approach #1 to shipping a config, which we might use for a
project that makes up a significant portion of our userbase. We ship
(and own) the /etc/gitconfig; BigProject team ships and owns their own
gitconfig; everybody internally who works on BigProject, whether it's
just once to fix a small thing or every day as their main job, gets the
relevant configs for BigProject.

Approach #2 I think is also still a useful one, and maybe more
interesting outside of Google:

When I run 'sudo apt install big-oss-project-devkit', a few things
happen:
1. /usr/share/big-oss-project/gitconfig appears
2. `git config --global \
		'includeIf.hasRemoteUrl:https://github/big-oss-project/*' \
		'/usr/share/big-oss-project/gitconfig'` is run
3. whatever other special tools, scripts, etc. are installed

That way regardless of which project I'm working on -
big-oss-project/translation, big-oss-project/docs,
big-oss-project/big-oss-project - I still get configs and style checkers
and whatever else.

With this approach #2, it's still possible for someone to do a drive-by
contribution without ever running 'apt install big-oss-project-devkit',
so it's not quite as strong a recommendation as the former
"remote-suggested-hooks" topic. User would still want to take a look at
the README for big-oss-project to learn they're supposed to be
installing that package ahead of time. But it's still a oneshot setup
for nice things like partial clone filters, maybe sparsity filters,
maybe config-based hooks, etc., especially if big-oss-project already
was shipping some project-specific tooling (like maybe a special
debugger or a docker image or I don't know).

The nice thing about 'hasRemoteUrl' in this case is that we don't need
to know the location of the user's big-oss-project/ checkout on disk. We
can set that config globally and they can checkout big-oss-project as
many times and as many places as they wish. It wouldn't be possible to
ship configs via a package manager or other automated script without it.

> 
> NEEDSWORK: The way this works is that if we see such an include, we
> shunt all subsequent configs into a stash (while looking for URLs), then
> process the stash. In particular, this means that more memory is needed,
> and the nature of error reporting changes (currently, if a callback
> returns nonzero for a variable, processing halts immediately, but with
> this patch, all the config might be read from disk before the callback
> even sees the variable). I'll need to expand on this and write a
> documentation section.

Hm. I'm not so sure about making another structure for storing config
into memory, because we already do that during the regular config parse
(to make things like git_config_get_string() fast). Can we not re-walk
the in-memory config at the end of the normal parse, rather than reading
from disk twice?

I think git_config()/repo_config() callback even does that for you for free...?

2304 void repo_config(struct repository *repo, config_fn_t fn, void
*data)
2305 {
2306         git_config_check_init(repo);
2307         configset_iter(repo->config, fn, data);
2308 }

> 
> One alternative is to rerun the config parsing mechanism upon noticing
> the first URL-conditional include in order to find all URLs. This would
> require the config files to be read from disk twice, though.

What's the easiest way to "try it and see", to add tooling and find out
whether the config files would be reopened during the second parse?
Because I suspect that we won't actually reopen those files, due to the
config cache.

So couldn't we do something like....

pass #1:
 if (include)
   if (not hasRemoteUrl)
     open up path & parse
 put config into in-memory cache normally
pass #2: (and this pass would need to be added to repo_config() probably)
 if (include)
   if (hasRemoteUrl)
     open up path & parse
     insert in-order into in-memory cache
 don't touch existing configs otherwise

I think it's in practice similar to the approach you're using (getting
around the weird ordering with a cache in memory), but we could reuse
the existing config cache rather than creating a new and different one.

 - Emily
Ævar Arnfjörð Bjarmason Nov. 6, 2021, 4:41 a.m. UTC | #2
On Fri, Nov 05 2021, Emily Shaffer wrote:

> On Fri, Oct 29, 2021 at 10:31:10AM -0700, Jonathan Tan wrote:
> [...]
>> 
>> One alternative is to rerun the config parsing mechanism upon noticing
>> the first URL-conditional include in order to find all URLs. This would
>> require the config files to be read from disk twice, though.
>
> What's the easiest way to "try it and see", to add tooling and find out
> whether the config files would be reopened during the second parse?
> Because I suspect that we won't actually reopen those files, due to the
> config cache.

strace -f?

> So couldn't we do something like....
>
> pass #1:
>  if (include)
>    if (not hasRemoteUrl)
>      open up path & parse
>  put config into in-memory cache normally
> pass #2: (and this pass would need to be added to repo_config() probably)
>  if (include)
>    if (hasRemoteUrl)
>      open up path & parse
>      insert in-order into in-memory cache
>  don't touch existing configs otherwise
>
> I think it's in practice similar to the approach you're using (getting
> around the weird ordering with a cache in memory), but we could reuse
> the existing config cache rather than creating a new and different one.

I don't know enough to say if this two-step approach is better (although
I'm slightly biased in that direction, since it seems simpler), but this
just seems like premature optimization.

I.e. let's just read the files twice, they'll be in the OS's FS cache,
which is unlikely to be a bottleneck for the amount of files involved.

That being said we do have exactly this cache already. See [1] and
3c8687a73ee (add `config_set` API for caching config-like files,
2014-07-28).

But I think that was added due to *very* frequent re-parsing of the
entire config every time someone needed a config variable, not due to
the I/O overhead (but I may be wrong).

So if we've got 100 config variables we need and 10 config files then
10*100 is probably starting to hurt, but if for whatever reason we
needed 2*10 here that's probably no big deal, and in any case would only
happen if this new include mechanism was in play.

1. https://lore.kernel.org/git/1404631162-18556-1-git-send-email-tanayabh@gmail.com/
Jonathan Tan Nov. 9, 2021, 12:22 a.m. UTC | #3
> To expand a little more on this:

[snip]

> The nice thing about 'hasRemoteUrl' in this case is that we don't need
> to know the location of the user's big-oss-project/ checkout on disk. We
> can set that config globally and they can checkout big-oss-project as
> many times and as many places as they wish. It wouldn't be possible to
> ship configs via a package manager or other automated script without it.

Ah, thanks for the elaboration!

> > NEEDSWORK: The way this works is that if we see such an include, we
> > shunt all subsequent configs into a stash (while looking for URLs), then
> > process the stash. In particular, this means that more memory is needed,
> > and the nature of error reporting changes (currently, if a callback
> > returns nonzero for a variable, processing halts immediately, but with
> > this patch, all the config might be read from disk before the callback
> > even sees the variable). I'll need to expand on this and write a
> > documentation section.
> 
> Hm. I'm not so sure about making another structure for storing config
> into memory, because we already do that during the regular config parse
> (to make things like git_config_get_string() fast). Can we not re-walk
> the in-memory config at the end of the normal parse, rather than reading
> from disk twice?
> 
> I think git_config()/repo_config() callback even does that for you for free...?

The main thing is that we wouldn't know if an entry would have been
overridden by a value from an includeif.hasremoteurl or not.

> What's the easiest way to "try it and see", to add tooling and find out
> whether the config files would be reopened during the second parse?
> Because I suspect that we won't actually reopen those files, due to the
> config cache.

As Ævar says, strace should work. The hard part is implementing the
recursive config parse, but it looks like the way to go, so I'll try it
and see how it goes.

[1] https://lore.kernel.org/git/211106.8635o9hogz.gmgdl@evledraar.gmail.com/

> So couldn't we do something like....
> 
> pass #1:
>  if (include)
>    if (not hasRemoteUrl)
>      open up path & parse
>  put config into in-memory cache normally
> pass #2: (and this pass would need to be added to repo_config() probably)
>  if (include)
>    if (hasRemoteUrl)
>      open up path & parse
>      insert in-order into in-memory cache
>  don't touch existing configs otherwise
>
> I think it's in practice similar to the approach you're using (getting
> around the weird ordering with a cache in memory), but we could reuse
> the existing config cache rather than creating a new and different one.

What do you mean by "insert in-order"? If you mean figuring out which
variables would be overridden (and for multi-valued variables, what
order to put all the values in), I think that's the hard part.

Another thing is that at the point where we read the config
(config_with_options()), we have a callback, so we would need to make
sure that we're writing to the in-memory cache in the first place (as
opposed to passing a callback that does something else). That might be
doable by changing the API, but in ay case, I'll try the recursive
config parse first.
Jonathan Tan Nov. 9, 2021, 12:25 a.m. UTC | #4
> > What's the easiest way to "try it and see", to add tooling and find out
> > whether the config files would be reopened during the second parse?
> > Because I suspect that we won't actually reopen those files, due to the
> > config cache.
> 
> strace -f?

Thanks - this might work.

> > So couldn't we do something like....
> >
> > pass #1:
> >  if (include)
> >    if (not hasRemoteUrl)
> >      open up path & parse
> >  put config into in-memory cache normally
> > pass #2: (and this pass would need to be added to repo_config() probably)
> >  if (include)
> >    if (hasRemoteUrl)
> >      open up path & parse
> >      insert in-order into in-memory cache
> >  don't touch existing configs otherwise
> >
> > I think it's in practice similar to the approach you're using (getting
> > around the weird ordering with a cache in memory), but we could reuse
> > the existing config cache rather than creating a new and different one.
> 
> I don't know enough to say if this two-step approach is better (although
> I'm slightly biased in that direction, since it seems simpler), but this
> just seems like premature optimization.
> 
> I.e. let's just read the files twice, they'll be in the OS's FS cache,
> which is unlikely to be a bottleneck for the amount of files involved.

OK - let me try this.

> That being said we do have exactly this cache already. See [1] and
> 3c8687a73ee (add `config_set` API for caching config-like files,
> 2014-07-28).
> 
> But I think that was added due to *very* frequent re-parsing of the
> entire config every time someone needed a config variable, not due to
> the I/O overhead (but I may be wrong).
> 
> So if we've got 100 config variables we need and 10 config files then
> 10*100 is probably starting to hurt, but if for whatever reason we
> needed 2*10 here that's probably no big deal, and in any case would only
> happen if this new include mechanism was in play.
> 
> 1. https://lore.kernel.org/git/1404631162-18556-1-git-send-email-tanayabh@gmail.com/ 

This might not work for the reasons I described in my reply to Emily
[1]. I'll try the read-twice version first.

[1] https://lore.kernel.org/git/20211109002255.1110653-1-jonathantanmy@google.com/
diff mbox series

Patch

diff --git a/config.c b/config.c
index 94ad5ce913..63a37e0a5d 100644
--- a/config.c
+++ b/config.c
@@ -120,13 +120,30 @@  static long config_buf_ftell(struct config_source *conf)
 	return conf->u.buf.pos;
 }
 
+struct stashed_var {
+	char *var;
+	char *value;
+	int depth;
+
+	char *url;
+};
+
 struct config_include_data {
 	int depth;
 	config_fn_t fn;
 	void *data;
 	const struct config_options *opts;
+
+	/*
+	 * All remote URLs discovered when reading all config files.
+	 */
+	struct string_list remote_urls;
+
+	struct stashed_var *stashed;
+	size_t stashed_nr, stashed_alloc;
+	int current_stash_depth;
 };
-#define CONFIG_INCLUDE_INIT { 0 }
+#define CONFIG_INCLUDE_INIT { .remote_urls = STRING_LIST_INIT_DUP }
 
 static int git_config_include(const char *var, const char *value, void *data);
 
@@ -316,28 +333,110 @@  static int include_condition_is_true(const struct config_options *opts,
 	return 0;
 }
 
+static int execute_stashed(struct config_include_data *inc)
+{
+	size_t i = 0;
+	while (i < inc->stashed_nr) {
+		int ret = inc->fn(inc->stashed[i].var, inc->stashed[i].value,
+				  inc->data);
+		if (ret)
+			return ret;
+
+		/*
+		 * If it is an include, skip to next entry of the same depth if
+		 * the URL doesn't match
+		 */
+		if (inc->stashed[i].url) {
+			struct strbuf pattern = STRBUF_INIT;
+			struct string_list_item *url_item;
+			int found = 0;
+
+			strbuf_addstr(&pattern, inc->stashed[i].url);
+			add_trailing_starstar_for_dir(&pattern);
+			for_each_string_list_item(url_item, &inc->remote_urls) {
+				if (!wildmatch(pattern.buf, url_item->string,
+					       WM_PATHNAME)) {
+					found = 1;
+					break;
+				}
+			}
+			strbuf_release(&pattern);
+			if (found) {
+				i++;
+			} else {
+				int depth = inc->stashed[i].depth;
+
+				i++;
+				while (i < inc->stashed_nr &&
+				       inc->stashed[i].depth != depth)
+					i++;
+			}
+		} else {
+			i++;
+		}
+	}
+	return 0;
+}
+
 static int git_config_include(const char *var, const char *value, void *data)
 {
 	struct config_include_data *inc = data;
+	const char *remote_name;
+	size_t remote_name_len;
 	const char *cond, *key;
 	size_t cond_len;
-	int ret;
+	int ret = 0;
+
+	if (!parse_config_key(var, "remote", &remote_name, &remote_name_len,
+			      &key) &&
+	    remote_name &&
+	    !strcmp(key, "url"))
+		string_list_append(&inc->remote_urls, value);
 
 	/*
 	 * Pass along all values, including "include" directives; this makes it
 	 * possible to query information on the includes themselves.
 	 */
-	ret = inc->fn(var, value, inc->data);
-	if (ret < 0)
-		return ret;
+	if (inc->stashed_nr || starts_with(var, "includeif.hasremoteurl:")) {
+		struct stashed_var *last;
+
+		/*
+		 * Start or continue using the stash. (A false positive on
+		 * "includeif.hasremoteurl:?.path" is fine here - this just
+		 * means that some config variables unnecessarily go through
+		 * the stash before being passed to the callback.)
+		 */
+		ALLOC_GROW_BY(inc->stashed, inc->stashed_nr, 1,
+			      inc->stashed_alloc);
+		last = &inc->stashed[inc->stashed_nr - 1];
+		last->var = xstrdup(var);
+		last->value = xstrdup(value);
+		last->depth = inc->current_stash_depth;
+	} else {
+		ret = inc->fn(var, value, inc->data);
+		if (ret < 0)
+			return ret;
+	}
 
 	if (!strcmp(var, "include.path"))
 		ret = handle_path_include(value, inc);
 
 	if (!parse_config_key(var, "includeif", &cond, &cond_len, &key) &&
-	    (cond && include_condition_is_true(inc->opts, cond, cond_len)) &&
-	    !strcmp(key, "path"))
-		ret = handle_path_include(value, inc);
+	    cond && !strcmp(key, "path")) {
+		const char *url;
+		size_t url_len;
+
+		if (skip_prefix_mem(cond, cond_len, "hasremoteurl:", &url,
+				    &url_len)) {
+			inc->stashed[inc->stashed_nr - 1].url =
+				xmemdupz(url, url_len);
+			inc->current_stash_depth++;
+			ret = handle_path_include(value, inc);
+			inc->current_stash_depth--;
+		} else if (include_condition_is_true(inc->opts, cond, cond_len)) {
+			ret = handle_path_include(value, inc);
+		}
+	}
 
 	return ret;
 }
@@ -1933,6 +2032,7 @@  int config_with_options(config_fn_t fn, void *data,
 			const struct config_options *opts)
 {
 	struct config_include_data inc = CONFIG_INCLUDE_INIT;
+	int ret;
 
 	if (opts->respect_includes) {
 		inc.fn = fn;
@@ -1950,17 +2050,25 @@  int config_with_options(config_fn_t fn, void *data,
 	 * regular lookup sequence.
 	 */
 	if (config_source && config_source->use_stdin) {
-		return git_config_from_stdin(fn, data);
+		ret = git_config_from_stdin(fn, data);
 	} else if (config_source && config_source->file) {
-		return git_config_from_file(fn, config_source->file, data);
+		ret = git_config_from_file(fn, config_source->file, data);
 	} else if (config_source && config_source->blob) {
 		struct repository *repo = config_source->repo ?
 			config_source->repo : the_repository;
-		return git_config_from_blob_ref(fn, repo, config_source->blob,
+		ret = git_config_from_blob_ref(fn, repo, config_source->blob,
 						data);
+	} else {
+		ret = do_git_config_sequence(opts, fn, data);
 	}
 
-	return do_git_config_sequence(opts, fn, data);
+	if (inc.stashed_nr) {
+		execute_stashed(&inc);
+		inc.stashed_nr = 0;
+	}
+
+	string_list_clear(&inc.remote_urls, 0);
+	return ret;
 }
 
 static void configset_iter(struct config_set *cs, config_fn_t fn, void *data)
diff --git a/t/t1300-config.sh b/t/t1300-config.sh
index 9ff46f3b04..ea15f7fd46 100755
--- a/t/t1300-config.sh
+++ b/t/t1300-config.sh
@@ -2387,4 +2387,64 @@  test_expect_success '--get and --get-all with --fixed-value' '
 	test_must_fail git config --file=config --get-regexp --fixed-value fixed+ non-existent
 '
 
+test_expect_success 'includeIf.hasremoteurl' '
+	test_create_repo hasremoteurlTest &&
+
+	cat >"$(pwd)"/include-this <<-\EOF &&
+	[user]
+		this = this-is-included
+	EOF
+	cat >"$(pwd)"/dont-include-that <<-\EOF &&
+	[user]
+		that = that-is-not-included
+	EOF
+	cat >>hasremoteurlTest/.git/config <<-EOF &&
+	[includeIf "hasremoteurl:foo"]
+		path = "$(pwd)/include-this"
+	[includeIf "hasremoteurl:bar"]
+		path = "$(pwd)/dont-include-that"
+	[remote "foo"]
+		url = foo
+	EOF
+
+	echo this-is-included >expect-this &&
+	git -C hasremoteurlTest config --get user.this >actual-this &&
+	test_cmp expect-this actual-this &&
+
+	test_must_fail git -C hasremoteurlTest config --get user.that
+'
+
+test_expect_success 'includeIf.hasremoteurl respects last-config-wins' '
+	test_create_repo hasremoteurlTestOverride &&
+
+	cat >"$(pwd)"/include-two-three <<-\EOF &&
+	[user]
+		two = included-config
+		three = included-config
+	EOF
+	cat >>hasremoteurlTestOverride/.git/config <<-EOF &&
+	[remote "foo"]
+		url = foo
+	[user]
+		one = main-config
+		two = main-config
+	[includeIf "hasremoteurl:foo"]
+		path = "$(pwd)/include-two-three"
+	[user]
+		three = main-config
+	EOF
+
+	echo main-config >expect-main-config &&
+	echo included-config >expect-included-config &&
+
+	git -C hasremoteurlTestOverride config --get user.one >actual &&
+	test_cmp expect-main-config actual &&
+
+	git -C hasremoteurlTestOverride config --get user.two >actual &&
+	test_cmp expect-included-config actual &&
+
+	git -C hasremoteurlTestOverride config --get user.three >actual &&
+	test_cmp expect-main-config actual
+'
+
 test_done