[0/6,RFC] config.c: use struct for config reading state

Message ID	pull.1463.git.git.1677631097.gitgitgadget@gmail.com (mailing list archive)
Headers	show Return-Path: <git-owner@vger.kernel.org> Message-Id: <pull.1463.git.git.1677631097.gitgitgadget@gmail.com> From: "Glen Choo via GitGitGadget" <gitgitgadget@gmail.com> Date: Wed, 01 Mar 2023 00:38:11 +0000 Subject: [PATCH 0/6] [RFC] config.c: use struct for config reading state Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Jonathan Tan <jonathantanmy@google.com>, Emily Shaffer <nasamuffin@google.com>, Jeff King <peff@peff.net>, Derrick Stolee <derrickstolee@github.com>, Glen Choo <chooglen@google.com> Precedence: bulk
Series	config.c: use struct for config reading state \| expand [0/6,RFC] config.c: use struct for config reading state [1/6] config.c: plumb config_source through static fns [2/6] config.c: don't assign to "cf" directly [3/6] config.c: create config_reader and the_reader [4/6] config.c: plumb the_reader through callbacks [5/6] config.c: remove current_config_kvi [6/6] config.c: remove current_parsing_scope

Philippe Blain via GitGitGadget March 1, 2023, 12:38 a.m. UTC

This RFC is preparation for config.[ch] to be libified as as part of the
libification effort that Emily described in [1]. One of the first goals is
to read config from a file, but the trouble with how config.c is written
today is that all reading operations rely on global state, so before turning
that into a library, we'd want to make that state non-global.

This series gets us about halfway there; it does enough plumbing for a
workable-but-kinda-ugly library interface, but with a little bit more work,
I think we can get rid of global state in-tree as well. That requires a fair
amount of work though, so I'd like to get thoughts on that before starting
work.

= Description

This series extracts the global config reading state into "struct
config_reader" and plumbs it through the config reading machinery. It's very
similar to how we've plumbed "struct repository" and other 'context objects'
in the past, except:

 * The global state (named "the_reader") for the git process lives in a
   config.c static variable, and not on "the_repository". See 3/6 for the
   rationale.

 * I've stopped short of adding "struct config_reader" to config.h public
   functions, since that would affect non-config.c callers.

If we stop right here, it's quite easy to extend it to a future config-lib.h
without having to adjust the config.h interface:

 * Move the core config reading functionality from config.c to config-lib.c.

 * Have config-lib.h accept "struct config_reader" as an arg.

 * Have config.h call config-lib.h while passing "the_reader".

and I have some WIP patches that do just that [3], but I think they can be
significantly improved if we go a bit further...

= Leftover bits and RFC

With a bit more work on the config machinery, we could make it so that
config reading stops being global even without adjusting non-config.c
callers. The idea is pretty simple: have the config machinery initialize an
internal "struct config_reader" every time we read config and expose that
state to the config callbacks (instead of, in this series, asking the caller
to initialize "struct config_reader" themselves). I believe that only config
callbacks are accessing this state, e.g. because they use the low-level
information (like builtin/config.c printing the filename and scope of the
value) or for error reporting (like git_parse_int() reporting the filename
and line number of the value it failed to parse), and only config callbacks
should be accessing this state anyway.

The catch (aka the reason I stopped halfway through) is that I couldn't find
a way to expose "struct config_reader" state without some fairly big
changes, complexity-wise or LoC-wise, e.g.

 * We could add "struct config_reader" to "config_fn_t", i.e.
   
   -typedef int (*config_fn_t)(const char *var, const char *val, void
   *data); +typedef int (*config_fn_t)(const struct config_reader *reader,
   const char *var, const char *val, void *data);
   
   which isn't complex at all, except that there are ~100 config_fn_t
   implementations [3] and a good number of them may never reference
   "reader". If the churn is tolerable, I think this a good way forward.

 * We could create a new kind of "config_fn_t" that accepts "struct
   config_reader", e.g.
   
   typedef int (*config_fn_t)(const char *var, const char *val, void *data);
   +typedef int (*config_state_fn_t)(const struct config_reader *reader,
   const char *var, const char *val, void *data);
   
   and only adjust the callers that would actually reference "reader". This
   is less churn, but I couldn't find a great way to do this kind of
   'switching between config callback types' elegantly.

 * We could smuggle "struct config_reader" to callback functions in a way
   that interested callers could see it, but uninterested callers could
   ignore. One trick that Jonathan Tan came up with (though not necessarily
   endorsed) would be to allocate a struct for the config value + "struct
   config_reader", then, interested callers could use "offset_of" to recover
   the "struct config_reader". It's a little hacky, but it's low-churn.

= Questions

 * Is this worth merging without the extra work? There are some cleanups in
   this series that could make it valuable, but there are also some hacks
   (see 4/6) that aren't so great.
 * Is the extra work even worth it?
 * Do any of the ideas seem more promising than the others? Are there other
   ideas I'm missing?

[1]
https://lore.kernel.org/git/CAJoAoZ=Cig_kLocxKGax31sU7Xe4==BGzC__Bg2_pr7krNq6MA@mail.gmail.com
[2]
https://github.com/chooglen/git/compare/config/structify-reading...chooglen:git:config/read-without-globals
[3] This is a rough estimate based on "git grep"-ing callers of the config.h
functions. I vaguely recall callbacks being called "old-style", with the
suggestion that we should replace them with the "new-style" constant time
git_config_get_*() family of functions. That would decrease the number of
config callbacks significantly.

Glen Choo (6):
  config.c: plumb config_source through static fns
  config.c: don't assign to "cf" directly
  config.c: create config_reader and the_reader
  config.c: plumb the_reader through callbacks
  config.c: remove current_config_kvi
  config.c: remove current_parsing_scope

 config.c | 489 ++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 287 insertions(+), 202 deletions(-)


base-commit: dadc8e6dacb629f46aee39bde90b6f09b73722eb
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1463%2Fchooglen%2Fconfig%2Fstructify-reading-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1463/chooglen/config/structify-reading-v1
Pull-Request: https://github.com/git/git/pull/1463

Jonathan Tan March 6, 2023, 7:57 p.m. UTC | #1

"Glen Choo via GitGitGadget" <gitgitgadget@gmail.com> writes:
>  * We could add "struct config_reader" to "config_fn_t", i.e.
>    
>    -typedef int (*config_fn_t)(const char *var, const char *val, void
>    *data); +typedef int (*config_fn_t)(const struct config_reader *reader,
>    const char *var, const char *val, void *data);
>    
>    which isn't complex at all, except that there are ~100 config_fn_t
>    implementations [3] and a good number of them may never reference
>    "reader". If the churn is tolerable, I think this a good way forward.
> 
>  * We could create a new kind of "config_fn_t" that accepts "struct
>    config_reader", e.g.
>    
>    typedef int (*config_fn_t)(const char *var, const char *val, void *data);
>    +typedef int (*config_state_fn_t)(const struct config_reader *reader,
>    const char *var, const char *val, void *data);
>    
>    and only adjust the callers that would actually reference "reader". This
>    is less churn, but I couldn't find a great way to do this kind of
>    'switching between config callback types' elegantly.

To reduce churn, one thing that could be done alongside is to convert
config-using code (which is...practically the rest of Git) to start
using the configset interface (we seem to be using configsets internally
anyway, as evidenced by repo_config()). That way, we would reduce the
number of implementations of config_fn_t.

>  * We could smuggle "struct config_reader" to callback functions in a way
>    that interested callers could see it, but uninterested callers could
>    ignore. One trick that Jonathan Tan came up with (though not necessarily
>    endorsed) would be to allocate a struct for the config value + "struct
>    config_reader", then, interested callers could use "offset_of" to recover
>    the "struct config_reader". It's a little hacky, but it's low-churn.

Indeed, although we should probably use this as a last resort.

> = Questions
> 
>  * Is this worth merging without the extra work? There are some cleanups in
>    this series that could make it valuable, but there are also some hacks
>    (see 4/6) that aren't so great.

I'm leaning towards merging it now, but can go either way, since the
cost of churn is limited to one single file, but so are the benefits.
If it was up to me to decide, I would merge it now, because this opens
up a lot of work that other contributors could individually do (in
particular, converting individual config code paths so that we don't
need to reference the_reader as a global anymore).

I don't see 4/6 as a hack. It is true that the nature of the config_fn_t
callback could change so that passing the reader would end up being
done in yet another different way, but firstly, I don't think that will
happen for quite some time, and secondly, it might not happen at all
(right now, I think what's most likely to happen is that the rest of the
Git code moves to configsets and only a fraction of the Git code would
need to do low-level parsing, which would not have a problem passing the
reader through the data object since they would probably need to pass
other context anyway).

>  * Is the extra work even worth it?

Depends on which extra work, but I think that eliminating the the_reader
global would really be useful (and, as far as I know, the whole reason
for this effort). From the Git codebase's perspective, doing this would
(as far as I know) eliminate the need for pushing and popping cf, and
make multithreaded multi-repo operations less error-prone (e.g. we
can spawn a thread operating on a submodule and that thread can itself
read the configs of nested submodules without worrying about clobbering
global state...well, there is thread-local storage, but as far as I know
this is not portable).

>  * Do any of the ideas seem more promising than the others? Are there other
>    ideas I'm missing?

Hopefully I answered this in my answers to the other questions.

Glen Choo March 6, 2023, 9:45 p.m. UTC | #2

Jonathan Tan <jonathantanmy@google.com> writes:

> "Glen Choo via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>  * We could create a new kind of "config_fn_t" that accepts "struct
>>    config_reader", e.g.
>>    
>>    typedef int (*config_fn_t)(const char *var, const char *val, void *data);
>>    +typedef int (*config_state_fn_t)(const struct config_reader *reader,
>>    const char *var, const char *val, void *data);
>>    
>>    and only adjust the callers that would actually reference "reader". This
>>    is less churn, but I couldn't find a great way to do this kind of
>>    'switching between config callback types' elegantly.
>
> To reduce churn, one thing that could be done alongside is to convert
> config-using code (which is...practically the rest of Git) to start
> using the configset interface (we seem to be using configsets internally
> anyway, as evidenced by repo_config()). That way, we would reduce the
> number of implementations of config_fn_t.

By configset interface, I believe you mean the O(1) lookup functions
like git_config_get_int() (which rely on the value being cached, but
don't necessarily accept "struct config_set" as an arg)? I think that
makes sense both from a performance and maintenance perspective.

>> = Questions
>> 
>>  * Is this worth merging without the extra work? There are some cleanups in
>>    this series that could make it valuable, but there are also some hacks
>>    (see 4/6) that aren't so great.
> I don't see 4/6 as a hack. It is true that the nature of the config_fn_t
> callback could change so that passing the reader would end up being
> done in yet another different way, but firstly, I don't think that will
> happen for quite some time, and secondly, it might not happen at all
> (right now, I think what's most likely to happen is that the rest of the
> Git code moves to configsets and only a fraction of the Git code would
> need to do low-level parsing, which would not have a problem passing the
> reader through the data object since they would probably need to pass
> other context anyway).

Given how painful it is to change the config_fn_t signature, I think it
is important to get as right as possible the first time. After I sent
this out, I thought of yet another possible config_fn_t signature
(since most callbacks only need diagnostic information, we could pass
"struct key_value_info" instead of the more privileged "struct
config_reader"), but given how many functions we'd have to change, it
seemed extremely difficult to even begin experimenting with this
different signature.

So I think you're right that we'd want to start by removing as many
config_fn_t implementations as possible. Perhaps we'd also want to
consider creating new abstractions for other situations where the key is
not known, e.g. "remote.<name>" and "branch.<name>".

Jonathan Tan March 6, 2023, 10:38 p.m. UTC | #3

Glen Choo <chooglen@google.com> writes:
> By configset interface, I believe you mean the O(1) lookup functions
> like git_config_get_int() (which rely on the value being cached, but
> don't necessarily accept "struct config_set" as an arg)? I think that
> makes sense both from a performance and maintenance perspective.

Ah, yes. (More precisely, not the one where you call something like
repo_config(), passing a callback function that.)

> Given how painful it is to change the config_fn_t signature, I think it
> is important to get as right as possible the first time. After I sent
> this out, I thought of yet another possible config_fn_t signature
> (since most callbacks only need diagnostic information, we could pass
> "struct key_value_info" instead of the more privileged "struct
> config_reader"), but given how many functions we'd have to change, it
> seemed extremely difficult to even begin experimenting with this
> different signature.

Yeah, the first change is the hardest. I think passing it a single
struct (so, instead of key, value, data, reader, and then any future
fields we would need) to which we can add fields later would mean that
we wouldn't need any changes beyond the first, though.

Ævar Arnfjörð Bjarmason March 7, 2023, 11:57 a.m. UTC | #4

On Wed, Mar 01 2023, Glen Choo via GitGitGadget wrote:

> This RFC is preparation for config.[ch] to be libified as as part of the
> libification effort that Emily described in [1]. One of the first goals is
> to read config from a file, but the trouble with how config.c is written
> today is that all reading operations rely on global state, so before turning
> that into a library, we'd want to make that state non-global.
>
> This series gets us about halfway there; it does enough plumbing for a
> workable-but-kinda-ugly library interface, but with a little bit more work,
> I think we can get rid of global state in-tree as well. That requires a fair
> amount of work though, so I'd like to get thoughts on that before starting
> work.
>
> = Description
>
> This series extracts the global config reading state into "struct
> config_reader" and plumbs it through the config reading machinery. It's very
> similar to how we've plumbed "struct repository" and other 'context objects'
> in the past, except:
>
>  * The global state (named "the_reader") for the git process lives in a
>    config.c static variable, and not on "the_repository". See 3/6 for the
>    rationale.

I agree with the overall direction, but don't think that rationale in
3/6 is sufficient to go in this "the_reader" direction, as opposed to
sticking with and extending "the_repository" approach.

For orthagonal reasons (getting rid of some of the API duplication) I've
been carrying a patch to get rid of the "configset" part of the *public*
API, i.e. to have API users always use the "repo_config_*()" or
"git_config_*()" variants, that patch is at:
https://github.com/avar/git/commit/0233297a359bbda43a902dd0213aacdca82faa34

All of the rationale in your 3/6 is true now, but as that patch shows
the reason for why we have "the_repository" is for the trivial reason
that we want to access the repo's "config" member.

It's a bit distasteful, but that change argues that just mocking up a
"struct repository" with a "config" member pointing to a new configset
is better than maintaining an entirely different API just for those
cases where we need to parse a one-off file or whatever.

I think that going in that direction neatly solves the issues you're
noting here and in your 3/6, i.e. we'd always have this in the "repo"
object, so we'd just stick the persistent "reader" variables in the
"struct repository"'s "config" member.

Glen Choo March 7, 2023, 6:22 p.m. UTC | #5

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Wed, Mar 01 2023, Glen Choo via GitGitGadget wrote:
>
>> This series extracts the global config reading state into "struct
>> config_reader" and plumbs it through the config reading machinery. It's very
>> similar to how we've plumbed "struct repository" and other 'context objects'
>> in the past, except:
>>
>>  * The global state (named "the_reader") for the git process lives in a
>>    config.c static variable, and not on "the_repository". See 3/6 for the
>>    rationale.
>
> I agree with the overall direction, but don't think that rationale in
> 3/6 is sufficient to go in this "the_reader" direction, as opposed to
> sticking with and extending "the_repository" approach.
>
> For orthagonal reasons (getting rid of some of the API duplication) I've
> been carrying a patch to get rid of the "configset" part of the *public*
> API, i.e. to have API users always use the "repo_config_*()" or
> "git_config_*()" variants, that patch is at:
> https://github.com/avar/git/commit/0233297a359bbda43a902dd0213aacdca82faa34

Those patches are probably worth sending, even if only as RFC. I found
it pretty hard to draft a substantial response without effectively doing
a full review of the patch.

> It's a bit distasteful, but that change argues that just mocking up a
> "struct repository" with a "config" member pointing to a new configset
> is better than maintaining an entirely different API just for those
> cases where we need to parse a one-off file or whatever.
>
> I think that going in that direction neatly solves the issues you're
> noting here and in your 3/6, i.e. we'd always have this in the "repo"
> object, so we'd just stick the persistent "reader" variables in the
> "struct repository"'s "config" member.

If I understand your proposal correctly, we would move the config
variables to the_repository. Then, any time a caller would like to work
with an individual file, it would init a new "struct repository" with a
clean set of config members (using repo_init_repo_blank_config() or
something) and reuse the repo_config_* API?

It is a workable solution, e.g. that approach would work around the
failures in test-tool and scalar that I observed. In the spirit of
libification, this feels like a kludge, though, since we'd be reverting
to using "struct repository" for more things instead of using more
well-scoped interfaces. IMO a better future for the config_set API would
be to move it into configset.c or something, where only users who want
the low level API would use it and everyone else would just pretend it
doesn't exist. This would be a little like libgit2's organization, where
'general config', 'config parsing' and 'in-memory config value
representations' are separate files, e.g.

  https://github.com/libgit2/libgit2/blob/main/src/libgit2/config.h
  https://github.com/libgit2/libgit2/blob/main/src/libgit2/config_parse.h
  https://github.com/libgit2/libgit2/blob/main/src/libgit2/config_entries.h

I also hesitate to put the config variables on the_repository, because
in the long term, I think "struct config_reader" can and should be
purely internal to config.c. But if we start advertising its existence
via the_repository, that might be an invitation to (ab)use that API and
make that transition harder.

Ævar Arnfjörð Bjarmason March 7, 2023, 6:36 p.m. UTC | #6

On Tue, Mar 07 2023, Glen Choo wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> On Wed, Mar 01 2023, Glen Choo via GitGitGadget wrote:
>>
>>> This series extracts the global config reading state into "struct
>>> config_reader" and plumbs it through the config reading machinery. It's very
>>> similar to how we've plumbed "struct repository" and other 'context objects'
>>> in the past, except:
>>>
>>>  * The global state (named "the_reader") for the git process lives in a
>>>    config.c static variable, and not on "the_repository". See 3/6 for the
>>>    rationale.
>>
>> I agree with the overall direction, but don't think that rationale in
>> 3/6 is sufficient to go in this "the_reader" direction, as opposed to
>> sticking with and extending "the_repository" approach.
>>
>> For orthagonal reasons (getting rid of some of the API duplication) I've
>> been carrying a patch to get rid of the "configset" part of the *public*
>> API, i.e. to have API users always use the "repo_config_*()" or
>> "git_config_*()" variants, that patch is at:
>> https://github.com/avar/git/commit/0233297a359bbda43a902dd0213aacdca82faa34
>
> Those patches are probably worth sending, even if only as RFC. I found
> it pretty hard to draft a substantial response without effectively doing
> a full review of the patch.

Yes, sorry. It's part of some changes on top of my outstanding config
API changes (just re-rolled at
https://lore.kernel.org/git/cover-v6-0.9-00000000000-20230307T180516Z-avarab@gmail.com/). Hopefully
those will land soon after the upcoming release, I'll try to submit this
(along with related changes) soon afterwards.

>> It's a bit distasteful, but that change argues that just mocking up a
>> "struct repository" with a "config" member pointing to a new
>> configset is better than maintaining an entirely different API just
>> for those cases where we need to parse a one-off file or whatever.
>>
>> I think that going in that direction neatly solves the issues you're
>> noting here and in your 3/6, i.e. we'd always have this in the "repo"
>> object, so we'd just stick the persistent "reader" variables in the
>> "struct repository"'s "config" member.
>
> If I understand your proposal correctly, we would move the config
> variables to the_repository. Then, any time a caller would like to work
> with an individual file, it would init a new "struct repository" with a
> clean set of config members (using repo_init_repo_blank_config() or
> something) and reuse the repo_config_* API?

It's certainly a hack, but so is introducing a new "the_reader"
singleton whose lifetime we need to manage seperately from
"the_repository" in the common case :)

I think a better argument for this is probably that if you try to change
repository.h so that we define "struct repository" thusly (The
"hash_algo" field being still there due to a very common macro):

	struct repository {
	        const struct git_hash_algo *hash_algo;
	        struct config_set *config;
	};

And then try to:

	make config.o

You'll get:
	
	$ make config.o
	    CC config.o
	config.c: In function ‘include_by_branch’:
	config.c:311:46: error: ‘struct repository’ has no member named ‘gitdir’
	  311 |         const char *refname = !the_repository->gitdir ?
	      |                                              ^~
	config.c: In function ‘repo_read_config’:
	config.c:2523:30: error: ‘struct repository’ has no member named ‘commondir’
	 2523 |         opts.commondir = repo->commondir;
	      |                              ^~
	config.c:2524:28: error: ‘struct repository’ has no member named ‘gitdir’
	 2524 |         opts.git_dir = repo->gitdir;
	      |                            ^~
	make: *** [Makefile:2719: config.o] Error 1

I.e. almost all of the config code doesn't care about the repository at
all, it just needs the "struct config_set" that's in the repository
struct.

With the linked-to change I'm arguing that just mocking it up sucks less
than carrying a duplicate set of API functions just for those rare cases
where we need to one-off read a config file.

But...

> It is a workable solution, e.g. that approach would work around the
> failures in test-tool and scalar that I observed. In the spirit of
> libification, this feels like a kludge, though, since we'd be reverting
> to using "struct repository" for more things instead of using more
> well-scoped interfaces. IMO a better future for the config_set API would
> be to move it into configset.c or something, where only users who want
> the low level API would use it and everyone else would just pretend it
> doesn't exist. This would be a little like libgit2's organization, where
> 'general config', 'config parsing' and 'in-memory config value
> representations' are separate files, e.g.
>
>   https://github.com/libgit2/libgit2/blob/main/src/libgit2/config.h
>   https://github.com/libgit2/libgit2/blob/main/src/libgit2/config_parse.h
>   https://github.com/libgit2/libgit2/blob/main/src/libgit2/config_entries.h
>
> I also hesitate to put the config variables on the_repository, because
> in the long term, I think "struct config_reader" can and should be
> purely internal to config.c. But if we start advertising its existence
> via the_repository, that might be an invitation to (ab)use that API and
> make that transition harder.

...yes it's a kludge, but I'd think if you're interested in more
generally fixing it that a better use of time would be to narrowly focus
on those cases where we don't have a "repository" now, i.e. the
configset API users my linked-to patch shows & amends.

Everything else that uses git_config_get_*(), repo_config_get_*() will
either use an already set up "the_repository", or lazily init it, see
the git_config_check_init() API users.

Or, we could have repo_config() and friends not take a "struct
repository", but another config-specific object which has the subset of
the three fields from that struct which it actually needs (config,
gitdir & commondir).

The mocking function I added in the linked-to commit was just a way of
getting to that via the shortest means.

Junio C Hamano March 7, 2023, 7:36 p.m. UTC | #7

Glen Choo <chooglen@google.com> writes:

> ... In the spirit of
> libification, this feels like a kludge, though, since we'd be reverting
> to using "struct repository" for more things instead of using more
> well-scoped interfaces.

If you include "populate from system-wide, per-user, and repository
specific configuration files" as part of the API being libified,
your configsets cannot avoid being tied to a repository.  But I do
not think the reader needs to be in the repository.

> IMO a better future for the config_set API would
> be to move it into configset.c or something, where only users who want
> the low level API would use it and everyone else would just pretend it
> doesn't exist.

Isn't the use of the reader object purely transitory while you
populate the keys and values in a configset from a single file?  At
the layer to read and populate a configset from a single "source"
file, you still do not need repository.

Only when you say "I have a 'repo' instance and I want to read the
config variables from appropriate places", you call such a "read and
populate configset from a single source" helper three or four times
to populate repo->config.  Once a configset is populated, it or its
contents do not depend on the reader instance to function, so I do
not see how it benefits to have the reader in the repository object.

Glen Choo March 7, 2023, 10:53 p.m. UTC | #8

Junio C Hamano <gitster@pobox.com> writes:

> If you include "populate from system-wide, per-user, and repository
> specific configuration files" as part of the API being libified,
> your configsets cannot avoid being tied to a repository.  But I do
> not think the reader needs to be in the repository.
>
> [...]
>
> Isn't the use of the reader object purely transitory while you
> populate the keys and values in a configset from a single file?  At
> the layer to read and populate a configset from a single "source"
> file, you still do not need repository.
>
> [...]
> Only when you say "I have a 'repo' instance and I want to read the
> config variables from appropriate places", you call such a "read and
> populate configset from a single source" helper three or four times
> to populate repo->config.  Once a configset is populated, it or its
> contents do not depend on the reader instance to function, so I do
> not see how it benefits to have the reader in the repository object.

Yes, exactly. Having a config_set on the repository makes sense, but I
don't see a good reason to have the reader on the repository.

If Ævar sends his series soon, it would be fruitful to see how that
series interacts with this one :)

Ævar Arnfjörð Bjarmason March 8, 2023, 9:17 a.m. UTC | #9

On Tue, Mar 07 2023, Glen Choo wrote:

> Junio C Hamano <gitster@pobox.com> writes:
>
>> If you include "populate from system-wide, per-user, and repository
>> specific configuration files" as part of the API being libified,
>> your configsets cannot avoid being tied to a repository.  But I do
>> not think the reader needs to be in the repository.
>>
>> [...]
>>
>> Isn't the use of the reader object purely transitory while you
>> populate the keys and values in a configset from a single file?  At
>> the layer to read and populate a configset from a single "source"
>> file, you still do not need repository.
>>
>> [...]
>> Only when you say "I have a 'repo' instance and I want to read the
>> config variables from appropriate places", you call such a "read and
>> populate configset from a single source" helper three or four times
>> to populate repo->config.  Once a configset is populated, it or its
>> contents do not depend on the reader instance to function, so I do
>> not see how it benefits to have the reader in the repository object.
>
> Yes, exactly. Having a config_set on the repository makes sense, but I
> don't see a good reason to have the reader on the repository.

Isn't Junio suggesting something different here?

I hadn't looked closely at this aspect of it, and just took it as a
given that we needed to persist this data outside of the configset
lifetime.

If that's not the case then we don't need it in the file scope, nor a
"struct repository" or whatever, and could just have it materialized by
git_config_check_init(), no? I.e. when we create the configset we'd
create it, and throw it away after the configset is created?

I.e. to address this note in your initial RFC:

	I think we can get rid of global state in-tree as well. That requires a fair
	amount of work though, so I'd like to get thoughts on that before starting
	work.

> If Ævar sends his series soon, it would be fruitful to see how that
> series interacts with this one :)

I tried merging this topic with that, and it didn't conflict textually
or semantically. I just raised it because I think with your 3/6 you're
needlessly tying yourself in knots, i.e. with this part:

	A more typical approach would be to put this struct on "the_repository",
	but that's a worse fit for this use case since config reading is not
	scoped to a repository. E.g. we can read config before the repository is
	known ("read_very_early_config()"), blatantly ignore the repo
	("read_protected_config()"), or read only from a file
	("git_config_from_file()"). This is especially evident in t5318 and
	t9210, where test-tool and scalar parse config but don't fully
	initialize "the_repository".

Out of those examples read_very_early_config() and
read_protected_config() already call config_with_options(), which
optionally uses a "struct repository" via the "repo" member of "struct
git_config_source".

I think we may have gotten lost in the weeds over what amounts to an
asthetic preference about how to do polymorphism in C. I'd be fine with
mocking up the "struct repository", but you could equally prepare and
pass some new "config_state" struct that would contain the required
information (including the configset).

As well as this in the CL:

	The catch (aka the reason I stopped halfway through) is that I
	couldn't find a way to expose "struct config_reader" state
	without some fairly big changes, complexity-wise or LoC-wise[...]

I didn't look into exactly why config_fn_t would need your new "reader",
but if you accept that we could stick such a thing into "the_repository"
then there's no catch or need for the churn to change all those
callbacks.

Of course something that wants to use the API as a "real" library would
need to use some alternate mechanism, as you reference in adding a new
"config_state_fn_t". You note:

	but I couldn't find a great way to do this kind of 'switching
	between config callback types' elegantly.

So, I don't know, but I was suggesting looking into doing that based on
"the_repository" in play...

Ævar Arnfjörð Bjarmason March 8, 2023, 10:32 a.m. UTC | #10

On Mon, Mar 06 2023, Jonathan Tan wrote:

> Glen Choo <chooglen@google.com> writes:
>> By configset interface, I believe you mean the O(1) lookup functions
>> like git_config_get_int() (which rely on the value being cached, but
>> don't necessarily accept "struct config_set" as an arg)? I think that
>> makes sense both from a performance and maintenance perspective.
>
> Ah, yes. (More precisely, not the one where you call something like
> repo_config(), passing a callback function that.)
>
>> Given how painful it is to change the config_fn_t signature, I think it
>> is important to get as right as possible the first time. After I sent
>> this out, I thought of yet another possible config_fn_t signature
>> (since most callbacks only need diagnostic information, we could pass
>> "struct key_value_info" instead of the more privileged "struct
>> config_reader"), but given how many functions we'd have to change, it
>> seemed extremely difficult to even begin experimenting with this
>> different signature.
>
> Yeah, the first change is the hardest. I think passing it a single
> struct (so, instead of key, value, data, reader, and then any future
> fields we would need) to which we can add fields later would mean that
> we wouldn't need any changes beyond the first, though.

The more I've looked at this the more I'm convinced this is the wrong
direction.

For the configset API users we already have the line number, source
etc. in the "util" member, i.e. when we have an error in any API user
that uses the configset they can error about the specific line that
config came from.

I think this may have been conflated because e.g. for the configset to
get the "scope" we need to go from do_git_config_sequence(), which will
currently set "current_parsing_scope", all the way down to
configset_add_value(), and there we'll make use of the
"config_set_callback", which is a config_fn_t.

But that's all internal "static" functions, except
git_config_from_file() and git_config_from_file_with_options(), but
those have only a handful of callers.

But that's *different* than the user callbacks, which will be invoked
through a loop in configset_iter(), i.e. *after* we've parsed the
config, and are just getting the line number, scope etc. from the
configset.

There's other edge cases, e.g. current_config_line() will access the
global, but it only has two callers (one if we exclude the test
helper). But I think the answer there is to change the
config_read_push_default() code, not to give every current "config_fn_t"
implementation an extra parameter.

Glen Choo March 8, 2023, 11:09 p.m. UTC | #11

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Mon, Mar 06 2023, Jonathan Tan wrote:
>
>> Glen Choo <chooglen@google.com> writes:
>>> By configset interface, I believe you mean the O(1) lookup functions
>>> like git_config_get_int() (which rely on the value being cached, but
>>> don't necessarily accept "struct config_set" as an arg)? I think that
>>> makes sense both from a performance and maintenance perspective.
>>
>> Ah, yes. (More precisely, not the one where you call something like
>> repo_config(), passing a callback function that.)
>>
>>> Given how painful it is to change the config_fn_t signature, I think it
>>> is important to get as right as possible the first time. After I sent
>>> this out, I thought of yet another possible config_fn_t signature
>>> (since most callbacks only need diagnostic information, we could pass
>>> "struct key_value_info" instead of the more privileged "struct
>>> config_reader"), but given how many functions we'd have to change, it
>>> seemed extremely difficult to even begin experimenting with this
>>> different signature.
>>
>> Yeah, the first change is the hardest. I think passing it a single
>> struct (so, instead of key, value, data, reader, and then any future
>> fields we would need) to which we can add fields later would mean that
>> we wouldn't need any changes beyond the first, though.
>
> For the configset API users we already have the line number, source
> etc. in the "util" member, i.e. when we have an error in any API user
> that uses the configset they can error about the specific line that
> config came from.

Yeah, and I think we should plumb the "util" (actually "struct
key_value_info") back to the users who need that diagnostic info, which
achieves the same purpose as plumbing the "config_reader" to the
callback functions (since the callback functions only need to read
diagnostic information), but it's better since it exposes fewer gory
details and we can refactor those using coccinelle. The difficult part
is replacing the callbacks with the configset API in the first place.

> I think this may have been conflated because e.g. for the configset to
> get the "scope" we need to go from do_git_config_sequence(), which will
> currently set "current_parsing_scope", all the way down to
> configset_add_value(), and there we'll make use of the
> "config_set_callback", which is a config_fn_t.

I think this is half-right (at least from a historical perspective). We
started by just reading auxiliary info like line number and file name
from the "current config source", but when we started caching values in
config sets, we had to find another way to share this auxiliary info.
The solution that 0d44a2dacc (config: return configset value for
current_config_ functions, 2016-05-26) gave us is to also cache the
auxiliary info, and then when we iterate the configset, we set the
global "current_config_kvi" to the cached value.

So they aren't conflated per se, since the reason for their existence is
so that API users don't have to care whether or not they are iterating a
file or iterating a configset. But as you've observed...

> But that's all internal "static" functions, except
> git_config_from_file() and git_config_from_file_with_options(), but
> those have only a handful of callers.
>
> But that's *different* than the user callbacks, which will be invoked
> through a loop in configset_iter(), i.e. *after* we've parsed the
> config, and are just getting the line number, scope etc. from the
> configset.

in practice, very few users read (and should be reading) config directly
from files. Most want the 'config for the whole repo' (which is handled
by the repo_* and git_* functions) and others will explicitly call
git_config_from_file*() so I don't think the config API needs to keep
users ignorant of 'whether the current config is cached or not'.

We could convert callbacks to the configset API, and if we then we plumb
the key_value_info to the configset API users, maybe we can retire
"current_config_kvi".

> There's other edge cases, e.g. current_config_line() will access the
> global, but it only has two callers (one if we exclude the test
> helper). But I think the answer there is to change the
> config_read_push_default() code, not to give every current "config_fn_t"
> implementation an extra parameter.

I agree that we should rewrite config_read_push_default(), but wouldn't
we still need to expose auxiliary information (line number, scope) to
users of the callback API? e.g. config.c:die_bad_number() uses the file
name [*], and builtin/config.c definitely needs it for 'git config -l'.
Also, an express purpose for this series is to prepare
git_config_from_file() to be used by callers out-of-tree, which would
definitely need that information.

I'm not sure if you're proposing to move state from "the_reader" to
something like "the_repository.config_state". I'd hesitate to take that
approach, since we're just swapping one global for another.

[*] config.c:die_bad_number() is actually a bit broken because it
doesn't use the cached kvi info from the config set. I'll probably send
a fixup patch to fix this.

Glen Choo March 8, 2023, 11:18 p.m. UTC | #12

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> Yes, exactly. Having a config_set on the repository makes sense, but I
>> don't see a good reason to have the reader on the repository.
>
> [...]
>
> I hadn't looked closely at this aspect of it, and just took it as a
> given that we needed to persist this data outside of the configset
> lifetime.

There really isn't a need to persist the "config_reader" state, we'd
only need it while reading the file, and as you've hinted, once we've
cached the info in the configset, we'd just use that instead.

> If that's not the case then we don't need it in the file scope, nor a
> "struct repository" or whatever, and could just have it materialized by
> git_config_check_init(), no? I.e. when we create the configset we'd
> create it, and throw it away after the configset is created?

Yes, except that I'm proposing that we should do it even lower in the
chain, like do_config_from().

> 	The catch (aka the reason I stopped halfway through) is that I
> 	couldn't find a way to expose "struct config_reader" state
> 	without some fairly big changes, complexity-wise or LoC-wise[...]
>
> I didn't look into exactly why config_fn_t would need your new "reader",

Ah, this is what I was referencing in the CL here:

  I believe that only config callbacks are accessing this [config
  reader] state, e.g. because they use the low-level information (like
  builtin/config.c printing the filename and scope of the value) or for
  error reporting (like git_parse_int() reporting the filename and line
  number of the value it failed to parse)

> but if you accept that we could stick such a thing into "the_repository"
> then there's no catch or need for the churn to change all those
> callbacks.
>
> Of course something that wants to use the API as a "real" library would
> need to use some alternate mechanism, as you reference in adding a new
> "config_state_fn_t". You note:
>
> 	but I couldn't find a great way to do this kind of 'switching
> 	between config callback types' elegantly.
>
> So, I don't know, but I was suggesting looking into doing that based on
> "the_repository" in play...

And yes, since the primary purpose is to make git_config_from_file()
usable by a library caller (and secondarily, prepare for a future where
reading config is thread-safer because of less global state, as Jonathan
discussed [1]), I'd prefer not to use the_repository.

1. https://lore.kernel.org/git/20230306195756.3399115-1-jonathantanmy@google.com

[0/6,RFC] config.c: use struct for config reading state

Message

Comments