mbox series

[v2,0/3] remote: replace static variables with struct remote_state

Message ID 20211013193127.76537-1-chooglen@google.com (mailing list archive)
Headers show
Series remote: replace static variables with struct remote_state | expand

Message

Glen Choo Oct. 13, 2021, 7:31 p.m. UTC
This series aims to make the remotes subsystem work with
non-the_repository, which will allow submodule remotes to be accessed
in-process, rather than through child processes. This is accomplished by
creating a struct remote_state and adding it to struct repository.

One motivation for this is that it allows future submodule commands to run
in-process. An example is an RFC series of mine [1], where I tried to implement
"git branch --recurse-submodules" in-process but couldn't figure out how to
read the remotes of a submodule.

v2 aims to catch all of the instances of struct remote_state that I had
missed in v1, particularly with the struct branch functions. Because
more repo_* functions were added, the preparatory patches have been
reorganized.

In this version, my biggest uncertainty is in the interfaces of the
repo_* functions. Some functions expect a "struct repository" and one
of its contained members (e.g. "struct branch"). This interface allows
for incorrect behavior if the caller supplies a "struct repository"
instance that is unrelated from the "struct branch". This concern was
raised by Junio in [2].

While this concern is valid, I decided to keep this interface for a few
reasons:

1. The correct way of calling the function is 'obvious'.
2. It is relatively easy to get the contained object (struct branch/remote)
   and its containing struct repository/remote_state (e.g. we don't pass
   struct branch or remote through long call chains). For "struct
   branch", callers usually get the branch from the repo and use it
   immediately. For "struct remote", we don't use container objects
   outside of static functions. If you are interested in seeing all of
   the call sites, you can see a sample commit in [3].
3. The separation between container/contained objects allows us to
   reason better about what the correct interface is. e.g. we might be
   tempted to include a backpointer from struct branch to struct
   remote_state so that we can pass around struct branch and be
   confident that struct branch has all of the information it needs.

   However, I believe that some questions *shouldn't* be answered by
   just struct branch. The prime example in this series is
   branch_get_push() - it conceptually answers 'What is the pushremote
   of this branch', but the behavior we want is closer to 'If
   configured, give me the pushremote for the branch. Otherwise, give me
   the default pushremote of the repo.'. This is arguably a relevant
   detail that should be exposed to callers.

If we want the interface to prevent misuse, we can check that the
correct repository is passed at runtime. Alternatively, if we are
convinced that some questions can only be answered in the context of a
repository, we can take things one step further by using (struct
repository, branch_name) instead of (struct repository, struct branch).

A different concern is that struct repository is added to some callback
signatures even though the function body doesn't use it e.g.
repo_remote_for_branch(). However, this might be more of an accident of
history than anything else - the difference between remote and
pushremote is that remote always defaults to 'origin', whereas
the default value of pushremote is configurable.

Changes since v1:

* In v1, we moved static variables into the_repository->remote_state in
  two steps: static variables > static remote_state >
  the_repository->remote_state. In v2, make this change in one step:
  static variables > the_repository->remote_state.
* Add more instances of repo_* that were missed.

[1] https://lore.kernel.org/git/20210921232529.81811-1-chooglen@google.com/
[2] https://lore.kernel.org/git/xmqq4k9so15i.fsf@gitster.g/
[3] https://github.com/chooglen/git/commit/173e0268fd076044dd6b3cae893eedfa33965942

Glen Choo (3):
  remote: move static variables into per-repository struct
  remote: use remote_state parameter internally
  remote: add struct repository parameter to external functions

 remote.c     | 305 +++++++++++++++++++++++++++++++--------------------
 remote.h     | 130 +++++++++++++++++++---
 repository.c |   8 ++
 repository.h |   4 +
 4 files changed, 312 insertions(+), 135 deletions(-)

Range-diff against v1:
1:  6972ba4dcb = 1:  6972ba4dcb remote: move static variables into per-repository struct
2:  71b1da4389 = 2:  71b1da4389 remote: use remote_state parameter internally
3:  ff12771f06 = 3:  ff12771f06 remote: add struct repository parameter to external functions

Comments

Junio C Hamano Oct. 13, 2021, 8:11 p.m. UTC | #1
Glen Choo <chooglen@google.com> writes:

> While this concern is valid, I decided to keep this interface for a few
> reasons:
>
> 1. The correct way of calling the function is 'obvious'.
> 2. It is relatively easy to get the contained object (struct branch/remote)
>    and its containing struct repository/remote_state (e.g. we don't pass
>    struct branch or remote through long call chains). For "struct
>    branch", callers usually get the branch from the repo and use it
>    immediately. For "struct remote", we don't use container objects
>    outside of static functions. If you are interested in seeing all of
>    the call sites, you can see a sample commit in [3].
> 3. The separation between container/contained objects allows us to
>    reason better about what the correct interface is. e.g. we might be
>    tempted to include a backpointer from struct branch to struct
>    remote_state so that we can pass around struct branch and be
>    confident that struct branch has all of the information it needs.

I am not following.  None of the above reasons argue for forcing the
functions that take contained object to also take its container.

>    However, I believe that some questions *shouldn't* be answered by
>    just struct branch. The prime example in this series is
>    branch_get_push() - it conceptually answers 'What is the pushremote
>    of this branch', but the behavior we want is closer to 'If
>    configured, give me the pushremote for the branch. Otherwise, give me
>    the default pushremote of the repo.'. This is arguably a relevant
>    detail that should be exposed to callers.

It is a good example why such a function can just take an instance
of branch, and the caller,

 (1) who does care about the fallback, can be assured that the logic
     falls back to the correct repository; and

 (2) who does not care about the fallback and sees it a mere
     implementation detail of "I am on this branch; give me the
     remote to push to", do not have to know what, other than the
     branch object, needs to be passed.

if we explicitly record a branch object which repository it was
taken from.

There may be some other (real) reason where the resistance comes
from, that you may not be telling us, though.  But in what was
described in the message I am responding to, I didn't see much
convincing reason to argue _for_ keeping the contained objects
ignorant of the container and forcing callers to pass both to
functions that use both the container and contained to compute
something.

Thanks.
Junio C Hamano Oct. 13, 2021, 8:27 p.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

> There may be some other (real) reason where the resistance comes
> from, that you may not be telling us, though.  But in what was
> described in the message I am responding to, I didn't see much
> convincing reason to argue _for_ keeping the contained objects
> ignorant of the container and forcing callers to pass both to
> functions that use both the container and contained to compute
> something.

I am not you, so I can only speculate, but the real reason _could_
be that it makes it simpler to formulate steps 2 and 3 mechanically.
After adding "repo" parameter to a function that used to take, say,
a "branch", in step 3, a future clean-up series could add a .repo
member to branch objects and remove the "repo" parameter from such
function.

I think that approach would make more work to get to the final
state, though.
Glen Choo Oct. 13, 2021, 9:56 p.m. UTC | #3
Junio C Hamano <gitster@pobox.com> writes:

>>    However, I believe that some questions *shouldn't* be answered by
>>    just struct branch. The prime example in this series is
>>    branch_get_push() - it conceptually answers 'What is the pushremote
>>    of this branch', but the behavior we want is closer to 'If
>>    configured, give me the pushremote for the branch. Otherwise, give me
>>    the default pushremote of the repo.'. This is arguably a relevant
>>    detail that should be exposed to callers.
>
> It is a good example why such a function can just take an instance
> of branch, and the caller,
>
>  (1) who does care about the fallback, can be assured that the logic
>      falls back to the correct repository; and
>
>  (2) who does not care about the fallback and sees it a mere
>      implementation detail of "I am on this branch; give me the
>      remote to push to", do not have to know what, other than the
>      branch object, needs to be passed.
>
> if we explicitly record a branch object which repository it was
> taken from.
>
> There may be some other (real) reason where the resistance comes
> from, that you may not be telling us, though.  But in what was
> described in the message I am responding to, I didn't see much
> convincing reason to argue _for_ keeping the contained objects
> ignorant of the container and forcing callers to pass both to
> functions that use both the container and contained to compute
> something.

My primary line of thinking is that adding the backpointer from struct
branch to struct repository adds "unnecessary" coupling between the two
objects, which causes the more concrete problems of:

* Inconsistency between other functions that use struct repository as a
  "context object". We now find ourselves having to justify why branch
  functions can bypass the context object using a special pointer,
  whereas other functions and structs (say, struct object) cannot.
* Interface misuse, where callers can now dereference branch->repository
  even though it is meant to be internal. This is also a possible source
  of inconsistency.
* Extra memory consumption.

A counterpoint to what I said is [1], where Jonathan eventually added
the backpointer from struct ref_store to struct repository, which does
give the nice benefits of referential integrity and abstraction that you
cited. However in most of that series, struct ref_store is meant to be
the context object, but due to poor internal abstraction, ref_store ends
up depending on repository in some form or another and thus the
backpointer is more defensible (as opposed to passing two context
objects). I do not think we are in the same position with struct branch;
struct branch seems sufficiently separated to me.

I'm not opposed to adding a backpointer if it helps us get to a good
final state, but I currently suspect that the final state still involves
passing around struct repository as a context object.

[1] https://lore.kernel.org/git/xmqqlf3gib0p.fsf@gitster.g/
Glen Choo Oct. 13, 2021, 10 p.m. UTC | #4
Junio C Hamano <gitster@pobox.com> writes:

> I am not you, so I can only speculate, but the real reason _could_
> be that it makes it simpler to formulate steps 2 and 3 mechanically.

This is somewhat true, however...

> I think that approach would make more work to get to the final
> state, though.

I found that if I were to add the backpointer, it would have been better
to add it early and avoid this extra work that you mention here :)
Junio C Hamano Oct. 13, 2021, 11:37 p.m. UTC | #5
Glen Choo <chooglen@google.com> writes:

> My primary line of thinking is that adding the backpointer from struct
> branch to struct repository adds "unnecessary" coupling between the two
> objects,...

meaning, that the paragraphs below will explain why the reference is
unnecessary?

> ...which causes the more concrete problems of:

> * Inconsistency between other functions that use struct repository as a
>   "context object". We now find ourselves having to justify why branch
>   functions can bypass the context object using a special pointer,
>   whereas other functions and structs (say, struct object) cannot.

Given that the API has been evolved over time from a "we deal only
with a single repository and everybody else can just fork into a
separate repository" to "now this function can work in the specified
repository", it is understandable that some still do take repository
as a separate parameter, even though the original counterpart that
did not take a repository pointer took an object that clearly can
belong to only one repository at a time.  They would need to be
cleaned up, and they do not make a good excuse to add more of them,
when we know we are dealing with objects that belong only to one
repository, like the "remote.c" contents we have been discussing.

IOW, we need to start somewhere to clean things up, so either we
teach multi-repository support to remote.c from day one, or we pass
a repository as a separate and redundant parameter, with the
intention to later clean up.  I suspect that this piece is isolated
enough that it is simpler to use as a starting point, and then the
callers into remote.c API can then be cleaned up, gradually extending
the circle.

The original counerpart may not have been about "the other object"
to begin with (e.g. "we have this string (not an object); resolve it
to an object name in the context of the given repository"), in which
case they are fine as they are, but I think most of the contents of
"remote.c" do not fall into this category.

> * Interface misuse, where callers can now dereference branch->repository
>   even though it is meant to be internal. This is also a possible source
>   of inconsistency.

I do not think it is meant to be internal to begin with.  If we
declare, after careful consideration, that an in-core branch object
given by remote.c API always belongs to one and only one repository,
then following the branch->repository pointer should be the
documented and kosher way to find out the repository by anybody who
is given the branch object.  A function that takes repo and branch
to pretend as if it can work with an inconsistent pair is an
invidation for the interface misuse, but I do not think it is a
misuse of the API for a codepath that obtained a branch instance
inspecting what repository it came from.

A function that takes a pair of remote and a branch object for
whatever reason may want to have an assert() that makes sure they
came from the same repository, by the way.

Also, some of the functions involved in this topic may need to say
"I have this string, please resolve it as an object name in the
context of this repository", and call a function that takes a
repository as "context".  The repository they use must be given by
their caller somehow---is there any valid case where it must be
different from the other parameter (i.e. contained object) they are
dealing with?  I do not think so.  So such a function would have to
say "resolve this string in the context of the repository
branch->repo" because the function was given an in-core branch
instance.

> * Extra memory consumption.

This is true.  It however is the only valid excuse I can fully agree
with for being hesitant to have "where did I come from" information.
We do grow each in-core branch and remote instance by an 8-byte
pointer (but we reduce the stackframe consumption by 8-byte along
each step in the callchain as we do not have to pass a repository as
a separate pointer parameter).

> objects). I do not think we are in the same position with struct branch;
> struct branch seems sufficiently separated to me.

Sufficiently separated in the sense that you take a branch from the
parent repository and throwing it at a function with a pointer to an
in-core repository for a submodule would make some sort of sense?  

Sorry, but I do not follow this part of your argument, at all.
Glen Choo Oct. 14, 2021, 1:25 a.m. UTC | #6
Junio C Hamano <gitster@pobox.com> writes:

> IOW, we need to start somewhere to clean things up, so either we
> teach multi-repository support to remote.c from day one, or we pass
> a repository as a separate and redundant parameter, with the
> intention to later clean up.

I authored this topic with the assumption that passing a 'redundant'
context object was the path forward, not something to clean up. I think
this explains most of the divergence in opinion.

> ...I suspect that this piece is isolated
> enough that it is simpler to use as a starting point, and then the
> callers into remote.c API can then be cleaned up, gradually extending
> the circle.

Agree that this piece is a good place to start such work.

> I do not think it is meant to be internal to begin with.  If we
> declare, after careful consideration, that an in-core branch object
> given by remote.c API always belongs to one and only one repository,
> then following the branch->repository pointer should be the
> documented and kosher way to find out the repository by anybody who
> is given the branch object.

In the specific case of an in-core object that can refer only to one
repository, I think this is a fairly straightforward way of maintaining
referential integrity. However I don't think that the approach of
"contained object pointing to container" generalizes well - if I were
passed a struct remote_state and I needed to access its repository
later, is the ideal interface a backpointer, or is there a deeper
problem with how things are structured?

IOW, whether or not we should use the backpointer seems to depend on the
specifics of each use case. It might be productive to decide on
guidelines, especially with regards to repository.

> ...A function that takes repo and branch
> to pretend as if it can work with an inconsistent pair is an
> invidation for the interface misuse.

Fair.

>> objects). I do not think we are in the same position with struct branch;
>> struct branch seems sufficiently separated to me.
>
> Sufficiently separated in the sense that you take a branch from the
> parent repository and throwing it at a function with a pointer to an
> in-core repository for a submodule would make some sort of sense?  

I meant "sufficiently separated" in the sense that a struct branch is
meaningful to callers, even in the absence of a repository. Even in the
case where we *might* care about the repository e.g. getting a default
for remote_for_branch(), there are callers that make do with just
branch->remote_name.

This was meant to contrast to ref_store, which relies on its specific
repository for much of its internals and is difficult to separate. I am
worried about the knock-on effects of taking a _relatively_ clean
abstraction layer in struct branch and tying it back to the repository
layer.

That said, you have rightly noted that we should not pretend that a
branch from repo A can be thrown together with a pointer to repo B in
any kind of meaningful way (or at least, not yet). A backpointer can
prevent this.

I'm not fully convinced either way so I'll take time to mull over it; I
would also like to start on the right foot with this subsystem.