mbox series

[RFC,0/2] share a config between submodule and superproject

Message ID 20210408233936.533342-1-emilyshaffer@google.com (mailing list archive)
Headers show
Series share a config between submodule and superproject | expand

Message

Emily Shaffer April 8, 2021, 11:39 p.m. UTC
Hi folks,

Especially after es/config-based-hooks makes its way to 'next' and
'master', we think it would be really useful to have a config that
applies to a whole "project" - where "project" means "superproject and
its submodules." There's a longer defense in patch 2, but that's the
speedy summary.

I'm not wild about the implementation of this solution as it calls out
to 'git rev-parse' and 'git ls-files' - not once, but twice! - and I
understand that is considerably slower on some platforms than others.
But I'm open to suggestions - I couldn't find any other examples of a
repo figuring out whether it's a submodule or not. (I thought there may
be some, as 'repository.submodule_prefix' exists, but it seems like
that's only set in some special cases during operations initated from
the superproject.)

I'm hoping to work on some other submodule-centric stuff over the coming
months, and it might end up being very useful to be able to tell "am I a
submodule?" and "how do I talk to my superproject?" more generally - so
I'm really open to figuring out a better way than this, if folks have
ideas.

Patch 1 is a small refactor that we can take or leave - I found
"SCOPE_SUBMODULE" to be pretty ambiguous, especially since it seems to
refer to configs from .gitmodules. Even though I decided that
"superproject" was a better name than "submodule" I still wasn't super
happy with the ambiguity. But we can drop it if folks don't want to
rename.

Thanks a bunch,
 - Emily

Emily Shaffer (2):
  config: rename "submodule" scope to "gitmodules"
  config: add 'config.superproject' file

 Documentation/git-config.txt   |  21 +++++-
 builtin/config.c               |  10 ++-
 config.c                       |  30 +++++++-
 config.h                       |   5 +-
 submodule-config.c             |   2 +-
 submodule.c                    |  29 ++++++++
 submodule.h                    |   6 ++
 t/t1311-superproject-config.sh | 124 +++++++++++++++++++++++++++++++++
 8 files changed, 220 insertions(+), 7 deletions(-)
 create mode 100755 t/t1311-superproject-config.sh

Comments

Ævar Arnfjörð Bjarmason April 14, 2021, 10:32 a.m. UTC | #1
On Fri, Apr 09 2021, Emily Shaffer wrote:

> I'm hoping to work on some other submodule-centric stuff over the coming
> months, and it might end up being very useful to be able to tell "am I a
> submodule?" and "how do I talk to my superproject?" more generally - so
> I'm really open to figuring out a better way than this, if folks have
> ideas.
>
> Patch 1 is a small refactor that we can take or leave - I found
> "SCOPE_SUBMODULE" to be pretty ambiguous, especially since it seems to
> refer to configs from .gitmodules. Even though I decided that
> "superproject" was a better name than "submodule" I still wasn't super
> happy with the ambiguity. But we can drop it if folks don't want to
> rename.

This is less on your patch, and more on the larger work you're
suggesting, but the two are kind of related. Skip to the paragraph
starting with "But why" below for the relevance :)

I very much wish that we could eventually make the use of submodules
totally transparent, i.e. (taking the example of git.git):

 * You clone, and we just get objects from
   https://github.com/cr-marcstevens/sha1collisiondetection.git too

 * The fact that we have:

   160000 commit 855827c583bc30645ba427885caa40c5b81764d2  sha1collisiondetection

   Would become totally invisible to most users unless they run some
   gutsy ls-tree/files comand.

   We used to have a full git dir at sha1collisiondetection/.git and all
   the UX issues that entailed (e.g. switching to an old commit without
   the submodule).

   Now it's a stub and the actual repo is at
   .git/modules/sha1collisiondetection/, so we're kind of partially
   there.

 * I would think that the next (but big) logical step would be to use
   some combination of delta islands, upcoming sparse indexes etc. to
   actually share the object stores of the parent and submodule.

   Things like "git fsck" which now just punt on COMMIT would need to
   become smarter, but e.g. we could repack (or not, with islands)
   between parent and submodule.

I would think that this end goal makes more sense than the current
status quo of teaching every command that needs to e.g. grep the tree to
have a "--recurse-submodules". The distinction would be invisible to the
likes of "git-grep".

It would mean more complexity in e.g. "git commit", but we can imagine
if you wanted a cross-submodule commit it could do those commits
recursively, update parent COMMIT entries etc. (and even, optionally,
push out the submodule changes). That particular thing being so ad-hoc
is a *very* frequent pain point in submodule use.

But why am I talking about this here when all you're suggesting is
another config level?

Well, I think (but have not carefully thought about) that this
CONFIG_SCOPE_GITMODULES is probably a narrow net improvement now. If you
set most options in your .git/config to you that's the same logical
project, why shouldn't you get your diff setting or whatever because you
cd'd to a submodule "in the same project" (from the view of the user).

But I think that for a wider "improve submodules" effort it's worth
someone (and right now, that sounds like it's you) thinking about where
we're going with the feature. Maybe with some technical doc identifying
the most common pain points, what we propose (or could envision) doing
about them.

So e.g. in this case, having per-submodule config could be a step
forward, but it could also be one more step of walking in a circle.

I.e. don't think any user asked for or wanted to stitch together
multiple .git directories into one linked pseudo-checkout, that's
ultimately something we're exposing as an implementation detail. If we
no longer expose that implementation detail, would we be stuck
supporting what's ultimately a workaround feature?

None of that means we shouldn't have that one step forward that solves
real problems today.

But I think we should think about the end goal(s) sooner than
later. E.g. in your case, do you *really* want another config level, or
is it just the easiest way to get what you actually want, which is for a
"git config" in the submodule dir to perhaps consider its .git/config
and .git/modules/sha1collisiondetection/config as the same file for the
purposes of config parsing? Sans things like the remote URLs etc.
Emily Shaffer April 15, 2021, 9:25 p.m. UTC | #2
On Wed, Apr 14, 2021 at 12:32:17PM +0200, Ævar Arnfjörð Bjarmason wrote:
> On Fri, Apr 09 2021, Emily Shaffer wrote:
> 
> > I'm hoping to work on some other submodule-centric stuff over the coming
> > months, and it might end up being very useful to be able to tell "am I a
> > submodule?" and "how do I talk to my superproject?" more generally - so
> > I'm really open to figuring out a better way than this, if folks have
> > ideas.
> >
> > Patch 1 is a small refactor that we can take or leave - I found
> > "SCOPE_SUBMODULE" to be pretty ambiguous, especially since it seems to
> > refer to configs from .gitmodules. Even though I decided that
> > "superproject" was a better name than "submodule" I still wasn't super
> > happy with the ambiguity. But we can drop it if folks don't want to
> > rename.
> 
> This is less on your patch, and more on the larger work you're
> suggesting, but the two are kind of related. Skip to the paragraph
> starting with "But why" below for the relevance :)
> 
> I very much wish that we could eventually make the use of submodules
> totally transparent, i.e. (taking the example of git.git):
> 
>  * You clone, and we just get objects from
>    https://github.com/cr-marcstevens/sha1collisiondetection.git too
> 
>  * The fact that we have:
> 
>    160000 commit 855827c583bc30645ba427885caa40c5b81764d2  sha1collisiondetection
> 
>    Would become totally invisible to most users unless they run some
>    gutsy ls-tree/files comand.
> 
>    We used to have a full git dir at sha1collisiondetection/.git and all
>    the UX issues that entailed (e.g. switching to an old commit without
>    the submodule).
> 
>    Now it's a stub and the actual repo is at
>    .git/modules/sha1collisiondetection/, so we're kind of partially
>    there.

Side note: when I was writing the tests for patch 2 in this series I
noticed it was still really easy to end up with a full git dir at e.g.
sha1collisiondetection/.git, if you are trying to create a new repo to
use as a submodule (easily could be the case when working on a
"greenfield" project and you're the original author). There is
definitely a reason that I copied the (IMO) hack from the other
submodule test suite using the trash directory as a remote for my new
submodule. ;) I wonder whether I was just doing it wrong, or if we need
some established flow (maybe with `git submodule` subcommand) to create
a brand new submodule, not cloned from somewhere, and put its gitdir
inside of .git/modules?

>  * I would think that the next (but big) logical step would be to use
>    some combination of delta islands, upcoming sparse indexes etc. to
>    actually share the object stores of the parent and submodule.

Interesting - I'm trying to think of reasons not to and coming up blank,
but I also don't have much firsthand experience with the area of the
code that looks through the object store, so what do I know?

>    Things like "git fsck" which now just punt on COMMIT would need to
>    become smarter, but e.g. we could repack (or not, with islands)
>    between parent and submodule.
> 
> I would think that this end goal makes more sense than the current
> status quo of teaching every command that needs to e.g. grep the tree to
> have a "--recurse-submodules". The distinction would be invisible to the
> likes of "git-grep".

Yeah, I see where you're going, I think. Teaching everyone
--recurse-submodules or to respect the config setting
(core.recurseSubmodules? whatever it is) is inherently fragile, since it
relies on human reviewers to remember to chide patch authors to think of
the submodules use case. Neat.

> It would mean more complexity in e.g. "git commit", but we can imagine
> if you wanted a cross-submodule commit it could do those commits
> recursively, update parent COMMIT entries etc. (and even, optionally,
> push out the submodule changes). That particular thing being so ad-hoc
> is a *very* frequent pain point in submodule use.

Yeah, it sounds like you're describing the approach I was hoping to use
for commit-with-recursion:

 - Note each submodule with staged changes, as well as the superproject
 - (Optional? but might be nice) Open an editor with all the commit
   messages separated by scissors, so you can easily refer back to or
   modify the submodule commit messages while writing the superproject
   commit message
 - Generate all the submodule commits with the supplied commit-msgs
 - Take the commit IDs of all the newly created commits, stage them in
   the superproject, and generate the superproject commit with the
   supplied commit-msg

Maybe the editor bit is too much, but Jonathan Nieder at least really
liked that idea :) But the bit you're talking about - generating the
submodules first and then staging and committing in the superproject "on
the fly" - was the approach I was hoping to take.

> 
> But why am I talking about this here when all you're suggesting is
> another config level?
> 
> Well, I think (but have not carefully thought about) that this
> CONFIG_SCOPE_GITMODULES is probably a narrow net improvement now. If you
> set most options in your .git/config to you that's the same logical
> project, why shouldn't you get your diff setting or whatever because you
> cd'd to a submodule "in the same project" (from the view of the user).
> 
> But I think that for a wider "improve submodules" effort it's worth
> someone (and right now, that sounds like it's you) thinking about where
> we're going with the feature. Maybe with some technical doc identifying
> the most common pain points, what we propose (or could envision) doing
> about them.
> 
> So e.g. in this case, having per-submodule config could be a step
> forward, but it could also be one more step of walking in a circle.
> 
> I.e. don't think any user asked for or wanted to stitch together
> multiple .git directories into one linked pseudo-checkout, that's
> ultimately something we're exposing as an implementation detail. If we
> no longer expose that implementation detail, would we be stuck
> supporting what's ultimately a workaround feature?
> 
> None of that means we shouldn't have that one step forward that solves
> real problems today.
> 
> But I think we should think about the end goal(s) sooner than
> later.

Yeah, this is actually a good nudge for me. Internally we've got a big
nice doc explaining all our submodule plans for the next 6-9 months -
but I should probably get to sharing that with the list ;) I'd say to
look for it either this Friday or next Friday.

> E.g. in your case, do you *really* want another config level, or
> is it just the easiest way to get what you actually want, which is for a
> "git config" in the submodule dir to perhaps consider its .git/config
> and .git/modules/sha1collisiondetection/config as the same file for the
> purposes of config parsing? Sans things like the remote URLs etc.

As for this specific case, I want what is in the patch. Using a new
config file doesn't feel like a compromise to me - I actually would
prefer users to be able to explicitly choose shared vs. repo-specific
configs, rather than for we Git devs to implicitly decide which configs
are fine to share and which aren't. (I could also see having explicitly
shared or non-shared configs making it easier for wrappers to leverage
the Git config infrastructure, without mirroring our own "list of
configs to not share to submodule" for themselves.)

This RFC is mostly here to enable shared hooks, as you might have
guessed - but even with hooks, it's easy to imagine wanting a blend of
inherited vs. per-repo hooks. For example, I want to inherit a hook to
create a Gerrit Change-Id footer in my superproject and all my
submodules, definitely - but if my superproject is written in C and
includes a submodule which is in, I dunno, Rust or Zig or Perl or
whatever people are writing these days, I definitely don't want to try
and run my C linter from my superproject on my 15 Rust submodules - and
I definitely don't want to disable it in each one.

 - Emily
Junio C Hamano April 15, 2021, 9:41 p.m. UTC | #3
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> I very much wish that we could eventually make the use of submodules
> totally transparent, i.e. (taking the example of git.git):
>
>  * You clone, and we just get objects from
>    https://github.com/cr-marcstevens/sha1collisiondetection.git too
>
>  * The fact that we have:
>
>    160000 commit 855827c583bc30645ba427885caa40c5b81764d2  sha1collisiondetection
> ...

I am afraid that the story is not that simple (I wish it were).

There are at last two opposing ways submodules are to be used.  The
original motivation was to borrow an external project as part of
your project, and the way we use SHA1DC is fairly close to it (but
not quite).  In the context of such a usage

	git commit -m "message" --recurse-submodules

would often not be an appropriate operation.  A message that is
suitable in the context of the entire project would not be in the
context of the project that is bound to your project as a submodule,
and for your changes to be reusable by the folks who own the borrowed
project to make sense, your change should be defensible on its own,
"it helps this project that happens to use you as a submodule" by
itself is not all that convincing.

The other way submodule often gets used is to artificially split a
logically single project into many subdirectories and make them into
separate repositories, the top-level project binding them as
submodules.  An submodule in such an arrangement may not even make
sense as a standalone project---this pattern was only brought into
usage because without the more recent inventions like lazy/partial
clones and sparse checkouts, large projects did not fit within a
single repository.

With such an arrangement, of course it makes perfect sense for
things like

	git commit -m "message" --recurse-submodules
	git grep --recurse-submodules

to work as if you are working inside a single repository, by
definition.  You are splitting a logically single project into
multiple submodules as a workaround, and then still wanting to
treat them as a single project, after all.

Supporting those who want to use "collection of submodules as if it
were a single monolithic project" well is a worthy goal, but I do
not think it is healthy to assume that is the only use and forget
about use cases that would benefit from a clear boundary at
submodules (e.g. not sharing commit log message, a change at the
toplevel project may consist of multiple changes at the submoudle
level, etc.).

Thanks.