mbox series

[v9,0/3] teach submodules to know they're submodules

Message ID 20220310004423.2627181-1-emilyshaffer@google.com (mailing list archive)
Headers show
Series teach submodules to know they're submodules | expand

Message

Emily Shaffer March 10, 2022, 12:44 a.m. UTC
For the original cover letter, see
https://lore.kernel.org/git/20210611225428.1208973-1-emilyshaffer%40google.com.

CI run: https://github.com/nasamuffin/git/actions/runs/1954710601

Since v8:

Only a couple of minor fixes.

Junio pointed out that I could write the tests better using --type=bool
and 'test_cmp_config', and that we could be a little more careful about
when to give up on 'git rev-parse --show-superproject-working-dir'.

Glen mentioned that builtin/submodule--helper.c:run_update_procedure() is called
unconditionally earlier in the same function where I had added the
config in git-submodule.sh. So, I moved the config set into
submodule--helper.c to reduce possible edge cases where the config might
not be set.

Otherwise, this series is pretty much unchanged.

Since v7:

Actually a fairly large rework. Rather than keeping the path from gitdir
to gitdir, just keep a boolean under 'submodule.hasSuperproject'. The
idea is that from this boolean, we can decide whether to traverse the
filesystem looking for a superproject.

Because this simplifies the implementation, I compressed the three
middle commits into one. As proof-of-concept, I added a patch at the end
to check for this boolean when running `git rev-parse
--show-superproject-working-tree`.

One thing I'm not sure about: in the tests, I check whether the config
is set, but not what the boolean value of it is. Is there a better way
to do that? For example, I could imagine someone deciding to set
`submodule.hasSuperproject = false` and the tests would not function
correctly in that case. I think we don't really normalize the value on a
boolean config like that, so I didn't want to write a lot of comparison
to check if the value is 1 or true or True or TRUE or Yes or .... Am I
overthinking it?

The other thing I'm not sure about: since it's just a bool, we're not
restricted to setting this config only when we have both gitdir paths
available. That makes me want to set the config any time we are doing
something with submodules anyway, like any time 'git-submodule--helper'
is used. But that helper seems to be called in the context of the
superproject, not of the submodules, so adding this config for each
submodule we touch would be a second child process. Is there some other
common entry point for submodules that we can use?

 - Emily

Since v6:

I've dropped the fifth commit to use this new config for `git rev-parse
--show-superproject-working-tree`. I think it did more harm than good -
that tool uses an odd way to determine whether the superproject is
actually the superproject, anyways.

I poked a little bit at trying to find some benchmark to demonstrate
that "submodule.superprojectGitDir" is actually faster - but I didn't
end up able to write one without writing a ton of new code to traverse
the filesystem. To be honest, I'm not all that interested in performance
- I want the config added for correctness, instead.

So, the only real changes between v6 and v7 are some documentation
changes suggested by Jonathan Tan
(https://lore.kernel.org/git/20211117234300.2598132-1-jonathantanmy%40google.com).

Since v5:

A couple things. Firstly, a semantics change *back* to the semantics of
v3 - we map from gitdir to gitdir, *not* from common dir to common dir,
so that theoretically a submodule with multiple worktrees in multiple
superproject worktrees will be able to figure out which worktree of the
superproject it's in. (Realistically, that's not really possible right
now, but I'd like to change that soon.)

Secondly, a rewording of comments and commit messages to indicate that
this isn't a cache of some expensive operation, but rather intended to
be the source of truth for all submodules. I also added a fifth commit
rewriting `git rev-parse --show-superproject-working-tree` to
demonstrate what that means in practice - but from a practical
standpoint, I'm a little worried about that fifth patch. More details in
the patch 5 description.

I did discuss Ævar's idea of relying on in-process filesystem digging to
find the superproject's gitdir with the rest of the Google team, but in
the end decided that there are some worries about filesystem digging in
this way (namely, some ugly interactions with network drives that are
actually already an issue for Googler Linux machines). Plus, the allure
of being able to definitively know that we're a submodule is pretty
strong. ;) But overall, this is the direction I'd prefer to keep going
in, rather than trying to guess from the filesystem going forward.

Since v4:

The only real change here is a slight semantics change to map from
<submodule gitdir> to <superproject common git dir>. In every case
*except* for when the superproject has a worktree, this changes nothing.
For the case when the superproject has a worktree, this means that now
submodules will refer to the general superproject common dir (e.g. no
worktree-specific refs or configs or whatnot).

I *think* that because a submodule should exist in the context of the
common dir, not the worktree gitdir, that is ok. However, it does mean
it would be difficult to do something like sharing a config specific to
the worktree (the initial goal of this series).

$ROOT/.git
$ROOT/.git/config.superproject <- shared by $ROOT/.git/modules/sub
$ROOT/.git/modules/sub <- points to $ROOT/.git
$ROOT/.git/worktrees/wt
$ROOT/.git/worktrees/wt/config.superproject <- contains a certain config-based pre-commit hook

If the submodule only knows about the common dir, that is tough, because
the submodule would basically have to guess which worktree it's in from
its own path. There would be no way for '$WT/sub' to inherit
'$ROOT/.git/worktrees/wt/config.superproject'.

That said... right now, we don't support submodules in worktrees very
well at all. A submodule in a worktree will get a brand new gitdir in
$ROOT/.git/worktrees/modules/ (and that brand new gitdir would point to
the super's common dir). So I think we can punt on this entire question
until we teach submodules and worktrees to play more gracefully together
(it's on my long list...), and at that time we can probably introduce a
pointer from $ROOT/.git/modules/sub/worktrees/wt/ to
$ROOT/.git/worktrees/wt/....

Or, to summarize the long ramble above: "this is still kind of weird
with worktrees, but let's fix it later when we fix worktrees more
thoroughly".

(More rambling about worktree weirdness here:
https://lore.kernel.org/git/YYRaII8YWVxlBqsF%40google.com )


Since v3, a pretty major change: the semantics of
submodule.superprojectGitDir has changed, to point from the submodule's
gitdir to the superproject's gitdir (in v3 and earlier, we kept a path
from the submodule's *worktree* to the superproject's gitdir instead).
This cleans up some of the confusions about the behavior when a
submodule worktree moves around in the superproject's tree, or in a
future when we support submodules having multiple worktrees.

I also tried to simplify the tests to use 'test-tool path-utils
relative_path' everywhere - I think that makes them much more clear for
a test reader, but if you're reviewing and it isn't obvious what we're
testing for, please speak up.

I think this is pretty mature and there was a lot of general agreement
that the gitdir->gitdir association was the way to go, so please be
brutal and look for nits, leaks, etc. this round ;)
[/v4 cover letter]

Emily Shaffer (3):
  t7400-submodule-basic: modernize inspect() helper
  introduce submodule.hasSuperproject record
  rev-parse: short-circuit superproject worktree when config unset

 Documentation/config/submodule.txt |  6 ++++
 builtin/submodule--helper.c        | 11 +++++++
 submodule.c                        | 30 ++++++++++++++++++
 t/t1500-rev-parse.sh               | 10 +++++-
 t/t7400-submodule-basic.sh         | 42 ++++++++++++-------------
 t/t7406-submodule-update.sh        |  8 +++++
 t/t7412-submodule-absorbgitdirs.sh | 50 ++++++++++++++++++++++++++++--
 7 files changed, 131 insertions(+), 26 deletions(-)

Range-diff against v8:
-:  ---------- > 1:  251510c687 t7400-submodule-basic: modernize inspect() helper
1:  34cbfd81ee ! 2:  da01dc7c10 introduce submodule.hasSuperproject record
    @@ builtin/submodule--helper.c: static int clone_submodule(struct module_clone_data
      	free(sm_alternate);
      	free(error_strategy);
      
    -
    - ## git-submodule.sh ##
    -@@ git-submodule.sh: cmd_update()
    - 			;;
    - 		esac
    +@@ builtin/submodule--helper.c: static int run_update_procedure(int argc, const char **argv, const char *prefix)
    + 
    + 	free(prefixed_path);
      
    -+		# Note that the submodule is a submodule.
    -+		git -C "$sm_path" config submodule.hasSuperproject "true"
    ++	/*
    ++	 * This entry point is always called from a submodule, so this is a
    ++	 * good place to set a hint that this repo is a submodule.
    ++	 */
    ++	git_config_set("submodule.hasSuperproject", "true");
     +
    - 		if test -n "$recursive"
    - 		then
    - 			(
    + 	if (!oideq(&update_data.oid, &update_data.suboid) || update_data.force)
    + 		return do_run_update_procedure(&update_data);
    + 
     
      ## submodule.c ##
     @@ submodule.c: static void relocate_single_git_dir_into_superproject(const char *path)
    @@ submodule.c: static void relocate_single_git_dir_into_superproject(const char *p
      
      	relocate_gitdir(path, real_old_git_dir, real_new_git_dir);
      
    -+	/*
    -+	 * Note location of superproject's gitdir. Because the submodule already
    -+	 * has a gitdir and local config, we can store this pointer from
    -+	 * worktree config to worktree config, if the submodule has
    -+	 * extensions.worktreeConfig set.
    -+	 */
     +	strbuf_addf(&config_path, "%s/config", real_new_git_dir);
     +	git_configset_init(&sub_cs);
     +	git_configset_add_file(&sub_cs, config_path.buf);
    @@ t/t7400-submodule-basic.sh: inspect() {
      	git -C "$sub_dir" diff-files --exit-code &&
     +
     +	# Ensure that submodule.hasSuperproject is set.
    -+	git -C "$sub_dir" config "submodule.hasSuperproject"
    ++	test_cmp_config -C "$sub_dir" true --type=bool "submodule.hasSuperproject"
     +
      	git -C "$sub_dir" clean -n -d -x >untracked
      }
    @@ t/t7406-submodule-update.sh: test_expect_success 'submodule update --quiet passe
      
     +test_expect_success 'submodule update adds submodule.hasSuperproject to older repos' '
     +	(cd super &&
    -+	 git -C submodule config --unset submodule.hasSuperproject &&
    ++	 test_unconfig submodule.hasSuperproject &&
     +	 git submodule update &&
    -+	 git -C submodule config submodule.hasSuperproject
    ++	 test_cmp_config -C submodule true --type=bool submodule.hasSuperproject
     +	)
     +'
     +
    @@ t/t7412-submodule-absorbgitdirs.sh: test_expect_success 'absorb the git dir' '
     -	test_cmp expect.2 actual.2
     +	test_cmp expect.2 actual.2 &&
     +
    -+	git -C sub1 config submodule.hasSuperproject
    ++	test_cmp_config -C sub1 true --type=bool submodule.hasSuperproject
      '
      
      test_expect_success 'absorbing does not fail for deinitialized submodules' '
    @@ t/t7412-submodule-absorbgitdirs.sh: test_expect_success 'absorb the git dir in a
     -	test_cmp expect.2 actual.2
     +	test_cmp expect.2 actual.2 &&
     +
    -+	git -C sub1/nested config submodule.hasSuperproject
    ++	test_cmp_config -C sub1/nested true --type=bool submodule.hasSuperproject
      '
      
      test_expect_success 're-setup nested submodule' '
    @@ t/t7412-submodule-absorbgitdirs.sh: test_expect_success 'absorbing fails for a s
     +	git submodule absorbgitdirs sub4 &&
     +
     +	# make sure the submodule noted the superproject
    -+	git -C sub4 config submodule.hasSuperproject
    ++	test_cmp_config -C sub4 true --type=bool submodule.hasSuperproject
     +	)
     +'
     +
    @@ t/t7412-submodule-absorbgitdirs.sh: test_expect_success 'absorbing fails for a s
     +	git submodule absorbgitdirs sub5 &&
     +
     +	# make sure the submodule noted the superproject
    -+	git -C sub5 config submodule.hasSuperproject
    ++	test_cmp_config -C sub5 true --type=bool submodule.hasSuperproject
     +	)
     +'
     +
2:  c14ee8760f < -:  ---------- rev-parse: short-circuit superproject worktree when config unset
-:  ---------- > 3:  1893a84fdc rev-parse: short-circuit superproject worktree when config unset

Comments

Ævar Arnfjörð Bjarmason March 11, 2022, 9:09 a.m. UTC | #1
On Wed, Mar 09 2022, Emily Shaffer wrote:

> For the original cover letter, see
> https://lore.kernel.org/git/20210611225428.1208973-1-emilyshaffer%40google.com.
>
> CI run: https://github.com/nasamuffin/git/actions/runs/1954710601
>
> Since v8:
>
> Only a couple of minor fixes.
>
> Junio pointed out that I could write the tests better using --type=bool
> and 'test_cmp_config', and that we could be a little more careful about
> when to give up on 'git rev-parse --show-superproject-working-dir'.
>
> Glen mentioned that builtin/submodule--helper.c:run_update_procedure() is called
> unconditionally earlier in the same function where I had added the
> config in git-submodule.sh. So, I moved the config set into
> submodule--helper.c to reduce possible edge cases where the config might
> not be set.
>
> Otherwise, this series is pretty much unchanged.
>
> Since v7:
>
> Actually a fairly large rework. Rather than keeping the path from gitdir
> to gitdir, just keep a boolean under 'submodule.hasSuperproject'. The
> idea is that from this boolean, we can decide whether to traverse the
> filesystem looking for a superproject.
>
> Because this simplifies the implementation, I compressed the three
> middle commits into one. As proof-of-concept, I added a patch at the end
> to check for this boolean when running `git rev-parse
> --show-superproject-working-tree`.
>
> One thing I'm not sure about: in the tests, I check whether the config
> is set, but not what the boolean value of it is. Is there a better way
> to do that? For example, I could imagine someone deciding to set
> `submodule.hasSuperproject = false` and the tests would not function
> correctly in that case. I think we don't really normalize the value on a
> boolean config like that, so I didn't want to write a lot of comparison
> to check if the value is 1 or true or True or TRUE or Yes or .... Am I
> overthinking it?
>
> The other thing I'm not sure about: since it's just a bool, we're not
> restricted to setting this config only when we have both gitdir paths
> available. That makes me want to set the config any time we are doing
> something with submodules anyway, like any time 'git-submodule--helper'
> is used. But that helper seems to be called in the context of the
> superproject, not of the submodules, so adding this config for each
> submodule we touch would be a second child process. Is there some other
> common entry point for submodules that we can use?

I really don't mean to bring up the same points again, but I'm still
genuinely unsure what this is intended to solve in the end.

I.e. from the original RFC we went from it being for optimizations for
the shellscript "git rev-parse", to suggestions that the configured path
would be "canonical" in a way we couldn't discover on-the-fly (i.e. some
of Jonathan's noted edge cases [1]).

But now it's a boolean indicating "it's there, discover it", and the
implied (but not really explicitly stated) reason in 2/3 is that it's
purely for optimization purposes at this point.

But it's an optimization without a benchmark.

In [1] Jonathan (if I understood it correctly, see [2]) might have
suggested this is important to deal with some Google in-house NFS-a-like
auto-mounting software, i.e. the "walking up" is truly expensive in some
scenarios.

I do worry a bit that we'll be creating behavior edge cases related to
this, and if the problem being solved is for a relatively obscure setup
is it worth it, and in that case perhaps there should be a "I need this
optimization" setting guarding it?

But I don't know, a concrete case where this series makes a difference
would really help.

I tried to come up with one before[3] and all I could find was fleeting
cases we'd see go away with the migration of the remaining parts of
git-submodule.sh to C, which we already have in-flight patches for (or
rather, Glen is AFAIK at series 1/2 of submitting those, with 1/2
in-flight).

In any case I think lifting the bits of [3] where we assert that this
doesn't introduce any behavior change with a GIT_TEST_* knob would be
valuable.

I.e. as long a the intent isn't a behavior change let's test that
get_superproject_working_tree() doesn't need this across the entire test
suite, with specific tests that opt-in to the behavior (or do a whole
test suite run in that mode), rather than the default being
opt-out.

An opt-out is just a recipe for growing accidental implicit
dependencies, which explicitly isn't what we want for a "just an
optimization" knob. We do the same sort of opt-in/out-out testing for
e.g. split index, untracked cache etc (see the GIT_TEST_* bits in
ci/run-build-and-tests.sh). AFAICT a fix-up of just adding the
git_env_bool() here to this code in your 3/3 would do it:

	if (!git_env_bool("GIT_TEST_NO_SUBMODULE_HAS_SUPERPROJECT", 0) &&
	    !git_config_get_bool("submodule.hassuperproject", &has_superproject_cfg)
	    && !has_superproject_cfg)

And then adding GIT_TEST_NO_SUBMODULE_HAS_SUPERPROJECT=true to
linux-TEST-vars in ci/run-build-and-tests.sh. The tests that do rely on
submodule.hassuperproject would need to set
GIT_TEST_NO_SUBMODULE_HAS_SUPERPROJECT=false of course...

1. https://lore.kernel.org/git/YgF5V2Y0Btr8B4cd@google.com/
2. https://lore.kernel.org/git/220212.864k53yfws.gmgdl@evledraar.gmail.com/
3. https://lore.kernel.org/git/RFC-cover-0.2-00000000000-20211117T113134Z-avarab@gmail.com/
Junio C Hamano March 13, 2022, 5:43 a.m. UTC | #2
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> But now it's a boolean indicating "it's there, discover it", and the
> implied (but not really explicitly stated) reason in 2/3 is that it's
> purely for optimization purposes at this point.

You may know that I have a separate checkout of the 'todo' branch at
path "Meta" in my working tree.

I could use the hasSuperproject=false setting there, to say "this is
*NOT* a submodule, even the parent directory is a working tree of a
different repository, it is not our superproject, so do *NOT* bother
to go up to discover anything".

If that configuration weren't there in the "Meta/.git/config", the
parent directory of "Meta" (which has its own ".git") cannot tell if
that "Meta" thing is a submodule being prepared that hasn't been
added yet, or it will never intended to be a submodule.  I would
imagine that "git add X" can later be taught to refuse to add X if
there is X/.git and X/.git/config says it explicitly says that it
does not have a superproject.

So, I am not sure if it is a good characterization that it is for
optimization at all.