[7/8] fetch: fetch unpopulated, changed submodules

Message ID	20220210044152.78352-8-chooglen@google.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> Date: Thu, 10 Feb 2022 12:41:51 +0800 In-Reply-To: <20220210044152.78352-1-chooglen@google.com> Message-Id: <20220210044152.78352-8-chooglen@google.com> Mime-Version: 1.0 References: <20220210044152.78352-1-chooglen@google.com> Subject: [PATCH 7/8] fetch: fetch unpopulated, changed submodules From: Glen Choo <chooglen@google.com> To: git@vger.kernel.org Cc: Glen Choo <chooglen@google.com>, Jonathan Tan <jonathantanmy@google.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	fetch --recurse-submodules: fetch unpopulated submodules \| expand [0/8] fetch --recurse-submodules: fetch unpopulated submodules [1/8] submodule: inline submodule_commits() into caller [2/8] submodule: store new submodule commits oid_array in a struct [3/8] submodule: make static functions read submodules from commits [4/8] t5526: introduce test helper to assert on fetches [5/8] t5526: use grep to assert on fetches [6/8] submodule: extract get_fetch_task() [7/8] fetch: fetch unpopulated, changed submodules [8/8] submodule: fix bug and remove add_submodule_odb()

Glen Choo Feb. 10, 2022, 4:41 a.m. UTC

"git fetch --recurse-submodules" only considers populated
submodules (i.e. submodules that can be found by iterating the index),
which makes "git fetch" behave differently based on which commit is
checked out. As a result, even if the user has initialized all submodules
correctly, they may not fetch the necessary submodule commits, and
commands like "git checkout --recurse-submodules" might fail.

Teach "git fetch" to fetch cloned, changed submodules regardless of
whether they are populated (this is in addition to the current behavior
of fetching populated submodules).

Since a submodule may be encountered multiple times (via the list of
populated submodules or via the list of changed submodules), maintain a
list of seen submodules to avoid fetching a submodule more than once.

Signed-off-by: Glen Choo <chooglen@google.com>
---
submodule.c has a seemingly-unrelated change that teaches the "find
changed submodules" rev walk to call is_repository_shallow(). This fixes
what I believe is a legitimate bug - the rev walk would fail on a
shallow repo.

Our test suite did not catch this prior to this commit because we skip
the rev walk if .gitmodules is not found, and thus the test suite did
not attempt the rev walk on a shallow clone. After this commit,
we always attempt to find changed submodules (regardless of whether
there is a .gitmodules file), and the test suite noticed the bug.

 Documentation/fetch-options.txt |  26 ++--
 Documentation/git-fetch.txt     |  10 +-
 submodule.c                     | 101 +++++++++++++--
 t/t5526-fetch-submodules.sh     | 217 ++++++++++++++++++++++++++++++++
 4 files changed, 328 insertions(+), 26 deletions(-)

Junio C Hamano Feb. 10, 2022, 10:49 p.m. UTC | #1

Glen Choo <chooglen@google.com> writes:

> @@ -1273,10 +1277,6 @@ static void calculate_changed_submodule_paths(struct repository *r,
>  	struct strvec argv = STRVEC_INIT;
>  	struct string_list_item *name;
>  
> -	/* No need to check if there are no submodules configured */
> -	if (!submodule_from_path(r, NULL, NULL))
> -		return;
> -

It looks to me that this hunk reverts 18322bad (fetch: skip
on-demand checking when no submodules are configured, 2011-09-09),
which tried to avoid high cost computation when we know there is no
submodule.  Intended?  Perhaps it should be replaced with an
equivalent check that (1) still says "we do care about submodules"
even if the current checkout has no submodules (i.e. ls-files shows
no gitlinks), but (2) says "no, there is nothing interesting" when
$GIT_COMMON_DIR/modules/ is empty or some other cheap check we can
use?

> +get_fetch_task_from_index(struct submodule_parallel_fetch *spf,
> +			  const char **default_argv, struct strbuf *err)
>  {
> -	for (; spf->count < spf->r->index->cache_nr; spf->count++) {
> -		const struct cache_entry *ce = spf->r->index->cache[spf->count];
> +	for (; spf->index_count < spf->r->index->cache_nr; spf->index_count++) {
> +		const struct cache_entry *ce =
> +			spf->r->index->cache[spf->index_count];
>  		struct fetch_task *task;
>  
>  		if (!S_ISGITLINK(ce->ce_mode))
> @@ -1495,6 +1499,15 @@ get_fetch_task(struct submodule_parallel_fetch *spf,
>  		if (!task)
>  			continue;
>  
> +		/*
> +		 * We might have already considered this submodule
> +		 * because we saw it when iterating the changed
> +		 * submodule names.
> +		 */
> +		if (string_list_lookup(&spf->seen_submodule_names,
> +				       task->sub->name))
> +			continue;
> +
>  		switch (get_fetch_recurse_config(task->sub, spf))
>  		{
>  		default:
> @@ -1542,7 +1555,69 @@ get_fetch_task(struct submodule_parallel_fetch *spf,
>  			strbuf_addf(err, _("Fetching submodule %s%s\n"),
>  				    spf->prefix, ce->name);
>  
> -		spf->count++;
> +		spf->index_count++;
> +		return task;
> +	}
> +	return NULL;
> +}

Sorry, but I am confused.  If we are gathering which submodules to
fetch from the changes to gitlinks in the range of superproject
changes, why do we even need to scan the index (i.e. the current
checkout in the superproject) to begin with?  If it was changed,
we'd know get_fetch_task_from_changed() would take care of it, and
if there was no change to the submodule between the superproject's
commits before and after the fetch, there is nothing gained from
fetching in the submodules, no?

> +static struct fetch_task *
> +get_fetch_task_from_changed(struct submodule_parallel_fetch *spf,
> +			    const char **default_argv, struct strbuf *err)
> +{

> @@ -1553,7 +1628,10 @@ static int get_next_submodule(struct child_process *cp, struct strbuf *err,
>  {
>  	struct submodule_parallel_fetch *spf = data;
>  	const char *default_argv = NULL;
> -	struct fetch_task *task = get_fetch_task(spf, &default_argv, err);
> +	struct fetch_task *task =
> +		get_fetch_task_from_index(spf, &default_argv, err);
> +	if (!task)
> +		task = get_fetch_task_from_changed(spf, &default_argv, err);

Hmph, intersting.  So if "from index" grabbed some submodules
already, then the "from the changes in the superproject, we know
these submodules need refreshing" is not happen at all?  I am afraid
that I am still not following this...

Jonathan Tan Feb. 10, 2022, 10:51 p.m. UTC | #2

Glen Choo <chooglen@google.com> writes:
> submodule.c has a seemingly-unrelated change that teaches the "find
> changed submodules" rev walk to call is_repository_shallow(). This fixes
> what I believe is a legitimate bug - the rev walk would fail on a
> shallow repo.
> 
> Our test suite did not catch this prior to this commit because we skip
> the rev walk if .gitmodules is not found, and thus the test suite did
> not attempt the rev walk on a shallow clone. After this commit,
> we always attempt to find changed submodules (regardless of whether
> there is a .gitmodules file), and the test suite noticed the bug.

Is this bug present without the other code introduced in this patch? If
yes, it's better to put the bugfix in a separate patch with a test that
would have failed but now passes.

Some more high-level comments:

> @@ -1273,10 +1277,6 @@ static void calculate_changed_submodule_paths(struct repository *r,
>  	struct strvec argv = STRVEC_INIT;
>  	struct string_list_item *name;
>  
> -	/* No need to check if there are no submodules configured */
> -	if (!submodule_from_path(r, NULL, NULL))
> -		return;

I think this is removed because "no submodules configured" here actually
means "no submodules configured in the index", but submodules may be
configured in the superproject commits we're fetching.

I wonder if this should be mentioned in the commit message, but I'm OK
either way.

>  struct submodule_parallel_fetch {
> -	int count;
> +	int index_count;
> +	int changed_count;

Here (and elsewhere) we're checking both the index and the superproject
commits for .gitmodules. Do we still need to check the index?

> @@ -1495,6 +1499,15 @@ get_fetch_task(struct submodule_parallel_fetch *spf,
>  		if (!task)
>  			continue;
>  
> +		/*
> +		 * We might have already considered this submodule
> +		 * because we saw it when iterating the changed
> +		 * submodule names.
> +		 */
> +		if (string_list_lookup(&spf->seen_submodule_names,
> +				       task->sub->name))
> +			continue;

[snip]
> +		/*
> +		 * We might have already considered this submodule
> +		 * because we saw it in the index.
> +		 */
> +		if (string_list_lookup(&spf->seen_submodule_names, item.string))
> +			continue;

Hmm...it's odd that the checks happen in both places, when theoretically
we would do one after the other, so this check would only need to be in
one place. Maybe this is because of how we had to implement it (looping
over everything every time when we get the next fetch task) but if it's
easy to avoid, that would be great.

> +# Cleans up after tests that checkout branches other than the main ones
> +# in the tests.
> +checkout_main_branches() {
> +	git -C downstream checkout --recurse-submodules super &&
> +	git -C downstream/submodule checkout --recurse-submodules sub &&
> +	git -C downstream/submodule/subdir/deepsubmodule checkout --recurse-submodules deep
> +}

If we need to clean up in this way, I think it's better if we store a
pristine copy somewhere (e.g. pristine-downstream), delete downstream,
and copy it over when we need to.

> +# Test that we can fetch submodules in other branches by running fetch
> +# in a branch that has no submodules.
> +test_expect_success 'setup downstream branch without submodules' '
> +	(
> +		cd downstream &&
> +		git checkout --recurse-submodules -b no-submodules &&
> +		rm .gitmodules &&
> +		git rm submodule &&
> +		git add .gitmodules &&
> +		git commit -m "no submodules" &&
> +		git checkout --recurse-submodules super
> +	)
> +'

The tip of the branch indeed doesn't have any submodules, but when
fetching this branch, we might end up fetching some of the tip's
ancestors (depending on the repo we're fetching into), which do have
submodules. If we need a branch without submodules, I think that all
ancestors should not have submodules too.

That might be an argument for creating our own downstream and upstream
repos instead of reusing the existing ones.

> +test_expect_success "'--recurse-submodules=on-demand' should fetch submodule commits if the submodule is changed but the index has no submodules" '
> +	test_when_finished "checkout_main_branches" &&
> +	git -C downstream fetch --recurse-submodules &&
> +	# Create new superproject commit with updated submodules
> +	add_upstream_commit &&
> +	(
> +		cd submodule &&
> +		(
> +			cd subdir/deepsubmodule &&
> +			git fetch &&

Hmm...I thought submodule/subdir/deepsubmodule is upstream. Why is it
fetching?

> +	# Fetch the new superproject commit
> +	(
> +		cd downstream &&
> +		git switch --recurse-submodules no-submodules &&
> +		git fetch --recurse-submodules=on-demand >../actual.out 2>../actual.err &&
> +		git checkout --recurse-submodules origin/super 2>../actual-checkout.err

This patch set is about fetching, so the checkout here seems odd. To
verify that the fetch happened successfully, I think that we should
obtain the hashes of the commits that we expect to be fetched from
upstream, and then verify that they are present downstream.

> +	# Assert that we can checkout the superproject commit with --recurse-submodules
> +	! grep -E "error: Submodule .+ could not be updated" actual-checkout.err

Negative greps are error-prone, since they will also appear to work if
the message was just misspelled. We should probably check that the
expected commit is present instead.

> +# Test that we properly fetch the submodules in the index as well as
> +# submodules in other branches.
> +test_expect_success 'setup downstream branch with other submodule' '
> +	mkdir submodule2 &&
> +	(
> +		cd submodule2 &&
> +		git init &&
> +		echo sub2content >sub2file &&
> +		git add sub2file &&
> +		git commit -a -m new &&
> +		git branch -M sub2
> +	) &&
> +	git checkout -b super-sub2-only &&
> +	git submodule add "$pwd/submodule2" submodule2 &&
> +	git commit -m "add sub2" &&
> +	git checkout super &&
> +	(
> +		cd downstream &&
> +		git fetch --recurse-submodules origin &&
> +		git checkout super-sub2-only &&
> +		# Explicitly run "git submodule update" because sub2 is new
> +		# and has not been cloned.
> +		git submodule update --init &&
> +		git checkout --recurse-submodules super
> +	)
> +'

I couldn't see the submodule in the index to be fetched; maybe it's
there somewhere but it's not obvious to me. Also, why do we
need to run "git submodule update"? This patch set concerns itself with
fetching existing submodules, not cloning new ones.

> +test_expect_success "'--recurse-submodules' should fetch submodule commits in changed submodules and the index" '

Same comment about where in the index is the submodule to be fetched.

Glen Choo Feb. 11, 2022, 7:15 a.m. UTC | #3

Junio C Hamano <gitster@pobox.com> writes:

>> +static struct fetch_task *
>> +get_fetch_task_from_changed(struct submodule_parallel_fetch *spf,
>> +			    const char **default_argv, struct strbuf *err)
>> +{
>
>> @@ -1553,7 +1628,10 @@ static int get_next_submodule(struct child_process *cp, struct strbuf *err,
>>  {
>>  	struct submodule_parallel_fetch *spf = data;
>>  	const char *default_argv = NULL;
>> -	struct fetch_task *task = get_fetch_task(spf, &default_argv, err);
>> +	struct fetch_task *task =
>> +		get_fetch_task_from_index(spf, &default_argv, err);
>> +	if (!task)
>> +		task = get_fetch_task_from_changed(spf, &default_argv, err);
>
> Hmph, intersting.  So if "from index" grabbed some submodules
> already, then the "from the changes in the superproject, we know
> these submodules need refreshing" is not happen at all?  I am afraid
> that I am still not following this...

Hm, perhaps the following will help:

- get_next_submodule() is an iterator, specifically, it is a
  get_next_task_fn passed to run_processes_parallel_tr2(). It gets
  called until it is exhausted.
- Since get_next_submodule() is an iterator, I've implemented
  get_fetch_task_from_index() and get_fetch_task_from_changed() as
  iterators (they return NULL when they are exhausted).

So in practice:

- We repeatedly call get_next_submodule(), which tries to get a fetch
  task by calling the get_fetch_task_* functions.
- If get_fetch_task_from_index() returns non-NULL, get_next_submodule()
  uses that fetch task.
- Eventually, we will have considered every submodule in the index. At
  that point, get_fetch_task_from_index() is exhausted and always
  returns NULL.
- Since get_fetch_task_from_index() returns NULL, get_next_submodule()
  now gets its fetch tasks from get_fetch_task_from_changed().
- Eventually, we will also have considered every changed submodule, and
  get_fetch_task_from_changed() is exhausted.
- get_next_submodule() has now been exhausted and we are done.

As for the other questions, I'll dig a bit deeper before getting back to
you with answers.

Junio C Hamano Feb. 11, 2022, 5:07 p.m. UTC | #4

Glen Choo <chooglen@google.com> writes:

> Junio C Hamano <gitster@pobox.com> writes:
>
>>> +static struct fetch_task *
>>> +get_fetch_task_from_changed(struct submodule_parallel_fetch *spf,
>>> +			    const char **default_argv, struct strbuf *err)
>>> +{
>>
>>> @@ -1553,7 +1628,10 @@ static int get_next_submodule(struct child_process *cp, struct strbuf *err,
>>>  {
>>>  	struct submodule_parallel_fetch *spf = data;
>>>  	const char *default_argv = NULL;
>>> -	struct fetch_task *task = get_fetch_task(spf, &default_argv, err);
>>> +	struct fetch_task *task =
>>> +		get_fetch_task_from_index(spf, &default_argv, err);
>>> +	if (!task)
>>> +		task = get_fetch_task_from_changed(spf, &default_argv, err);
>>
>> Hmph, intersting.  So if "from index" grabbed some submodules
>> already, then the "from the changes in the superproject, we know
>> these submodules need refreshing" is not happen at all?  I am afraid
>> that I am still not following this...
>
> Hm, perhaps the following will help:
>
> - get_next_submodule() is an iterator, specifically, it is a
>   get_next_task_fn passed to run_processes_parallel_tr2(). It gets
>   called until it is exhausted.

Ahh, yeah, I totally forgot how we designed these things to work.

Even though these functions have a loop, (1) they start iterating at
the point where they left off in the last call, and (2) they return
as soon as they find the first item in the loop, which should have
stood out as a typical generator pattern, but somehow I missed these
signs.

> So in practice:
> ...
> - get_next_submodule() has now been exhausted and we are done.

But my original question (based on my misunderstanding that a single
call to these would grab all submodules that needs fetching by
inspecting either the index or the history) still stands, doesn't it?

Presumably the "history scan" part is because we assume that we
already had all the necessary submodule commits to check out any
superproject commits before this recursive fetch started.  That is
the reason why we do not scan the history behind the "old tips".  We
inspect only the history newer than them, leading to the "new tips",
and try to grab all submodule commits that newly appear, to ensure
that we can check out all the superproject commits we just obtained
and have no missing submodule commits necessary to do so.  Combined
with the assumption on the state before this fetch that we had all
necessary submodule commits to check out the superproject commits up
to "old tips", we maintain the invariant that we can check out any
superproject commits recursively, no?

If we are doing so, especially with this series where we do the
"history scan" to complete submodule commits necessary for all new
commits in the superproject, regardless of the branch being checked
out in the superproject, why do we still need to scan the index to
ensure that the current checkout can recurse down the submodule
without hitting a missing commit?

The only case the "index scan" may make a difference is when the
assumption, the invariant that we can check out any superproject
commits recursively, did not hold before we started the fetch, no?

Glen Choo Feb. 14, 2022, 4:24 a.m. UTC | #5

Jonathan Tan <jonathantanmy@google.com> writes:

>> @@ -1495,6 +1499,15 @@ get_fetch_task(struct submodule_parallel_fetch *spf,
>>  		if (!task)
>>  			continue;
>>  
>> +		/*
>> +		 * We might have already considered this submodule
>> +		 * because we saw it when iterating the changed
>> +		 * submodule names.
>> +		 */
>> +		if (string_list_lookup(&spf->seen_submodule_names,
>> +				       task->sub->name))
>> +			continue;
>
> [snip]
>> +		/*
>> +		 * We might have already considered this submodule
>> +		 * because we saw it in the index.
>> +		 */
>> +		if (string_list_lookup(&spf->seen_submodule_names, item.string))
>> +			continue;
>
> Hmm...it's odd that the checks happen in both places, when theoretically
> we would do one after the other, so this check would only need to be in
> one place. Maybe this is because of how we had to implement it (looping
> over everything every time when we get the next fetch task) but if it's
> easy to avoid, that would be great.

Yes, in order for the code to be correct, we only need this check once,
but I chose to check twice for defensiveness. That is, we avoid creating
implicit dependencies between the functions like "function A does not
consider whether a submodule has already been fetched, so it must always
be called before function B".

Perhaps there is another concern that overrides this? e.g. performance.

Glen Choo Feb. 14, 2022, 10:17 a.m. UTC | #6

Glen Choo <chooglen@google.com> writes:

> Teach "git fetch" to fetch cloned, changed submodules regardless of
> whether they are populated (this is in addition to the current behavior
> of fetching populated submodules).

Reviewers (and myself) have rightfully asked why "git fetch" should
continue to bother looking for submodules in the index if it already
fetches all of the changed submodules. The reasons for this are twofold:

1. The primary reason is that --recurse-submodules, aka
--recurse-submodules=yes does an unconditional fetch in each of the
submodules regardless of whether they have been changed by a
superproject commit. This is the behavior of e.g. from
 t/t5526-fetch-submodules.sh:101:

  test_expect_success "fetch --recurse-submodules recurses into submodules" '
    # Creates commits in the submodules but NOT the superproject
    add_upstream_commit &&
    (
      cd downstream &&
      git fetch --recurse-submodules >../actual.out 2>../actual.err
    ) &&
    test_must_be_empty actual.out &&
    # Assert that the new submodule commits have been fetched and that
    # no superproject commit was fetched
    verify_fetch_result actual.err
  '

Thus, we continue to check the index to implement this unconditional
fetching behavior.

2. In the --recurse-submodule=on-demand case, it can be correct to
ignore the index because "on-demand" only requires us to fetch changed
submodules. But in the event that a submodule is both changed and
populated, we may prefer to read the index instead of the superproject
commit, because the contents of the index are more obvious and more
actionable to the user.

For example, we print the path of the submodule when attempting to fetch
a submodule for debugging purposes:

- For a populated submodule, we print "Fetching submodule <path>"
- For an unpopulated submodule, we print "Fetching submodule <path> at
  commit foo"

Presumably, the user would prefer to see the "populated submodule"
message because it's easier to work with, e.g. "git -C <path>
<fix-the-problem>" instead of "git checkout --recurse-submodules
<commit-with-submodule> && git <fix-the-problem>".

The latter is not a sufficient reason to read the index and then the
changed submodule list (because we could try to read the changed
submodule configuration from index), but since we need to support
--recurse-submodules=yes, this implementation is convenient for
achieving both goals.

Glen Choo Feb. 14, 2022, 6:04 p.m. UTC | #7

Jonathan Tan <jonathantanmy@google.com> writes:

> Glen Choo <chooglen@google.com> writes:
>> submodule.c has a seemingly-unrelated change that teaches the "find
>> changed submodules" rev walk to call is_repository_shallow(). This fixes
>> what I believe is a legitimate bug - the rev walk would fail on a
>> shallow repo.
>> 
>> Our test suite did not catch this prior to this commit because we skip
>> the rev walk if .gitmodules is not found, and thus the test suite did
>> not attempt the rev walk on a shallow clone. After this commit,
>> we always attempt to find changed submodules (regardless of whether
>> there is a .gitmodules file), and the test suite noticed the bug.
>
> Is this bug present without the other code introduced in this patch? If
> yes, it's better to put the bugfix in a separate patch with a test that
> would have failed but now passes.

Makes sense, I'll do so.

>> @@ -1273,10 +1277,6 @@ static void calculate_changed_submodule_paths(struct repository *r,
>>  	struct strvec argv = STRVEC_INIT;
>>  	struct string_list_item *name;
>>  
>> -	/* No need to check if there are no submodules configured */
>> -	if (!submodule_from_path(r, NULL, NULL))
>> -		return;
>
> I think this is removed because "no submodules configured" here actually
> means "no submodules configured in the index", but submodules may be
> configured in the superproject commits we're fetching.
>
> I wonder if this should be mentioned in the commit message, but I'm OK
> either way.

Yes, your interpretation is correct. Though, as Junio mentioned in
<xmqqtud6e3r8.fsf@gitster.g>, I think we'd prefer to have _some kind_ of
check, even though this one no longer makes sense.

>
>>  struct submodule_parallel_fetch {
>> -	int count;
>> +	int index_count;
>> +	int changed_count;
>
> Here (and elsewhere) we're checking both the index and the superproject
> commits for .gitmodules. Do we still need to check the index?

Since this is a frequently asked question, I answered this elsewhere,
namely <kl6lczjp7nwj.fsf@chooglen-macbookpro.roam.corp.google.com>.

>> +# Cleans up after tests that checkout branches other than the main ones
>> +# in the tests.
>> +checkout_main_branches() {
>> +	git -C downstream checkout --recurse-submodules super &&
>> +	git -C downstream/submodule checkout --recurse-submodules sub &&
>> +	git -C downstream/submodule/subdir/deepsubmodule checkout --recurse-submodules deep
>> +}
>
> If we need to clean up in this way, I think it's better if we store a
> pristine copy somewhere (e.g. pristine-downstream), delete downstream,
> and copy it over when we need to.

The need for cleanup isn't that big; this just checks out the right
branches after we've checked out _other_ branches. If remove
the checkout, we won't need this any more, and...

>> +	# Fetch the new superproject commit
>> +	(
>> +		cd downstream &&
>> +		git switch --recurse-submodules no-submodules &&
>> +		git fetch --recurse-submodules=on-demand >../actual.out 2>../actual.err &&
>> +		git checkout --recurse-submodules origin/super 2>../actual-checkout.err
>
> This patch set is about fetching, so the checkout here seems odd. To
> verify that the fetch happened successfully, I think that we should
> obtain the hashes of the commits that we expect to be fetched from
> upstream, and then verify that they are present downstream.

IIUC this feedback correctly, the checkout is just an indirect way of
checking if we have the commit, so it makes more sense to just check if
we have the commit.

But explicitly checking for the commit (with "git cat-file -e" I
assume?) is probably overkill - verify_fetch_result() already checks for
this by grep-ing the output of "git fetch".

So I think it's ok to drop the checkout and not check for the commit
(beyond verify_fetch_result()).

>> +# Test that we can fetch submodules in other branches by running fetch
>> +# in a branch that has no submodules.
>> +test_expect_success 'setup downstream branch without submodules' '
>> +	(
>> +		cd downstream &&
>> +		git checkout --recurse-submodules -b no-submodules &&
>> +		rm .gitmodules &&
>> +		git rm submodule &&
>> +		git add .gitmodules &&
>> +		git commit -m "no submodules" &&
>> +		git checkout --recurse-submodules super
>> +	)
>> +'
>
> The tip of the branch indeed doesn't have any submodules, but when
> fetching this branch, we might end up fetching some of the tip's
> ancestors (depending on the repo we're fetching into), which do have
> submodules. If we need a branch without submodules, I think that all
> ancestors should not have submodules too.
>
> That might be an argument for creating our own downstream and upstream
> repos instead of reusing the existing ones.

I think I just made a silly wording error, I meant a "commit" or
"working tree state" without submodules, not a branch. The behavior I
wanted to test is whether or not changed submodules are fetched in
the absence of submodules and .gitmodules in the index/working tree.

>> +test_expect_success "'--recurse-submodules=on-demand' should fetch submodule commits if the submodule is changed but the index has no submodules" '
>> +	test_when_finished "checkout_main_branches" &&
>> +	git -C downstream fetch --recurse-submodules &&
>> +	# Create new superproject commit with updated submodules
>> +	add_upstream_commit &&
>> +	(
>> +		cd submodule &&
>> +		(
>> +			cd subdir/deepsubmodule &&
>> +			git fetch &&
>
> Hmm...I thought submodule/subdir/deepsubmodule is upstream. Why is it
> fetching?

Ah, deepsubmodule is a submodule in the "submodule/" repo, whose
remote is in "deepsubmodule/":

  test_expect_success setup '
    mkdir deepsubmodule &&
    (
      cd deepsubmodule &&
      git init &&
      echo deepsubcontent > deepsubfile &&
      git add deepsubfile &&
      git commit -m new deepsubfile &&
      git branch -M deep
    ) &&
    mkdir submodule &&
    (
      cd submodule &&
      git init &&
      echo subcontent > subfile &&
      git add subfile &&
      git submodule add "$pwd/deepsubmodule" subdir/deepsubmodule &&
      git commit -a -m new &&
      git branch -M sub
    )

So we fetch in "submodule/subdir/deepsubmodule" to get a new
deepsubmodule and (non-deep) submodule commit. Both of these commits
are then used to construct a new superproject commit.

If this is too confusing, maybe I should try to make the test simpler.

>
>> +	# Assert that we can checkout the superproject commit with --recurse-submodules
>> +	! grep -E "error: Submodule .+ could not be updated" actual-checkout.err
>
> Negative greps are error-prone, since they will also appear to work if
> the message was just misspelled. We should probably check that the
> expected commit is present instead.

That's a good point, I hadn't considered that.

>> +# Test that we properly fetch the submodules in the index as well as
>> +# submodules in other branches.
>> +test_expect_success 'setup downstream branch with other submodule' '
>> +	mkdir submodule2 &&
>> +	(
>> +		cd submodule2 &&
>> +		git init &&
>> +		echo sub2content >sub2file &&
>> +		git add sub2file &&
>> +		git commit -a -m new &&
>> +		git branch -M sub2
>> +	) &&
>> +	git checkout -b super-sub2-only &&
>> +	git submodule add "$pwd/submodule2" submodule2 &&
>> +	git commit -m "add sub2" &&
>> +	git checkout super &&
>> +	(
>> +		cd downstream &&
>> +		git fetch --recurse-submodules origin &&
>> +		git checkout super-sub2-only &&
>> +		# Explicitly run "git submodule update" because sub2 is new
>> +		# and has not been cloned.
>> +		git submodule update --init &&
>> +		git checkout --recurse-submodules super
>> +	)
>> +'
>
> I couldn't see the submodule in the index to be fetched; maybe it's
> there somewhere but it's not obvious to me.

If it helps, I've updated the description to:

  # In downstream, init "submodule2", but do not check it out while
  # fetching. This lets us assert that unpopulated submodules can be
  # fetched.

The 'submodules in the index' are the existing submodules ("submodule"
and "submodule/subdir/deepsubmodule"), and the changed, unpopulated
submodule is "submodule2".

> Also, why do we need to run "git submodule update"? This patch set
> concerns itself with fetching existing submodules, not cloning new
> ones.

In this setup step, 'downstream' clones 'submodule2' using "git
submodule update". So from the perspective of the following tests,
'submodule2' is an existing submodule. We could have cloned 'submodule2'
in an earlier setup step, but it wouldn't have been needed until these
tests.

[7/8] fetch: fetch unpopulated, changed submodules

Commit Message

Comments

Patch