Message ID | 20220210044152.78352-8-chooglen@google.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | fetch --recurse-submodules: fetch unpopulated submodules | expand |
Glen Choo <chooglen@google.com> writes: > @@ -1273,10 +1277,6 @@ static void calculate_changed_submodule_paths(struct repository *r, > struct strvec argv = STRVEC_INIT; > struct string_list_item *name; > > - /* No need to check if there are no submodules configured */ > - if (!submodule_from_path(r, NULL, NULL)) > - return; > - It looks to me that this hunk reverts 18322bad (fetch: skip on-demand checking when no submodules are configured, 2011-09-09), which tried to avoid high cost computation when we know there is no submodule. Intended? Perhaps it should be replaced with an equivalent check that (1) still says "we do care about submodules" even if the current checkout has no submodules (i.e. ls-files shows no gitlinks), but (2) says "no, there is nothing interesting" when $GIT_COMMON_DIR/modules/ is empty or some other cheap check we can use? > +get_fetch_task_from_index(struct submodule_parallel_fetch *spf, > + const char **default_argv, struct strbuf *err) > { > - for (; spf->count < spf->r->index->cache_nr; spf->count++) { > - const struct cache_entry *ce = spf->r->index->cache[spf->count]; > + for (; spf->index_count < spf->r->index->cache_nr; spf->index_count++) { > + const struct cache_entry *ce = > + spf->r->index->cache[spf->index_count]; > struct fetch_task *task; > > if (!S_ISGITLINK(ce->ce_mode)) > @@ -1495,6 +1499,15 @@ get_fetch_task(struct submodule_parallel_fetch *spf, > if (!task) > continue; > > + /* > + * We might have already considered this submodule > + * because we saw it when iterating the changed > + * submodule names. > + */ > + if (string_list_lookup(&spf->seen_submodule_names, > + task->sub->name)) > + continue; > + > switch (get_fetch_recurse_config(task->sub, spf)) > { > default: > @@ -1542,7 +1555,69 @@ get_fetch_task(struct submodule_parallel_fetch *spf, > strbuf_addf(err, _("Fetching submodule %s%s\n"), > spf->prefix, ce->name); > > - spf->count++; > + spf->index_count++; > + return task; > + } > + return NULL; > +} Sorry, but I am confused. If we are gathering which submodules to fetch from the changes to gitlinks in the range of superproject changes, why do we even need to scan the index (i.e. the current checkout in the superproject) to begin with? If it was changed, we'd know get_fetch_task_from_changed() would take care of it, and if there was no change to the submodule between the superproject's commits before and after the fetch, there is nothing gained from fetching in the submodules, no? > +static struct fetch_task * > +get_fetch_task_from_changed(struct submodule_parallel_fetch *spf, > + const char **default_argv, struct strbuf *err) > +{ > @@ -1553,7 +1628,10 @@ static int get_next_submodule(struct child_process *cp, struct strbuf *err, > { > struct submodule_parallel_fetch *spf = data; > const char *default_argv = NULL; > - struct fetch_task *task = get_fetch_task(spf, &default_argv, err); > + struct fetch_task *task = > + get_fetch_task_from_index(spf, &default_argv, err); > + if (!task) > + task = get_fetch_task_from_changed(spf, &default_argv, err); Hmph, intersting. So if "from index" grabbed some submodules already, then the "from the changes in the superproject, we know these submodules need refreshing" is not happen at all? I am afraid that I am still not following this...
Glen Choo <chooglen@google.com> writes: > submodule.c has a seemingly-unrelated change that teaches the "find > changed submodules" rev walk to call is_repository_shallow(). This fixes > what I believe is a legitimate bug - the rev walk would fail on a > shallow repo. > > Our test suite did not catch this prior to this commit because we skip > the rev walk if .gitmodules is not found, and thus the test suite did > not attempt the rev walk on a shallow clone. After this commit, > we always attempt to find changed submodules (regardless of whether > there is a .gitmodules file), and the test suite noticed the bug. Is this bug present without the other code introduced in this patch? If yes, it's better to put the bugfix in a separate patch with a test that would have failed but now passes. Some more high-level comments: > @@ -1273,10 +1277,6 @@ static void calculate_changed_submodule_paths(struct repository *r, > struct strvec argv = STRVEC_INIT; > struct string_list_item *name; > > - /* No need to check if there are no submodules configured */ > - if (!submodule_from_path(r, NULL, NULL)) > - return; I think this is removed because "no submodules configured" here actually means "no submodules configured in the index", but submodules may be configured in the superproject commits we're fetching. I wonder if this should be mentioned in the commit message, but I'm OK either way. > struct submodule_parallel_fetch { > - int count; > + int index_count; > + int changed_count; Here (and elsewhere) we're checking both the index and the superproject commits for .gitmodules. Do we still need to check the index? > @@ -1495,6 +1499,15 @@ get_fetch_task(struct submodule_parallel_fetch *spf, > if (!task) > continue; > > + /* > + * We might have already considered this submodule > + * because we saw it when iterating the changed > + * submodule names. > + */ > + if (string_list_lookup(&spf->seen_submodule_names, > + task->sub->name)) > + continue; [snip] > + /* > + * We might have already considered this submodule > + * because we saw it in the index. > + */ > + if (string_list_lookup(&spf->seen_submodule_names, item.string)) > + continue; Hmm...it's odd that the checks happen in both places, when theoretically we would do one after the other, so this check would only need to be in one place. Maybe this is because of how we had to implement it (looping over everything every time when we get the next fetch task) but if it's easy to avoid, that would be great. > +# Cleans up after tests that checkout branches other than the main ones > +# in the tests. > +checkout_main_branches() { > + git -C downstream checkout --recurse-submodules super && > + git -C downstream/submodule checkout --recurse-submodules sub && > + git -C downstream/submodule/subdir/deepsubmodule checkout --recurse-submodules deep > +} If we need to clean up in this way, I think it's better if we store a pristine copy somewhere (e.g. pristine-downstream), delete downstream, and copy it over when we need to. > +# Test that we can fetch submodules in other branches by running fetch > +# in a branch that has no submodules. > +test_expect_success 'setup downstream branch without submodules' ' > + ( > + cd downstream && > + git checkout --recurse-submodules -b no-submodules && > + rm .gitmodules && > + git rm submodule && > + git add .gitmodules && > + git commit -m "no submodules" && > + git checkout --recurse-submodules super > + ) > +' The tip of the branch indeed doesn't have any submodules, but when fetching this branch, we might end up fetching some of the tip's ancestors (depending on the repo we're fetching into), which do have submodules. If we need a branch without submodules, I think that all ancestors should not have submodules too. That might be an argument for creating our own downstream and upstream repos instead of reusing the existing ones. > +test_expect_success "'--recurse-submodules=on-demand' should fetch submodule commits if the submodule is changed but the index has no submodules" ' > + test_when_finished "checkout_main_branches" && > + git -C downstream fetch --recurse-submodules && > + # Create new superproject commit with updated submodules > + add_upstream_commit && > + ( > + cd submodule && > + ( > + cd subdir/deepsubmodule && > + git fetch && Hmm...I thought submodule/subdir/deepsubmodule is upstream. Why is it fetching? > + # Fetch the new superproject commit > + ( > + cd downstream && > + git switch --recurse-submodules no-submodules && > + git fetch --recurse-submodules=on-demand >../actual.out 2>../actual.err && > + git checkout --recurse-submodules origin/super 2>../actual-checkout.err This patch set is about fetching, so the checkout here seems odd. To verify that the fetch happened successfully, I think that we should obtain the hashes of the commits that we expect to be fetched from upstream, and then verify that they are present downstream. > + # Assert that we can checkout the superproject commit with --recurse-submodules > + ! grep -E "error: Submodule .+ could not be updated" actual-checkout.err Negative greps are error-prone, since they will also appear to work if the message was just misspelled. We should probably check that the expected commit is present instead. > +# Test that we properly fetch the submodules in the index as well as > +# submodules in other branches. > +test_expect_success 'setup downstream branch with other submodule' ' > + mkdir submodule2 && > + ( > + cd submodule2 && > + git init && > + echo sub2content >sub2file && > + git add sub2file && > + git commit -a -m new && > + git branch -M sub2 > + ) && > + git checkout -b super-sub2-only && > + git submodule add "$pwd/submodule2" submodule2 && > + git commit -m "add sub2" && > + git checkout super && > + ( > + cd downstream && > + git fetch --recurse-submodules origin && > + git checkout super-sub2-only && > + # Explicitly run "git submodule update" because sub2 is new > + # and has not been cloned. > + git submodule update --init && > + git checkout --recurse-submodules super > + ) > +' I couldn't see the submodule in the index to be fetched; maybe it's there somewhere but it's not obvious to me. Also, why do we need to run "git submodule update"? This patch set concerns itself with fetching existing submodules, not cloning new ones. > +test_expect_success "'--recurse-submodules' should fetch submodule commits in changed submodules and the index" ' Same comment about where in the index is the submodule to be fetched.
Junio C Hamano <gitster@pobox.com> writes: >> +static struct fetch_task * >> +get_fetch_task_from_changed(struct submodule_parallel_fetch *spf, >> + const char **default_argv, struct strbuf *err) >> +{ > >> @@ -1553,7 +1628,10 @@ static int get_next_submodule(struct child_process *cp, struct strbuf *err, >> { >> struct submodule_parallel_fetch *spf = data; >> const char *default_argv = NULL; >> - struct fetch_task *task = get_fetch_task(spf, &default_argv, err); >> + struct fetch_task *task = >> + get_fetch_task_from_index(spf, &default_argv, err); >> + if (!task) >> + task = get_fetch_task_from_changed(spf, &default_argv, err); > > Hmph, intersting. So if "from index" grabbed some submodules > already, then the "from the changes in the superproject, we know > these submodules need refreshing" is not happen at all? I am afraid > that I am still not following this... Hm, perhaps the following will help: - get_next_submodule() is an iterator, specifically, it is a get_next_task_fn passed to run_processes_parallel_tr2(). It gets called until it is exhausted. - Since get_next_submodule() is an iterator, I've implemented get_fetch_task_from_index() and get_fetch_task_from_changed() as iterators (they return NULL when they are exhausted). So in practice: - We repeatedly call get_next_submodule(), which tries to get a fetch task by calling the get_fetch_task_* functions. - If get_fetch_task_from_index() returns non-NULL, get_next_submodule() uses that fetch task. - Eventually, we will have considered every submodule in the index. At that point, get_fetch_task_from_index() is exhausted and always returns NULL. - Since get_fetch_task_from_index() returns NULL, get_next_submodule() now gets its fetch tasks from get_fetch_task_from_changed(). - Eventually, we will also have considered every changed submodule, and get_fetch_task_from_changed() is exhausted. - get_next_submodule() has now been exhausted and we are done. As for the other questions, I'll dig a bit deeper before getting back to you with answers.
Glen Choo <chooglen@google.com> writes: > Junio C Hamano <gitster@pobox.com> writes: > >>> +static struct fetch_task * >>> +get_fetch_task_from_changed(struct submodule_parallel_fetch *spf, >>> + const char **default_argv, struct strbuf *err) >>> +{ >> >>> @@ -1553,7 +1628,10 @@ static int get_next_submodule(struct child_process *cp, struct strbuf *err, >>> { >>> struct submodule_parallel_fetch *spf = data; >>> const char *default_argv = NULL; >>> - struct fetch_task *task = get_fetch_task(spf, &default_argv, err); >>> + struct fetch_task *task = >>> + get_fetch_task_from_index(spf, &default_argv, err); >>> + if (!task) >>> + task = get_fetch_task_from_changed(spf, &default_argv, err); >> >> Hmph, intersting. So if "from index" grabbed some submodules >> already, then the "from the changes in the superproject, we know >> these submodules need refreshing" is not happen at all? I am afraid >> that I am still not following this... > > Hm, perhaps the following will help: > > - get_next_submodule() is an iterator, specifically, it is a > get_next_task_fn passed to run_processes_parallel_tr2(). It gets > called until it is exhausted. Ahh, yeah, I totally forgot how we designed these things to work. Even though these functions have a loop, (1) they start iterating at the point where they left off in the last call, and (2) they return as soon as they find the first item in the loop, which should have stood out as a typical generator pattern, but somehow I missed these signs. > So in practice: > ... > - get_next_submodule() has now been exhausted and we are done. But my original question (based on my misunderstanding that a single call to these would grab all submodules that needs fetching by inspecting either the index or the history) still stands, doesn't it? Presumably the "history scan" part is because we assume that we already had all the necessary submodule commits to check out any superproject commits before this recursive fetch started. That is the reason why we do not scan the history behind the "old tips". We inspect only the history newer than them, leading to the "new tips", and try to grab all submodule commits that newly appear, to ensure that we can check out all the superproject commits we just obtained and have no missing submodule commits necessary to do so. Combined with the assumption on the state before this fetch that we had all necessary submodule commits to check out the superproject commits up to "old tips", we maintain the invariant that we can check out any superproject commits recursively, no? If we are doing so, especially with this series where we do the "history scan" to complete submodule commits necessary for all new commits in the superproject, regardless of the branch being checked out in the superproject, why do we still need to scan the index to ensure that the current checkout can recurse down the submodule without hitting a missing commit? The only case the "index scan" may make a difference is when the assumption, the invariant that we can check out any superproject commits recursively, did not hold before we started the fetch, no?
Jonathan Tan <jonathantanmy@google.com> writes: >> @@ -1495,6 +1499,15 @@ get_fetch_task(struct submodule_parallel_fetch *spf, >> if (!task) >> continue; >> >> + /* >> + * We might have already considered this submodule >> + * because we saw it when iterating the changed >> + * submodule names. >> + */ >> + if (string_list_lookup(&spf->seen_submodule_names, >> + task->sub->name)) >> + continue; > > [snip] >> + /* >> + * We might have already considered this submodule >> + * because we saw it in the index. >> + */ >> + if (string_list_lookup(&spf->seen_submodule_names, item.string)) >> + continue; > > Hmm...it's odd that the checks happen in both places, when theoretically > we would do one after the other, so this check would only need to be in > one place. Maybe this is because of how we had to implement it (looping > over everything every time when we get the next fetch task) but if it's > easy to avoid, that would be great. Yes, in order for the code to be correct, we only need this check once, but I chose to check twice for defensiveness. That is, we avoid creating implicit dependencies between the functions like "function A does not consider whether a submodule has already been fetched, so it must always be called before function B". Perhaps there is another concern that overrides this? e.g. performance.
Glen Choo <chooglen@google.com> writes: > Teach "git fetch" to fetch cloned, changed submodules regardless of > whether they are populated (this is in addition to the current behavior > of fetching populated submodules). Reviewers (and myself) have rightfully asked why "git fetch" should continue to bother looking for submodules in the index if it already fetches all of the changed submodules. The reasons for this are twofold: 1. The primary reason is that --recurse-submodules, aka --recurse-submodules=yes does an unconditional fetch in each of the submodules regardless of whether they have been changed by a superproject commit. This is the behavior of e.g. from t/t5526-fetch-submodules.sh:101: test_expect_success "fetch --recurse-submodules recurses into submodules" ' # Creates commits in the submodules but NOT the superproject add_upstream_commit && ( cd downstream && git fetch --recurse-submodules >../actual.out 2>../actual.err ) && test_must_be_empty actual.out && # Assert that the new submodule commits have been fetched and that # no superproject commit was fetched verify_fetch_result actual.err ' Thus, we continue to check the index to implement this unconditional fetching behavior. 2. In the --recurse-submodule=on-demand case, it can be correct to ignore the index because "on-demand" only requires us to fetch changed submodules. But in the event that a submodule is both changed and populated, we may prefer to read the index instead of the superproject commit, because the contents of the index are more obvious and more actionable to the user. For example, we print the path of the submodule when attempting to fetch a submodule for debugging purposes: - For a populated submodule, we print "Fetching submodule <path>" - For an unpopulated submodule, we print "Fetching submodule <path> at commit foo" Presumably, the user would prefer to see the "populated submodule" message because it's easier to work with, e.g. "git -C <path> <fix-the-problem>" instead of "git checkout --recurse-submodules <commit-with-submodule> && git <fix-the-problem>". The latter is not a sufficient reason to read the index and then the changed submodule list (because we could try to read the changed submodule configuration from index), but since we need to support --recurse-submodules=yes, this implementation is convenient for achieving both goals.
Jonathan Tan <jonathantanmy@google.com> writes: > Glen Choo <chooglen@google.com> writes: >> submodule.c has a seemingly-unrelated change that teaches the "find >> changed submodules" rev walk to call is_repository_shallow(). This fixes >> what I believe is a legitimate bug - the rev walk would fail on a >> shallow repo. >> >> Our test suite did not catch this prior to this commit because we skip >> the rev walk if .gitmodules is not found, and thus the test suite did >> not attempt the rev walk on a shallow clone. After this commit, >> we always attempt to find changed submodules (regardless of whether >> there is a .gitmodules file), and the test suite noticed the bug. > > Is this bug present without the other code introduced in this patch? If > yes, it's better to put the bugfix in a separate patch with a test that > would have failed but now passes. Makes sense, I'll do so. >> @@ -1273,10 +1277,6 @@ static void calculate_changed_submodule_paths(struct repository *r, >> struct strvec argv = STRVEC_INIT; >> struct string_list_item *name; >> >> - /* No need to check if there are no submodules configured */ >> - if (!submodule_from_path(r, NULL, NULL)) >> - return; > > I think this is removed because "no submodules configured" here actually > means "no submodules configured in the index", but submodules may be > configured in the superproject commits we're fetching. > > I wonder if this should be mentioned in the commit message, but I'm OK > either way. Yes, your interpretation is correct. Though, as Junio mentioned in <xmqqtud6e3r8.fsf@gitster.g>, I think we'd prefer to have _some kind_ of check, even though this one no longer makes sense. > >> struct submodule_parallel_fetch { >> - int count; >> + int index_count; >> + int changed_count; > > Here (and elsewhere) we're checking both the index and the superproject > commits for .gitmodules. Do we still need to check the index? Since this is a frequently asked question, I answered this elsewhere, namely <kl6lczjp7nwj.fsf@chooglen-macbookpro.roam.corp.google.com>. >> +# Cleans up after tests that checkout branches other than the main ones >> +# in the tests. >> +checkout_main_branches() { >> + git -C downstream checkout --recurse-submodules super && >> + git -C downstream/submodule checkout --recurse-submodules sub && >> + git -C downstream/submodule/subdir/deepsubmodule checkout --recurse-submodules deep >> +} > > If we need to clean up in this way, I think it's better if we store a > pristine copy somewhere (e.g. pristine-downstream), delete downstream, > and copy it over when we need to. The need for cleanup isn't that big; this just checks out the right branches after we've checked out _other_ branches. If remove the checkout, we won't need this any more, and... >> + # Fetch the new superproject commit >> + ( >> + cd downstream && >> + git switch --recurse-submodules no-submodules && >> + git fetch --recurse-submodules=on-demand >../actual.out 2>../actual.err && >> + git checkout --recurse-submodules origin/super 2>../actual-checkout.err > > This patch set is about fetching, so the checkout here seems odd. To > verify that the fetch happened successfully, I think that we should > obtain the hashes of the commits that we expect to be fetched from > upstream, and then verify that they are present downstream. IIUC this feedback correctly, the checkout is just an indirect way of checking if we have the commit, so it makes more sense to just check if we have the commit. But explicitly checking for the commit (with "git cat-file -e" I assume?) is probably overkill - verify_fetch_result() already checks for this by grep-ing the output of "git fetch". So I think it's ok to drop the checkout and not check for the commit (beyond verify_fetch_result()). >> +# Test that we can fetch submodules in other branches by running fetch >> +# in a branch that has no submodules. >> +test_expect_success 'setup downstream branch without submodules' ' >> + ( >> + cd downstream && >> + git checkout --recurse-submodules -b no-submodules && >> + rm .gitmodules && >> + git rm submodule && >> + git add .gitmodules && >> + git commit -m "no submodules" && >> + git checkout --recurse-submodules super >> + ) >> +' > > The tip of the branch indeed doesn't have any submodules, but when > fetching this branch, we might end up fetching some of the tip's > ancestors (depending on the repo we're fetching into), which do have > submodules. If we need a branch without submodules, I think that all > ancestors should not have submodules too. > > That might be an argument for creating our own downstream and upstream > repos instead of reusing the existing ones. I think I just made a silly wording error, I meant a "commit" or "working tree state" without submodules, not a branch. The behavior I wanted to test is whether or not changed submodules are fetched in the absence of submodules and .gitmodules in the index/working tree. >> +test_expect_success "'--recurse-submodules=on-demand' should fetch submodule commits if the submodule is changed but the index has no submodules" ' >> + test_when_finished "checkout_main_branches" && >> + git -C downstream fetch --recurse-submodules && >> + # Create new superproject commit with updated submodules >> + add_upstream_commit && >> + ( >> + cd submodule && >> + ( >> + cd subdir/deepsubmodule && >> + git fetch && > > Hmm...I thought submodule/subdir/deepsubmodule is upstream. Why is it > fetching? Ah, deepsubmodule is a submodule in the "submodule/" repo, whose remote is in "deepsubmodule/": test_expect_success setup ' mkdir deepsubmodule && ( cd deepsubmodule && git init && echo deepsubcontent > deepsubfile && git add deepsubfile && git commit -m new deepsubfile && git branch -M deep ) && mkdir submodule && ( cd submodule && git init && echo subcontent > subfile && git add subfile && git submodule add "$pwd/deepsubmodule" subdir/deepsubmodule && git commit -a -m new && git branch -M sub ) So we fetch in "submodule/subdir/deepsubmodule" to get a new deepsubmodule and (non-deep) submodule commit. Both of these commits are then used to construct a new superproject commit. If this is too confusing, maybe I should try to make the test simpler. > >> + # Assert that we can checkout the superproject commit with --recurse-submodules >> + ! grep -E "error: Submodule .+ could not be updated" actual-checkout.err > > Negative greps are error-prone, since they will also appear to work if > the message was just misspelled. We should probably check that the > expected commit is present instead. That's a good point, I hadn't considered that. >> +# Test that we properly fetch the submodules in the index as well as >> +# submodules in other branches. >> +test_expect_success 'setup downstream branch with other submodule' ' >> + mkdir submodule2 && >> + ( >> + cd submodule2 && >> + git init && >> + echo sub2content >sub2file && >> + git add sub2file && >> + git commit -a -m new && >> + git branch -M sub2 >> + ) && >> + git checkout -b super-sub2-only && >> + git submodule add "$pwd/submodule2" submodule2 && >> + git commit -m "add sub2" && >> + git checkout super && >> + ( >> + cd downstream && >> + git fetch --recurse-submodules origin && >> + git checkout super-sub2-only && >> + # Explicitly run "git submodule update" because sub2 is new >> + # and has not been cloned. >> + git submodule update --init && >> + git checkout --recurse-submodules super >> + ) >> +' > > I couldn't see the submodule in the index to be fetched; maybe it's > there somewhere but it's not obvious to me. If it helps, I've updated the description to: # In downstream, init "submodule2", but do not check it out while # fetching. This lets us assert that unpopulated submodules can be # fetched. The 'submodules in the index' are the existing submodules ("submodule" and "submodule/subdir/deepsubmodule"), and the changed, unpopulated submodule is "submodule2". > Also, why do we need to run "git submodule update"? This patch set > concerns itself with fetching existing submodules, not cloning new > ones. In this setup step, 'downstream' clones 'submodule2' using "git submodule update". So from the perspective of the following tests, 'submodule2' is an existing submodule. We could have cloned 'submodule2' in an earlier setup step, but it wouldn't have been needed until these tests.
diff --git a/Documentation/fetch-options.txt b/Documentation/fetch-options.txt index e967ff1874..38dad13683 100644 --- a/Documentation/fetch-options.txt +++ b/Documentation/fetch-options.txt @@ -185,15 +185,23 @@ endif::git-pull[] ifndef::git-pull[] --recurse-submodules[=yes|on-demand|no]:: This option controls if and under what conditions new commits of - populated submodules should be fetched too. It can be used as a - boolean option to completely disable recursion when set to 'no' or to - unconditionally recurse into all populated submodules when set to - 'yes', which is the default when this option is used without any - value. Use 'on-demand' to only recurse into a populated submodule - when the superproject retrieves a commit that updates the submodule's - reference to a commit that isn't already in the local submodule - clone. By default, 'on-demand' is used, unless - `fetch.recurseSubmodules` is set (see linkgit:git-config[1]). + submodules should be fetched too. When recursing through submodules, + `git fetch` always attempts to fetch "changed" submodules, that is, a + submodule that has commits that are referenced by a newly fetched + superproject commit but are missing in the local submodule clone. A + changed submodule can be fetched as long as it is present locally e.g. + in `$GIT_DIR/modules/` (see linkgit:gitsubmodules[7]); if the upstream + adds a new submodule, that submodule cannot be fetched until it is + cloned e.g. by `git submodule update`. ++ +When set to 'on-demand', only changed submodules are fetched. When set +to 'yes', all populated submodules are fetched and submodules that are +both unpopulated and changed are fetched. When set to 'no', submodules +are never fetched. ++ +When unspecified, this uses the value of `fetch.recurseSubmodules` if it +is set (see linkgit:git-config[1]), defaulting to 'on-demand' if unset. +When this option is used without any value, it defaults to 'yes'. endif::git-pull[] -j:: diff --git a/Documentation/git-fetch.txt b/Documentation/git-fetch.txt index 550c16ca61..e9d364669a 100644 --- a/Documentation/git-fetch.txt +++ b/Documentation/git-fetch.txt @@ -287,12 +287,10 @@ include::transfer-data-leaks.txt[] BUGS ---- -Using --recurse-submodules can only fetch new commits in already checked -out submodules right now. When e.g. upstream added a new submodule in the -just fetched commits of the superproject the submodule itself cannot be -fetched, making it impossible to check out that submodule later without -having to do a fetch again. This is expected to be fixed in a future Git -version. +Using --recurse-submodules can only fetch new commits in submodules that are +present locally e.g. in `$GIT_DIR/modules/`. If the upstream adds a new +submodule, that submodule cannot be fetched until it is cloned e.g. by `git +submodule update`. This is expected to be fixed in a future Git version. SEE ALSO -------- diff --git a/submodule.c b/submodule.c index d695dcadf4..0c02bbc9c3 100644 --- a/submodule.c +++ b/submodule.c @@ -22,6 +22,7 @@ #include "parse-options.h" #include "object-store.h" #include "commit-reach.h" +#include "shallow.h" static int config_update_recurse_submodules = RECURSE_SUBMODULES_OFF; static int initialized_fetch_ref_tips; @@ -907,6 +908,9 @@ static void collect_changed_submodules(struct repository *r, save_warning = warn_on_object_refname_ambiguity; warn_on_object_refname_ambiguity = 0; + /* make sure shallows are read */ + is_repository_shallow(the_repository); + repo_init_revisions(r, &rev, NULL); setup_revisions(argv->nr, argv->v, &rev, &s_r_opt); warn_on_object_refname_ambiguity = save_warning; @@ -1273,10 +1277,6 @@ static void calculate_changed_submodule_paths(struct repository *r, struct strvec argv = STRVEC_INIT; struct string_list_item *name; - /* No need to check if there are no submodules configured */ - if (!submodule_from_path(r, NULL, NULL)) - return; - strvec_push(&argv, "--"); /* argv[0] program name */ oid_array_for_each_unique(&ref_tips_after_fetch, append_oid_to_argv, &argv); @@ -1347,7 +1347,8 @@ int submodule_touches_in_range(struct repository *r, } struct submodule_parallel_fetch { - int count; + int index_count; + int changed_count; struct strvec args; struct repository *r; const char *prefix; @@ -1357,6 +1358,7 @@ struct submodule_parallel_fetch { int result; struct string_list changed_submodule_names; + struct string_list seen_submodule_names; /* Pending fetches by OIDs */ struct fetch_task **oid_fetch_tasks; @@ -1367,6 +1369,7 @@ struct submodule_parallel_fetch { #define SPF_INIT { \ .args = STRVEC_INIT, \ .changed_submodule_names = STRING_LIST_INIT_DUP, \ + .seen_submodule_names = STRING_LIST_INIT_DUP, \ .submodules_with_errors = STRBUF_INIT, \ } @@ -1481,11 +1484,12 @@ static struct repository *get_submodule_repo_for(struct repository *r, } static struct fetch_task * -get_fetch_task(struct submodule_parallel_fetch *spf, - const char **default_argv, struct strbuf *err) +get_fetch_task_from_index(struct submodule_parallel_fetch *spf, + const char **default_argv, struct strbuf *err) { - for (; spf->count < spf->r->index->cache_nr; spf->count++) { - const struct cache_entry *ce = spf->r->index->cache[spf->count]; + for (; spf->index_count < spf->r->index->cache_nr; spf->index_count++) { + const struct cache_entry *ce = + spf->r->index->cache[spf->index_count]; struct fetch_task *task; if (!S_ISGITLINK(ce->ce_mode)) @@ -1495,6 +1499,15 @@ get_fetch_task(struct submodule_parallel_fetch *spf, if (!task) continue; + /* + * We might have already considered this submodule + * because we saw it when iterating the changed + * submodule names. + */ + if (string_list_lookup(&spf->seen_submodule_names, + task->sub->name)) + continue; + switch (get_fetch_recurse_config(task->sub, spf)) { default: @@ -1542,7 +1555,69 @@ get_fetch_task(struct submodule_parallel_fetch *spf, strbuf_addf(err, _("Fetching submodule %s%s\n"), spf->prefix, ce->name); - spf->count++; + spf->index_count++; + return task; + } + return NULL; +} + +static struct fetch_task * +get_fetch_task_from_changed(struct submodule_parallel_fetch *spf, + const char **default_argv, struct strbuf *err) +{ + for (; spf->changed_count < spf->changed_submodule_names.nr; + spf->changed_count++) { + struct string_list_item item = + spf->changed_submodule_names.items[spf->changed_count]; + struct changed_submodule_data *cs_data = item.util; + struct fetch_task *task; + + /* + * We might have already considered this submodule + * because we saw it in the index. + */ + if (string_list_lookup(&spf->seen_submodule_names, item.string)) + continue; + + task = fetch_task_create(spf->r, cs_data->path, + cs_data->super_oid); + if (!task) + continue; + + switch (get_fetch_recurse_config(task->sub, spf)) { + default: + case RECURSE_SUBMODULES_DEFAULT: + case RECURSE_SUBMODULES_ON_DEMAND: + *default_argv = "on-demand"; + break; + case RECURSE_SUBMODULES_ON: + *default_argv = "yes"; + break; + case RECURSE_SUBMODULES_OFF: + continue; + } + + task->repo = get_submodule_repo_for(spf->r, task->sub->path, + cs_data->super_oid); + if (!task->repo) { + fetch_task_release(task); + free(task); + + strbuf_addf(err, _("Could not access submodule '%s'\n"), + cs_data->path); + continue; + } + if (!is_tree_submodule_active(spf->r, cs_data->super_oid, + task->sub->path)) + continue; + + if (!spf->quiet) + strbuf_addf(err, + _("Fetching submodule %s%s at commit %s\n"), + spf->prefix, task->sub->path, + find_unique_abbrev(cs_data->super_oid, + DEFAULT_ABBREV)); + spf->changed_count++; return task; } return NULL; @@ -1553,7 +1628,10 @@ static int get_next_submodule(struct child_process *cp, struct strbuf *err, { struct submodule_parallel_fetch *spf = data; const char *default_argv = NULL; - struct fetch_task *task = get_fetch_task(spf, &default_argv, err); + struct fetch_task *task = + get_fetch_task_from_index(spf, &default_argv, err); + if (!task) + task = get_fetch_task_from_changed(spf, &default_argv, err); if (task) { struct strbuf submodule_prefix = STRBUF_INIT; @@ -1573,6 +1651,7 @@ static int get_next_submodule(struct child_process *cp, struct strbuf *err, *task_cb = task; strbuf_release(&submodule_prefix); + string_list_insert(&spf->seen_submodule_names, task->sub->name); return 1; } diff --git a/t/t5526-fetch-submodules.sh b/t/t5526-fetch-submodules.sh index cb18f0ac21..f37dca4e09 100755 --- a/t/t5526-fetch-submodules.sh +++ b/t/t5526-fetch-submodules.sh @@ -399,6 +399,223 @@ test_expect_success "'--recurse-submodules=on-demand' recurses as deep as necess verify_fetch_result actual.err ' +# Cleans up after tests that checkout branches other than the main ones +# in the tests. +checkout_main_branches() { + git -C downstream checkout --recurse-submodules super && + git -C downstream/submodule checkout --recurse-submodules sub && + git -C downstream/submodule/subdir/deepsubmodule checkout --recurse-submodules deep +} + +# Test that we can fetch submodules in other branches by running fetch +# in a branch that has no submodules. +test_expect_success 'setup downstream branch without submodules' ' + ( + cd downstream && + git checkout --recurse-submodules -b no-submodules && + rm .gitmodules && + git rm submodule && + git add .gitmodules && + git commit -m "no submodules" && + git checkout --recurse-submodules super + ) +' + +test_expect_success "'--recurse-submodules=on-demand' should fetch submodule commits if the submodule is changed but the index has no submodules" ' + test_when_finished "checkout_main_branches" && + git -C downstream fetch --recurse-submodules && + # Create new superproject commit with updated submodules + add_upstream_commit && + ( + cd submodule && + ( + cd subdir/deepsubmodule && + git fetch && + git checkout -q FETCH_HEAD + ) && + git add subdir/deepsubmodule && + git commit -m "new deep submodule" + ) && + git add submodule && + git commit -m "new submodule" && + + # Fetch the new superproject commit + ( + cd downstream && + git switch --recurse-submodules no-submodules && + git fetch --recurse-submodules=on-demand >../actual.out 2>../actual.err && + git checkout --recurse-submodules origin/super 2>../actual-checkout.err + ) && + test_must_be_empty actual.out && + git rev-parse --short HEAD >superhead && + git -C submodule rev-parse --short HEAD >subhead && + git -C deepsubmodule rev-parse --short HEAD >deephead && + verify_fetch_result actual.err && + + # Assert that the fetch happened at the non-HEAD commits + grep "Fetching submodule submodule at commit $superhead" actual.err && + grep "Fetching submodule submodule/subdir/deepsubmodule at commit $subhead" actual.err && + + # Assert that we can checkout the superproject commit with --recurse-submodules + ! grep -E "error: Submodule .+ could not be updated" actual-checkout.err +' + +test_expect_success "'--recurse-submodules' should fetch submodule commits if the submodule is changed but the index has no submodules" ' + test_when_finished "checkout_main_branches" && + # Fetch any leftover commits from other tests. + git -C downstream fetch --recurse-submodules && + # Create new superproject commit with updated submodules + add_upstream_commit && + ( + cd submodule && + ( + cd subdir/deepsubmodule && + git fetch && + git checkout -q FETCH_HEAD + ) && + git add subdir/deepsubmodule && + git commit -m "new deep submodule" + ) && + git add submodule && + git commit -m "new submodule" && + + # Fetch the new superproject commit + ( + cd downstream && + git switch --recurse-submodules no-submodules && + git fetch --recurse-submodules >../actual.out 2>../actual.err && + git checkout --recurse-submodules origin/super 2>../actual-checkout.err + ) && + test_must_be_empty actual.out && + git rev-parse --short HEAD >superhead && + git -C submodule rev-parse --short HEAD >subhead && + git -C deepsubmodule rev-parse --short HEAD >deephead && + verify_fetch_result actual.err && + + # Assert that the fetch happened at the non-HEAD commits + grep "Fetching submodule submodule at commit $superhead" actual.err && + grep "Fetching submodule submodule/subdir/deepsubmodule at commit $subhead" actual.err && + + # Assert that we can checkout the superproject commit with --recurse-submodules + ! grep -E "error: Submodule .+ could not be updated" actual-checkout.err +' + +test_expect_success "'--recurse-submodules' should ignore changed, inactive submodules" ' + test_when_finished "checkout_main_branches" && + # Fetch any leftover commits from other tests. + git -C downstream fetch --recurse-submodules && + # Create new superproject commit with updated submodules + add_upstream_commit && + ( + cd submodule && + ( + cd subdir/deepsubmodule && + git fetch && + git checkout -q FETCH_HEAD + ) && + git add subdir/deepsubmodule && + git commit -m "new deep submodule" + ) && + git add submodule && + git commit -m "new submodule" && + + # Fetch the new superproject commit + ( + cd downstream && + git switch --recurse-submodules no-submodules && + git -c submodule.submodule.active=false fetch --recurse-submodules >../actual.out 2>../actual.err + ) && + test_must_be_empty actual.out && + git rev-parse --short HEAD >superhead && + # Neither should be fetched because the submodule is inactive + rm subhead && + rm deephead && + verify_fetch_result actual.err +' + +# Test that we properly fetch the submodules in the index as well as +# submodules in other branches. +test_expect_success 'setup downstream branch with other submodule' ' + mkdir submodule2 && + ( + cd submodule2 && + git init && + echo sub2content >sub2file && + git add sub2file && + git commit -a -m new && + git branch -M sub2 + ) && + git checkout -b super-sub2-only && + git submodule add "$pwd/submodule2" submodule2 && + git commit -m "add sub2" && + git checkout super && + ( + cd downstream && + git fetch --recurse-submodules origin && + git checkout super-sub2-only && + # Explicitly run "git submodule update" because sub2 is new + # and has not been cloned. + git submodule update --init && + git checkout --recurse-submodules super + ) +' + +test_expect_success "'--recurse-submodules' should fetch submodule commits in changed submodules and the index" ' + test_when_finished "checkout_main_branches" && + # Fetch any leftover commits from other tests. + git -C downstream fetch --recurse-submodules && + # Create new commit in origin/super + add_upstream_commit && + ( + cd submodule && + ( + cd subdir/deepsubmodule && + git fetch && + git checkout -q FETCH_HEAD + ) && + git add subdir/deepsubmodule && + git commit -m "new deep submodule" + ) && + git add submodule && + git commit -m "new submodule" && + + # Create new commit in origin/super-sub2-only + git checkout super-sub2-only && + ( + cd submodule2 && + test_commit --no-tag foo + ) && + git add submodule2 && + git commit -m "new submodule2" && + + git checkout super && + ( + cd downstream && + git fetch --recurse-submodules >../actual.out 2>../actual.err && + git checkout --recurse-submodules origin/super-sub2-only 2>../actual-checkout.err + ) && + test_must_be_empty actual.out && + + # Assert that the submodules in the super branch are fetched + git rev-parse --short HEAD >superhead && + git -C submodule rev-parse --short HEAD >subhead && + git -C deepsubmodule rev-parse --short HEAD >deephead && + verify_fetch_result actual.err && + # Assert that submodule is read from the index, not from a commit + ! grep "Fetching submodule submodule at commit" actual.err && + + # Assert that super-sub2-only and submodule2 were fetched even + # though another branch is checked out + super_sub2_only_head=$(git rev-parse --short super-sub2-only) && + grep -E "\.\.${super_sub2_only_head}\s+super-sub2-only\s+-> origin/super-sub2-only" actual.err && + grep "Fetching submodule submodule2 at commit $super_sub2_only_head" actual.err && + sub2head=$(git -C submodule2 rev-parse --short HEAD) && + grep -E "\.\.${sub2head}\s+sub2\s+-> origin/sub2" actual.err && + + # Assert that we can checkout the superproject commit with --recurse-submodules + ! grep -E "error: Submodule .+ could not be updated" actual-checkout.err +' + test_expect_success "'--recurse-submodules=on-demand' stops when no new submodule commits are found in the superproject (and ignores config)" ' add_upstream_commit && echo a >> file &&
"git fetch --recurse-submodules" only considers populated submodules (i.e. submodules that can be found by iterating the index), which makes "git fetch" behave differently based on which commit is checked out. As a result, even if the user has initialized all submodules correctly, they may not fetch the necessary submodule commits, and commands like "git checkout --recurse-submodules" might fail. Teach "git fetch" to fetch cloned, changed submodules regardless of whether they are populated (this is in addition to the current behavior of fetching populated submodules). Since a submodule may be encountered multiple times (via the list of populated submodules or via the list of changed submodules), maintain a list of seen submodules to avoid fetching a submodule more than once. Signed-off-by: Glen Choo <chooglen@google.com> --- submodule.c has a seemingly-unrelated change that teaches the "find changed submodules" rev walk to call is_repository_shallow(). This fixes what I believe is a legitimate bug - the rev walk would fail on a shallow repo. Our test suite did not catch this prior to this commit because we skip the rev walk if .gitmodules is not found, and thus the test suite did not attempt the rev walk on a shallow clone. After this commit, we always attempt to find changed submodules (regardless of whether there is a .gitmodules file), and the test suite noticed the bug. Documentation/fetch-options.txt | 26 ++-- Documentation/git-fetch.txt | 10 +- submodule.c | 101 +++++++++++++-- t/t5526-fetch-submodules.sh | 217 ++++++++++++++++++++++++++++++++ 4 files changed, 328 insertions(+), 26 deletions(-)