clone: --filter=tree:0 implies fetch.recurseSubmodules=no

Message ID	pull.797.git.1605904586929.gitgitgadget@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <pull.797.git.1605904586929.gitgitgadget@gmail.com> Date: Fri, 20 Nov 2020 20:36:26 +0000 Subject: [PATCH] clone: --filter=tree:0 implies fetch.recurseSubmodules=no Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Jonathan Tan <jonathantanmy@google.com>, Taylor Blau <me@ttaylorr.com>, Derrick Stolee <derrickstolee@github.com>, Derrick Stolee <dstolee@microsoft.com> Precedence: bulk From: Derrick Stolee <dstolee@microsoft.com>
Series	clone: --filter=tree:0 implies fetch.recurseSubmodules=no \| expand clone: --filter=tree:0 implies fetch.recurseSubmodules=no

Message ID

pull.797.git.1605904586929.gitgitgadget@gmail.com (mailing list archive)

State

New, archived

Headers

Message-Id: <pull.797.git.1605904586929.gitgitgadget@gmail.com>
Date: Fri, 20 Nov 2020 20:36:26 +0000
Subject: [PATCH] clone: --filter=tree:0 implies fetch.recurseSubmodules=no
Fcc: Sent
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
To: git@vger.kernel.org
Cc: Jonathan Tan <jonathantanmy@google.com>,
        Taylor Blau <me@ttaylorr.com>,
        Derrick Stolee <derrickstolee@github.com>,
        Derrick Stolee <dstolee@microsoft.com>
Precedence: bulk
From: Derrick Stolee <dstolee@microsoft.com>

Series

clone: --filter=tree:0 implies fetch.recurseSubmodules=no | expand

Commit Message

Derrick Stolee Nov. 20, 2020, 8:36 p.m. UTC

From: Derrick Stolee <dstolee@microsoft.com>

The partial clone feature has several modes, but only a few are quick
for a server to process using reachability bitmaps:

* Blobless: --filter=blob:none downloads all commits and trees and
  fetches necessary blobs on-demand.

* Treeless: --filter=tree:0 downloads all commits and fetches necessary
  trees and blobs on demand.

This treeles mode is most similar to a shallow clone in the total size
(it only adds the commit objects for the full history). This makes
treeless clones an interesting replacement for shallow clones. A user
can run more commands in a treeless clone than in a shallow clone,
especially 'git log' (no pathspec).

In particular, servers can still serve 'git fetch' requests quickly by
calculating the difference between commit wants and haves using bitmaps.

I was testing this feature with this in mind, and I knew that some trees
would be downloaded multiple times when checking out a new branch, but I
did not expect to discover a significant issue with 'git fetch', at
least in repostiories with submodules.

I was testing these commands:

	$ git clone --filter=tree:0 --single-branch --branch=master \
	  https://github.com/git/git
	$ git -C git fetch origin "+refs/heads/*:refs/remotes/origin/*"

This fetch command started downloading several pack-files of trees
before completing the command. I never let it finish since I got so
impatient with the repeated downloads. During debugging, I found that
the stack triggering promisor_remote_get_direct() was going through
fetch_populated_submodules(). Notice that I did not recurse my
submodules in the original clone, so the sha1collisiondetection
submodule is not initialized. Even so, my 'git fetch' was scanning
commits for updates to submodules.

I decided that even if I did populate the submodules, the nature of
treeless clones makes me not want to care about the contents of commits
other than those that I am explicitly navigating to.

This loop of tree fetches can be avoided by adding
--no-recurse-submodules to the 'git fetch' command or setting
fetch.recurseSubmodules=no.

To make this as painless as possible for future users of treeless
clones, automatically set fetch.recurseSubmodules=no at clone time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
    clone: --filter=tree:0 implies fetch.recurseSubmodules=no
    
    While testing different partial clone options, I stumbled across this
    one. My initial thought was that we were parsing commits and loading
    their root trees unnecessarily, but I see that doesn't happen after this
    change.
    
    Here are some recent discussions about using --filter=tree:0:
    
    [1] 
    https://lore.kernel.org/git/aa7b89ee-08aa-7943-6a00-28dcf344426e@syntevo.com/
    [2] https://lore.kernel.org/git/cover.1588633810.git.me@ttaylorr.com/[3] 
    https://lore.kernel.org/git/58274817-7ac6-b6ae-0d10-22485dfe5e0e@syntevo.com/
    
    Thanks, -Stolee

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-797%2Fderrickstolee%2Ftree-0-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-797/derrickstolee/tree-0-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/797

 list-objects-filter-options.c | 4 ++++
 t/t5616-partial-clone.sh      | 6 ++++++
 2 files changed, 10 insertions(+)


base-commit: faefdd61ec7c7f6f3c8c9907891465ac9a2a1475

Comments

Jeff King Nov. 21, 2020, 12:04 a.m. UTC | #1

On Fri, Nov 20, 2020 at 08:36:26PM +0000, Derrick Stolee via GitGitGadget wrote:

> I decided that even if I did populate the submodules, the nature of
> treeless clones makes me not want to care about the contents of commits
> other than those that I am explicitly navigating to.
> 
> This loop of tree fetches can be avoided by adding
> --no-recurse-submodules to the 'git fetch' command or setting
> fetch.recurseSubmodules=no.
> 
> To make this as painless as possible for future users of treeless
> clones, automatically set fetch.recurseSubmodules=no at clone time.

I think it's definitely worth adjusting the defaults here to make this
less painful out of the box. But I wonder if this is too big a hammer,
and the on-demand logic just needs to be a bit less aggressive.

If I clone with tree:0, I'm still going to get the tree for the thing
I'm actually checking out (assuming a non-bare repo). It would be
reasonable to recursively fetch the submodules in that commit to
check them out (assuming you specified --recurse-submodules).

If I then fetch again, I'll end up with another tree that I'm about to
checkout. So likewise, would it make sense to fetch any updates from
there?

I.e., I think the problem is that whether or not cared about submodules
in the first place, the default "on-demand" setting of fetch.submodules
is very eager to poke through history looking at the .gitmodules file to
see if there is anything worth also fetching. But:

  - if we know there are no active submodules in the first place
    (because you did not say --recurse-submodules), should it skip that
    poking? That seems like it shouldn't be too hard.

  - during its poking, should it set the necessary variables so that it
    never demand-fetches from a promisor remote? I suspect this part
    may be hard, because "fetch" and "checkout" are distinct operations
    (so during the "fetch" we don't yet have the new tree demand-fetched
    by checkout; in fact the user might not even be interested in
    checking it out yet).

Given the difficulties in the latter case, this may be the best we can
do. But in that case, what happens when we _do_ care about submodules,
and do:

  git clone --recurse-submodules --filter=tree:0 ...
  git fetch
  git merge origin

Will we correctly fetch-on-demand the submodules we need during the
merge operation? If so, then that user experience is probably pretty
reasonable.

-Peff

Philippe Blain Nov. 21, 2020, 4:19 p.m. UTC | #2

Hi Stolee,

On 20-11-20 15 h 36, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The partial clone feature has several modes, but only a few are quick
> for a server to process using reachability bitmaps:
>
> * Blobless: --filter=blob:none downloads all commits and trees and
>   fetches necessary blobs on-demand.
>
> * Treeless: --filter=tree:0 downloads all commits and fetches necessary
>   trees and blobs on demand.
>
> This treeles mode is most similar to a shallow clone in the total size
> (it only adds the commit objects for the full history). This makes
> treeless clones an interesting replacement for shallow clones. A user
> can run more commands in a treeless clone than in a shallow clone,
> especially 'git log' (no pathspec).
>
> In particular, servers can still serve 'git fetch' requests quickly by
> calculating the difference between commit wants and haves using bitmaps.
>
> I was testing this feature with this in mind, and I knew that some trees
> would be downloaded multiple times when checking out a new branch, but I
> did not expect to discover a significant issue with 'git fetch', at
> least in repostiories with submodules.
>
> I was testing these commands:
>
> 	$ git clone --filter=tree:0 --single-branch --branch=master \
> 	  https://github.com/git/git
> 	$ git -C git fetch origin "+refs/heads/*:refs/remotes/origin/*"
>
> This fetch command started downloading several pack-files of trees
> before completing the command. I never let it finish since I got so
> impatient with the repeated downloads. During debugging, I found that
> the stack triggering promisor_remote_get_direct() was going through
> fetch_populated_submodules(). Notice that I did not recurse my
> submodules in the original clone, so the sha1collisiondetection
> submodule is not initialized. Even so, my 'git fetch' was scanning
> commits for updates to submodules.

I'm not super familiar with the inner workings
offetch_populated_submodules(), but is seems weird that this function
does something in that case. It should do nothing, as the submodule is
not populated. Maybe it would be worth it to investigate what exactly is
happening?

> I decided that even if I did populate the submodules, the nature of
> treeless clones makes me not want to care about the contents of commits
> other than those that I am explicitly navigating to.
>
> This loop of tree fetches can be avoided by adding
> --no-recurse-submodules to the 'git fetch' command or setting
> fetch.recurseSubmodules=no.
>
> To make this as painless as possible for future users of treeless
> clones, automatically set fetch.recurseSubmodules=no at clone time.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>     clone: --filter=tree:0 implies fetch.recurseSubmodules=no
>     
>     While testing different partial clone options, I stumbled across this
>     one. My initial thought was that we were parsing commits and loading
>     their root trees unnecessarily, but I see that doesn't happen after this
>     change.
>     
>     Here are some recent discussions about using --filter=tree:0:
>     
>     [1] 
>     https://lore.kernel.org/git/aa7b89ee-08aa-7943-6a00-28dcf344426e@syntevo.com/
>     [2] https://lore.kernel.org/git/cover.1588633810.git.me@ttaylorr.com/[3] 
>     https://lore.kernel.org/git/58274817-7ac6-b6ae-0d10-22485dfe5e0e@syntevo.com/
>     
>     Thanks, -Stolee
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-797%2Fderrickstolee%2Ftree-0-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-797/derrickstolee/tree-0-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/797
>
>  list-objects-filter-options.c | 4 ++++
>  t/t5616-partial-clone.sh      | 6 ++++++
>  2 files changed, 10 insertions(+)

In any case I think such a change would also need a doc update, probably
in Documentation/fetch-options.txt and Documentation/config/fetch.txt.

Cheers,

Philippe.

Derrick Stolee Nov. 23, 2020, 3:18 p.m. UTC | #3

On 11/20/2020 7:04 PM, Jeff King wrote:
> On Fri, Nov 20, 2020 at 08:36:26PM +0000, Derrick Stolee via GitGitGadget wrote:
> 
>> I decided that even if I did populate the submodules, the nature of
>> treeless clones makes me not want to care about the contents of commits
>> other than those that I am explicitly navigating to.
>>
>> This loop of tree fetches can be avoided by adding
>> --no-recurse-submodules to the 'git fetch' command or setting
>> fetch.recurseSubmodules=no.
>>
>> To make this as painless as possible for future users of treeless
>> clones, automatically set fetch.recurseSubmodules=no at clone time.
> 
> I think it's definitely worth adjusting the defaults here to make this
> less painful out of the box. But I wonder if this is too big a hammer,
> and the on-demand logic just needs to be a bit less aggressive.

Yes, you are probably right.

I really should have labeled this patch as "RFC" as that was my intent.

> If I clone with tree:0, I'm still going to get the tree for the thing
> I'm actually checking out (assuming a non-bare repo). It would be
> reasonable to recursively fetch the submodules in that commit to
> check them out (assuming you specified --recurse-submodules).
> 
> If I then fetch again, I'll end up with another tree that I'm about to
> checkout. So likewise, would it make sense to fetch any updates from
> there?

One thing that is different is that we will fetch the trees we need
_during that checkout_, so why do it preemptively in the fetch?

Further, the number of trees being fetched is _not_ the number of ref
tips, but seems to be related to the total number of commits different.
There must be a rev walk looking for which commits changed the modules
file or the commit link in the root tree.

> I.e., I think the problem is that whether or not cared about submodules
> in the first place, the default "on-demand" setting of fetch.submodules
> is very eager to poke through history looking at the .gitmodules file to
> see if there is anything worth also fetching. But:
> 
>   - if we know there are no active submodules in the first place
>     (because you did not say --recurse-submodules), should it skip that
>     poking? That seems like it shouldn't be too hard.

This is a good direction to pursue.

>   - during its poking, should it set the necessary variables so that it
>     never demand-fetches from a promisor remote? I suspect this part
>     may be hard, because "fetch" and "checkout" are distinct operations
>     (so during the "fetch" we don't yet have the new tree demand-fetched
>     by checkout; in fact the user might not even be interested in
>     checking it out yet).

I also think this is a good idea. In particular, should we consider
making the "submodule fetch" be part of the "promisor"? That is, we
only fetch our submodule on checkout? Can we assume that the commit
will still exist on the remote, to some extent?

Naturally, this only really applies for --filter=tree:0, since in cases
like blobless clones, we would still want the commits and trees from the
submodule.

> Given the difficulties in the latter case, this may be the best we can
> do. But in that case, what happens when we _do_ care about submodules,
> and do:
> 
>   git clone --recurse-submodules --filter=tree:0 ...
>   git fetch
>   git merge origin
> 
> Will we correctly fetch-on-demand the submodules we need during the
> merge operation? If so, then that user experience is probably pretty
> reasonable.

Hopefully. Notably, the filter option does _not_ apply recursively
to the submodules, so even if we try to make the superproject a partial
clone, the submodule is not partial.

More research is required. Let's drop this patch, as I don't currently
have time to do the necessary deep dive. If someone else has time to
look into this, I'd be happy to review a better patch.

Thanks,
-Stolee

Jeff King Nov. 24, 2020, 8:04 a.m. UTC | #4

On Mon, Nov 23, 2020 at 10:18:42AM -0500, Derrick Stolee wrote:

> > If I clone with tree:0, I'm still going to get the tree for the thing
> > I'm actually checking out (assuming a non-bare repo). It would be
> > reasonable to recursively fetch the submodules in that commit to
> > check them out (assuming you specified --recurse-submodules).
> > 
> > If I then fetch again, I'll end up with another tree that I'm about to
> > checkout. So likewise, would it make sense to fetch any updates from
> > there?
> 
> One thing that is different is that we will fetch the trees we need
> _during that checkout_, so why do it preemptively in the fetch?
> 
> Further, the number of trees being fetched is _not_ the number of ref
> tips, but seems to be related to the total number of commits different.
> There must be a rev walk looking for which commits changed the modules
> file or the commit link in the root tree.

Yeah, that's exactly what the "on-demand" thing is. I said before it was
looking for .gitmodules, but I think it has to be looking for updated
gitlinks. I'm pretty sure the logic only kicks in if we have a
.gitmodules, though, to avoid the extra traversal when you're not using
submodules (but I guess we're all now paying that traversal cost in
git.git, even if we don't populate the modules).

I think the goal is to do all of the fetches at once, so after "git
fetch" you can then run checkout, merge, etc, without worrying that
you'll need network access later.

But it's not actually "checkout" that does the fetching. It will update
the gitlink, but that will just show a diff against the content of the
submodule. You have to actually "git submodule update" to update the
repository, which will then auto-fetch (though I suspect with
appropriate config or command-line options, checkout can be convinced to
trigger "submodule update").

So in that sense, maybe your patch is the most sensible thing. If you're
not checking out immediately, we probably have no business in a partial
clone guessing at trees you _might_ check out later, especially if it
involves demand-fetching them from the server. And if you do check out
and "submodule update" immediately, then we'd do the fetch then anyway.

We can punt on the logic to walk the trees, only looking at ones we
actually _do_ have locally, until somebody else wants to work on it.

> >   - during its poking, should it set the necessary variables so that it
> >     never demand-fetches from a promisor remote? I suspect this part
> >     may be hard, because "fetch" and "checkout" are distinct operations
> >     (so during the "fetch" we don't yet have the new tree demand-fetched
> >     by checkout; in fact the user might not even be interested in
> >     checking it out yet).
> 
> I also think this is a good idea. In particular, should we consider
> making the "submodule fetch" be part of the "promisor"? That is, we
> only fetch our submodule on checkout? Can we assume that the commit
> will still exist on the remote, to some extent?

I think issues of whether the commit will exist are outside the scope
here. "submodule update" has to decide if we have it, or how to get it
(if it can). But it won't fail a checkout in that case; you'll just be
left with a diff between the gitlink and what's in the submodule repo.

I'm not quite sure what you're asking with the rest of it. From this:

> Naturally, this only really applies for --filter=tree:0, since in cases
> like blobless clones, we would still want the commits and trees from the
> submodule.

it sounds like you're asking whether the submodule should also be using
the same filter. Maybe, but I think it would depend on the workflow and
the module (and probably is something that you'd want to be able to
configure independently). At any rate, I think that's orthogonal to this
issue.

> > Given the difficulties in the latter case, this may be the best we can
> > do. But in that case, what happens when we _do_ care about submodules,
> > and do:
> > 
> >   git clone --recurse-submodules --filter=tree:0 ...
> >   git fetch
> >   git merge origin
> > 
> > Will we correctly fetch-on-demand the submodules we need during the
> > merge operation? If so, then that user experience is probably pretty
> > reasonable.

Answering my own question: no, you'd run "git submodule update"
afterwards.

> Hopefully. Notably, the filter option does _not_ apply recursively
> to the submodules, so even if we try to make the superproject a partial
> clone, the submodule is not partial.

Yep, but I do think that's orthogonal.

> More research is required. Let's drop this patch, as I don't currently
> have time to do the necessary deep dive. If someone else has time to
> look into this, I'd be happy to review a better patch.

I don't mind dropping it, but I've actually come around to the idea that
your patch or something like it is probably a strict improvement.

-Peff

diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
index defd3dfd10..249939dfa5 100644
--- a/list-objects-filter-options.c
+++ b/list-objects-filter-options.c
@@ -376,6 +376,10 @@  void partial_clone_register(
 		       expand_list_objects_filter_spec(filter_options));
 	free(filter_name);
 
+	if (filter_options->choice == LOFC_TREE_DEPTH &&
+	    !filter_options->tree_exclude_depth)
+		git_config_set("fetch.recursesubmodules", "no");
+
 	/* Make sure the config info are reset */
 	promisor_remote_reinit();
 }
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index f4d49d8335..b2eaf78069 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -341,6 +341,12 @@  test_expect_success 'partial clone with sparse filter succeeds' '
 	)
 '
 
+test_expect_success '--filter=tree:0 sets fetch.recurseSubmodules=no' '
+	rm -rf dst &&
+	git clone --filter=tree:0 "file://$(pwd)/src" dst &&
+	test_config -C dst fetch.recursesubmodules no
+'
+
 test_expect_success 'partial clone with unresolvable sparse filter fails cleanly' '
 	rm -rf dst.git &&
 	test_must_fail git clone --no-local --bare \

clone: --filter=tree:0 implies fetch.recurseSubmodules=no

Commit Message

Comments

Patch