diff mbox series

[2/9] maintenance: add prefetch task

Message ID 85118ed5f19468d5051dd1579e35cae3c3114d24.1596731425.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Maintenance II: prefetch, loose-objects, incremental-repack tasks | expand

Commit Message

Philippe Blain via GitGitGadget Aug. 6, 2020, 4:30 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

When working with very large repositories, an incremental 'git fetch'
command can download a large amount of data. If there are many other
users pushing to a common repo, then this data can rival the initial
pack-file size of a 'git clone' of a medium-size repo.

Users may want to keep the data on their local repos as close as
possible to the data on the remote repos by fetching periodically in
the background. This can break up a large daily fetch into several
smaller hourly fetches.

The task is called "prefetch" because it is work done in advance
of a foreground fetch to make that 'git fetch' command much faster.

However, if we simply ran 'git fetch <remote>' in the background,
then the user running a foregroudn 'git fetch <remote>' would lose
some important feedback when a new branch appears or an existing
branch updates. This is especially true if a remote branch is
force-updated and this isn't noticed by the user because it occurred
in the background. Further, the functionality of 'git push
--force-with-lease' becomes suspect.

When running 'git fetch <remote> <options>' in the background, use
the following options for careful updating:

1. --no-tags prevents getting a new tag when a user wants to see
   the new tags appear in their foreground fetches.

2. --refmap= removes the configured refspec which usually updates
   refs/remotes/<remote>/* with the refs advertised by the remote.
   While this looks confusing, this was documented and tested by
   b40a50264ac (fetch: document and test --refmap="", 2020-01-21),
   including this sentence in the documentation:

	Providing an empty `<refspec>` to the `--refmap` option
	causes Git to ignore the configured refspecs and rely
	entirely on the refspecs supplied as command-line arguments.

3. By adding a new refspec "+refs/heads/*:refs/prefetch/<remote>/*"
   we can ensure that we actually load the new values somewhere in
   our refspace while not updating refs/heads or refs/remotes. By
   storing these refs here, the commit-graph job will update the
   commit-graph with the commits from these hidden refs.

4. --prune will delete the refs/prefetch/<remote> refs that no
   longer appear on the remote.

5. --no-write-fetch-head prevents updating FETCH_HEAD.

We've been using this step as a critical background job in Scalar
[1] (and VFS for Git). This solved a pain point that was showing up
in user reports: fetching was a pain! Users do not like waiting to
download the data that was created while they were away from their
machines. After implementing background fetch, the foreground fetch
commands sped up significantly because they mostly just update refs
and download a small amount of new data. The effect is especially
dramatic when paried with --no-show-forced-udpates (through
fetch.showForcedUpdates=false).

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt | 15 +++++++++
 builtin/gc.c                      | 52 +++++++++++++++++++++++++++++++
 t/t7900-maintenance.sh            | 25 +++++++++++++++
 3 files changed, 92 insertions(+)

Comments

Emily Shaffer Aug. 12, 2020, 11:10 p.m. UTC | #1
On Thu, Aug 06, 2020 at 04:30:17PM +0000, Derrick Stolee via GitGitGadget wrote:
> 
> 3. By adding a new refspec "+refs/heads/*:refs/prefetch/<remote>/*"
>    we can ensure that we actually load the new values somewhere in
>    our refspace while not updating refs/heads or refs/remotes. By
>    storing these refs here, the commit-graph job will update the
>    commit-graph with the commits from these hidden refs.

[emily] How does the content of refs/prefetch/* get used?
[jrnieder] How does the content of refs/prefetch/* get cleaned up?
[jonathantan] refs/prefetch/* will get used in negotiation so it is
useful to keep these.
> 
> 4. --prune will delete the refs/prefetch/<remote> refs that no
>    longer appear on the remote.
[jrnieder] this is what cleans up refs/prefetch/* later :)

> +static int fetch_remote(const char *remote, struct maintenance_opts *opts)
> +{
> +	struct child_process child = CHILD_PROCESS_INIT;
> +
> +	child.git_cmd = 1;
> +	strvec_pushl(&child.args, "fetch", remote, "--prune", "--no-tags",
> +		     "--no-write-fetch-head", "--refmap=", NULL);
[jonathantan] It would be good to pass --recurse-submodules=no.
[jrnieder] Since we are specifying the refmap here, if we were to
recurse into submodules, we'd have to be careful about making sure
refmap gets propagated correctly.

> +static int fill_each_remote(struct remote *remote, void *cbdata)
[jrnieder] Since this is a callback that happens for each remote, maybe
this should be named to indicate it only fills one remote at a time
instead. Nit :)
> +{
> +	struct string_list *remotes = (struct string_list *)cbdata;
> +
> +	string_list_append(remotes, remote->name);
> +	return 0;
> +}
> +
> +static int maintenance_task_prefetch(struct maintenance_opts *opts)
> +{
> +	int result = 0;
> +	struct string_list_item *item;
> +	struct string_list remotes = STRING_LIST_INIT_DUP;
> +
> +	if (for_each_remote(fill_each_remote, &remotes)) {
> +		error(_("failed to fill remotes"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	for (item = remotes.items;
> +	     item && item < remotes.items + remotes.nr;
> +	     item++)

Is there a reason not to use for_each_string_list_item() instead? This
would be more brief and also easier to read (less thinking about what
the loop is doing).

> +		result |= fetch_remote(item->string, opts);
> +
> +cleanup:
> +	string_list_clear(&remotes, 0);
> +	return result;
> +}
> +

>  enum maintenance_task_label {
> +	TASK_PREFETCH,
[jrnieder] Nit: Is there a sort order for these? Should we establish an order
early on (e.g. alphabetical)?
>  	TASK_GC,
>  	TASK_COMMIT_GRAPH,

> +test_expect_success 'prefetch multiple remotes' '
> +	git clone . clone1 &&
> +	git clone . clone2 &&
> +	git remote add remote1 "file://$(pwd)/clone1" &&
> +	git remote add remote2 "file://$(pwd)/clone2" &&
> +	git -C clone1 switch -c one &&
> +	git -C clone2 switch -c two &&
> +	test_commit -C clone1 one &&
> +	test_commit -C clone2 two &&
> +	GIT_TRACE2_EVENT="$(pwd)/run-prefetch.txt" git maintenance run --task=prefetch 2>/dev/null &&
> +	fetchargs="--prune --no-tags --no-write-fetch-head --refmap= --quiet" &&
> +	test_subcommand git fetch remote1 $fetchargs +refs/heads/\\*:refs/prefetch/remote1/\\* <run-prefetch.txt &&
[jrnieder] In practice, why were all the \\ needed? Trying to figure out
where Git is using a shell that would need the * escaped and finding it
hard to reason about.
> +	test_subcommand git fetch remote2 $fetchargs +refs/heads/\\*:refs/prefetch/remote2/\\* <run-prefetch.txt &&
> +	test_path_is_missing .git/refs/remotes &&
> +	test_cmp clone1/.git/refs/heads/one .git/refs/prefetch/remote1/one &&
> +	test_cmp clone2/.git/refs/heads/two .git/refs/prefetch/remote2/two &&
[jrnieder] Should we use test_cmp_rev instead to make compatible with
packed-refs?
> +	git log prefetch/remote1/one &&
> +	git log prefetch/remote2/two
[jonathantan] Why do we use 'git log' to check? I'm a little confused
about what's going on; if you just want to check that the refs are
present you could use 'git rev-parse' instead?
Derrick Stolee Aug. 14, 2020, 1:28 a.m. UTC | #2
On 8/12/2020 7:10 PM, Emily Shaffer wrote:
> On Thu, Aug 06, 2020 at 04:30:17PM +0000, Derrick Stolee via GitGitGadget wrote:
>>
>> 3. By adding a new refspec "+refs/heads/*:refs/prefetch/<remote>/*"
>>    we can ensure that we actually load the new values somewhere in
>>    our refspace while not updating refs/heads or refs/remotes. By
>>    storing these refs here, the commit-graph job will update the
>>    commit-graph with the commits from these hidden refs.
> 
> [emily] How does the content of refs/prefetch/* get used?
> [jrnieder] How does the content of refs/prefetch/* get cleaned up?
> [jonathantan] refs/prefetch/* will get used in negotiation so it is
> useful to keep these.
>>
>> 4. --prune will delete the refs/prefetch/<remote> refs that no
>>    longer appear on the remote.
> [jrnieder] this is what cleans up refs/prefetch/* later :)
> 
>> +static int fetch_remote(const char *remote, struct maintenance_opts *opts)
>> +{
>> +	struct child_process child = CHILD_PROCESS_INIT;
>> +
>> +	child.git_cmd = 1;
>> +	strvec_pushl(&child.args, "fetch", remote, "--prune", "--no-tags",
>> +		     "--no-write-fetch-head", "--refmap=", NULL);
> [jonathantan] It would be good to pass --recurse-submodules=no.
> [jrnieder] Since we are specifying the refmap here, if we were to
> recurse into submodules, we'd have to be careful about making sure
> refmap gets propagated correctly.

Good point.

>> +static int fill_each_remote(struct remote *remote, void *cbdata)
> [jrnieder] Since this is a callback that happens for each remote, maybe
> this should be named to indicate it only fills one remote at a time
> instead. Nit :)

Sure. append_remote() is better.

>> +{
>> +	struct string_list *remotes = (struct string_list *)cbdata;
>> +
>> +	string_list_append(remotes, remote->name);
>> +	return 0;
>> +}
>> +
>> +static int maintenance_task_prefetch(struct maintenance_opts *opts)
>> +{
>> +	int result = 0;
>> +	struct string_list_item *item;
>> +	struct string_list remotes = STRING_LIST_INIT_DUP;
>> +
>> +	if (for_each_remote(fill_each_remote, &remotes)) {
>> +		error(_("failed to fill remotes"));
>> +		result = 1;
>> +		goto cleanup;
>> +	}
>> +
>> +	for (item = remotes.items;
>> +	     item && item < remotes.items + remotes.nr;
>> +	     item++)
> 
> Is there a reason not to use for_each_string_list_item() instead? This
> would be more brief and also easier to read (less thinking about what
> the loop is doing).

I just keep forgetting about that macro. Thanks.

>> +		result |= fetch_remote(item->string, opts);
>> +
>> +cleanup:
>> +	string_list_clear(&remotes, 0);
>> +	return result;
>> +}
>> +
> 
>>  enum maintenance_task_label {
>> +	TASK_PREFETCH,
> [jrnieder] Nit: Is there a sort order for these? Should we establish an order
> early on (e.g. alphabetical)?

There _is_ an order, and maybe that should be up for debate.
This one is really about fetching new refs before (possibly)
deleting unreachable data or doing a full repack in the 'gc'
task. The rest of the tasks are after the 'gc' task (in an
unimportant order) because they are likely to be no-ops if the
'gc' task does work.

Of course, users can customize this order using the
--task=<task> options.

>>  	TASK_GC,
>>  	TASK_COMMIT_GRAPH,
> 
>> +test_expect_success 'prefetch multiple remotes' '
>> +	git clone . clone1 &&
>> +	git clone . clone2 &&
>> +	git remote add remote1 "file://$(pwd)/clone1" &&
>> +	git remote add remote2 "file://$(pwd)/clone2" &&
>> +	git -C clone1 switch -c one &&
>> +	git -C clone2 switch -c two &&
>> +	test_commit -C clone1 one &&
>> +	test_commit -C clone2 two &&
>> +	GIT_TRACE2_EVENT="$(pwd)/run-prefetch.txt" git maintenance run --task=prefetch 2>/dev/null &&
>> +	fetchargs="--prune --no-tags --no-write-fetch-head --refmap= --quiet" &&
>> +	test_subcommand git fetch remote1 $fetchargs +refs/heads/\\*:refs/prefetch/remote1/\\* <run-prefetch.txt &&

> [jrnieder] In practice, why were all the \\ needed? Trying to figure out
> where Git is using a shell that would need the * escaped and finding it
> hard to reason about.

It's the 'grep' way down inside test_subcommand that will
fail if it interprets these '*' incorrectly.

>> +	test_subcommand git fetch remote2 $fetchargs +refs/heads/\\*:refs/prefetch/remote2/\\* <run-prefetch.txt &&
>> +	test_path_is_missing .git/refs/remotes &&
>> +	test_cmp clone1/.git/refs/heads/one .git/refs/prefetch/remote1/one &&
>> +	test_cmp clone2/.git/refs/heads/two .git/refs/prefetch/remote2/two &&

> [jrnieder] Should we use test_cmp_rev instead to make compatible with
> packed-refs?

While that would be good, these are refs across two .git directories.

I suppose that I could fetch normally and check refs/remotes/remote1/one
versus refs/prefetch/remote1/one.

>> +	git log prefetch/remote1/one &&
>> +	git log prefetch/remote2/two

> [jonathantan] Why do we use 'git log' to check? I'm a little confused
> about what's going on; if you just want to check that the refs are
> present you could use 'git rev-parse' instead?

The main point is to demonstrate that we have the actual objects
in the current repository, not just refs storing object IDs that
don't actually exist in the object database. Does rev-parse check
that the object exists? (A quick check in my local repository seems
to say no.)

Reorganizing the test to work with test_cmp_rev results in the
following:

test_expect_success 'prefetch multiple remotes' '
	git clone . clone1 &&
	git clone . clone2 &&
	git remote add remote1 "file://$(pwd)/clone1" &&
	git remote add remote2 "file://$(pwd)/clone2" &&
	git -C clone1 switch -c one &&
	git -C clone2 switch -c two &&
	test_commit -C clone1 one &&
	test_commit -C clone2 two &&
	GIT_TRACE2_EVENT="$(pwd)/run-prefetch.txt" git maintenance run --task=prefetch 2>/dev/null &&
	fetchargs="--prune --no-tags --no-write-fetch-head --recurse-submodules=no --refmap= --quiet" &&
	test_subcommand git fetch remote1 $fetchargs +refs/heads/\\*:refs/prefetch/remote1/\\* <run-prefetch.txt &&
	test_subcommand git fetch remote2 $fetchargs +refs/heads/\\*:refs/prefetch/remote2/\\* <run-prefetch.txt &&
	test_path_is_missing .git/refs/remotes &&
	git log prefetch/remote1/one &&
	git log prefetch/remote2/two &&
	git fetch --all &&
	test_cmp_rev refs/remotes/remote1/one refs/prefetch/remote1/one &&
	test_cmp_rev refs/remotes/remote2/two refs/prefetch/remote2/two
'

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index e1a2a8902c..bb0d5eded4 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -57,6 +57,21 @@  since it will not expire `.graph` files that were in the previous
 `commit-graph-chain` file. They will be deleted by a later run based on
 the expiration delay.
 
+prefetch::
+	The `prefetch` task updates the object directory with the latest
+	objects from all registered remotes. For each remote, a `git fetch`
+	command is run. The refmap is custom to avoid updating local or remote
+	branches (those in `refs/heads` or `refs/remotes`). Instead, the
+	remote refs are stored in `refs/prefetch/<remote>/`. Also, tags are
+	not updated.
++
+This is done to avoid disrupting the remote-tracking branches. The end users
+expect these refs to stay unmoved unless they initiate a fetch.  With prefetch
+task, however, the objects necessary to complete a later real fetch would
+already be obtained, so the real fetch would go faster.  In the ideal case,
+it will just become an update to bunch of remote-tracking branches without
+any object transfer.
+
 gc::
 	Cleanup unnecessary files and optimize the local repository. "GC"
 	stands for "garbage collection," but this task performs many
diff --git a/builtin/gc.c b/builtin/gc.c
index 8162bca974..942b7ea535 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -29,6 +29,7 @@ 
 #include "tree.h"
 #include "promisor-remote.h"
 #include "refs.h"
+#include "remote.h"
 
 #define FAILED_RUN "failed to run %s"
 
@@ -843,6 +844,52 @@  static int maintenance_task_commit_graph(struct maintenance_opts *opts)
 	return 1;
 }
 
+static int fetch_remote(const char *remote, struct maintenance_opts *opts)
+{
+	struct child_process child = CHILD_PROCESS_INIT;
+
+	child.git_cmd = 1;
+	strvec_pushl(&child.args, "fetch", remote, "--prune", "--no-tags",
+		     "--no-write-fetch-head", "--refmap=", NULL);
+
+	if (opts->quiet)
+		strvec_push(&child.args, "--quiet");
+
+	strvec_pushf(&child.args, "+refs/heads/*:refs/prefetch/%s/*", remote);
+
+	return !!run_command(&child);
+}
+
+static int fill_each_remote(struct remote *remote, void *cbdata)
+{
+	struct string_list *remotes = (struct string_list *)cbdata;
+
+	string_list_append(remotes, remote->name);
+	return 0;
+}
+
+static int maintenance_task_prefetch(struct maintenance_opts *opts)
+{
+	int result = 0;
+	struct string_list_item *item;
+	struct string_list remotes = STRING_LIST_INIT_DUP;
+
+	if (for_each_remote(fill_each_remote, &remotes)) {
+		error(_("failed to fill remotes"));
+		result = 1;
+		goto cleanup;
+	}
+
+	for (item = remotes.items;
+	     item && item < remotes.items + remotes.nr;
+	     item++)
+		result |= fetch_remote(item->string, opts);
+
+cleanup:
+	string_list_clear(&remotes, 0);
+	return result;
+}
+
 static int maintenance_task_gc(struct maintenance_opts *opts)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
@@ -880,6 +927,7 @@  struct maintenance_task {
 };
 
 enum maintenance_task_label {
+	TASK_PREFETCH,
 	TASK_GC,
 	TASK_COMMIT_GRAPH,
 
@@ -888,6 +936,10 @@  enum maintenance_task_label {
 };
 
 static struct maintenance_task tasks[] = {
+	[TASK_PREFETCH] = {
+		"prefetch",
+		maintenance_task_prefetch,
+	},
 	[TASK_GC] = {
 		"gc",
 		maintenance_task_gc,
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 406dc7c303..3dd99ef660 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -58,4 +58,29 @@  test_expect_success 'run --task duplicate' '
 	test_i18ngrep "cannot be selected multiple times" err
 '
 
+test_expect_success 'run --task=prefetch with no remotes' '
+	git maintenance run --task=prefetch 2>err &&
+	test_must_be_empty err
+'
+
+test_expect_success 'prefetch multiple remotes' '
+	git clone . clone1 &&
+	git clone . clone2 &&
+	git remote add remote1 "file://$(pwd)/clone1" &&
+	git remote add remote2 "file://$(pwd)/clone2" &&
+	git -C clone1 switch -c one &&
+	git -C clone2 switch -c two &&
+	test_commit -C clone1 one &&
+	test_commit -C clone2 two &&
+	GIT_TRACE2_EVENT="$(pwd)/run-prefetch.txt" git maintenance run --task=prefetch 2>/dev/null &&
+	fetchargs="--prune --no-tags --no-write-fetch-head --refmap= --quiet" &&
+	test_subcommand git fetch remote1 $fetchargs +refs/heads/\\*:refs/prefetch/remote1/\\* <run-prefetch.txt &&
+	test_subcommand git fetch remote2 $fetchargs +refs/heads/\\*:refs/prefetch/remote2/\\* <run-prefetch.txt &&
+	test_path_is_missing .git/refs/remotes &&
+	test_cmp clone1/.git/refs/heads/one .git/refs/prefetch/remote1/one &&
+	test_cmp clone2/.git/refs/heads/two .git/refs/prefetch/remote2/two &&
+	git log prefetch/remote1/one &&
+	git log prefetch/remote2/two
+'
+
 test_done