diff mbox series

fetch: remove fetch_if_missing=0

Message ID 20191101203830.231676-1-jonathantanmy@google.com (mailing list archive)
State New, archived
Headers show
Series fetch: remove fetch_if_missing=0 | expand

Commit Message

Jonathan Tan Nov. 1, 2019, 8:38 p.m. UTC
In fetch_pack() (and all functions it calls), pass
OBJECT_INFO_SKIP_FETCH_OBJECT whenever we query an object that could be
a tree or blob that we do not want to be lazy-fetched even if it is
absent. Thus, the only lazy-fetches occurring for trees and blobs are
when resolving deltas.

Thus, we can remove fetch_if_missing=0 from builtin/fetch.c. Remove
this, and also add a test ensuring that such objects are not
lazy-fetched. (We might be able to remove fetch_if_missing=0 from other
places too, but I have limited myself to builtin/fetch.c in this commit
because I have not written tests for the other commands yet.)

Note that commits and tags may still be lazy-fetched. I limited myself
to objects that could be trees or blobs here because Git does not
support creating such commit- and tag-excluding clones yet, and even if
such a clone were manually created, Git does not have good support for
fetching a single commit (when fetching a commit, it and all its
ancestors would be sent).

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
I've verified that this also solves the bug explained in:
https://public-inbox.org/git/20191007181825.13463-1-jonathantanmy@google.com/

Quoting what I wrote from there:

> (Alternatively, we
> could add the equivalent of OBJECT_INFO_SKIP_FETCH_OBJECT to functions
> like parse_commit() that are used by files like negotiator/default.c, or
> split up commit parsing into object reading - which already has that
> flag - and commit parsing.)

This commit adds OBJECT_INFO_SKIP_FETCH_OBJECT to functions, but not the
parse_commit() mentioned - as stated above, this commit only handles
objects that could be trees or blobs.
---
 builtin/fetch.c          |  5 ++-
 fetch-pack.c             |  3 +-
 t/t5616-partial-clone.sh | 69 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+), 4 deletions(-)

Comments

Jonathan Nieder Nov. 1, 2019, 10:05 p.m. UTC | #1
Hi,

Jonathan Tan wrote:

> In fetch_pack() (and all functions it calls), pass
> OBJECT_INFO_SKIP_FETCH_OBJECT whenever we query an object that could be
> a tree or blob that we do not want to be lazy-fetched even if it is
> absent. Thus, the only lazy-fetches occurring for trees and blobs are
> when resolving deltas.
>
> Thus, we can remove fetch_if_missing=0 from builtin/fetch.c. Remove
> this, and also add a test ensuring that such objects are not
> lazy-fetched. (We might be able to remove fetch_if_missing=0 from other
> places too, but I have limited myself to builtin/fetch.c in this commit
> because I have not written tests for the other commands yet.)

Hooray!  Thanks much, this looks easier to maintain.

> Note that commits and tags may still be lazy-fetched. I limited myself
> to objects that could be trees or blobs here because Git does not
> support creating such commit- and tag-excluding clones yet, and even if
> such a clone were manually created, Git does not have good support for
> fetching a single commit (when fetching a commit, it and all its
> ancestors would be sent).

Is there a place we could put a NEEDSWORK comment to avoid confusion
when debugging if this gets introduced later?

Even if not, this seems like a sensible choice.

> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> I've verified that this also solves the bug explained in:
> https://public-inbox.org/git/20191007181825.13463-1-jonathantanmy@google.com/

Might be worth mentioning the example from there in the commit message
as well, to help explain the context behind the change.

I would still be in favor of applying that more conservative change to
"master", even this late in the -rc cycle.

[...]
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -1074,7 +1074,8 @@ static int check_exist_and_connected(struct ref *ref_map)
>  	 * we need all direct targets to exist.
>  	 */
>  	for (r = rm; r; r = r->next) {
> -		if (!has_object_file(&r->old_oid))
> +		if (!has_object_file_with_flags(&r->old_oid,
> +						OBJECT_INFO_SKIP_FETCH_OBJECT))

Yes.

[...]
> @@ -1755,8 +1756,6 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
>  
>  	packet_trace_identity("fetch");
>  
> -	fetch_if_missing = 0;
> -

This is the scary part, but in an "uncomfortably exciting" sense rather
than a worrying one.  Thanks for adding a test.

[...]
> --- a/fetch-pack.c
> +++ b/fetch-pack.c
> @@ -673,7 +673,8 @@ static void mark_complete_and_common_ref(struct fetch_negotiator *negotiator,
>  		struct object *o;
>  
>  		if (!has_object_file_with_flags(&ref->old_oid,
> -						OBJECT_INFO_QUICK))
> +						OBJECT_INFO_QUICK |
> +							OBJECT_INFO_SKIP_FETCH_OBJECT))

Should we make OBJECT_INFO_QUICK always imply
OBJECT_INFO_SKIP_FETCH_OBJECT?  I would suspect that if we are willing to
avoid checking thoroughly locally, checking remotely would be even more
undesirable.

[...]
> --- a/t/t5616-partial-clone.sh
> +++ b/t/t5616-partial-clone.sh
> @@ -296,6 +296,75 @@ test_expect_success 'partial clone with unresolvable sparse filter fails cleanly
>  	test_i18ngrep "unable to parse sparse filter data in" err
>  '
>  
> +setup_triangle () {
> +	rm -rf big-blob.txt server client promisor-remote &&
> +
> +	touch big-blob.txt &&

Tests seem to prefer spelling this as

	>big-blob.txt &&

because that specifes the content of the file.

> +	for i in $(seq 1 100)
> +	do
> +		echo line $i >>big-blob.txt
> +	done &&

Should this use test_seq for better portability?

nit: can avoid a subshell:

	test_seq 1 100 | sed -e 's/^/line /' >big-blob.txt

[...]
> +test_expect_success 'fetch lazy-fetches only to resolve deltas' '
> +	setup_triangle &&
> +
> +	# Exercise to make sure it works. Git will not fetch anything from the
> +	# promisor remote other than for the big blob (because it needs to
> +	# resolve the delta).
> +	GIT_TRACE_PACKET="$(pwd)/trace" git -C client \
> +		fetch "file://$(pwd)/server" master &&
> +
> +	# Verify the assumption that the client needed to fetch the delta base
> +	# to resolve the delta.
> +	git hash-object big-blob.txt >hash &&
> +	grep "want $(cat hash)" trace

nit: can avoid using cat:

	hash=$(git hash-object big-blob.txt) &&
	grep "want $hash" trace

Thanks and hope that helps,
Jonathan
Junio C Hamano Nov. 2, 2019, 5:55 a.m. UTC | #2
Jonathan Nieder <jrnieder@gmail.com> writes:

>> @@ -1755,8 +1756,6 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
>>  
>>  	packet_trace_identity("fetch");
>>  
>> -	fetch_if_missing = 0;
>> -
>
> This is the scary part, but in an "uncomfortably exciting" sense rather
> than a worrying one.  Thanks for adding a test.
>
> [...]
>> --- a/fetch-pack.c
>> +++ b/fetch-pack.c
>> @@ -673,7 +673,8 @@ static void mark_complete_and_common_ref(struct fetch_negotiator *negotiator,
>>  		struct object *o;
>>  
>>  		if (!has_object_file_with_flags(&ref->old_oid,
>> -						OBJECT_INFO_QUICK))
>> +						OBJECT_INFO_QUICK |
>> +							OBJECT_INFO_SKIP_FETCH_OBJECT))
>
> Should we make OBJECT_INFO_QUICK always imply
> OBJECT_INFO_SKIP_FETCH_OBJECT?  I would suspect that if we are willing to
> avoid checking thoroughly locally, checking remotely would be even more
> undesirable.

I think I've seen this mentioned a few times during this cycle in
the list archive ;-)

>> +	for i in $(seq 1 100)
>> +	do
>> +		echo line $i >>big-blob.txt
>> +	done &&
>
> Should this use test_seq for better portability?

Yup.

> nit: can avoid a subshell:
>
> 	test_seq 1 100 | sed -e 's/^/line /' >big-blob.txt

Yeah, but it costs process start-up and "sed" that may be rather
heavyweight.  At least

	for i in $(test_seq ...)
	do
		echo line $i
	done >big-blob.txt

would save repeated opening and closing the file, I'd think.

>> +	git hash-object big-blob.txt >hash &&
>> +	grep "want $(cat hash)" trace
>
> nit: can avoid using cat:
>
> 	hash=$(git hash-object big-blob.txt) &&
> 	grep "want $hash" trace
>
> Thanks and hope that helps,

Thanks, both.
Junio C Hamano Nov. 2, 2019, 5:59 a.m. UTC | #3
Jonathan Nieder <jrnieder@gmail.com> writes:

> I would still be in favor of applying that more conservative change to
> "master", even this late in the -rc cycle.

I believe <20191023233403.145457-1-jonathantanmy@google.com> is the
latest incarnation of the "conservative" approach.  I think I can
buy that, but let me see if there are interactions with other topics
first.
Eric Sunshine Nov. 2, 2019, 6:11 a.m. UTC | #4
On Sat, Nov 2, 2019 at 1:55 AM Junio C Hamano <gitster@pobox.com> wrote:
> Jonathan Nieder <jrnieder@gmail.com> writes:
> > nit: can avoid a subshell:
> >
> >       test_seq 1 100 | sed -e 's/^/line /' >big-blob.txt
>
> Yeah, but it costs process start-up and "sed" that may be rather
> heavyweight.  At least
>
>         for i in $(test_seq ...)
>         do
>                 echo line $i
>         done >big-blob.txt
>
> would save repeated opening and closing the file, I'd think.

More bikeshedding:

    printf "line %d\n" $(test_seq 1 100) >big-blob.txt

is reasonably concise, perhaps easier to grok than the one-liner
incorporating 'sed', and shouldn't run afoul of command-line length
limitation.
Jonathan Tan Nov. 5, 2019, 6:53 p.m. UTC | #5
Thanks for your review.

> > Note that commits and tags may still be lazy-fetched. I limited myself
> > to objects that could be trees or blobs here because Git does not
> > support creating such commit- and tag-excluding clones yet, and even if
> > such a clone were manually created, Git does not have good support for
> > fetching a single commit (when fetching a commit, it and all its
> > ancestors would be sent).
> 
> Is there a place we could put a NEEDSWORK comment to avoid confusion
> when debugging if this gets introduced later?
> 
> Even if not, this seems like a sensible choice.

Done in the test.

> > I've verified that this also solves the bug explained in:
> > https://public-inbox.org/git/20191007181825.13463-1-jonathantanmy@google.com/
> 
> Might be worth mentioning the example from there in the commit message
> as well, to help explain the context behind the change.
> 
> I would still be in favor of applying that more conservative change to
> "master", even this late in the -rc cycle.

If we're applying that change first, then this no longer fixes any bug,
but is just a code cleanup. (If we don't apply that change, then I'll
include that example in the commit message.)

> Should we make OBJECT_INFO_QUICK always imply
> OBJECT_INFO_SKIP_FETCH_OBJECT?  I would suspect that if we are willing to
> avoid checking thoroughly locally, checking remotely would be even more
> undesirable.

As I wrote in [1], the implication does not really go both ways. I think
it's better to keep them separate. (Or, at least, we don't need to make
the decision in this patch.)

[1] https://public-inbox.org/git/20191011220822.154063-1-jonathantanmy@google.com/

> > +test_expect_success 'fetch lazy-fetches only to resolve deltas' '
> > +	setup_triangle &&
> > +
> > +	# Exercise to make sure it works. Git will not fetch anything from the
> > +	# promisor remote other than for the big blob (because it needs to
> > +	# resolve the delta).
> > +	GIT_TRACE_PACKET="$(pwd)/trace" git -C client \
> > +		fetch "file://$(pwd)/server" master &&
> > +
> > +	# Verify the assumption that the client needed to fetch the delta base
> > +	# to resolve the delta.
> > +	git hash-object big-blob.txt >hash &&
> > +	grep "want $(cat hash)" trace
> 
> nit: can avoid using cat:
> 
> 	hash=$(git hash-object big-blob.txt) &&
> 	grep "want $hash" trace

I think it's less error-prone if we always have a "git" command on its
own on a line, to avoid losing its error code. When piped into another
invocation, or when command-substituted into an argument (e.g. "echo
$(git hash-object foo)"), we lose its error code.
Jonathan Nieder Nov. 5, 2019, 6:58 p.m. UTC | #6
Jonathan Tan wrote:

> I think it's less error-prone if we always have a "git" command on its
> own on a line, to avoid losing its error code. When piped into another
> invocation, or when command-substituted into an argument (e.g. "echo
> $(git hash-object foo)"), we lose its error code.

Sure, but assignment is a special case:

	if myvar=$(false)
	then
		echo yes
	else
		echo no
	fi

prints "no".

See "Don't use command substitution in a way that discards git's exit
code" in t/README for more details.

Thanks,
Jonathan
diff mbox series

Patch

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 0c345b5dfe..5ff7367dd7 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1074,7 +1074,8 @@  static int check_exist_and_connected(struct ref *ref_map)
 	 * we need all direct targets to exist.
 	 */
 	for (r = rm; r; r = r->next) {
-		if (!has_object_file(&r->old_oid))
+		if (!has_object_file_with_flags(&r->old_oid,
+						OBJECT_INFO_SKIP_FETCH_OBJECT))
 			return -1;
 	}
 
@@ -1755,8 +1756,6 @@  int cmd_fetch(int argc, const char **argv, const char *prefix)
 
 	packet_trace_identity("fetch");
 
-	fetch_if_missing = 0;
-
 	/* Record the command line for the reflog */
 	strbuf_addstr(&default_rla, "fetch");
 	for (i = 1; i < argc; i++)
diff --git a/fetch-pack.c b/fetch-pack.c
index 0130b44112..37178e2d34 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -673,7 +673,8 @@  static void mark_complete_and_common_ref(struct fetch_negotiator *negotiator,
 		struct object *o;
 
 		if (!has_object_file_with_flags(&ref->old_oid,
-						OBJECT_INFO_QUICK))
+						OBJECT_INFO_QUICK |
+							OBJECT_INFO_SKIP_FETCH_OBJECT))
 			continue;
 		o = parse_object(the_repository, &ref->old_oid);
 		if (!o)
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index 79f7b65f8c..1081ed2de0 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -296,6 +296,75 @@  test_expect_success 'partial clone with unresolvable sparse filter fails cleanly
 	test_i18ngrep "unable to parse sparse filter data in" err
 '
 
+setup_triangle () {
+	rm -rf big-blob.txt server client promisor-remote &&
+
+	touch big-blob.txt &&
+	for i in $(seq 1 100)
+	do
+		echo line $i >>big-blob.txt
+	done &&
+
+	# Create a server with 2 commits: a commit with a big blob and a child
+	# commit with an incremental change. Also, create a partial clone
+	# client that only contains the first commit.
+	git init server &&
+	git -C server config --local uploadpack.allowfilter 1 &&
+	cp big-blob.txt server &&
+	git -C server add big-blob.txt &&
+	git -C server commit -m "initial" &&
+	git clone --bare --filter=tree:0 "file://$(pwd)/server" client &&
+	echo another line >>server/big-blob.txt &&
+	git -C server commit -am "append line to big blob" &&
+
+	# Create a promisor remote that only contains the blob from the first
+	# commit, and set it as the promisor remote of client. Thus, whenever
+	# the client lazy fetches, the lazy fetch will succeed only if it is
+	# for this blob.
+	git init promisor-remote &&
+	test_commit -C promisor-remote one && # so that ref advertisement is not empty
+	git -C promisor-remote config --local uploadpack.allowanysha1inwant 1 &&
+	git -C promisor-remote hash-object -w --stdin <big-blob.txt &&
+	git -C client remote set-url origin "file://$(pwd)/promisor-remote"
+}
+
+test_expect_success 'fetch lazy-fetches only to resolve deltas' '
+	setup_triangle &&
+
+	# Exercise to make sure it works. Git will not fetch anything from the
+	# promisor remote other than for the big blob (because it needs to
+	# resolve the delta).
+	GIT_TRACE_PACKET="$(pwd)/trace" git -C client \
+		fetch "file://$(pwd)/server" master &&
+
+	# Verify the assumption that the client needed to fetch the delta base
+	# to resolve the delta.
+	git hash-object big-blob.txt >hash &&
+	grep "want $(cat hash)" trace
+'
+
+test_expect_success 'fetch lazy-fetches only to resolve deltas, protocol v2' '
+	setup_triangle &&
+
+	git -C server config --local protocol.version 2 &&
+	git -C client config --local protocol.version 2 &&
+	git -C promisor-remote config --local protocol.version 2 &&
+
+	# Exercise to make sure it works. Git will not fetch anything from the
+	# promisor remote other than for the big blob (because it needs to
+	# resolve the delta).
+	GIT_TRACE_PACKET="$(pwd)/trace" git -C client \
+		fetch "file://$(pwd)/server" master &&
+
+	# Verify that protocol version 2 was used.
+	grep "fetch< version 2" trace &&
+
+	# Verify the assumption that the client needed to fetch the delta base
+	# to resolve the delta.
+	git hash-object big-blob.txt >hash &&
+	grep "want $(cat hash)" trace
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd