fetch: remove fetch_if_missing=0

Message ID	20191101203830.231676-1-jonathantanmy@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=gx+g=YZ=vger.kernel.org=git-owner@kernel.org> Date: Fri, 1 Nov 2019 13:38:30 -0700 Message-Id: <20191101203830.231676-1-jonathantanmy@google.com> Mime-Version: 1.0 Subject: [PATCH] fetch: remove fetch_if_missing=0 From: Jonathan Tan <jonathantanmy@google.com> To: git@vger.kernel.org Cc: Jonathan Tan <jonathantanmy@google.com>, gitster@pobox.com Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk
Series	fetch: remove fetch_if_missing=0 \| expand fetch: remove fetch_if_missing=0

Jonathan Tan Nov. 1, 2019, 8:38 p.m. UTC

In fetch_pack() (and all functions it calls), pass
OBJECT_INFO_SKIP_FETCH_OBJECT whenever we query an object that could be
a tree or blob that we do not want to be lazy-fetched even if it is
absent. Thus, the only lazy-fetches occurring for trees and blobs are
when resolving deltas.

Thus, we can remove fetch_if_missing=0 from builtin/fetch.c. Remove
this, and also add a test ensuring that such objects are not
lazy-fetched. (We might be able to remove fetch_if_missing=0 from other
places too, but I have limited myself to builtin/fetch.c in this commit
because I have not written tests for the other commands yet.)

Note that commits and tags may still be lazy-fetched. I limited myself
to objects that could be trees or blobs here because Git does not
support creating such commit- and tag-excluding clones yet, and even if
such a clone were manually created, Git does not have good support for
fetching a single commit (when fetching a commit, it and all its
ancestors would be sent).

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
I've verified that this also solves the bug explained in:
https://public-inbox.org/git/20191007181825.13463-1-jonathantanmy@google.com/

Quoting what I wrote from there:

> (Alternatively, we
> could add the equivalent of OBJECT_INFO_SKIP_FETCH_OBJECT to functions
> like parse_commit() that are used by files like negotiator/default.c, or
> split up commit parsing into object reading - which already has that
> flag - and commit parsing.)

This commit adds OBJECT_INFO_SKIP_FETCH_OBJECT to functions, but not the
parse_commit() mentioned - as stated above, this commit only handles
objects that could be trees or blobs.
---
 builtin/fetch.c          |  5 ++-
 fetch-pack.c             |  3 +-
 t/t5616-partial-clone.sh | 69 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+), 4 deletions(-)

Jonathan Nieder Nov. 1, 2019, 10:05 p.m. UTC | #1

Hi,

Jonathan Tan wrote:

> In fetch_pack() (and all functions it calls), pass
> OBJECT_INFO_SKIP_FETCH_OBJECT whenever we query an object that could be
> a tree or blob that we do not want to be lazy-fetched even if it is
> absent. Thus, the only lazy-fetches occurring for trees and blobs are
> when resolving deltas.
>
> Thus, we can remove fetch_if_missing=0 from builtin/fetch.c. Remove
> this, and also add a test ensuring that such objects are not
> lazy-fetched. (We might be able to remove fetch_if_missing=0 from other
> places too, but I have limited myself to builtin/fetch.c in this commit
> because I have not written tests for the other commands yet.)

Hooray!  Thanks much, this looks easier to maintain.

> Note that commits and tags may still be lazy-fetched. I limited myself
> to objects that could be trees or blobs here because Git does not
> support creating such commit- and tag-excluding clones yet, and even if
> such a clone were manually created, Git does not have good support for
> fetching a single commit (when fetching a commit, it and all its
> ancestors would be sent).

Is there a place we could put a NEEDSWORK comment to avoid confusion
when debugging if this gets introduced later?

Even if not, this seems like a sensible choice.

> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> I've verified that this also solves the bug explained in:
> https://public-inbox.org/git/20191007181825.13463-1-jonathantanmy@google.com/

Might be worth mentioning the example from there in the commit message
as well, to help explain the context behind the change.

I would still be in favor of applying that more conservative change to
"master", even this late in the -rc cycle.

[...]
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -1074,7 +1074,8 @@ static int check_exist_and_connected(struct ref *ref_map)
>  	 * we need all direct targets to exist.
>  	 */
>  	for (r = rm; r; r = r->next) {
> -		if (!has_object_file(&r->old_oid))
> +		if (!has_object_file_with_flags(&r->old_oid,
> +						OBJECT_INFO_SKIP_FETCH_OBJECT))

Yes.

[...]
> @@ -1755,8 +1756,6 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
>  
>  	packet_trace_identity("fetch");
>  
> -	fetch_if_missing = 0;
> -

This is the scary part, but in an "uncomfortably exciting" sense rather
than a worrying one.  Thanks for adding a test.

[...]
> --- a/fetch-pack.c
> +++ b/fetch-pack.c
> @@ -673,7 +673,8 @@ static void mark_complete_and_common_ref(struct fetch_negotiator *negotiator,
>  		struct object *o;
>  
>  		if (!has_object_file_with_flags(&ref->old_oid,
> -						OBJECT_INFO_QUICK))
> +						OBJECT_INFO_QUICK |
> +							OBJECT_INFO_SKIP_FETCH_OBJECT))

Should we make OBJECT_INFO_QUICK always imply
OBJECT_INFO_SKIP_FETCH_OBJECT?  I would suspect that if we are willing to
avoid checking thoroughly locally, checking remotely would be even more
undesirable.

[...]
> --- a/t/t5616-partial-clone.sh
> +++ b/t/t5616-partial-clone.sh
> @@ -296,6 +296,75 @@ test_expect_success 'partial clone with unresolvable sparse filter fails cleanly
>  	test_i18ngrep "unable to parse sparse filter data in" err
>  '
>  
> +setup_triangle () {
> +	rm -rf big-blob.txt server client promisor-remote &&
> +
> +	touch big-blob.txt &&

Tests seem to prefer spelling this as

	>big-blob.txt &&

because that specifes the content of the file.

> +	for i in $(seq 1 100)
> +	do
> +		echo line $i >>big-blob.txt
> +	done &&

Should this use test_seq for better portability?

nit: can avoid a subshell:

	test_seq 1 100 | sed -e 's/^/line /' >big-blob.txt

[...]
> +test_expect_success 'fetch lazy-fetches only to resolve deltas' '
> +	setup_triangle &&
> +
> +	# Exercise to make sure it works. Git will not fetch anything from the
> +	# promisor remote other than for the big blob (because it needs to
> +	# resolve the delta).
> +	GIT_TRACE_PACKET="$(pwd)/trace" git -C client \
> +		fetch "file://$(pwd)/server" master &&
> +
> +	# Verify the assumption that the client needed to fetch the delta base
> +	# to resolve the delta.
> +	git hash-object big-blob.txt >hash &&
> +	grep "want $(cat hash)" trace

nit: can avoid using cat:

	hash=$(git hash-object big-blob.txt) &&
	grep "want $hash" trace

Thanks and hope that helps,
Jonathan

Junio C Hamano Nov. 2, 2019, 5:55 a.m. UTC | #2

Jonathan Nieder <jrnieder@gmail.com> writes:

>> @@ -1755,8 +1756,6 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
>>  
>>  	packet_trace_identity("fetch");
>>  
>> -	fetch_if_missing = 0;
>> -
>
> This is the scary part, but in an "uncomfortably exciting" sense rather
> than a worrying one.  Thanks for adding a test.
>
> [...]
>> --- a/fetch-pack.c
>> +++ b/fetch-pack.c
>> @@ -673,7 +673,8 @@ static void mark_complete_and_common_ref(struct fetch_negotiator *negotiator,
>>  		struct object *o;
>>  
>>  		if (!has_object_file_with_flags(&ref->old_oid,
>> -						OBJECT_INFO_QUICK))
>> +						OBJECT_INFO_QUICK |
>> +							OBJECT_INFO_SKIP_FETCH_OBJECT))
>
> Should we make OBJECT_INFO_QUICK always imply
> OBJECT_INFO_SKIP_FETCH_OBJECT?  I would suspect that if we are willing to
> avoid checking thoroughly locally, checking remotely would be even more
> undesirable.

I think I've seen this mentioned a few times during this cycle in
the list archive ;-)

>> +	for i in $(seq 1 100)
>> +	do
>> +		echo line $i >>big-blob.txt
>> +	done &&
>
> Should this use test_seq for better portability?

Yup.

> nit: can avoid a subshell:
>
> 	test_seq 1 100 | sed -e 's/^/line /' >big-blob.txt

Yeah, but it costs process start-up and "sed" that may be rather
heavyweight.  At least

	for i in $(test_seq ...)
	do
		echo line $i
	done >big-blob.txt

would save repeated opening and closing the file, I'd think.

>> +	git hash-object big-blob.txt >hash &&
>> +	grep "want $(cat hash)" trace
>
> nit: can avoid using cat:
>
> 	hash=$(git hash-object big-blob.txt) &&
> 	grep "want $hash" trace
>
> Thanks and hope that helps,

Thanks, both.

Junio C Hamano Nov. 2, 2019, 5:59 a.m. UTC | #3

Jonathan Nieder <jrnieder@gmail.com> writes:

> I would still be in favor of applying that more conservative change to
> "master", even this late in the -rc cycle.

I believe <20191023233403.145457-1-jonathantanmy@google.com> is the
latest incarnation of the "conservative" approach.  I think I can
buy that, but let me see if there are interactions with other topics
first.

Eric Sunshine Nov. 2, 2019, 6:11 a.m. UTC | #4

On Sat, Nov 2, 2019 at 1:55 AM Junio C Hamano <gitster@pobox.com> wrote:
> Jonathan Nieder <jrnieder@gmail.com> writes:
> > nit: can avoid a subshell:
> >
> >       test_seq 1 100 | sed -e 's/^/line /' >big-blob.txt
>
> Yeah, but it costs process start-up and "sed" that may be rather
> heavyweight.  At least
>
>         for i in $(test_seq ...)
>         do
>                 echo line $i
>         done >big-blob.txt
>
> would save repeated opening and closing the file, I'd think.

More bikeshedding:

    printf "line %d\n" $(test_seq 1 100) >big-blob.txt

is reasonably concise, perhaps easier to grok than the one-liner
incorporating 'sed', and shouldn't run afoul of command-line length
limitation.

Jonathan Tan Nov. 5, 2019, 6:53 p.m. UTC | #5

Thanks for your review.

> > Note that commits and tags may still be lazy-fetched. I limited myself
> > to objects that could be trees or blobs here because Git does not
> > support creating such commit- and tag-excluding clones yet, and even if
> > such a clone were manually created, Git does not have good support for
> > fetching a single commit (when fetching a commit, it and all its
> > ancestors would be sent).
> 
> Is there a place we could put a NEEDSWORK comment to avoid confusion
> when debugging if this gets introduced later?
> 
> Even if not, this seems like a sensible choice.

Done in the test.

> > I've verified that this also solves the bug explained in:
> > https://public-inbox.org/git/20191007181825.13463-1-jonathantanmy@google.com/
> 
> Might be worth mentioning the example from there in the commit message
> as well, to help explain the context behind the change.
> 
> I would still be in favor of applying that more conservative change to
> "master", even this late in the -rc cycle.

If we're applying that change first, then this no longer fixes any bug,
but is just a code cleanup. (If we don't apply that change, then I'll
include that example in the commit message.)

> Should we make OBJECT_INFO_QUICK always imply
> OBJECT_INFO_SKIP_FETCH_OBJECT?  I would suspect that if we are willing to
> avoid checking thoroughly locally, checking remotely would be even more
> undesirable.

As I wrote in [1], the implication does not really go both ways. I think
it's better to keep them separate. (Or, at least, we don't need to make
the decision in this patch.)

[1] https://public-inbox.org/git/20191011220822.154063-1-jonathantanmy@google.com/

> > +test_expect_success 'fetch lazy-fetches only to resolve deltas' '
> > +	setup_triangle &&
> > +
> > +	# Exercise to make sure it works. Git will not fetch anything from the
> > +	# promisor remote other than for the big blob (because it needs to
> > +	# resolve the delta).
> > +	GIT_TRACE_PACKET="$(pwd)/trace" git -C client \
> > +		fetch "file://$(pwd)/server" master &&
> > +
> > +	# Verify the assumption that the client needed to fetch the delta base
> > +	# to resolve the delta.
> > +	git hash-object big-blob.txt >hash &&
> > +	grep "want $(cat hash)" trace
> 
> nit: can avoid using cat:
> 
> 	hash=$(git hash-object big-blob.txt) &&
> 	grep "want $hash" trace

I think it's less error-prone if we always have a "git" command on its
own on a line, to avoid losing its error code. When piped into another
invocation, or when command-substituted into an argument (e.g. "echo
$(git hash-object foo)"), we lose its error code.

Jonathan Nieder Nov. 5, 2019, 6:58 p.m. UTC | #6

Jonathan Tan wrote:

> I think it's less error-prone if we always have a "git" command on its
> own on a line, to avoid losing its error code. When piped into another
> invocation, or when command-substituted into an argument (e.g. "echo
> $(git hash-object foo)"), we lose its error code.

Sure, but assignment is a special case:

	if myvar=$(false)
	then
		echo yes
	else
		echo no
	fi

prints "no".

See "Don't use command substitution in a way that discards git's exit
code" in t/README for more details.

Thanks,
Jonathan

fetch: remove fetch_if_missing=0

Commit Message

Comments

Patch