diff mbox series

[v2] push: don't fetch commit object when checking existence

Message ID 20240522201559.1677959-1-tom@compton.nu (mailing list archive)
State Accepted
Commit 6549c41ead833c8d8c4098806a29399433065516
Headers show
Series [v2] push: don't fetch commit object when checking existence | expand

Commit Message

Tom Hughes May 22, 2024, 8:15 p.m. UTC
If we're checking to see whether to tell the user to do a fetch
before pushing there's no need for us to actually fetch the object
from the remote if the clone is partial.

Because the promisor doesn't do negotiation actually trying to do
the fetch of the new head can be very expensive as it will try and
include history that we already have and it just results in rejecting
the push with a different message, and in behavior that is different
to a clone that is not partial.

Signed-off-by: Tom Hughes <tom@compton.nu>
---
 remote.c                 |  2 +-
 t/t0410-partial-clone.sh | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+), 1 deletion(-)

Comments

Junio C Hamano May 22, 2024, 8:55 p.m. UTC | #1
Tom Hughes <tom@compton.nu> writes:

> diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
> index 88a66f0904..7797391c03 100755
> --- a/t/t0410-partial-clone.sh
> +++ b/t/t0410-partial-clone.sh
> @@ -689,6 +689,25 @@ test_expect_success 'lazy-fetch when accessing object not in the_repository' '
>  	! grep "[?]$FILE_HASH" out
>  '
>  
> +test_expect_success 'push should not fetch new commit objects' '
> +	rm -rf server client &&
> +	test_create_repo server &&
> +	test_config -C server uploadpack.allowfilter 1 &&
> +	test_config -C server uploadpack.allowanysha1inwant 1 &&
> +	test_commit -C server server1 &&

OK, we create the source that allows a partial clone.

> +	git clone --filter=blob:none "file://$(pwd)/server" client &&
> +	test_commit -C client client1 &&

And make a clone out of it, without blobs.

> +	test_commit -C server server2 &&
> +	COMMIT=$(git -C server rev-parse server2) &&

Then we create a new commit that the client does not yet have.

> +	test_must_fail git -C client push 2>err &&

We try to overwrite it.  We expect it to fail with "not a fast forward".

> +	grep "fetch first" err &&

May want to use "test_grep" but this script does not use it, so
being consistent with the surrounding tests is good.

> +	git -C client rev-list --objects --missing=print "$COMMIT" >objects &&
> +	grep "^[?]$COMMIT" objects
> +'

OK.

>  . "$TEST_DIRECTORY"/lib-httpd.sh
>  start_httpd

Looking good.  Thanks, will queue.
Tom Hughes May 22, 2024, 9:46 p.m. UTC | #2
On 22/05/2024 21:55, Junio C Hamano wrote:
> Tom Hughes <tom@compton.nu> writes:
>
>> +test_expect_success 'push should not fetch new commit objects' '
>> +	rm -rf server client &&
>> +	test_create_repo server &&
>> +	test_config -C server uploadpack.allowfilter 1 &&
>> +	test_config -C server uploadpack.allowanysha1inwant 1 &&
>> +	test_commit -C server server1 &&
> 
> OK, we create the source that allows a partial clone.
> 
>> +	git clone --filter=blob:none "file://$(pwd)/server" client &&
>> +	test_commit -C client client1 &&
> 
> And make a clone out of it, without blobs.
> 
>> +	test_commit -C server server2 &&
>> +	COMMIT=$(git -C server rev-parse server2) &&
> 
> Then we create a new commit that the client does not yet have.
> 
>> +	test_must_fail git -C client push 2>err &&
> 
> We try to overwrite it.  We expect it to fail with "not a fast forward".

Well that is what it would fail with at the moment, but it's not
what would happen with a non-partial clone - a non-partial clone
would fail with "fetch first" instead.

This patch makes both cases consistent although that wasn't the
main driver - the main driver was to stop it fetching 100Mb or
more of history in the large repository I was working with when
the upstream has one new commit.

>> +	grep "fetch first" err &&
> 
> May want to use "test_grep" but this script does not use it, so
> being consistent with the surrounding tests is good.

So here we are testing that it's a "fetch first" and rather
than "not a fast forward".

>> +	git -C client rev-list --objects --missing=print "$COMMIT" >objects &&
>> +	grep "^[?]$COMMIT" objects
>> +'
> 
> OK.

and also that it hasn't fetched the new commit.

Tom
Junio C Hamano May 22, 2024, 9:58 p.m. UTC | #3
Tom Hughes <tom@compton.nu> writes:

>>> +	test_must_fail git -C client push 2>err &&
>> We try to overwrite it.  We expect it to fail with "not a fast
>> forward".
>
> Well that is what it would fail with at the moment, but it's not
> what would happen with a non-partial clone - a non-partial clone
> would fail with "fetch first" instead.

Oh, don't get me wrong.  I wasn't trying to split hairs between the
two error modes and their phrasing.  The "fetch-first" from
set_ref_status_for_push() is done before we even initiate the
transfer to stop the operation, with a cheap check, that will
eventually lead to "not a fast forward" error.  IOW, in my mind,
they are the same errors, just diagnosed at two different places in
the code and their messages phrased differently.

> So here we are testing that it's a "fetch first" and rather
> than "not a fast forward".

I think that is being overly specific, but that is fine.  As I said,
to the end users, these two errors mean the same thing (they would
need to fetch first and then integrate their changes before pushing
it out again), so it is plausible that we may in the future decide
that we want to use the same message.  When it happens, this test
must change, which may even be a good thing (it makes it clear what
the fallout from such a change looks like).

>>> +	git -C client rev-list --objects --missing=print "$COMMIT" >objects &&
>>> +	grep "^[?]$COMMIT" objects
>>> +'
>> OK.
>
> and also that it hasn't fetched the new commit.

Yes, and this is a good check that will stand the test of time, even
across a change to rephrase the error message.

Thanks.
Jeff King May 23, 2024, 8:58 a.m. UTC | #4
On Wed, May 22, 2024 at 09:15:40PM +0100, Tom Hughes wrote:

> diff --git a/remote.c b/remote.c
> index 2b650b813b..20395bbbd0 100644
> --- a/remote.c
> +++ b/remote.c
> @@ -1773,7 +1773,7 @@ void set_ref_status_for_push(struct ref *remote_refs, int send_mirror,
>  		if (!reject_reason && !ref->deletion && !is_null_oid(&ref->old_oid)) {
>  			if (starts_with(ref->name, "refs/tags/"))
>  				reject_reason = REF_STATUS_REJECT_ALREADY_EXISTS;
> -			else if (!repo_has_object_file(the_repository, &ref->old_oid))
> +			else if (!repo_has_object_file_with_flags(the_repository, &ref->old_oid, OBJECT_INFO_SKIP_FETCH_OBJECT))
>  				reject_reason = REF_STATUS_REJECT_FETCH_FIRST;
>  			else if (!lookup_commit_reference_gently(the_repository, &ref->old_oid, 1) ||
>  				 !lookup_commit_reference_gently(the_repository, &ref->new_oid, 1))

This makes sense to me, as we're just speculatively asking "do we have
the object". I think for that reason it would also be reasonable to use
OBJECT_INFO_QUICK here, which would avoid a fruitless re-scan of the
local objects/ directory. We often pair the two[1].

In practice, though, I think fetching the missing object is going to be
much more expensive than a local re-scan. We tend to notice the latter
only when you have a large number of objects to check, and here we're
basically limited by the number of non-fast-forward refs you're trying
to push.

So I also think it would be OK to leave it here and only do QUICK if
somebody ever notices it.

-Peff

[1] We've talked about unifying those two flags, since they so often
    come together. There's some discussion in:

      https://lore.kernel.org/git/20191011220822.154063-1-jonathantanmy@google.com/

    that they could become one flag, but these two:

      https://lore.kernel.org/git/20190909222101.GB31319@sigill.intra.peff.net/

      https://lore.kernel.org/git/20200322054916.GB578498@coredump.intra.peff.net/

    argue that QUICK implies SKIP_FETCH, but not always the other way
    around. (Obviously getting a bit off topic for your patch; if
    anything, I think this call site would just use both for now).
diff mbox series

Patch

diff --git a/remote.c b/remote.c
index 2b650b813b..20395bbbd0 100644
--- a/remote.c
+++ b/remote.c
@@ -1773,7 +1773,7 @@  void set_ref_status_for_push(struct ref *remote_refs, int send_mirror,
 		if (!reject_reason && !ref->deletion && !is_null_oid(&ref->old_oid)) {
 			if (starts_with(ref->name, "refs/tags/"))
 				reject_reason = REF_STATUS_REJECT_ALREADY_EXISTS;
-			else if (!repo_has_object_file(the_repository, &ref->old_oid))
+			else if (!repo_has_object_file_with_flags(the_repository, &ref->old_oid, OBJECT_INFO_SKIP_FETCH_OBJECT))
 				reject_reason = REF_STATUS_REJECT_FETCH_FIRST;
 			else if (!lookup_commit_reference_gently(the_repository, &ref->old_oid, 1) ||
 				 !lookup_commit_reference_gently(the_repository, &ref->new_oid, 1))
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 88a66f0904..7797391c03 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -689,6 +689,25 @@  test_expect_success 'lazy-fetch when accessing object not in the_repository' '
 	! grep "[?]$FILE_HASH" out
 '
 
+test_expect_success 'push should not fetch new commit objects' '
+	rm -rf server client &&
+	test_create_repo server &&
+	test_config -C server uploadpack.allowfilter 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+	test_commit -C server server1 &&
+
+	git clone --filter=blob:none "file://$(pwd)/server" client &&
+	test_commit -C client client1 &&
+
+	test_commit -C server server2 &&
+	COMMIT=$(git -C server rev-parse server2) &&
+
+	test_must_fail git -C client push 2>err &&
+	grep "fetch first" err &&
+	git -C client rev-list --objects --missing=print "$COMMIT" >objects &&
+	grep "^[?]$COMMIT" objects
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd