diff mbox series

bundle-uri: refresh packed_git if unbundle succeed

Message ID pull.1730.git.1715742069966.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series bundle-uri: refresh packed_git if unbundle succeed | expand

Commit Message

Xing Xin May 15, 2024, 3:01 a.m. UTC
From: Xing Xin <xingxin.xx@bytedance.com>

When using the bundle-uri mechanism with a bundle list containing
multiple interrelated bundles, we encountered a bug where tips from
downloaded bundles were not being discovered, resulting in rather slow
clones. This was particularly problematic when employing the heuristic
`creationTokens`.

And this is easy to reproduce. Suppose we have a repository with a
single branch `main` pointing to commit `A`, firstly we create a base
bundle with

  git bundle create base.bundle main

Then let's add a new commit `B` on top of `A`, so that an incremental
bundle for `main` can be created with

  git bundle create incr.bundle A..main

Now we can generate a bundle list with the following content:

  [bundle]
      version = 1
      mode = all
      heuristic = creationToken

  [bundle "base"]
      uri = base.bundle
      creationToken = 1

  [bundle "incr"]
      uri = incr.bundle
      creationToken = 2

A fresh clone with the bundle list above would give the expected
`refs/bundles/main` pointing at `B` in new repository, in other words we
already had everything locally from the bundles, but git would still
download everything from server as if we got nothing.

So why the `refs/bundles/main` is not discovered? After some digging I
found that:

1. when unbundling a downloaded bundle, a `verify_bundle` is called to
   check its prerequisites if any. The verify procedure would find oids
   so `packed_git` is initialized.

2. after unbundled all bundles, we would enter `do_fetch_pack_v2`,
   during which `mark_complete_and_common_ref` and `mark_tips` would
   find oids with `OBJECT_INFO_QUICK` flag set, so no new packs would be
   enlisted if `packed_git` has already initialized in 1.

Back to the example above, when unbunding `incr.bundle`, `base.pack` is
enlisted to `packed_git` bacause of the prerequisites to verify. Then we
can not find `B` for negotiation at a latter time bacause `B` exists in
`incr.pack` which is not enlisted in `packed_git`.

This commit fixes this by adding a `reprepare_packed_git` call for every
successfully unbundled bundle, which ensures to enlist all generated
packs from bundle uri. And a set of negotiation related tests are added.

Signed-off-by: Xing Xin <xingxin.xx@bytedance.com>
---
    bundle-uri: refresh packed_git if unbundle succeed

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1730%2Fblanet%2Fxx%2Fbundle-uri-bug-using-bundle-list-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1730

 bundle-uri.c                |   3 +
 t/t5558-clone-bundle-uri.sh | 129 ++++++++++++++++++++++++++++++++++++
 2 files changed, 132 insertions(+)


base-commit: 83f1add914c6b4682de1e944ec0d1ac043d53d78

Comments

Patrick Steinhardt May 17, 2024, 5 a.m. UTC | #1
On Wed, May 15, 2024 at 03:01:09AM +0000, blanet via GitGitGadget wrote:
> From: Xing Xin <xingxin.xx@bytedance.com>

Long time no see :)

> When using the bundle-uri mechanism with a bundle list containing
> multiple interrelated bundles, we encountered a bug where tips from
> downloaded bundles were not being discovered, resulting in rather slow
> clones. This was particularly problematic when employing the heuristic
> `creationTokens`.
> 
> And this is easy to reproduce. Suppose we have a repository with a
> single branch `main` pointing to commit `A`, firstly we create a base
> bundle with
> 
>   git bundle create base.bundle main
> 
> Then let's add a new commit `B` on top of `A`, so that an incremental
> bundle for `main` can be created with
> 
>   git bundle create incr.bundle A..main
> 
> Now we can generate a bundle list with the following content:
> 
>   [bundle]
>       version = 1
>       mode = all
>       heuristic = creationToken
> 
>   [bundle "base"]
>       uri = base.bundle
>       creationToken = 1
> 
>   [bundle "incr"]
>       uri = incr.bundle
>       creationToken = 2
> 
> A fresh clone with the bundle list above would give the expected
> `refs/bundles/main` pointing at `B` in new repository, in other words we
> already had everything locally from the bundles, but git would still
> download everything from server as if we got nothing.
> 
> So why the `refs/bundles/main` is not discovered? After some digging I
> found that:
> 
> 1. when unbundling a downloaded bundle, a `verify_bundle` is called to
>    check its prerequisites if any. The verify procedure would find oids
>    so `packed_git` is initialized.
> 
> 2. after unbundled all bundles, we would enter `do_fetch_pack_v2`,
>    during which `mark_complete_and_common_ref` and `mark_tips` would
>    find oids with `OBJECT_INFO_QUICK` flag set, so no new packs would be
>    enlisted if `packed_git` has already initialized in 1.

And I assume we do not want it to not use `OBJECT_INFO_QUICK`?

> Back to the example above, when unbunding `incr.bundle`, `base.pack` is
> enlisted to `packed_git` bacause of the prerequisites to verify. Then we
> can not find `B` for negotiation at a latter time bacause `B` exists in
> `incr.pack` which is not enlisted in `packed_git`.

Okay, the explanation feels sensible.

> This commit fixes this by adding a `reprepare_packed_git` call for every
> successfully unbundled bundle, which ensures to enlist all generated
> packs from bundle uri. And a set of negotiation related tests are added.

This makes me wonder though. Do we really need to call
`reprepare_packed_git()` once for every bundle, or can't we instead call
it once at the end once we have fetched all bundles? Reading on.

> Signed-off-by: Xing Xin <xingxin.xx@bytedance.com>
> ---
>     bundle-uri: refresh packed_git if unbundle succeed
> 
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1730%2Fblanet%2Fxx%2Fbundle-uri-bug-using-bundle-list-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/1730
> 
>  bundle-uri.c                |   3 +
>  t/t5558-clone-bundle-uri.sh | 129 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 132 insertions(+)
> 
> diff --git a/bundle-uri.c b/bundle-uri.c
> index ca32050a78f..2b9d36cfd8e 100644
> --- a/bundle-uri.c
> +++ b/bundle-uri.c
> @@ -7,6 +7,7 @@
>  #include "refs.h"
>  #include "run-command.h"
>  #include "hashmap.h"
> +#include "packfile.h"
>  #include "pkt-line.h"
>  #include "config.h"
>  #include "remote.h"
> @@ -376,6 +377,8 @@ static int unbundle_from_file(struct repository *r, const char *file)
>  			       VERIFY_BUNDLE_QUIET)))
>  		return 1;
>  
> +	reprepare_packed_git(r);
> +

So what's hidden here is that `unbundle_from_file()` will also try to
access the bundle's refs right away. Surprisingly, we do so by calling
`refs_update_ref()` with `REF_SKIP_OID_VERIFICATION`, which has the
effect that we basically accept arbitrary object IDs here even if we do
not know those. That's why we didn't have to `reprepare_packed_git()`
before this change.

Now there are two conflicting thoughts here:

  - Either we can now drop `REF_SKIP_OID_VERIFICATION` as the object IDs
    should now be accessible.

  - Or we can avoid calling `reprepare_packed_git()` inside the loop and
    instead call it once after we have fetched all bundles.

The second one feels a bit like premature optimization to me. But the
first item does feel like it could help us to catch broken bundles
because we wouldn't end up creating refs for objects that neither we nor
the bundle have.

Patrick
karthik nayak May 17, 2024, 7:36 a.m. UTC | #2
"blanet via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Xing Xin <xingxin.xx@bytedance.com>>
> When using the bundle-uri mechanism with a bundle list containing
> multiple interrelated bundles, we encountered a bug where tips from
> downloaded bundles were not being discovered, resulting in rather slow
> clones. This was particularly problematic when employing the heuristic
> `creationTokens`.
>
> And this is easy to reproduce. Suppose we have a repository with a
> single branch `main` pointing to commit `A`, firstly we create a base
> bundle with
>
>   git bundle create base.bundle main
>
> Then let's add a new commit `B` on top of `A`, so that an incremental
> bundle for `main` can be created with
>
>   git bundle create incr.bundle A..main
>
> Now we can generate a bundle list with the following content:
>
>   [bundle]
>       version = 1
>       mode = all
>       heuristic = creationToken
>
>   [bundle "base"]
>       uri = base.bundle
>       creationToken = 1
>
>   [bundle "incr"]
>       uri = incr.bundle
>       creationToken = 2
>
> A fresh clone with the bundle list above would give the expected
> `refs/bundles/main` pointing at `B` in new repository, in other words we
> already had everything locally from the bundles, but git would still
> download everything from server as if we got nothing.
>
> So why the `refs/bundles/main` is not discovered? After some digging I
> found that:
>
> 1. when unbundling a downloaded bundle, a `verify_bundle` is called to

s/a//

>    check its prerequisites if any. The verify procedure would find oids
>    so `packed_git` is initialized.
>

So the flow is:
1. `fetch_bundle_list` fetches all the bundles advertised via
`download_bundle_list` to local files.
2. It then calls `unbundle_all_bundles` to unbundle all the bundles.
3. Each bundle is then unbundled using `unbundle_from_file`.
4. Here, we first read the bundle header to get all the prerequisites
for the bundle, this is done in `read_bundle_header`.
5. Then we call `unbundle`, which calls `verify_bundle` to ensure that
the repository does indeed contain the prerequisites mentioned in the
bundle. Then it creates the index from the bundle file.

So because the objects are being checked, the `prepare_packed_git`
function is eventually called, which means that the
`raw_object_store->packed_git` data gets filled in and
`packed_git_initialized` is set.

This means consecutive calls to `prepare_packed_git` doesn't
re-initiate `raw_object_store->packed_git` since
`packed_git_initialized` already is set.

So your explanation makes sense, as a _nit_ I would perhaps add the part
about why consecutive calls to `prepare_packed_git` are ineffective.

> 2. after unbundled all bundles, we would enter `do_fetch_pack_v2`,

s/unbundled/unbundling

>    during which `mark_complete_and_common_ref` and `mark_tips` would
>    find oids with `OBJECT_INFO_QUICK` flag set, so no new packs would be
>    enlisted if `packed_git` has already initialized in 1.
> Back to the example above, when unbunding `incr.bundle`, `base.pack` is
> enlisted to `packed_git` bacause of the prerequisites to verify. Then we
> can not find `B` for negotiation at a latter time bacause `B` exists in
> `incr.pack` which is not enlisted in `packed_git`.
>
> This commit fixes this by adding a `reprepare_packed_git` call for every
> successfully unbundled bundle, which ensures to enlist all generated
> packs from bundle uri. And a set of negotiation related tests are added.
>

The solution makes sense.

> Signed-off-by: Xing Xin <xingxin.xx@bytedance.com>
> ---
>     bundle-uri: refresh packed_git if unbundle succeed
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1730%2Fblanet%2Fxx%2Fbundle-uri-bug-using-bundle-list-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/1730
>
>  bundle-uri.c                |   3 +
>  t/t5558-clone-bundle-uri.sh | 129 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 132 insertions(+)
>
> diff --git a/bundle-uri.c b/bundle-uri.c
> index ca32050a78f..2b9d36cfd8e 100644
> --- a/bundle-uri.c
> +++ b/bundle-uri.c
> @@ -7,6 +7,7 @@
>  #include "refs.h"
>  #include "run-command.h"
>  #include "hashmap.h"
> +#include "packfile.h"
>  #include "pkt-line.h"
>  #include "config.h"
>  #include "remote.h"
> @@ -376,6 +377,8 @@ static int unbundle_from_file(struct repository *r, const char *file)
>  			       VERIFY_BUNDLE_QUIET)))
>  		return 1;
>
> +	reprepare_packed_git(r);
> +
>

Would it make sense to move this to `bundle.c:unbundle()`, since that is
also where the idx is created?

[snip]
Junio C Hamano May 17, 2024, 4:14 p.m. UTC | #3
Patrick Steinhardt <ps@pks.im> writes:

> Now there are two conflicting thoughts here:
>
>   - Either we can now drop `REF_SKIP_OID_VERIFICATION` as the object IDs
>     should now be accessible.
>
>   - Or we can avoid calling `reprepare_packed_git()` inside the loop and
>     instead call it once after we have fetched all bundles.
>
> The second one feels a bit like premature optimization to me. But the
> first item does feel like it could help us to catch broken bundles
> because we wouldn't end up creating refs for objects that neither we nor
> the bundle have.

I like the way your thoughts are structured around here.

I do agree that the latter is a wrong approach---we shouldn't be
trusting what came from elsewhere over the network without first
checking.  We should probably be running the "index-pack --fix-thin"
the unbundling process runs with also the "--fsck-objects" option if
we are not doing so already, and even then, we should make sure that
the object we are making our ref point at have everything behind it.
Xing Xin May 20, 2024, 9:41 a.m. UTC | #4
At 2024-05-17 13:00:49, "Patrick Steinhardt" <ps@pks.im> wrote:
>On Wed, May 15, 2024 at 03:01:09AM +0000, blanet via GitGitGadget wrote:
>> From: Xing Xin <xingxin.xx@bytedance.com>
>
>Long time no see :)

Glad to see you again here :)

>> 
>> So why the `refs/bundles/main` is not discovered? After some digging I
>> found that:
>> 
>> 1. when unbundling a downloaded bundle, a `verify_bundle` is called to
>>    check its prerequisites if any. The verify procedure would find oids
>>    so `packed_git` is initialized.
>> 
>> 2. after unbundled all bundles, we would enter `do_fetch_pack_v2`,
>>    during which `mark_complete_and_common_ref` and `mark_tips` would
>>    find oids with `OBJECT_INFO_QUICK` flag set, so no new packs would be
>>    enlisted if `packed_git` has already initialized in 1.
>
>And I assume we do not want it to not use `OBJECT_INFO_QUICK`?

I think so. For clones or fetches without using bundle-uri, we can hardly
encounter the case that new packs are added during the negotiation process.
So using `OBJECT_INFO_QUICK` can get some performance gain.

>
>> Back to the example above, when unbunding `incr.bundle`, `base.pack` is
>> enlisted to `packed_git` bacause of the prerequisites to verify. Then we
>> can not find `B` for negotiation at a latter time bacause `B` exists in
>> `incr.pack` which is not enlisted in `packed_git`.
>
>Okay, the explanation feels sensible.
>
>> This commit fixes this by adding a `reprepare_packed_git` call for every
>> successfully unbundled bundle, which ensures to enlist all generated
>> packs from bundle uri. And a set of negotiation related tests are added.
>
>This makes me wonder though. Do we really need to call
>`reprepare_packed_git()` once for every bundle, or can't we instead call
>it once at the end once we have fetched all bundles? Reading on.
>
>> Signed-off-by: Xing Xin <xingxin.xx@bytedance.com>
>> ---
>>     bundle-uri: refresh packed_git if unbundle succeed
>> 
>> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1730%2Fblanet%2Fxx%2Fbundle-uri-bug-using-bundle-list-v1
>> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1730/blanet/xx/bundle-uri-bug-using-bundle-list-v1
>> Pull-Request: https://github.com/gitgitgadget/git/pull/1730
>> 
>>  bundle-uri.c                |   3 +
>>  t/t5558-clone-bundle-uri.sh | 129 ++++++++++++++++++++++++++++++++++++
>>  2 files changed, 132 insertions(+)
>> 
>> diff --git a/bundle-uri.c b/bundle-uri.c
>> index ca32050a78f..2b9d36cfd8e 100644
>> --- a/bundle-uri.c
>> +++ b/bundle-uri.c
>> @@ -7,6 +7,7 @@
>>  #include "refs.h"
>>  #include "run-command.h"
>>  #include "hashmap.h"
>> +#include "packfile.h"
>>  #include "pkt-line.h"
>>  #include "config.h"
>>  #include "remote.h"
>> @@ -376,6 +377,8 @@ static int unbundle_from_file(struct repository *r, const char *file)
>>  			       VERIFY_BUNDLE_QUIET)))
>>  		return 1;
>>  
>> +	reprepare_packed_git(r);
>> +
>
>So what's hidden here is that `unbundle_from_file()` will also try to
>access the bundle's refs right away. Surprisingly, we do so by calling
>`refs_update_ref()` with `REF_SKIP_OID_VERIFICATION`, which has the
>effect that we basically accept arbitrary object IDs here even if we do
>not know those. That's why we didn't have to `reprepare_packed_git()`
>before this change.

You are right! I tried dropping this `REF_SKIP_OID_VERIFICATION` flag and
the negotiation works as expected.

After some further digging I find that without `REF_SKIP_OID_VERIFICATION`,
both `write_ref_to_lockfile` for files backend and `reftable_be_transaction_prepare`
for reftable backend would call `parse_object` to check the oid. `parse_object`
can help refresh `packed_git` via `reprepare_packed_git`.

>
>Now there are two conflicting thoughts here:
>
>  - Either we can now drop `REF_SKIP_OID_VERIFICATION` as the object IDs
>    should now be accessible.
>
>  - Or we can avoid calling `reprepare_packed_git()` inside the loop and
>    instead call it once after we have fetched all bundles.
>
>The second one feels a bit like premature optimization to me. But the
>first item does feel like it could help us to catch broken bundles
>because we wouldn't end up creating refs for objects that neither we nor
>the bundle have.

I favor the first approach because a validation on the object IDs we are
writing is a safe guard . And the flag itself was designed to be used in
testing scenarios.

/*
 * Blindly write an object_id. This is useful for testing data corruption
 * scenarios.
 */
#define REF_SKIP_OID_VERIFICATION (1 << 10)
Xing Xin May 20, 2024, 10:19 a.m. UTC | #5
At 2024-05-17 15:36:53, "Karthik Nayak" <karthik.188@gmail.com> wrote:
>"blanet via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Xing Xin <xingxin.xx@bytedance.com>>
>> When using the bundle-uri mechanism with a bundle list containing
>> multiple interrelated bundles, we encountered a bug where tips from
>> downloaded bundles were not being discovered, resulting in rather slow
>> clones. This was particularly problematic when employing the heuristic
>> `creationTokens`.
>>
>> And this is easy to reproduce. Suppose we have a repository with a
>> single branch `main` pointing to commit `A`, firstly we create a base
>> bundle with
>>
>>   git bundle create base.bundle main
>>
>> Then let's add a new commit `B` on top of `A`, so that an incremental
>> bundle for `main` can be created with
>>
>>   git bundle create incr.bundle A..main
>>
>> Now we can generate a bundle list with the following content:
>>
>>   [bundle]
>>       version = 1
>>       mode = all
>>       heuristic = creationToken
>>
>>   [bundle "base"]
>>       uri = base.bundle
>>       creationToken = 1
>>
>>   [bundle "incr"]
>>       uri = incr.bundle
>>       creationToken = 2
>>
>> A fresh clone with the bundle list above would give the expected
>> `refs/bundles/main` pointing at `B` in new repository, in other words we
>> already had everything locally from the bundles, but git would still
>> download everything from server as if we got nothing.
>>
>> So why the `refs/bundles/main` is not discovered? After some digging I
>> found that:
>>
>> 1. when unbundling a downloaded bundle, a `verify_bundle` is called to
>
>s/a//

Thanks!

>
>>    check its prerequisites if any. The verify procedure would find oids
>>    so `packed_git` is initialized.
>>
>
>So the flow is:
>1. `fetch_bundle_list` fetches all the bundles advertised via
>`download_bundle_list` to local files.
>2. It then calls `unbundle_all_bundles` to unbundle all the bundles.
>3. Each bundle is then unbundled using `unbundle_from_file`.
>4. Here, we first read the bundle header to get all the prerequisites
>for the bundle, this is done in `read_bundle_header`.
>5. Then we call `unbundle`, which calls `verify_bundle` to ensure that
>the repository does indeed contain the prerequisites mentioned in the
>bundle. Then it creates the index from the bundle file.
>
>So because the objects are being checked, the `prepare_packed_git`
>function is eventually called, which means that the
>`raw_object_store->packed_git` data gets filled in and
>`packed_git_initialized` is set.
>
>This means consecutive calls to `prepare_packed_git` doesn't
>re-initiate `raw_object_store->packed_git` since
>`packed_git_initialized` already is set.
>
>So your explanation makes sense, as a _nit_ I would perhaps add the part
>about why consecutive calls to `prepare_packed_git` are ineffective.

Thanks my friend, you have expressed this issue more clearly. I will
post a new description based on your explanation with the creationToken
case covered.

>
>> 2. after unbundled all bundles, we would enter `do_fetch_pack_v2`,
>
>s/unbundled/unbundling

Copy that.

>
>> +#include "packfile.h"
>>  #include "pkt-line.h"
>>  #include "config.h"
>>  #include "remote.h"
>> @@ -376,6 +377,8 @@ static int unbundle_from_file(struct repository *r, const char *file)
>>  			       VERIFY_BUNDLE_QUIET)))
>>  		return 1;
>>
>> +	reprepare_packed_git(r);
>> +
>>
>
>Would it make sense to move this to `bundle.c:unbundle()`, since that is
>also where the idx is created?
>

I wonder if we need a mental model that we should `reprepare_packed_git`
that when a new pack and its corresponding idx is generated? Currently
whether to call `reprepare_packed_git` is determined by the caller.

But within the scope of this bug, I tend to remove the
`REF_SKIP_OID_VERIFICATION` flag when writing refs as Patrick suggested.

Xing Xin
Xing Xin May 20, 2024, 11:48 a.m. UTC | #6
At 2024-05-18 00:14:47, "Junio C Hamano" <gitster@pobox.com> wrote:
>Patrick Steinhardt <ps@pks.im> writes:
>
>> Now there are two conflicting thoughts here:
>>
>>   - Either we can now drop `REF_SKIP_OID_VERIFICATION` as the object IDs
>>     should now be accessible.
>>
>>   - Or we can avoid calling `reprepare_packed_git()` inside the loop and
>>     instead call it once after we have fetched all bundles.
>>
>> The second one feels a bit like premature optimization to me. But the
>> first item does feel like it could help us to catch broken bundles
>> because we wouldn't end up creating refs for objects that neither we nor
>> the bundle have.
>
>I like the way your thoughts are structured around here.
>
>I do agree that the latter is a wrong approach---we shouldn't be
>trusting what came from elsewhere over the network without first
>checking.  We should probably be running the "index-pack --fix-thin"
>the unbundling process runs with also the "--fsck-objects" option if
>we are not doing so already, and even then, we should make sure that
>the object we are making our ref point at have everything behind it.

Currently `unbundle` and all its callers are not adding "--fsck-objects".
There is a param `extra_index_pack_args` for `unbundle` but it is
mainly used for progress related options.

Personally I think data from bundles and data received via network
should be treated equally. For "fetch-pack" we now have some configs
such as  "fetch.fsckobjects" and "transfer.fsckobjects" to decide the
behavior, these configs are invisible when we are fetching bundles.

So for bundles I think we have some different ways to go:

  - Acknowledge the configs mentioned above and behave like
    "fetch-pack".

  - Add "--fsck-objects" as a default in `unbundle`.

  - In `unbundle_from_file`, pass in "--fsck-objects" as
    `extra_index_pack_args` for `unbundle` so this only affect the
    bundle-uri related process.

What do you think?

Xing Xin
Junio C Hamano May 20, 2024, 5:19 p.m. UTC | #7
"Xing Xin" <bupt_xingxin@163.com> writes:

> Personally I think data from bundles and data received via network
> should be treated equally.

Yup, that is not personal ;-) but is universally accepted as a good
discipline.  In the case of bundle-uri, the bundle came over the
network so it is even more true that they should be treated the
same.

> For "fetch-pack" we now have some configs
> such as  "fetch.fsckobjects" and "transfer.fsckobjects" to decide the
> behavior, these configs are invisible when we are fetching bundles.

When fetching over network, transport.c:fetch_refs_via_pack() calls
fetch_pack.c:fetch_pack(), which eventually calls get_pack() and the
configuration variables are honored there.  It appears that the
transport layer is unaware of the .fsckobjects configuration knobs.

When fetching from a bundle, transport.c:fetch_refs_from_bundle()
calls bundle.c:unbundle().  This function has three callers, i.e.
"git bundle unbundle", normal fetching from a bundle, and more
recently added bundle-uri codepaths.  

I think one reasonable approach to take is to add an extra parameter
that takes one of three values: (never, use-config, always), and
conditionally add "--fsck-objects" to the command line of the
index-pack.  Teach "git bundle unbundle" the "--fsck-objects" option
so that it can pass 'never' or 'always' from the command line, and
pass 'use-config' from the code paths for normal fetching from a
budnle and bundle-uri.

To implement use-config, you'd probably need to refactor a small
part of fetch-pack.c:get_pack()

	if (fetch_fsck_objects >= 0
	    ? fetch_fsck_objects
	    : transfer_fsck_objects >= 0
	    ? transfer_fsck_objects
	    : 0)
		fsck_objects = 1;

into a public function (to support a caller like unbundle() that
comes from sideways, the new function may also need to call
fetch_pack_setup() to prime them).

A patch series may take a structure like so:

 * define enum { UNBUNDLE_FSCK_NEVER, UNBUNDLE_FSCK_ALWAYS } in
   bundle.h, have bundle.c:unbundle() accept a new parameter of that
   type, and conditionally add "--fsck-objects" to its call to
   "index-pack".  "git bundle unbundle" can pass 'never' to its
   invocation to unbundle() as an easy way to test it.  For the
   other two callers, we can start by passing 'always'.

 * (optional) teach "git bundle unbundle" a new "--fsck-objects"
   option to allow passing 'always' to its call to unbundle().  With
   that, add tests to feed it a bundle with questionable objects in
   it and make sure that unbundling notices.

 * refactor fetch-pack.c:get_pack() to make the fetch-then-transfer
   configuration logic available to external callers.

 * Add UNBUNDLE_FSCK_USE_CONFIG to the enum, enhance unbundle() to
   react to the value by calling the helper function you introduced
   in the previous step.

Thanks.
Xing Xin May 27, 2024, 4:04 p.m. UTC | #8
At 2024-05-21 01:19:02, "Junio C Hamano" <gitster@pobox.com> wrote:
>"Xing Xin" <bupt_xingxin@163.com> writes:
>
>> Personally I think data from bundles and data received via network
>> should be treated equally.
>
>Yup, that is not personal ;-) but is universally accepted as a good
>discipline.  In the case of bundle-uri, the bundle came over the
>network so it is even more true that they should be treated the
>same.
>
>> For "fetch-pack" we now have some configs
>> such as  "fetch.fsckobjects" and "transfer.fsckobjects" to decide the
>> behavior, these configs are invisible when we are fetching bundles.
>
>When fetching over network, transport.c:fetch_refs_via_pack() calls
>fetch_pack.c:fetch_pack(), which eventually calls get_pack() and the
>configuration variables are honored there.  It appears that the
>transport layer is unaware of the .fsckobjects configuration knobs.
>
>When fetching from a bundle, transport.c:fetch_refs_from_bundle()
>calls bundle.c:unbundle().  This function has three callers, i.e.
>"git bundle unbundle", normal fetching from a bundle, and more
>recently added bundle-uri codepaths.  
>
>I think one reasonable approach to take is to add an extra parameter
>that takes one of three values: (never, use-config, always), and
>conditionally add "--fsck-objects" to the command line of the
>index-pack.  Teach "git bundle unbundle" the "--fsck-objects" option
>so that it can pass 'never' or 'always' from the command line, and
>pass 'use-config' from the code paths for normal fetching from a
>budnle and bundle-uri.
>
>To implement use-config, you'd probably need to refactor a small
>part of fetch-pack.c:get_pack()
>
>	if (fetch_fsck_objects >= 0
>	    ? fetch_fsck_objects
>	    : transfer_fsck_objects >= 0
>	    ? transfer_fsck_objects
>	    : 0)
>		fsck_objects = 1;
>
>into a public function (to support a caller like unbundle() that
>comes from sideways, the new function may also need to call
>fetch_pack_setup() to prime them).
>
>A patch series may take a structure like so:
>
> * define enum { UNBUNDLE_FSCK_NEVER, UNBUNDLE_FSCK_ALWAYS } in
>   bundle.h, have bundle.c:unbundle() accept a new parameter of that
>   type, and conditionally add "--fsck-objects" to its call to
>   "index-pack".  "git bundle unbundle" can pass 'never' to its
>   invocation to unbundle() as an easy way to test it.  For the
>   other two callers, we can start by passing 'always'.
>
> * (optional) teach "git bundle unbundle" a new "--fsck-objects"
>   option to allow passing 'always' to its call to unbundle().  With
>   that, add tests to feed it a bundle with questionable objects in
>   it and make sure that unbundling notices.

I just submitted a new series mainly focusing on the unbundle handling during
fetches. I would like to submit a new one for teaching "git bundle unbundle" a
"--fsck-objects" option after this to make changes more targeted.

> * refactor fetch-pack.c:get_pack() to make the fetch-then-transfer
>   configuration logic available to external callers.
>
> * Add UNBUNDLE_FSCK_USE_CONFIG to the enum, enhance unbundle() to

I tend to use `UNBUNDLE_FSCK_FOLLOW_FETCH` because this option is only
used in fetches, though the current implementation is indeed reading configs.

>   react to the value by calling the helper function you introduced
>   in the previous step.

The new patch series is constructed right as you suggested, thanks a lot for
your help. 

Xing Xin
diff mbox series

Patch

diff --git a/bundle-uri.c b/bundle-uri.c
index ca32050a78f..2b9d36cfd8e 100644
--- a/bundle-uri.c
+++ b/bundle-uri.c
@@ -7,6 +7,7 @@ 
 #include "refs.h"
 #include "run-command.h"
 #include "hashmap.h"
+#include "packfile.h"
 #include "pkt-line.h"
 #include "config.h"
 #include "remote.h"
@@ -376,6 +377,8 @@  static int unbundle_from_file(struct repository *r, const char *file)
 			       VERIFY_BUNDLE_QUIET)))
 		return 1;
 
+	reprepare_packed_git(r);
+
 	/*
 	 * Convert all refs/heads/ from the bundle into refs/bundles/
 	 * in the local repository.
diff --git a/t/t5558-clone-bundle-uri.sh b/t/t5558-clone-bundle-uri.sh
index 1ca5f745e73..a5b04d6f187 100755
--- a/t/t5558-clone-bundle-uri.sh
+++ b/t/t5558-clone-bundle-uri.sh
@@ -20,7 +20,10 @@  test_expect_success 'fail to clone from non-bundle file' '
 test_expect_success 'create bundle' '
 	git init clone-from &&
 	git -C clone-from checkout -b topic &&
+
 	test_commit -C clone-from A &&
+	git -C clone-from bundle create A.bundle topic &&
+
 	test_commit -C clone-from B &&
 	git -C clone-from bundle create B.bundle topic
 '
@@ -259,6 +262,132 @@  test_expect_success 'clone bundle list (file, any mode, all failures)' '
 	! grep "refs/bundles/" refs
 '
 
+#########################################################################
+# Clone negotiation related tests begin here
+
+test_expect_success 'negotiation: bundle with part of wanted commits' '
+	test_when_finished rm -rf trace*.txt &&
+	GIT_TRACE_PACKET="$(pwd)/trace-packet.txt" \
+	git clone --no-local --bundle-uri="clone-from/A.bundle" \
+		clone-from nego-bundle-part &&
+	git -C nego-bundle-part for-each-ref --format="%(refname)" >refs &&
+	grep "refs/bundles/" refs >actual &&
+	cat >expect <<-\EOF &&
+	refs/bundles/topic
+	EOF
+	test_cmp expect actual &&
+	# Ensure that refs/bundles/topic are sent as "have".
+	grep "clone> have $(git -C clone-from rev-parse A)" trace-packet.txt
+'
+
+test_expect_success 'negotiation: bundle with all wanted commits' '
+	test_when_finished rm -rf trace*.txt &&
+	GIT_TRACE_PACKET="$(pwd)/trace-packet.txt" \
+	git clone --no-local --single-branch --branch=topic --no-tags \
+		--bundle-uri="clone-from/B.bundle" \
+		clone-from nego-bundle-all &&
+	git -C nego-bundle-all for-each-ref --format="%(refname)" >refs &&
+	grep "refs/bundles/" refs >actual &&
+	cat >expect <<-\EOF &&
+	refs/bundles/topic
+	EOF
+	test_cmp expect actual &&
+	# We already have all needed commits so no "want" needed.
+	! grep "clone> want " trace-packet.txt
+'
+
+test_expect_success 'negotiation: bundle list (no heuristic)' '
+	test_when_finished rm -f trace*.txt &&
+	cat >bundle-list <<-EOF &&
+	[bundle]
+		version = 1
+		mode = all
+
+	[bundle "bundle-1"]
+		uri = file://$(pwd)/clone-from/bundle-1.bundle
+
+	[bundle "bundle-2"]
+		uri = file://$(pwd)/clone-from/bundle-2.bundle
+	EOF
+
+	GIT_TRACE_PACKET="$(pwd)/trace-packet.txt" \
+	git clone --no-local --bundle-uri="file://$(pwd)/bundle-list" \
+		clone-from nego-bundle-list-no-heuristic &&
+
+	git -C nego-bundle-list-no-heuristic for-each-ref --format="%(refname)" >refs &&
+	grep "refs/bundles/" refs >actual &&
+	cat >expect <<-\EOF &&
+	refs/bundles/base
+	refs/bundles/left
+	EOF
+	test_cmp expect actual &&
+	grep "clone> have $(git -C nego-bundle-list-no-heuristic rev-parse refs/bundles/left)" trace-packet.txt
+'
+
+test_expect_success 'negotiation: bundle list (creationToken)' '
+	test_when_finished rm -f trace*.txt &&
+	cat >bundle-list <<-EOF &&
+	[bundle]
+		version = 1
+		mode = all
+		heuristic = creationToken
+
+	[bundle "bundle-1"]
+		uri = file://$(pwd)/clone-from/bundle-1.bundle
+		creationToken = 1
+
+	[bundle "bundle-2"]
+		uri = file://$(pwd)/clone-from/bundle-2.bundle
+		creationToken = 2
+	EOF
+
+	GIT_TRACE_PACKET="$(pwd)/trace-packet.txt" \
+	git clone --no-local --bundle-uri="file://$(pwd)/bundle-list" \
+		clone-from nego-bundle-list-heuristic &&
+
+	git -C nego-bundle-list-heuristic for-each-ref --format="%(refname)" >refs &&
+	grep "refs/bundles/" refs >actual &&
+	cat >expect <<-\EOF &&
+	refs/bundles/base
+	refs/bundles/left
+	EOF
+	test_cmp expect actual &&
+	grep "clone> have $(git -C nego-bundle-list-heuristic rev-parse refs/bundles/left)" trace-packet.txt
+'
+
+test_expect_success 'negotiation: bundle list with all wanted commits' '
+	test_when_finished rm -f trace*.txt &&
+	cat >bundle-list <<-EOF &&
+	[bundle]
+		version = 1
+		mode = all
+		heuristic = creationToken
+
+	[bundle "bundle-1"]
+		uri = file://$(pwd)/clone-from/bundle-1.bundle
+		creationToken = 1
+
+	[bundle "bundle-2"]
+		uri = file://$(pwd)/clone-from/bundle-2.bundle
+		creationToken = 2
+	EOF
+
+	GIT_TRACE_PACKET="$(pwd)/trace-packet.txt" \
+	git clone --no-local --single-branch --branch=left --no-tags \
+		--bundle-uri="file://$(pwd)/bundle-list" \
+		clone-from nego-bundle-list-all &&
+
+	git -C nego-bundle-list-all for-each-ref --format="%(refname)" >refs &&
+	grep "refs/bundles/" refs >actual &&
+	cat >expect <<-\EOF &&
+	refs/bundles/base
+	refs/bundles/left
+	EOF
+	test_cmp expect actual &&
+	# We already have all needed commits so no "want" needed.
+	! grep "clone> want " trace-packet.txt
+'
+
 #########################################################################
 # HTTP tests begin here