Message ID | 20240724144957.3033840-4-toon@iotcl.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fetch: use bundle URIs when having creationToken heuristic | expand |
Toon Claes <toon@iotcl.com> writes: > At the moment, bundle URIs are only used by git-clone(1). For a clone > the use of bundle URI is trivial, because the repository is empty so > downloading bundles will never result in downloading objects that are in > the repository already. > > For git-fetch(1), this more complicated to use bundle URI. We want to Nit: s/this/it is/ > avoid downloading bundles that only contains objects that are in the > local repository already. > > One way to achieve this is possible when the "creationToken" heuristic > is used for bundle URIs. We attempt to download and unbundle the minimum This first sentence reads a bit weird. Perhaps, "One way to achieve this is to restrict the usage of bundle URIs to the 'creationToken' heuristic for git-fetch(1)."? > number of bundles by creationToken in decreasing order. If we fail to > unbundle (after a successful download) then move to the next > non-downloaded bundle and attempt downloading. Once we succeed in > applying a bundle, move to the previous unapplied bundle and attempt to > unbundle it again. At the end the highest applied creationToken is > written to `fetch.bundleCreationToken` in the git-config. The next time > bundles are advertised by the server, bundles with a lower creationToken > value are ignored. This was already implemented by > 7903efb717 (bundle-uri: download in creationToken order, 2023-01-31) in > fetch_bundles_by_token(). > > Using the creationToken heuristic is optional, but without it the client > has no idea which bundles are new, how to sort them, and which only have > objects the client already has. > > With this knowledge, make git-fetch(1) use bundle URIs from the server, > but only when the creationToken heuristic is used. > I would say that apart from all this it is also worth noting that bundle URIs are still opt-in from the client side. Rest of the patch looks good to me. Thanks [snip]
On Wed, Jul 24, 2024 at 04:49:57PM +0200, Toon Claes wrote: > One way to achieve this is possible when the "creationToken" heuristic > is used for bundle URIs. We attempt to download and unbundle the minimum > number of bundles by creationToken in decreasing order. If we fail to > unbundle (after a successful download) then move to the next > non-downloaded bundle and attempt downloading. Once we succeed in > applying a bundle, move to the previous unapplied bundle and attempt to > unbundle it again. At the end the highest applied creationToken is > written to `fetch.bundleCreationToken` in the git-config. The next time > bundles are advertised by the server, bundles with a lower creationToken > value are ignored. This was already implemented by > 7903efb717 (bundle-uri: download in creationToken order, 2023-01-31) in > fetch_bundles_by_token(). I think Junio essentially asked this already, but I'm still missing the bigger picture here. When the "creationToken" heuristic is applied, the effect of your change is that we'll always favor bundle URIs now over performing proper fetches, right? Now suppose that the server creates new bundled whenever somebody pushes a new change to the default branch. We do not really have information how this bundle is structured. It _could_ be an incremental bundle, and in that case it might be sensible to fetch that bundle. But it could also be that the server generates a full bundle including all objects transitively reachable from that default branch. Now if we started to rely on the "creationToken" heuristic, we would basically end up re-downloading the complete repository, which is a strict regression. Now that scenario is of course hypothetical. But the problem is that the strategy for how bundle URIs are generated are determined by the hosting provider. So ultimately, I expect that the reality will lie somewhere in between and be different depending on which hosting solution you use. All of this to me means that the "creationToken" heuristic is not really a good signal, unless I'm missing something about the way it works. Is there any additional signal provided by the server except for the time when the bundle was created? If so, is that information sufficient to determine whether it makes sense for a client to fetch a bundle instead of performing a "proper" fetch? If not, what is the additional info that we would need to make this schema work properly? So unless I'm missing something, I feel like we need to think bigger and design a heuristic that gives us the information needed. Without such a heuristic, default-enabling may or may not do the right thing, and we have no way to really argue whether it will do as we now depend on server operators to do the right thing. Patrick
Patrick Steinhardt <ps@pks.im> writes: > I think Junio essentially asked this already, but I'm still missing the > bigger picture here. When the "creationToken" heuristic is applied, the > effect of your change is that we'll always favor bundle URIs now over > performing proper fetches, right? Yes, sort of. Bundle URIs are a step before "proper" fetches. A "proper" fetch might still happen after downloading bundles. (I would rather call them "negotiated fetches" or something, but let's just stick with "proper" fetch for now) > Now suppose that the server creates new bundled whenever somebody pushes > a new change to the default branch. We do not really have information > how this bundle is structured. It _could_ be an incremental bundle, and > in that case it might be sensible to fetch that bundle. But it could > also be that the server generates a full bundle including all objects > transitively reachable from that default branch. Now if we started to > rely on the "creationToken" heuristic, we would basically end up > re-downloading the complete repository, which is a strict regression. > > Now that scenario is of course hypothetical. But the problem is that the > strategy for how bundle URIs are generated are determined by the hosting > provider. So ultimately, I expect that the reality will lie somewhere in > between and be different depending on which hosting solution you use. That is true. The mechanism of bundle URIs is mostly outside the control of Git itself. It's up to the Git hosting provider how they use it. This gives them a lot of flexibility, like where to store bundles, how incremental bundles are used, and how often bundles are regenerated. But this also comes with a huge responsibility, due to the scenario you describe above. At GitLab we want to do an initial roll-out of bundle URIs with just one bundle advertised to the client. This bundle is generated every week, and only contains the whole history of default branch (up to the newest commit). But, we'll advertise this bundle with creationToken "1", always. This will cause the client to fetch any bundle for the repository only once. Even when the bundle is updated by the server, the client will not fetch it because the creationToken did not change. > All of this to me means that the "creationToken" heuristic is not really > a good signal, unless I'm missing something about the way it works. Is > there any additional signal provided by the server except for the time > when the bundle was created? The Git hosting provider can use the "creationToken" however they want, at the moment it's up to them to decide on a good strategy. For example, assume you decide to create bundle URIs only for the default branch, then you can choose to use the committer date of the topmost commit of that branch as the creationToken. Imagine you have the following example commit history (committer date of some commits are indicated with a caret below those commits): A --- B --- C --- D --- E --- F --- G ^ ^ ^ 2024-08-02 2024-08-04 2024-08-09 Today (on 2024-08-02, at this point commits D to G don't exist yet) the Git host decides to create a bundle: - git bundle create first.bundle main The server will advertise this bundle with the creationToken being the Unix timestamp of the topmost commit in the bundle: - first.bundle = 2024-08-02 or in Unix 1722549600 When the client clones/fetches, it will download this bundle and store creationToken 1722549600 (= 2024-08-02) in the repo config as `fetch.bundleCreationToken`. A few days later (on 2024-08-04, now D & E are added) the client fetches again, and there are no new bundles, so a "proper" fetch happens. A week after the first bundle (on 2024-08-09, D to G were added since) the Git host decides to create a new (incremental) bundle: - git bundle create second.bundle C..main The server then advertises the following bundles: - first.bundle = 2024-08-02 or in Unix 1722549600 - second.bundle = 2024-08-09 1723154400 Now when the client fetches again, it sees a new bundle "second.bundle", and will download it. This bundle contains the history from C..G. And you might have noticed, the client already has C..E, while it only needs E..G. This is a slight inefficiency, it's a pitfall of bundle URI I don't think we can avoid. By design bundle URIs are used to pre-generate files for clients to use, so they are never built-to-order. It's almost impossible to avoid clients to download bundles that have objects they already have. Well, maybe it is possible. But then the question arises, if the current main branch is at G, do you want the client to download a bundle which has a little bit too much objects, or do you want to have the client do a "proper" fetch and skip any bundle? But all the above assumes there's only one branch in the bundle. With more branches another strategy might be required. > If so, is that information sufficient to determine whether it makes > sense for a client to fetch a bundle instead of performing a "proper" > fetch? If not, what is the additional info that we would need to make > this schema work properly? I think my example above indicates the "creationToken" _can_ be sufficient information for the client to determine if downloading a bundle makes sense, but it depends on a well-though-out strategy of the Git host. So one the hand it gives the host the flexibility of using a strategy, on the other hand it puts a lot of responsibility on them. > So unless I'm missing something, I feel like we need to think bigger and > design a heuristic that gives us the information needed. Without such a > heuristic, default-enabling may or may not do the right thing, and we > have no way to really argue whether it will do as we now depend on > server operators to do the right thing. I think we're absolutely not ready to default-enabling the bundle URI feature in Git. Here at GitLab we're trying pilot bundle URI to figure out how it should/can work. Roll-out to CI runners is our first target and because we have the flexibility to toggle use of bundle URI on the client side, we can test things safely, without affecting any other user. Talking about heuristics, any heuristic will be a _hack_ if you ask me. If you think about it, the bundles them self should be the single source of truth. In fact, a bundle is a packfile with a header prepended that contains refs and their OID in git-show-ref(1) format, and includes which prerequisite OIDs it has. So another approach I think of is to have the client partially download bundles (so it pauses the download once they have received the complete header) and make them parse the header to determine if they continue the download. But this approach feels quite complicated to me. Luckily, with the bundle.heuristic config we can always introduce new values if we discover there is a flaw in the "creationToken" heuristic.
On Fri, Aug 02, 2024 at 03:46:39PM +0200, Toon claes wrote: > Patrick Steinhardt <ps@pks.im> writes: > > > I think Junio essentially asked this already, but I'm still missing the > > bigger picture here. When the "creationToken" heuristic is applied, the > > effect of your change is that we'll always favor bundle URIs now over > > performing proper fetches, right? > > Yes, sort of. Bundle URIs are a step before "proper" fetches. A "proper" > fetch might still happen after downloading bundles. > (I would rather call them "negotiated fetches" or something, but let's > just stick with "proper" fetch for now) > > > Now suppose that the server creates new bundled whenever somebody pushes > > a new change to the default branch. We do not really have information > > how this bundle is structured. It _could_ be an incremental bundle, and > > in that case it might be sensible to fetch that bundle. But it could > > also be that the server generates a full bundle including all objects > > transitively reachable from that default branch. Now if we started to > > rely on the "creationToken" heuristic, we would basically end up > > re-downloading the complete repository, which is a strict regression. > > > > Now that scenario is of course hypothetical. But the problem is that the > > strategy for how bundle URIs are generated are determined by the hosting > > provider. So ultimately, I expect that the reality will lie somewhere in > > between and be different depending on which hosting solution you use. > > That is true. The mechanism of bundle URIs is mostly outside the control > of Git itself. It's up to the Git hosting provider how they use it. This > gives them a lot of flexibility, like where to store bundles, how > incremental bundles are used, and how often bundles are regenerated. But > this also comes with a huge responsibility, due to the scenario you > describe above. > > At GitLab we want to do an initial roll-out of bundle URIs with just one > bundle advertised to the client. This bundle is generated every week, > and only contains the whole history of default branch (up to the newest > commit). But, we'll advertise this bundle with creationToken "1", > always. This will cause the client to fetch any bundle for the > repository only once. Even when the bundle is updated by the server, the > client will not fetch it because the creationToken did not change. Okay. Are details like this documented anywhere in the Git project? Because when we start to build on the preexisting heuristics I think that we should give server operators a guide for how things come together and what the client's expectations can and should be. > > All of this to me means that the "creationToken" heuristic is not really > > a good signal, unless I'm missing something about the way it works. Is > > there any additional signal provided by the server except for the time > > when the bundle was created? > > The Git hosting provider can use the "creationToken" however they want, > at the moment it's up to them to decide on a good strategy. > > For example, assume you decide to create bundle URIs only for the > default branch, then you can choose to use the committer date of the > topmost commit of that branch as the creationToken. > > Imagine you have the following example commit history (committer date of > some commits are indicated with a caret below those commits): > > > A --- B --- C --- D --- E --- F --- G > ^ ^ ^ > 2024-08-02 2024-08-04 2024-08-09 > > Today (on 2024-08-02, at this point commits D to G don't exist yet) the > Git host decides to create a bundle: > > - git bundle create first.bundle main > > The server will advertise this bundle with the creationToken being the > Unix timestamp of the topmost commit in the bundle: > > - first.bundle = 2024-08-02 or in Unix 1722549600 > > When the client clones/fetches, it will download this bundle and store > creationToken 1722549600 (= 2024-08-02) in the repo config as > `fetch.bundleCreationToken`. > > A few days later (on 2024-08-04, now D & E are added) the client fetches > again, and there are no new bundles, so a "proper" fetch happens. > > A week after the first bundle (on 2024-08-09, D to G were added since) > the Git host decides to create a new (incremental) bundle: > > - git bundle create second.bundle C..main > > The server then advertises the following bundles: > > - first.bundle = 2024-08-02 or in Unix 1722549600 > - second.bundle = 2024-08-09 1723154400 > > Now when the client fetches again, it sees a new bundle "second.bundle", > and will download it. This bundle contains the history from C..G. And > you might have noticed, the client already has C..E, while it only needs > E..G. This is a slight inefficiency, it's a pitfall of bundle URI I > don't think we can avoid. By design bundle URIs are used to pre-generate > files for clients to use, so they are never built-to-order. It's almost > impossible to avoid clients to download bundles that have objects they > already have. > > Well, maybe it is possible. But then the question arises, if the current > main branch is at G, do you want the client to download a bundle which > has a little bit too much objects, or do you want to have the client do > a "proper" fetch and skip any bundle? > > But all the above assumes there's only one branch in the bundle. With > more branches another strategy might be required. Okay. What I'm still missing though is the ability for a client do discern whether bundles are incremental or full. In the schema you describe, my assumption is that the server operator will eventually regenerate the first bundle to become a full bundle. Which means that if there was any activity in the repository, then the Unix timestamp of that regenerated bundle will likely be bigger than what the client has stored on disk. Consequently, without information that this is in fact a full bundle, the client would end up fetching it. This is a potentially big waste of time and resources. > > If so, is that information sufficient to determine whether it makes > > sense for a client to fetch a bundle instead of performing a "proper" > > fetch? If not, what is the additional info that we would need to make > > this schema work properly? > > I think my example above indicates the "creationToken" _can_ be > sufficient information for the client to determine if downloading a > bundle makes sense, but it depends on a well-though-out strategy of the > Git host. So one the hand it gives the host the flexibility of using a > strategy, on the other hand it puts a lot of responsibility on them. It does put a big burden on the hoster to get it right, that much is true. But I think the bigger problem is that the hoster _cannot_ get it right, because they have no ability to distribute enough information to the client. For a schema like this to work I think we need to advertise at least two things for each bundle: a unique bundle token, and the token of the base bundle that a bundle may refer to. This would allow us to advertise dependencies between bundles to a client rather easily: a full bundle does not have a base, whereas incremental bundles would point to the previous bundle. A theoretical flow would thus go like this: 1. The server announces the full bundle A.bundle. The client downloads it and stores the unique token Tok(A) in its config. 2. Subsequent fetches notice that Tok(A) announced by the server and stored in the config remain the same. 3. The server creates incremental bundles B.bundle and C.bundle. B.bundle builds on top of A, C builds on top of B. The server announces all three of those bundles. The client determines that it only has A, so to get to C it needs to fetch both B and C. It then stores Tok(C) in its config. One scenario not yet covered is when the server regenerates the full bundle to get A' = A + B + C + D. Ideally, the server would temporarily announce a new bundle D = C + D while also announcing that A' and the result from D are equivalent. Like this, the client knows that it does not have to refetch A', but already has its contents and can thus update its config to point to Tok(A'). But even a schema that is more involved like this one is rather suboptimal. Eventually, the equivalence of Tok(A') and D will not be advertised by the server anymore. So the client will not be able to learn about which incremental bundles to fetch anymore, and would be forced to re-download everything. > > So unless I'm missing something, I feel like we need to think bigger and > > design a heuristic that gives us the information needed. Without such a > > heuristic, default-enabling may or may not do the right thing, and we > > have no way to really argue whether it will do as we now depend on > > server operators to do the right thing. > > I think we're absolutely not ready to default-enabling the bundle URI > feature in Git. Here at GitLab we're trying pilot bundle URI to figure > out how it should/can work. Roll-out to CI runners is our first target > and because we have the flexibility to toggle use of bundle URI on the > client side, we can test things safely, without affecting any other > user. > > Talking about heuristics, any heuristic will be a _hack_ if you ask me. > If you think about it, the bundles them self should be the single source > of truth. In fact, a bundle is a packfile with a header prepended that > contains refs and their OID in git-show-ref(1) format, and includes > which prerequisite OIDs it has. Yes, that is true indeed. > So another approach I think of is to have the client partially download > bundles (so it pauses the download once they have received the complete > header) and make them parse the header to determine if they continue the > download. But this approach feels quite complicated to me. It certainly is more complicated. But it's also the only mechanism that gives you all the information you need, I guess. > Luckily, with the bundle.heuristic config we can always introduce new > values if we discover there is a flaw in the "creationToken" heuristic. Well, I'd claim we have already shown that it's flawed. I think before we enable incremental bundle fetches by default, even if it's only with the "creationToken" set, we need to show and document a way for server operators to make use of this feature. In other words, we need to document how they have to generate bundles in a way that does not cause clients to re-download everything whenever a new bundle is announced. Patrick
diff --git a/builtin/fetch.c b/builtin/fetch.c index 693f02b958..98e811f438 100644 --- a/builtin/fetch.c +++ b/builtin/fetch.c @@ -1694,6 +1694,19 @@ static int do_fetch(struct transport *transport, retcode = 1; } + if (transport_has_remote_bundle_uri(transport)) { + /* + * Only use bundle-URIs when they use the creationToken + * heuristic, this allows us to ensure not downloading bundles + * we don't need. You can read the comments in + * fetch_bundles_by_token() to understand how this works. + */ + if (transport->bundles->heuristic == BUNDLE_HEURISTIC_CREATIONTOKEN) { + if (fetch_bundle_list(the_repository, transport->bundles)) + warning(_("failed to fetch advertised bundles")); + } + } + if (fetch_and_consume_refs(&display_state, transport, transaction, ref_map, &fetch_head, config)) { retcode = 1; diff --git a/t/t5584-fetch-bundle-uri.sh b/t/t5584-fetch-bundle-uri.sh new file mode 100755 index 0000000000..6c2383646e --- /dev/null +++ b/t/t5584-fetch-bundle-uri.sh @@ -0,0 +1,49 @@ +#!/bin/sh + +test_description='test use of bundle URI in "git fetch"' + +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME + +. ./test-lib.sh + +test_expect_success 'set up repos and bundles' ' + git init source && + test_commit -C source A && + git clone --no-local source go-A-to-C && + test_commit -C source B && + git clone --no-local source go-B-to-C && + git clone --no-local source go-B-to-D && + git -C source bundle create B.bundle main && + test_commit -C source C && + git -C source bundle create B-to-C.bundle B..main && + git -C source config uploadpack.advertiseBundleURIs true && + git -C source config bundle.version 1 && + git -C source config bundle.mode all && + git -C source config bundle.heuristic creationToken && + git -C source config bundle.bundle-B.uri "file://$(pwd)/source/B.bundle" && + git -C source config bundle.bundle-B.creationToken 1 && + git -C source config bundle.bundle-B-to-C.uri "file://$(pwd)/source/B-to-C.bundle" && + git -C source config bundle.bundle-B-to-C.creationToken 2 +' + +test_expect_success 'fetches one bundle URI to get up-to-date' ' + git -C go-B-to-C -c transfer.bundleURI=true fetch origin && + test 1 = $(ls go-B-to-C/.git/objects/bundles | wc -l) && + test 2 = $(git -C go-B-to-C config fetch.bundleCreationToken) +' + +test_expect_success 'fetches two bundle URIs to get up-to-date' ' + git -C go-A-to-C -c transfer.bundleURI=true fetch origin && + test 2 = $(ls go-A-to-C/.git/objects/bundles | wc -l) && + test 2 = $(git -C go-A-to-C config fetch.bundleCreationToken) +' + +test_expect_success 'fetches one bundle URI and objects from remote' ' + test_commit -C source D && + git -C go-B-to-D -c transfer.bundleURI=true fetch origin && + test 1 = $(ls go-B-to-D/.git/objects/bundles | wc -l) && + test 2 = $(git -C go-B-to-D config fetch.bundleCreationToken) +' + +test_done
At the moment, bundle URIs are only used by git-clone(1). For a clone the use of bundle URI is trivial, because the repository is empty so downloading bundles will never result in downloading objects that are in the repository already. For git-fetch(1), this more complicated to use bundle URI. We want to avoid downloading bundles that only contains objects that are in the local repository already. One way to achieve this is possible when the "creationToken" heuristic is used for bundle URIs. We attempt to download and unbundle the minimum number of bundles by creationToken in decreasing order. If we fail to unbundle (after a successful download) then move to the next non-downloaded bundle and attempt downloading. Once we succeed in applying a bundle, move to the previous unapplied bundle and attempt to unbundle it again. At the end the highest applied creationToken is written to `fetch.bundleCreationToken` in the git-config. The next time bundles are advertised by the server, bundles with a lower creationToken value are ignored. This was already implemented by 7903efb717 (bundle-uri: download in creationToken order, 2023-01-31) in fetch_bundles_by_token(). Using the creationToken heuristic is optional, but without it the client has no idea which bundles are new, how to sort them, and which only have objects the client already has. With this knowledge, make git-fetch(1) use bundle URIs from the server, but only when the creationToken heuristic is used. Signed-off-by: Toon Claes <toon@iotcl.com> --- builtin/fetch.c | 13 ++++++++++ t/t5584-fetch-bundle-uri.sh | 49 +++++++++++++++++++++++++++++++++++++ 2 files changed, 62 insertions(+) create mode 100755 t/t5584-fetch-bundle-uri.sh -- 2.45.2