Message ID | 6344c225897de1a2d8aa86d610e9eaf1c6ec82b4.1591821067.git.jonathantanmy@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | CDN offloading update | expand |
Jonathan Tan <jonathantanmy@google.com> writes: > +The server advertises the `packfile-uris` capability. > + > +If the client then communicates which protocols (HTTPS, etc.) it supports with > +a `packfile-uris` argument, the server MAY send a `packfile-uris` section > +directly before the `packfile` section (right after `wanted-refs` if it is > +sent) containing URIs of any of the given protocols. The URIs point to > +packfiles that use only features that the client has declared that it supports > +(e.g. ofs-delta and thin-pack). See protocol-v2.txt for the documentation of > +this section. Even if we have some pre-packaged packfiles, if the requestor lacks some feature (by the way, shouldn't we consistently use "capability" instead of "feature" to call things like "ofs-delta"?) to be able to understand them, we would already know that fact (i.e. the lack of capability on the other side) by the time we have to decide if we can give packfile-uris section. OK, this makes sense. > +Clients should then download and index all the given URIs (in addition to > +downloading and indexing the packfile given in the `packfile` section of the > +response) before performing the connectivity check. Looking good. Is there other requirement on the packfile that can be retrieved with the URI listed in the packfile-uris section? It would be clear that it must, together with the contents in the dynamic 'packfile' section, give fully connected set of objects to the other side that has the object it claimed to have. But are we allowed to include extra/unasked-for objects in such a packfile? Or is it better to leave it unspecified? Thanks.
On Wed, Jun 10 2020, Jonathan Tan wrote: > +This is the implementation: a feature, marked experimental, that allows the > +server to be configured by one or more `uploadpack.blobPackfileUri=<sha1> > +<uri>` entries. Whenever the list of objects to be sent is assembled, all such > +blobs are excluded, replaced with URIs. The client will download those URIs, > +expecting them to each point to packfiles containing single blobs. I was poking at this recently to see whether I could change it into the more dumb method I noted in https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/ As an aside on a protocol level could that be supported with this current verb by having the client say "packfile-uris=early" or something like that instead of "packfile-uris"? The server advertising the same, and the client then just requesting packfile-uris before ls-refs or whatever? The server would need to be stateful about what's requested when and serve up something different than the current one-blob-per-pack. Any pointers to where/how to implement that would be welcome, I got lost in the non-linearity of the connect.c/fetch-pack.c/upload-pack.c code yesterday. But I'm mainly replying here to ask if it's intentional that clients are tolerant of the server sending whatever it pleases in the supposedly "single blob" packs. As demonstrated by the tests passing with this patch: diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh index 7d5b17909bb..4fe2030f4c1 100755 --- a/t/t5702-protocol-v2.sh +++ b/t/t5702-protocol-v2.sh @@ -797,11 +797,12 @@ test_expect_success 'when server does not send "ready", expect FLUSH' ' configure_exclusion () { git -C "$1" hash-object "$2" >objh && + echo -n shattered | git -C "$1" hash-object --stdin -w >>objh && git -C "$1" pack-objects "$HTTPD_DOCUMENT_ROOT_PATH/mypack" <objh >packh && git -C "$1" config --add \ "uploadpack.blobpackfileuri" \ - "$(cat objh) $(cat packh) $HTTPD_URL/dumb/mypack-$(cat packh).pack" && - cat objh + "$(head -n 1 objh) $(cat packh) $HTTPD_URL/dumb/mypack-$(cat packh).pack" && + head -n 1 objh } test_expect_success 'part of packfile response provided as URI' ' @@ -820,10 +821,11 @@ test_expect_success 'part of packfile response provided as URI' ' configure_exclusion "$P" my-blob >h && configure_exclusion "$P" other-blob >h2 && - GIT_TRACE=1 GIT_TRACE_PACKET="$(pwd)/log" GIT_TEST_SIDEBAND_ALL=1 \ - git -c protocol.version=2 \ + GIT_TRACE=1 GIT_TRACE2="$(pwd)/log" GIT_TRACE_PACKET="$(pwd)/log" GIT_TEST_SIDEBAND_ALL=1 \ + CHECK_SHATTERED=1 git -c protocol.version=2 \ -c fetch.uriprotocols=http,https \ clone "$HTTPD_URL/smart/http_parent" http_child && + cp "$(pwd)/log" /tmp/clone.log && # Ensure that my-blob and other-blob are in separate packfiles. for idx in http_child/.git/objects/pack/*.idx @@ -832,7 +834,7 @@ test_expect_success 'part of packfile response provided as URI' ' { grep "^[0-9a-f]\{16,\} " out || : } >out.objectlist && - if test_line_count = 1 out.objectlist + if true then if grep $(cat h) out then As you may guess from the "shattered" I was trying to find if the particulars around the partial fsck allowed me to exploit this somehow, I haven't found a way to do that, just be annoying by sending the client more than they asked for, but I could also do that with the normal dialog. Just wondering if the client should be opening the pack and barfing if it has more than one object, or not care.
> On Wed, Jun 10 2020, Jonathan Tan wrote: > > > +This is the implementation: a feature, marked experimental, that allows the > > +server to be configured by one or more `uploadpack.blobPackfileUri=<sha1> > > +<uri>` entries. Whenever the list of objects to be sent is assembled, all such > > +blobs are excluded, replaced with URIs. The client will download those URIs, > > +expecting them to each point to packfiles containing single blobs. > > I was poking at this recently to see whether I could change it into the > more dumb method I noted in > https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/ > > As an aside on a protocol level could that be supported with this > current verb by having the client say "packfile-uris=early" or something > like that instead of "packfile-uris"? Hmm...would the advantage of this be that the client can subsequently report any OIDs it finds as "want"s? I guess the protocol could be extended to support "packfile-uris" at any negotiation step. > The server advertising the same, > and the client then just requesting packfile-uris before ls-refs or > whatever? The server would need to be stateful about what's requested > when and serve up something different than the current > one-blob-per-pack. Statefulness will be difficult. Right now, protocol v2 is stateless, and updating it to be stateful will be difficult, I believe - at least for HTTP, the statelessness design has been long there and other implementations of Git or systems that use Git might have already made that assumption (it is stateless both for protocol v0 and v2). As for serving more than one blob per pack, the current protocol and implementation already allows this. You can see a demonstration by cloning the following repository, which supports it on the server side: GIT_TRACE_PACKET=1 git -c fetch.uriprotocols=https clone \ https://chromium.googlesource.com/chromium/src/base > Any pointers to where/how to implement that would be > welcome, I got lost in the non-linearity of the > connect.c/fetch-pack.c/upload-pack.c code yesterday. upload_pack_v2() in upload-pack.c and do_fetch_pack_v2() in fetch-pack.c have the state machines of the server and client side respectively - I think those would be the first places to look. > But I'm mainly replying here to ask if it's intentional that clients are > tolerant of the server sending whatever it pleases in the supposedly > "single blob" packs. As demonstrated by the tests passing with this > patch: [snip test] Yes, it has the same tolerance w.r.t. the packfile URI packs as w.r.t. the inline packfile that gets sent. > As you may guess from the "shattered" I was trying to find if the > particulars around the partial fsck allowed me to exploit this somehow, > I haven't found a way to do that, just be annoying by sending the client > more than they asked for, but I could also do that with the normal > dialog. Just wondering if the client should be opening the pack and > barfing if it has more than one object, or not care. Ah yes, as you said, it's the same as with the normal dialog.
On Wed, Nov 25 2020, Jonathan Tan wrote: >> On Wed, Jun 10 2020, Jonathan Tan wrote: >> >> > +This is the implementation: a feature, marked experimental, that allows the >> > +server to be configured by one or more `uploadpack.blobPackfileUri=<sha1> >> > +<uri>` entries. Whenever the list of objects to be sent is assembled, all such >> > +blobs are excluded, replaced with URIs. The client will download those URIs, >> > +expecting them to each point to packfiles containing single blobs. >> >> I was poking at this recently to see whether I could change it into the >> more dumb method I noted in >> https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/ >> >> As an aside on a protocol level could that be supported with this >> current verb by having the client say "packfile-uris=early" or something >> like that instead of "packfile-uris"? > > Hmm...would the advantage of this be that the client can subsequently > report any OIDs it finds as "want"s? Yes, as a means-to-an-end of allowing the server CDN part to be dumb and possibly out-of-date. I.e. the current "here's some blobs and then a complimentary pack" requires very close CDN/server coordination. Whereas if you can say early in the dialog "here's a pack that should have most of what you need, then do a want/fetch to get up-to-date" you can just make that pack in a cronjob, the pack could even be corrupted etc. and we'd just optimistically fall back on a full clone. It wouldn't be reporting any OID it finds (too noisy). I haven't experimented, but one dumb way to do it is for the server to serve up a pack with "start negotiating with this SHA", which would just be a recent revision of HEAD guaranteed to be in the served pack. Or the client could scour the pack and e.g. find all commit tips in it by running the equivalent of commit-graph on it first. More expensive, but expensive on the client-side, and allows e.g. re-using that codepath to clone on the basis of a GIT_ALTERNATE_OBJECT_DIRECTORIES containing objects the <remote url> might have already. > I guess the protocol could be extended to support "packfile-uris" at any > negotiation step. *nod* >> The server advertising the same, >> and the client then just requesting packfile-uris before ls-refs or >> whatever? The server would need to be stateful about what's requested >> when and serve up something different than the current >> one-blob-per-pack. > > Statefulness will be difficult. Right now, protocol v2 is stateless, > and updating it to be stateful will be difficult, I believe - at least > for HTTP, the statelessness design has been long there and other > implementations of Git or systems that use Git might have already made > that assumption (it is stateless both for protocol v0 and v2). Sorry, on second thought what I'm suggesting doesn't really require state, just the server to get/read/understand the "client supports this" flags that it already does. > As for serving more than one blob per pack, the current protocol and > implementation already allows this. You can see a demonstration by > cloning the following repository, which supports it on the server side: > > GIT_TRACE_PACKET=1 git -c fetch.uriprotocols=https clone \ > https://chromium.googlesource.com/chromium/src/base So cloning that produces two packs (the packfile-uri one, and the negotiated one), the packfile-uri one being: $ git show-index <objects/pack/pack-d917d9803d3acb03da7ea9e8ebb8e643364ba051.idx 12 3ea853619c232488e683139217585f520ec636e0 (8e33883c) 9371 83d1358e4979bf9aff879cb4150276cd3894463a (0ea17153) 13640 d0f8c611c044b36b8ac57b7a3af18a8d88e4dde2 (af8be7a0) 572 da9ca8b36d5029bd9b18addc054ba9c0b016e6bc (d57fdbac) So 3ea853619 is a commit with a tree da9ca8b36/ , that tree has pointer to win/ via 83d1358e4, which has a blob win/.clang-tidy at d0f8c611c. So this is intended & you're already using it like that. So shouldn't the docs be updated from: [...]one or more `uploadpack.blobPackfileUri=<sha1> <uri>` entries. Whenever the list of objects to be sent is assembled, all such blobs are excluded, replaced with URIs. The client will download those URIs, expecting them to each point to packfiles containing single blobs. To something like: [...]one or more `uploadpack.postWantPackfileUri=<sha1> <uri>` entries. When the server sends the subsequent full pack any objects sent out-of-bound may be excluded. The client will download those URIs, expecting them to each contain some amount of objects. The union of the packs found at these URIs and the server's own returned packfile is expected to yield a valid repository that'll pass an fsck. Aside from the hassle of renaming the "uploadpack.blobPackfileUri" variable to some more accurate "this clearly has nothing to do with blobs per-se, really" name that re-done paragraph seems to more accurately reflect what this is doing & intended to do. Also, why it it necessary to make it an error if the expected hash for the packfile doesn't match, since we're about to do a full fsck connectivity check anyway? The POC patch at the end of this mail[1] shows that we mostly support it not matching now. I suppose you might want to point to an evil CDN, but since it currently needs to fsck the potential for that evilness seems rather limited. It's not a hassle in the current closely coupled server/CDN implementation, but to e.g. have a server dumbly pointing at https://some-cdn.example/REPONAME/some-big-recent-pack.pack it's a hassle, so having it at least optionally support the NULL_SHA would be ideal for that. So I'm curious what you think it adds to the overall security of the feature. >> Any pointers to where/how to implement that would be >> welcome, I got lost in the non-linearity of the >> connect.c/fetch-pack.c/upload-pack.c code yesterday. > > upload_pack_v2() in upload-pack.c and do_fetch_pack_v2() in fetch-pack.c > have the state machines of the server and client side respectively - I > think those would be the first places to look. Thanks. >> But I'm mainly replying here to ask if it's intentional that clients are >> tolerant of the server sending whatever it pleases in the supposedly >> "single blob" packs. As demonstrated by the tests passing with this >> patch: > > [snip test] > > Yes, it has the same tolerance w.r.t. the packfile URI packs as w.r.t. > the inline packfile that gets sent. > >> As you may guess from the "shattered" I was trying to find if the >> particulars around the partial fsck allowed me to exploit this somehow, >> I haven't found a way to do that, just be annoying by sending the client >> more than they asked for, but I could also do that with the normal >> dialog. Just wondering if the client should be opening the pack and >> barfing if it has more than one object, or not care. > > Ah yes, as you said, it's the same as with the normal dialog. 1.: diff --git a/fetch-pack.c b/fetch-pack.c index b10c432315..87d948b023 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -1661,9 +1661,9 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args, if (memcmp(packfile_uris.items[i].string, packname, the_hash_algo->hexsz)) - die("fetch-pack: pack downloaded from %s does not match expected hash %.*s", - uri, (int) the_hash_algo->hexsz, - packfile_uris.items[i].string); + warning("fetch-pack: pack downloaded from %s does not match expected hash %.*s", + uri, (int) the_hash_algo->hexsz, + packfile_uris.items[i].string); string_list_append_nodup(pack_lockfiles, xstrfmt("%s/pack/pack-%s.keep", diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh index 7d5b17909b..bc0fc4d8e3 100755 --- a/t/t5702-protocol-v2.sh +++ b/t/t5702-protocol-v2.sh @@ -798,9 +798,13 @@ test_expect_success 'when server does not send "ready", expect FLUSH' ' configure_exclusion () { git -C "$1" hash-object "$2" >objh && git -C "$1" pack-objects "$HTTPD_DOCUMENT_ROOT_PATH/mypack" <objh >packh && + fake_sha=$(git hash-object --stdin <packh) && + mv \ + "$HTTPD_DOCUMENT_ROOT_PATH/mypack-$(cat packh).pack" \ + "$HTTPD_DOCUMENT_ROOT_PATH/mypack-$fake_sha.pack" git -C "$1" config --add \ "uploadpack.blobpackfileuri" \ - "$(cat objh) $(cat packh) $HTTPD_URL/dumb/mypack-$(cat packh).pack" && + "$(cat objh) $fake_sha $HTTPD_URL/dumb/mypack-$fake_sha.pack" && cat objh } @@ -852,7 +856,7 @@ test_expect_success 'part of packfile response provided as URI' ' test_line_count = 6 filelist ' -test_expect_success 'fetching with valid packfile URI but invalid hash fails' ' +-test_expect_success 'fetching with valid packfile URI but invalid hash warns' ' P="$HTTPD_DOCUMENT_ROOT_PATH/http_parent" && rm -rf "$P" http_child log && @@ -870,13 +874,12 @@ test_expect_success 'fetching with valid packfile URI but invalid hash fails' ' # the hash of the packfile, since the hash does not matter for this # test as long as it is not the hash of the pack, and it is of the # expected length. - git -C "$P" hash-object other-blob >objh && git -C "$P" pack-objects "$HTTPD_DOCUMENT_ROOT_PATH/mypack" <objh >packh && git -C "$P" config --add \ "uploadpack.blobpackfileuri" \ "$(cat objh) $(cat objh) $HTTPD_URL/dumb/mypack-$(cat packh).pack" && - test_must_fail env GIT_TEST_SIDEBAND_ALL=1 \ + env GIT_TEST_SIDEBAND_ALL=1 \ git -c protocol.version=2 \ -c fetch.uriprotocols=http,https \ clone "$HTTPD_URL/smart/http_parent" http_child 2>err &&
diff --git a/Documentation/technical/packfile-uri.txt b/Documentation/technical/packfile-uri.txt new file mode 100644 index 0000000000..318713abc3 --- /dev/null +++ b/Documentation/technical/packfile-uri.txt @@ -0,0 +1,78 @@ +Packfile URIs +============= + +This feature allows servers to serve part of their packfile response as URIs. +This allows server designs that improve scalability in bandwidth and CPU usage +(for example, by serving some data through a CDN), and (in the future) provides +some measure of resumability to clients. + +This feature is available only in protocol version 2. + +Protocol +-------- + +The server advertises the `packfile-uris` capability. + +If the client then communicates which protocols (HTTPS, etc.) it supports with +a `packfile-uris` argument, the server MAY send a `packfile-uris` section +directly before the `packfile` section (right after `wanted-refs` if it is +sent) containing URIs of any of the given protocols. The URIs point to +packfiles that use only features that the client has declared that it supports +(e.g. ofs-delta and thin-pack). See protocol-v2.txt for the documentation of +this section. + +Clients should then download and index all the given URIs (in addition to +downloading and indexing the packfile given in the `packfile` section of the +response) before performing the connectivity check. + +Server design +------------- + +The server can be trivially made compatible with the proposed protocol by +having it advertise `packfile-uris`, tolerating the client sending +`packfile-uris`, and never sending any `packfile-uris` section. But we should +include some sort of non-trivial implementation in the Minimum Viable Product, +at least so that we can test the client. + +This is the implementation: a feature, marked experimental, that allows the +server to be configured by one or more `uploadpack.blobPackfileUri=<sha1> +<uri>` entries. Whenever the list of objects to be sent is assembled, all such +blobs are excluded, replaced with URIs. The client will download those URIs, +expecting them to each point to packfiles containing single blobs. + +Client design +------------- + +The client has a config variable `fetch.uriprotocols` that determines which +protocols the end user is willing to use. By default, this is empty. + +When the client downloads the given URIs, it should store them with "keep" +files, just like it does with the packfile in the `packfile` section. These +additional "keep" files can only be removed after the refs have been updated - +just like the "keep" file for the packfile in the `packfile` section. + +The division of work (initial fetch + additional URIs) introduces convenient +points for resumption of an interrupted clone - such resumption can be done +after the Minimum Viable Product (see "Future work"). + +Future work +----------- + +The protocol design allows some evolution of the server and client without any +need for protocol changes, so only a small-scoped design is included here to +form the MVP. For example, the following can be done: + + * On the server, more sophisticated means of excluding objects (e.g. by + specifying a commit to represent that commit and all objects that it + references). + * On the client, resumption of clone. If a clone is interrupted, information + could be recorded in the repository's config and a "clone-resume" command + can resume the clone in progress. (Resumption of subsequent fetches is more + difficult because that must deal with the user wanting to use the repository + even after the fetch was interrupted.) + +There are some possible features that will require a change in protocol: + + * Additional HTTP headers (e.g. authentication) + * Byte range support + * Different file formats referenced by URIs (e.g. raw object) diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt index 995f07481e..f9f4e4ddd0 100644 --- a/Documentation/technical/protocol-v2.txt +++ b/Documentation/technical/protocol-v2.txt @@ -323,13 +323,26 @@ included in the client's request: indicating its sideband (1, 2, or 3), and the server may send "0005\2" (a PKT-LINE of sideband 2 with no payload) as a keepalive packet. +If the 'packfile-uris' feature is advertised, the following argument +can be included in the client's request as well as the potential +addition of the 'packfile-uris' section in the server's response as +explained below. + + packfile-uris <comma-separated list of protocols> + Indicates to the server that the client is willing to receive + URIs of any of the given protocols in place of objects in the + sent packfile. Before performing the connectivity check, the + client should download from all given URIs. Currently, the + protocols supported are "http" and "https". + The response of `fetch` is broken into a number of sections separated by delimiter packets (0001), with each section beginning with its section header. Most sections are sent only when the packfile is sent. output = acknowledgements flush-pkt | [acknowledgments delim-pkt] [shallow-info delim-pkt] - [wanted-refs delim-pkt] packfile flush-pkt + [wanted-refs delim-pkt] [packfile-uris delim-pkt] + packfile flush-pkt acknowledgments = PKT-LINE("acknowledgments" LF) (nak | *ack) @@ -347,6 +360,9 @@ header. Most sections are sent only when the packfile is sent. *PKT-LINE(wanted-ref LF) wanted-ref = obj-id SP refname + packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri + packfile-uri = PKT-LINE(40*(HEXDIGIT) SP *%x20-ff LF) + packfile = PKT-LINE("packfile" LF) *PKT-LINE(%x01-03 *%x00-ff) @@ -418,6 +434,20 @@ header. Most sections are sent only when the packfile is sent. * The server MUST NOT send any refs which were not requested using 'want-ref' lines. + packfile-uris section + * This section is only included if the client sent + 'packfile-uris' and the server has at least one such URI to + send. + + * Always begins with the section header "packfile-uris". + + * For each URI the server sends, it sends a hash of the pack's + contents (as output by git index-pack) followed by the URI. + + * The hashes are 40 hex characters long. When Git upgrades to a new + hash algorithm, this might need to be updated. (It should match + whatever index-pack outputs after "pack\t" or "keep\t". + packfile section * This section is only included if the client has sent 'want' lines in its request and either requested that no more
Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- Documentation/technical/packfile-uri.txt | 78 ++++++++++++++++++++++++ Documentation/technical/protocol-v2.txt | 32 +++++++++- 2 files changed, 109 insertions(+), 1 deletion(-) create mode 100644 Documentation/technical/packfile-uri.txt