Message ID | 4d17d560b87746acfd62ff785cc22c09600d4e65.1590789428.git.jonathantanmy@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | CDN offloading update | expand |
Jonathan Tan <jonathantanmy@google.com> writes: > When Git fetches a pack using dumb HTTP, it reuses the server's name for > the packfile (which incorporates a hash), which is different from the > behavior of fetch-pack and receive-pack. My first two reads of the above mistakenly thought that for some reason the packfile has the URL to the server encoded in its name, but that is not what you meant by "the server's name". You rather meant "the name the server stores the packfile under", "the name the server gave the packfile", "it reuses the name the server uses for the packfile". The last rephrase may be the easiest to understand. > A subsequent patch will allow downloading packs over HTTP(S) as part of > a fetch. These downloads will not necessarily be from a Git repository, > and thus may not have a hash as part of its name. A location that is not necessarily a Git repository can still honor the naming convention, so I find this a bit weak argument. After all, the procedure to prepare such a CDN backed file would be using Git and the (git) "natural" name for the resulting packfile is easily available to it, isn't it? I am not necessarily against loosening the limitation of the shape of the filename, but we may want to say the reason why we want to name the packfile on the CDN side differently from how Git would naturally name them. What benefit would we get out from geing able to do so? Perhaps it makes arrangements such as "you can fetch 'pack-v1.0.pack' to become reasonably up-to-date if you at least have the version v1.0 software", "if the last time you fetched from us was 2020-05-20 or after, you can fetch 'pack-2020-05-20.pack' and be assured that you aren't missing anything", etc.? Such a "why would somebody want to name the packfile differently" would make a more convincing justification. > Thus, teach http to pass --stdin to index-pack, so that we have no > reliance on the server's name for the packfile. OK. By definition, if we feed the packdata via --stdin, the index-pack command would not even _know_ what the filename we use, or the name the other side had. Makes sense.
> Jonathan Tan <jonathantanmy@google.com> writes: > > > When Git fetches a pack using dumb HTTP, it reuses the server's name for > > the packfile (which incorporates a hash), which is different from the > > behavior of fetch-pack and receive-pack. > > My first two reads of the above mistakenly thought that for some > reason the packfile has the URL to the server encoded in its name, > but that is not what you meant by "the server's name". You rather > meant "the name the server stores the packfile under", "the name the > server gave the packfile", "it reuses the name the server uses for > the packfile". The last rephrase may be the easiest to understand. OK - I'll use that. > > A subsequent patch will allow downloading packs over HTTP(S) as part of > > a fetch. These downloads will not necessarily be from a Git repository, > > and thus may not have a hash as part of its name. > > A location that is not necessarily a Git repository can still honor > the naming convention, so I find this a bit weak argument. After > all, the procedure to prepare such a CDN backed file would be using > Git and the (git) "natural" name for the resulting packfile is > easily available to it, isn't it? > > I am not necessarily against loosening the limitation of the shape > of the filename, but we may want to say the reason why we want to > name the packfile on the CDN side differently from how Git would > naturally name them. What benefit would we get out from geing able > to do so? Perhaps it makes arrangements such as "you can fetch > 'pack-v1.0.pack' to become reasonably up-to-date if you at least > have the version v1.0 software", "if the last time you fetched from > us was 2020-05-20 or after, you can fetch 'pack-2020-05-20.pack' and > be assured that you aren't missing anything", etc.? Such a "why > would somebody want to name the packfile differently" would make a > more convincing justification. I didn't want to unnecessarily exclude features like signed URLs which may change the way the URL is - for example, in Google Cloud Storage, the signed part is a suffix [1]. I'll include this in the commit message. Having said that, after rereading my patch: (1) I'm not sure anymore if the restriction is that there must be a hash in the filename. It might be just that the filename must end in ".pack.temp". (Having said that, if the filename was not named "<hash>.pack.temp", it would look different to the rest of the contents of "objects/pack/", which may or may not be fine.) (2) The filename restriction in question is on the local filename, not the URL. We could do any manipulation we want on the URL (e.g. by appending ".pack.temp"). And one idea that came up at $DAYJOB is that if we're using a suffix of the URL as the filename, there may be a clash of names anyway, so we might as well use the hash instead (which is reported by the server). I'll take a further look - maybe this patch will no longer be needed. [1] https://cloud.google.com/storage/docs/access-control/signed-urls
diff --git a/http.c b/http.c index 4882c9f5b2..130e9d6259 100644 --- a/http.c +++ b/http.c @@ -2276,9 +2276,9 @@ int finish_http_pack_request(struct http_pack_request *preq) { struct packed_git **lst; struct packed_git *p = preq->target; - char *tmp_idx; - size_t len; struct child_process ip = CHILD_PROCESS_INIT; + int tmpfile_fd; + int ret = 0; close_pack_index(p); @@ -2290,35 +2290,24 @@ int finish_http_pack_request(struct http_pack_request *preq) lst = &((*lst)->next); *lst = (*lst)->next; - if (!strip_suffix(preq->tmpfile.buf, ".pack.temp", &len)) - BUG("pack tmpfile does not end in .pack.temp?"); - tmp_idx = xstrfmt("%.*s.idx.temp", (int)len, preq->tmpfile.buf); + tmpfile_fd = xopen(preq->tmpfile.buf, O_RDONLY); argv_array_push(&ip.args, "index-pack"); - argv_array_pushl(&ip.args, "-o", tmp_idx, NULL); - argv_array_push(&ip.args, preq->tmpfile.buf); + argv_array_push(&ip.args, "--stdin"); ip.git_cmd = 1; - ip.no_stdin = 1; + ip.in = tmpfile_fd; ip.no_stdout = 1; if (run_command(&ip)) { - unlink(preq->tmpfile.buf); - unlink(tmp_idx); - free(tmp_idx); - return -1; - } - - unlink(sha1_pack_index_name(p->hash)); - - if (finalize_object_file(preq->tmpfile.buf, sha1_pack_name(p->hash)) - || finalize_object_file(tmp_idx, sha1_pack_index_name(p->hash))) { - free(tmp_idx); - return -1; + ret = -1; + goto cleanup; } install_packed_git(the_repository, p); - free(tmp_idx); - return 0; +cleanup: + close(tmpfile_fd); + unlink(preq->tmpfile.buf); + return ret; } struct http_pack_request *new_http_pack_request(
When Git fetches a pack using dumb HTTP, it reuses the server's name for the packfile (which incorporates a hash), which is different from the behavior of fetch-pack and receive-pack. A subsequent patch will allow downloading packs over HTTP(S) as part of a fetch. These downloads will not necessarily be from a Git repository, and thus may not have a hash as part of its name. Thus, teach http to pass --stdin to index-pack, so that we have no reliance on the server's name for the packfile. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- http.c | 33 +++++++++++---------------------- 1 file changed, 11 insertions(+), 22 deletions(-)