diff mbox series

[03/11] dumb-http: store downloaded pack idx as tempfile

Message ID 20241025065806.GC2110355@coredump.intra.peff.net (mailing list archive)
State Accepted
Commit 63aca3f7f14e01cabccf2165b941205a8a9a01dc
Headers show
Series dumb-http pack index v1 regression + cleanups | expand

Commit Message

Jeff King Oct. 25, 2024, 6:58 a.m. UTC
This patch fixes a regression in b1b8dfde69 (finalize_object_file():
implement collision check, 2024-09-26) where fetching a v1 pack idx file
over the dumb-http protocol would cause the fetch to fail.

The core of the issue is that dumb-http stores the idx we fetch from the
remote at the same path that will eventually hold the idx we generate
from "index-pack --stdin". The sequence is something like this:

  0. We realize we need some object X, which we don't have locally, and
     nor does the other side have it as a loose object.

  1. We download the list of remote packs from objects/info/packs.

  2. For each entry in that file, we download each pack index and store
     it locally in .git/objects/pack/pack-$hash.idx (the $hash is not
     something we can verify yet and is given to us by the remote).

  3. We check each pack index we got to see if it has object X. When we
     find a match, we download the matching .pack file from the remote
     to a tempfile. We feed that to "index-pack --stdin", which
     reindexes the pack, rather than trusting that it has what the other
     side claims it does. In most cases, this will end up generating the
     exact same (byte-for-byte) pack index which we'll store at the same
     pack-$hash.idx path, because the index generation and $hash id are
     computed based on what's in the packfile. But:

       a. The other side might have used other options to generate the
          index. For instance we use index v2 by default, but long ago
          it was v1 (and you can still ask for v1 explicitly).

       b. The other side might even use a different mechanism to
          determine $hash. E.g., long ago it was based on the sorted
          list of objects in the packfile, but we switched to using the
          pack checksum in 1190a1acf8 (pack-objects: name pack files
          after trailer hash, 2013-12-05).

The regression we saw in the real world was (3a). A recent client
fetching from a server with a v1 index downloaded that index, then
complained about trying to overwrite it with its own v2 index. This
collision is otherwise harmless; we know we want to replace the remote
version with our local one, but the collision check doesn't realize
that.

There are a few options to fix it:

  - we could teach index-pack a command-line option to ignore only pack
    idx collisions, and use it when the dumb-http code invokes
    index-pack. This would be an awkward thing to expose users to and
    would involve a lot of boilerplate to get the option down to the
    collision code.

  - we could delete the remote .idx file right before running
    index-pack. It should be redundant at that point (since we've just
    downloaded the matching pack). But it feels risky to delete
    something from our own .git/objects based on what the other side has
    said. I'm not entirely positive that a malicious server couldn't lie
    about which pack-$hash.idx it has and get us to delete something
    precious.

  - we can stop co-mingling the downloaded idx files in our local
    objects directory. This is a slightly bigger change but I think
    fixes the root of the problem more directly.

This patch implements the third option. The big design questions are:
where do we store the downloaded files, and how do we manage their
lifetimes?

There are some additional quirks to the dumb-http system we should
consider. Remember that in step 2 we downloaded every pack index, but in
step 3 we may only download some of the matching packs. What happens to
those other idx files now? They sit in the .git/objects/pack directory,
possibly waiting to be used at a later date. That may save bandwidth for
a subsequent fetch, but it also creates a lot of weird corner cases:

  - our local object directory now has semi-untrusted .idx files sitting
    around, without their matching .pack

  - in case 3b, we noted that we might not generate the same hash as the
    other side. In that case even if we download the matching pack,
    our index-pack invocation will store it in a different
    pack-$hash.idx file. And the unmatched .idx will sit there forever.

  - if the server repacks, it may delete the old packs. Now we have
    these orphaned .idx files sitting around locally that will never be
    used (nor deleted).

  - if we repack locally we may delete our local version of the server's
    pack index and not realize we have it. So we'll download it again,
    even though we have all of the objects it mentions.

I think the right solution here is probably some more complex cache
management system: download the remote .idx files to their own storage
directory, mark them as "seen" when we get their matching pack (to avoid
re-downloading even if we repack), and then delete them when the
server's objects/info/refs no longer mentions them.

But since the dumb http protocol is so ancient and so inferior to the
smart http protocol, I don't think it's worth spending a lot of time
creating such a system. For this patch I'm just downloading the idx
files to .git/objects/tmp_pack_*, and marking them as tempfiles to be
deleted when we exit (and due to the name, any we miss due to a crash,
etc, should eventually be removed by "git gc" runs based on timestamps).

That is slightly worse for one case: if we download an idx but not the
matching pack, we won't retain that idx for subsequent runs. But the
flip side is that we're making other cases better (we never hold on to
useless idx files forever). I suspect that worse case does not even come
up often, since it implies that the packs are generated to match
distinct parts of history (i.e., in practice even in a repo with many
packs you're going to end up grabbing all of those packs to do a clone).
If somebody really cares about that, I think the right path forward is a
managed cache directory as above, and this patch is providing the first
step in that direction anyway (by moving things out of the objects/pack/
directory).

There are two test changes. One demonstrates the broken v1 index case
(it double-checks the resulting clone with fsck to be careful, but prior
to this patch it actually fails at the clone step). The other tweaks the
expectation for a test that covers the "slightly worse" case to
accommodate the extra index download.

The code changes are fairly simple. We stop using finalize_object_file()
to copy the remote's index file into place, and leave it as a tempfile.
We give the tempfile a real ".idx" name, since the packfile code expects
that, and thus we make sure it is out of the usual packs/ directory (so
we'd never mistake it for a real local .idx).

We also have to change parse_pack_index(), which creates a temporary
packed_git to access our index (we need this because all of the pack idx
code assumes we have that struct). It reads the index data from the
tempfile, but prior to this patch would speculatively write the
finalized name into the packed_git struct using the pack-$hash we expect
to use.

I was mildly surprised that this worked at all, since we call
verify_pack_index() on the packed_git which mentions the final name
before moving the file into place! But it works because
parse_pack_index() leaves the mmap-ed data in the struct, so the
lazy-open in verify_pack_index() never triggers, and we read from the
tempfile, ignoring the filename in the struct completely. Hacky, but it
works.

After this patch, parse_pack_index() now uses the index filename we pass
in to derive a matching .pack name. This is OK to change because there
are only two callers, both in the dumb http code (and the other passes
in an existing pack-$hash.idx name, so the derived name is going to be
pack-$hash.pack, which is what we were using anyway).

I'll follow up with some more cleanups in that area, but this patch is
sufficient to fix the regression.

Reported-by: fox <fox.gbr@townlong-yak.com>
Signed-off-by: Jeff King <peff@peff.net>
---
 http.c                     | 25 ++++++++++++++++++++-----
 packfile.c                 | 11 ++++++++++-
 t/t5550-http-fetch-dumb.sh | 12 +++++++++++-
 3 files changed, 41 insertions(+), 7 deletions(-)

Comments

Taylor Blau Oct. 25, 2024, 9:18 p.m. UTC | #1
On Fri, Oct 25, 2024 at 02:58:06AM -0400, Jeff King wrote:
> This patch fixes a regression in b1b8dfde69 (finalize_object_file():
> implement collision check, 2024-09-26) where fetching a v1 pack idx file
> over the dumb-http protocol would cause the fetch to fail.

Makes sense, and matches our discussion from elsewhere on the list
sometime last week.

> There are a few options to fix it:
>
>   - we could teach index-pack a command-line option to ignore only pack
>     idx collisions, and use it when the dumb-http code invokes
>     index-pack. This would be an awkward thing to expose users to and
>     would involve a lot of boilerplate to get the option down to the
>     collision code.
>
>   - we could delete the remote .idx file right before running
>     index-pack. It should be redundant at that point (since we've just
>     downloaded the matching pack). But it feels risky to delete
>     something from our own .git/objects based on what the other side has
>     said. I'm not entirely positive that a malicious server couldn't lie
>     about which pack-$hash.idx it has and get us to delete something
>     precious.
>
>   - we can stop co-mingling the downloaded idx files in our local
>     objects directory. This is a slightly bigger change but I think
>     fixes the root of the problem more directly.
>
> This patch implements the third option. The big design questions are:
> where do we store the downloaded files, and how do we manage their
> lifetimes?

Yep, the third option is definitely the most sensible here.

> I think the right solution here is probably some more complex cache
> management system: download the remote .idx files to their own storage
> directory, mark them as "seen" when we get their matching pack (to avoid
> re-downloading even if we repack), and then delete them when the
> server's objects/info/refs no longer mentions them.
>
> But since the dumb http protocol is so ancient and so inferior to the
> smart http protocol, I don't think it's worth spending a lot of time
> creating such a system. For this patch I'm just downloading the idx
> files to .git/objects/tmp_pack_*, and marking them as tempfiles to be
> deleted when we exit (and due to the name, any we miss due to a crash,
> etc, should eventually be removed by "git gc" runs based on timestamps).
>
> That is slightly worse for one case: [...]

OK. I think the more robust version of the solution you identified here
makes sense and is probably the right thing to do if we wanted to spend
more time and effort on fixing this part of the dumb HTTP protocol.

But I'm willing to believe that there are so few users of the protocol
at this point that it's almost certainly not worth the effort. So I'm
supportive of the stopping point that you chose here.

> diff --git a/http.c b/http.c
> index d59e59f66b..9642cad2e3 100644
> --- a/http.c
> +++ b/http.c
> @@ -19,6 +19,7 @@
>  #include "string-list.h"
>  #include "object-file.h"
>  #include "object-store-ll.h"
> +#include "tempfile.h"
>
>  static struct trace_key trace_curl = TRACE_KEY_INIT(CURL);
>  static int trace_curl_data = 1;
> @@ -2388,8 +2389,24 @@ static char *fetch_pack_index(unsigned char *hash, const char *base_url)
>  	strbuf_addf(&buf, "objects/pack/pack-%s.idx", hash_to_hex(hash));
>  	url = strbuf_detach(&buf, NULL);
>
> -	strbuf_addf(&buf, "%s.temp", sha1_pack_index_name(hash));
> -	tmp = strbuf_detach(&buf, NULL);
> +	/*
> +	 * Don't put this into packs/, since it's just temporary and we don't
> +	 * want to confuse it with our local .idx files.  We'll generate our
> +	 * own index if we choose to download the matching packfile.

I actually think putting it in $GIT_DIR/objects/pack is fine and perhaps
preferable, since that's where we put existing in-progress temporary
files. We'll ignore anything that doesn't end with "*.idx", so I think
it would be fine.

I don't feel strongly about it. It just feels a little weird to stick
these temporary files in one place, and stick other (similar) temporary
flies in another.

> diff --git a/t/t5550-http-fetch-dumb.sh b/t/t5550-http-fetch-dumb.sh
> index 3968b82260..706540ab74 100755
> --- a/t/t5550-http-fetch-dumb.sh
> +++ b/t/t5550-http-fetch-dumb.sh
> @@ -335,7 +335,7 @@ test_expect_success 'fetch can handle previously-fetched .idx files' '
>  	count_fetches 1 pack one.trace &&
>  	GIT_TRACE_CURL=$PWD/two.trace git --git-dir=clone_packed_branches.git \
>  		fetch "$HTTPD_URL"/dumb/repo_packed_branches.git branch2:branch2 &&
> -	count_fetches 0 idx two.trace &&
> +	count_fetches 1 idx two.trace &&
>  	count_fetches 1 pack two.trace
>  '
>
> @@ -521,4 +521,14 @@ test_expect_success 'fetching via http alternates works' '
>  	git -c http.followredirects=true clone "$HTTPD_URL/dumb/alt-child.git"
>  '
>
> +test_expect_success 'dumb http can fetch index v1' '
> +	server=$HTTPD_DOCUMENT_ROOT_PATH/idx-v1.git &&
> +	git init --bare "$server" &&
> +	git -C "$server" --work-tree=. commit --allow-empty -m foo &&
> +	git -C "$server" -c pack.indexVersion=1 gc &&
> +
> +	git clone "$HTTPD_URL/dumb/idx-v1.git" &&
> +	git -C idx-v1 fsck
> +'

Very nicely done.

Thanks,
Taylor
Jeff King Oct. 26, 2024, 6:02 a.m. UTC | #2
On Fri, Oct 25, 2024 at 05:18:51PM -0400, Taylor Blau wrote:

> > -	strbuf_addf(&buf, "%s.temp", sha1_pack_index_name(hash));
> > -	tmp = strbuf_detach(&buf, NULL);
> > +	/*
> > +	 * Don't put this into packs/, since it's just temporary and we don't
> > +	 * want to confuse it with our local .idx files.  We'll generate our
> > +	 * own index if we choose to download the matching packfile.
> 
> I actually think putting it in $GIT_DIR/objects/pack is fine and perhaps
> preferable, since that's where we put existing in-progress temporary
> files. We'll ignore anything that doesn't end with "*.idx", so I think
> it would be fine.

But the new tempfile _does_ end with .idx. And it (kinda) has to,
because that's an assumption made by packed_git: it stores
path/to/foo.pack, and then assumes it can access /path/to/foo.idx as
needed.

I say "kinda" because as I mentioned elsewhere, the prior code cheats a
bit and assumes that we can access the idx file through the original
mmap we made, even if the ".idx" the packed_git mentions does not yet
exist. I _think_ that's pretty foolproof and that we'd never unmap or
close an idx (though we certainly do for packs themselves). But it
doesn't feel like a great thing to rely on.

> I don't feel strongly about it. It just feels a little weird to stick
> these temporary files in one place, and stick other (similar) temporary
> flies in another.

One difference is that these are pure tempfiles. We'll never "finalize"
them by renaming them into place, so there's no chance of having to do a
cross-directory link/rename. They just get created and then deleted[1].

Perhaps something like "objects/temp-remote-index/*.idx" would be a
more descriptive name (and certainly what I'd use if making a more full
caching system), but there are complications:

  1. We don't have good cleanup routines for directories. Would that
     just become a semi-permanent empty directory in a dumb-http repo?

  2. git-prune cleans up tmp_* in objects/ and objects/pack, but would
     have to learn to look there, too.

So just sticking them in objects/ seemed like the path of least
resistance to me.

-Peff

[1] They really could be opened anywhere, like /tmp. Which I was tempted
    to do, but since they're non-trivial in size it seemed like keeping
    them inside the repo directory was the best bet.

    Curiously, if we lean into the "we will never close a mapped idx",
    then we could create them, open and map them, and then delete them
    immediately, which is a somewhat common tempfile idiom. At least on
    Unix systems. I'm not sure how Windows would like that. :)
Taylor Blau Oct. 28, 2024, 12:14 a.m. UTC | #3
On Sat, Oct 26, 2024 at 02:02:38AM -0400, Jeff King wrote:
> So just sticking them in objects/ seemed like the path of least
> resistance to me.

OK, reading it all back, I tend to agree that sticking them in objects/
makes sense. Thanks!

Thanks,
Taylor
diff mbox series

Patch

diff --git a/http.c b/http.c
index d59e59f66b..9642cad2e3 100644
--- a/http.c
+++ b/http.c
@@ -19,6 +19,7 @@ 
 #include "string-list.h"
 #include "object-file.h"
 #include "object-store-ll.h"
+#include "tempfile.h"
 
 static struct trace_key trace_curl = TRACE_KEY_INIT(CURL);
 static int trace_curl_data = 1;
@@ -2388,8 +2389,24 @@  static char *fetch_pack_index(unsigned char *hash, const char *base_url)
 	strbuf_addf(&buf, "objects/pack/pack-%s.idx", hash_to_hex(hash));
 	url = strbuf_detach(&buf, NULL);
 
-	strbuf_addf(&buf, "%s.temp", sha1_pack_index_name(hash));
-	tmp = strbuf_detach(&buf, NULL);
+	/*
+	 * Don't put this into packs/, since it's just temporary and we don't
+	 * want to confuse it with our local .idx files.  We'll generate our
+	 * own index if we choose to download the matching packfile.
+	 *
+	 * It's tempting to use xmks_tempfile() here, but it's important that
+	 * the file not exist, otherwise http_get_file() complains. So we
+	 * create a filename that should be unique, and then just register it
+	 * as a tempfile so that it will get cleaned up on exit.
+	 *
+	 * In theory we could hold on to the tempfile and delete these as soon
+	 * as we download the matching pack, but it would take a bit of
+	 * refactoring. Leaving them until the process ends is probably OK.
+	 */
+	tmp = xstrfmt("%s/tmp_pack_%s.idx",
+		      repo_get_object_directory(the_repository),
+		      hash_to_hex(hash));
+	register_tempfile(tmp);
 
 	if (http_get_file(url, tmp, NULL) != HTTP_OK) {
 		error("Unable to get pack index %s", url);
@@ -2427,10 +2444,8 @@  static int fetch_and_setup_pack_index(struct packed_git **packs_head,
 	}
 
 	ret = verify_pack_index(new_pack);
-	if (!ret) {
+	if (!ret)
 		close_pack_index(new_pack);
-		ret = finalize_object_file(tmp_idx, sha1_pack_index_name(sha1));
-	}
 	free(tmp_idx);
 	if (ret)
 		return -1;
diff --git a/packfile.c b/packfile.c
index df4ba67719..16d3bcf7f7 100644
--- a/packfile.c
+++ b/packfile.c
@@ -237,13 +237,22 @@  static struct packed_git *alloc_packed_git(int extra)
 	return p;
 }
 
+static char *pack_path_from_idx(const char *idx_path)
+{
+	size_t len;
+	if (!strip_suffix(idx_path, ".idx", &len))
+		BUG("idx path does not end in .idx: %s", idx_path);
+	return xstrfmt("%.*s.pack", (int)len, idx_path);
+}
+
 struct packed_git *parse_pack_index(unsigned char *sha1, const char *idx_path)
 {
-	const char *path = sha1_pack_name(sha1);
+	char *path = pack_path_from_idx(idx_path);
 	size_t alloc = st_add(strlen(path), 1);
 	struct packed_git *p = alloc_packed_git(alloc);
 
 	memcpy(p->pack_name, path, alloc); /* includes NUL */
+	free(path);
 	hashcpy(p->hash, sha1, the_repository->hash_algo);
 	if (check_packed_git_idx(idx_path, p)) {
 		free(p);
diff --git a/t/t5550-http-fetch-dumb.sh b/t/t5550-http-fetch-dumb.sh
index 3968b82260..706540ab74 100755
--- a/t/t5550-http-fetch-dumb.sh
+++ b/t/t5550-http-fetch-dumb.sh
@@ -335,7 +335,7 @@  test_expect_success 'fetch can handle previously-fetched .idx files' '
 	count_fetches 1 pack one.trace &&
 	GIT_TRACE_CURL=$PWD/two.trace git --git-dir=clone_packed_branches.git \
 		fetch "$HTTPD_URL"/dumb/repo_packed_branches.git branch2:branch2 &&
-	count_fetches 0 idx two.trace &&
+	count_fetches 1 idx two.trace &&
 	count_fetches 1 pack two.trace
 '
 
@@ -521,4 +521,14 @@  test_expect_success 'fetching via http alternates works' '
 	git -c http.followredirects=true clone "$HTTPD_URL/dumb/alt-child.git"
 '
 
+test_expect_success 'dumb http can fetch index v1' '
+	server=$HTTPD_DOCUMENT_ROOT_PATH/idx-v1.git &&
+	git init --bare "$server" &&
+	git -C "$server" --work-tree=. commit --allow-empty -m foo &&
+	git -C "$server" -c pack.indexVersion=1 gc &&
+
+	git clone "$HTTPD_URL/dumb/idx-v1.git" &&
+	git -C idx-v1 fsck
+'
+
 test_done