Message ID | 20241023002806.367082-1-emilyshaffer@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] fetch-pack: don't mark COMPLETE unless we have the full object | expand |
(I missed ccing reviewers from last round; adding now, although I know Junio is vacationing. Sorry about that.) On Tue, Oct 22, 2024 at 5:28 PM Emily Shaffer <emilyshaffer@google.com> wrote: > > When fetching, we decide which objects to skip asking for marking > certain commit IDs with the COMPLETE flag. This flag is set in > fetch-pack.c:mark_complete(), which is called from a few different > functions who decide what to mark or not mark. mark_complete() is > insulated against null pointer deref and repeatedly writing the same > commit to the list of objects to skip; because it's the central function > which decides that an object is COMPLETE and doesn't need to be fetched, > let's also insulate it against corruption where the object is not > present (even though we think it is). > > Without this check, it's possible to reach a corrupted state where > fetches can infinitely loop because we decide to skip fetching, but we > don't actually have the skipped commit in our object store. > > This manifested at $DAYJOB in a repo with the following features: > * blob-filtered partial clone enabled > * commit graph enabled > * ref Foo pointing to commit object 6aaaca > * object 6aaaca missing[1] > > With these prerequisites, we noticed that `git fetch` in the repo > produced an infinite loop: > 1. `git fetch` tries to fetch, but thinks it has all objects, so it > noops. > 2. At the end of cmd_fetch(), we try to write_commit_graph_reachable(). > 3. write_commit_graph_reachable() does a reachability walk, including > starting from Foo > 4. The reachability walk tries to peel Foo, and notices it's missing > 6aaaca. > 5. The partial clone machinery asks for a per-object JIT fetch of > 6aaaca. > 6. `git fetch` (child process) is asked to fetch 6aaaca. > 7. We put together the_repository->parsed_objects, adding all commit IDs > reachable from local refs to it > (fetch-pack.c:mark_complete_and_common_refs(), trace region > mark_complete_local_refs). We see Foo, so we add 6aaaca to > the_repository->parsed_objects and mark it as COMPLETE. > 8. cmd_fetch notices that the object ID it was asked for is already > known, so does not fetch anything new. > 9. GOTO 2. > > The culprit is that we're assuming all local refs already must have > objects in place. Let's not assume that, and explicitly check > has_object() before marking objects as COMPLETE. > > NEEDSWORK: It could be valuable to emit a trace event when we try to > mark_complete() an object where we don't actually has_object() (to > understand how often we're having to heal from corruption). > > Signed-off-by: Emily Shaffer <emilyshaffer@google.com> > Helped-by: Jonathan Tan <jonathantanmy@google.com> > > 1: That commit object went missing as a byproduct of this partial clone > gc issue: > https://lore.kernel.org/git/20241001191811.1934900-1-calvinwan@google.com/ > > --- > > > On the list, Junio and Robert Coup suggested that we should notice cases > when the list of object IDs we are trying to fetch gets filtered to > nothing during a lazy fetch. However, this turned out to be quite tricky > to implement - fetch-by-OID stores the OID in a ref object, so it's > challenging to guess that the ref we care about is an OID instead of a > normal ref. It also required some heuristic approach to catch the exact > moment after we removed COMPLETE objects from the list we wanted to > fetch, and to notice that the wrong objects were missing from that list. > > (From the list of alternatives I included with v1, v1 was approach iii; > Junio and Robert suggested iv; this patch holds something close to ii.) > > Jonathan Tan suggested that instead, we could be more careful about what > we mark COMPLETE or not; it seems like this is pretty straightforward to > do. > > I do wonder if it's a layering violation to do the has_object() check in > mark_complete(), which is otherwise a pretty stupid function; I included > it there because all of the other flavors of mark_complete_* eventually > boil down to mark_complete(), so we wouldn't need to remember to check > for object existence any other time we're trying to do this COMPLETE > marking thing. > > Note that I added a test to guarantee this behavior works, but without > commit graph enabled, the test passes both before and after; I guess > maybe it's better to add an explicit regression test for the > combination? But, this test fails with reftable - because we don't have > a way (that I know of) to force-create a ref with a bad ID, to force > this error condition. In the test as written I'm writing to > .git/refs/heads/bar directly; that doesn't work for reftable. But `git > update-ref` is too smart to let me set a ref to garbage. Any tips there > are welcome. I keep thinking about this test, and the more I think, the less valuable I believe it is. I think we aren't super in the habit of writing regression tests, but would it be that valuable to write a regression test in this case instead? On the other hand, I think the code diff is quite obviously a good idea, and we can see from the test suite that there isn't really a performance hit from it. Is it necessary to add a test at all? Or, I guess that we could try to inspect how many fetch attempts were needed to do the JIT fetch in this test. I suspect the number will be too high without this patch - I know it recurses at least more than once. I dunno. Anybody have stronger opinions than me? > > The CI run is at > https://github.com/nasamuffin/git/actions/runs/11470039702 - it seems > the reftable tests are the only things failing. > > Also, I am sending this version, but if there are any additional > comments or it requires more changes, please expect Jonathan Tan to take > over driving this patch the rest of the way for me. As previously > stated[2], I'll be OOO after this Friday for most of the rest of the > year; the rest of this week I'm trying to get the rest of my > non-upstream loose ends tied up, so I won't have time to do another > iteration. See folks around Christmastime :) > > - Emily > > 2: https://lore.kernel.org/git/CAJoAoZnovapqMcu72DGR40jRRqRn57uJVTJg82kZ_rohtGDSfQ@mail.gmail.com/ > > Cover letter from v1 follows: > > There are a few alternative approaches for this issue that I talked > about with some folks at $DAYJOB: > > i. Just disabling the commit graph rewrite allows this to fall > eventually into a path where the fetch actually succeeds. I didn't like > this solution - it's just whack-a-mole - so I didn't look too hard into > why it succeeds that way. It *could* make sense to disable commit graph > rewrite when we do a JIT fetch with blob or tree filter provided - but > if later we want to implement commit filter (something we've talked > about at Google) then I'd worry about this situation coming up again. > > ii. We could decide not to mark local refs (and commits reachable from > them) as COMPLETE in the_repository->parsed_objects. I didn't try this > solution out, and I'm not sure what the performance implications are, > but Jonathan Tan likes this solution, so I may try it out and see what > breaks shortly. > > iii. We could do all the JIT fetches with --refetch. In my opinion, this > is the safest/most self-healing solution; the JIT fetch only happens > when we really know we're missing the object, so it doesn't make sense > for that fetch to be canceled by any cache. It doesn't have performance > implications as far as I can guess (except that I think we still build > the parsed_objects hash even though we are going to ignore it, but we > already were doing that anyway). Of course, that's what this patch does. > > iv. We could do nothing; when cmd_fetch gets a fetch-by-object-id but > decides there is nothing more to do, it could terminate with an error. > That should stop the infinite recursion, and the error could suggest the > user to run `git fsck` and discover what the problem is. Depending on > the remediation we suggest, though, I think a direct fetch to fix this > particular loop would not work. > > I'm curious to hear thoughts from people who are more expert than me on > partial clone and fetching in general, though. > > This change is also still in RFC, for two reasons: > > First, it's intermittently failing tests for me locally, in weirdly > flaky ways: > > - t0410-partial-clone.sh fails when I run it from prove, but passes when > I run it manually, every time. > - t5601-clone.sh and t5505-remote.sh fail nonsensically on `rm -rf` that > should succeed (and does succeed if I stop the test with test_pause), > which makes me think there's something else borked in my setup, but > I'm not sure what. > - t5616-partial-clone.sh actually does fail in a way that I could see > having to do with this change (since I guess we might download more > packs than usual), but I was so confused by the other two errors I > haven't looked closely yet. > > And secondly, I didn't write tests verifying the breakage and that this > change fixes it yet, either. > > I'm going to work on both those things in the background, but I wanted > to get the description and RFC out early so that folks could take a look > and we could decide which approach is best. > > Thanks, > - Emily > --- > fetch-pack.c | 4 +++- > t/t0410-partial-clone.sh | 30 ++++++++++++++++++++++++++++++ > 2 files changed, 33 insertions(+), 1 deletion(-) > > diff --git a/fetch-pack.c b/fetch-pack.c > index f752da93a8..8cb2ce4c54 100644 > --- a/fetch-pack.c > +++ b/fetch-pack.c > @@ -603,7 +603,9 @@ static int mark_complete(const struct object_id *oid) > { > struct commit *commit = deref_without_lazy_fetch(oid, 1); > > - if (commit && !(commit->object.flags & COMPLETE)) { > + if (commit && > + !(commit->object.flags & COMPLETE) && > + has_object(the_repository, oid, 0)) { > commit->object.flags |= COMPLETE; > commit_list_insert(commit, &complete); > } > diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh > index 818700fbec..95de18ec40 100755 > --- a/t/t0410-partial-clone.sh > +++ b/t/t0410-partial-clone.sh > @@ -241,6 +241,36 @@ test_expect_success 'fetching of missing objects works with ref-in-want enabled' > grep "fetch< fetch=.*ref-in-want" trace > ' > > +test_expect_success 'fetching missing objects pointed to by a local ref' ' > + rm -rf reliable-server unreliable-client && > + test_when_finished rm -rf reliable-server unreliable-client && > + test_create_repo reliable-server && > + git -C reliable-server config uploadpack.allowanysha1inwant 1 && > + git -C reliable-server config uploadpack.allowfilter 1 && > + test_commit -C reliable-server foo && > + > + git clone --filter=blob:none "file://$(pwd)/reliable-server" unreliable-client && > + > + # to simulate the unreliable client losing a referenced object by > + # corruption, create the object on the server side, then create only a > + # reference to that object on the client side (without providing the > + # object itself). > + test_commit -C reliable-server bar && > + HASH=$(git -C reliable-server rev-parse HEAD) && > + echo "$HASH" >unreliable-client/.git/refs/heads/bar && > + > + # the object is really missing > + # check if we can rev-parse a partial SHA. partial so we do not fetch it, > + # but barely partial (trim only the last char) so that we do not collide > + test_must_fail git -C unreliable-client rev-parse ${HASH%%?} && > + > + # trigger a remote fetch by checking out `bar` > + git -C unreliable-client switch bar && > + > + # and now we have the missing object > + git -C unreliable-client rev-parse ${HASH%%?} > +' > + > test_expect_success 'fetching of missing objects from another promisor remote' ' > git clone "file://$(pwd)/server" server2 && > test_commit -C server2 bar && > > Range-diff against v1: > 1: 092be0a655 ! 1: 4db6bbb4cd promisor-remote: always JIT fetch with --refetch > - ## promisor-remote.c ## > -@@ promisor-remote.c: static int fetch_objects(struct repository *repo, > - strvec_pushl(&child.args, "-c", "fetch.negotiationAlgorithm=noop", > - "fetch", remote_name, "--no-tags", > - "--no-write-fetch-head", "--recurse-submodules=no", > -- "--filter=blob:none", "--stdin", NULL); > -+ "--filter=blob:none", "--refetch", "--stdin", NULL); > - if (!git_config_get_bool("promisor.quiet", &quiet) && quiet) > - strvec_push(&child.args, "--quiet"); > - if (start_command(&child)) > + ## fetch-pack.c ## > +@@ fetch-pack.c: static int mark_complete(const struct object_id *oid) > + { > + struct commit *commit = deref_without_lazy_fetch(oid, 1); > + > +- if (commit && !(commit->object.flags & COMPLETE)) { > ++ if (commit && > ++ !(commit->object.flags & COMPLETE) && > ++ has_object(the_repository, oid, 0)) { > + commit->object.flags |= COMPLETE; > + commit_list_insert(commit, &complete); > + } > + > + ## t/t0410-partial-clone.sh ## > +@@ t/t0410-partial-clone.sh: test_expect_success 'fetching of missing objects works with ref-in-want enabled' > + grep "fetch< fetch=.*ref-in-want" trace > + ' > + > ++test_expect_success 'fetching missing objects pointed to by a local ref' ' > ++ rm -rf reliable-server unreliable-client && > ++ test_when_finished rm -rf reliable-server unreliable-client && > ++ test_create_repo reliable-server && > ++ git -C reliable-server config uploadpack.allowanysha1inwant 1 && > ++ git -C reliable-server config uploadpack.allowfilter 1 && > ++ test_commit -C reliable-server foo && > ++ > ++ git clone --filter=blob:none "file://$(pwd)/reliable-server" unreliable-client && > ++ > ++ # to simulate the unreliable client losing a referenced object by > ++ # corruption, create the object on the server side, then create only a > ++ # reference to that object on the client side (without providing the > ++ # object itself). > ++ test_commit -C reliable-server bar && > ++ HASH=$(git -C reliable-server rev-parse HEAD) && > ++ echo "$HASH" >unreliable-client/.git/refs/heads/bar && > ++ > ++ # the object is really missing > ++ # check if we can rev-parse a partial SHA. partial so we do not fetch it, > ++ # but barely partial (trim only the last char) so that we do not collide > ++ test_must_fail git -C unreliable-client rev-parse ${HASH%%?} && > ++ > ++ # trigger a remote fetch by checking out `bar` > ++ git -C unreliable-client switch bar && > ++ > ++ # and now we have the missing object > ++ git -C unreliable-client rev-parse ${HASH%%?} > ++' > ++ > + test_expect_success 'fetching of missing objects from another promisor remote' ' > + git clone "file://$(pwd)/server" server2 && > + test_commit -C server2 bar && > -- > 2.47.0.105.g07ac214952-goog >
On Tue, Oct 22, 2024 at 05:28:05PM -0700, Emily Shaffer wrote: > This change is also still in RFC, for two reasons: > > First, it's intermittently failing tests for me locally, in weirdly > flaky ways: > > - t0410-partial-clone.sh fails when I run it from prove, but passes when > I run it manually, every time. > - t5601-clone.sh and t5505-remote.sh fail nonsensically on `rm -rf` that > should succeed (and does succeed if I stop the test with test_pause), > which makes me think there's something else borked in my setup, but > I'm not sure what. > - t5616-partial-clone.sh actually does fail in a way that I could see > having to do with this change (since I guess we might download more > packs than usual), but I was so confused by the other two errors I > haven't looked closely yet. > > And secondly, I didn't write tests verifying the breakage and that this > change fixes it yet, either. > > I'm going to work on both those things in the background, but I wanted > to get the description and RFC out early so that folks could take a look > and we could decide which approach is best. I am a little confused. Here you say that this patch is still in RFC, but the subject line dropped the RFC present in the first round. What is the state of this patch's readiness? Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > > I'm going to work on both those things in the background, but I wanted > > to get the description and RFC out early so that folks could take a look > > and we could decide which approach is best. > > I am a little confused. Here you say that this patch is still in RFC, > but the subject line dropped the RFC present in the first round. What is > the state of this patch's readiness? > > Thanks, > Taylor As Emily said [1], I'll be taking over driving this patch. The tl;dr is this patch is not ready, so I think you (the interim maintainer) can drop it. This patch strives to avoid marking missing objects as COMPLETE by doing a check in mark_complete(), but deref_without_lazy_fetch() already makes such a check (note how it can return NULL; also its name implies that it knows something about missing objects, namely that it says it won't lazy fetch them) so the question should be: why doesn't it return NULL when an object is missing? It turns out that it first checks the commit graph file and if it's there, then it's considered to be present, so it does not fetch at all. However there are code paths during commit graph writing (executed after every fetch, in builtin/fetch.c) that access the object store directly without going through the commit graph file (search for "object_info" in commit-graph.c), and those functions perform lazy fetches when the object is missing. So there's an infinite loop of "commit graph writer reads X" -> "fetch X" (nothing gets fetched) -> "write new commit graph, as we always do after a fetch" -> "commit graph writer reads X" -> ... So my initial proposal to not mark objects as COMPLETE if they are missing does not work, because they are already not marked as COMPLETE if they are missing. One could say that we should check both the commit graph file and the object store (or, perhaps even better, only the object store) before stating that an object is present, but I think that Git already assumes in some places that a commit is present merely by its presence in the commit graph file, and it's not worth changing this design. One solution to fix this is to make the commit graph writer never lazy-fetch. This closes us off to being able to have missing commits in a partial clone (at least, if we also want to use the commit graph file). This might be a reasonable thing to do - at least, partial clone has been around for a few years and we've not made many concrete steps towards that - but I feel that we shouldn't close ourselves off if there's an alternative. The alternative I'm currently thinking of is to detect if we didn't fetch any packfiles, and if we didn't, don't write a commit graph (and don't GC), much like we do when we have --negotiate-only. (A packfile-less fetch still can cause refs to be rewritten and thus reduce the number of reachable objects, thus enabling a smaller commit graph to be written and some objects to be GC-ed, but I think that this situation still doesn't warrant commit graph writing and/or GC - we can just do those next time.) The main issue is that we don't always know whether a pack is written or not - in particular, if we use something other than "connect" or "stateless-connect" on a remote helper, we won't know if a packfile was sent. We can solve this by (1) only write the commit graph and GC if we know for sure that a packfile was sent, or (2) write the commit graph and GC unless we know for sure that a packfile was not sent. I'm leaning towards (1) because it seems more conceptually coherent even though it is a change of behavior (from auto commit graph and GC to no), because I think that the repos that need scalability the most already use protocol v2 during fetching (which does require "connect" or "stateless-connect" from remote helpers, so we're covered here), but am OK with (2) as well. Feel free to let me know if you have any ideas. In the meantime I'll look at (1). [1] https://lore.kernel.org/git/20241023002806.367082-1-emilyshaffer@google.com/
diff --git a/fetch-pack.c b/fetch-pack.c index f752da93a8..8cb2ce4c54 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -603,7 +603,9 @@ static int mark_complete(const struct object_id *oid) { struct commit *commit = deref_without_lazy_fetch(oid, 1); - if (commit && !(commit->object.flags & COMPLETE)) { + if (commit && + !(commit->object.flags & COMPLETE) && + has_object(the_repository, oid, 0)) { commit->object.flags |= COMPLETE; commit_list_insert(commit, &complete); } diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh index 818700fbec..95de18ec40 100755 --- a/t/t0410-partial-clone.sh +++ b/t/t0410-partial-clone.sh @@ -241,6 +241,36 @@ test_expect_success 'fetching of missing objects works with ref-in-want enabled' grep "fetch< fetch=.*ref-in-want" trace ' +test_expect_success 'fetching missing objects pointed to by a local ref' ' + rm -rf reliable-server unreliable-client && + test_when_finished rm -rf reliable-server unreliable-client && + test_create_repo reliable-server && + git -C reliable-server config uploadpack.allowanysha1inwant 1 && + git -C reliable-server config uploadpack.allowfilter 1 && + test_commit -C reliable-server foo && + + git clone --filter=blob:none "file://$(pwd)/reliable-server" unreliable-client && + + # to simulate the unreliable client losing a referenced object by + # corruption, create the object on the server side, then create only a + # reference to that object on the client side (without providing the + # object itself). + test_commit -C reliable-server bar && + HASH=$(git -C reliable-server rev-parse HEAD) && + echo "$HASH" >unreliable-client/.git/refs/heads/bar && + + # the object is really missing + # check if we can rev-parse a partial SHA. partial so we do not fetch it, + # but barely partial (trim only the last char) so that we do not collide + test_must_fail git -C unreliable-client rev-parse ${HASH%%?} && + + # trigger a remote fetch by checking out `bar` + git -C unreliable-client switch bar && + + # and now we have the missing object + git -C unreliable-client rev-parse ${HASH%%?} +' + test_expect_success 'fetching of missing objects from another promisor remote' ' git clone "file://$(pwd)/server" server2 && test_commit -C server2 bar &&
When fetching, we decide which objects to skip asking for marking certain commit IDs with the COMPLETE flag. This flag is set in fetch-pack.c:mark_complete(), which is called from a few different functions who decide what to mark or not mark. mark_complete() is insulated against null pointer deref and repeatedly writing the same commit to the list of objects to skip; because it's the central function which decides that an object is COMPLETE and doesn't need to be fetched, let's also insulate it against corruption where the object is not present (even though we think it is). Without this check, it's possible to reach a corrupted state where fetches can infinitely loop because we decide to skip fetching, but we don't actually have the skipped commit in our object store. This manifested at $DAYJOB in a repo with the following features: * blob-filtered partial clone enabled * commit graph enabled * ref Foo pointing to commit object 6aaaca * object 6aaaca missing[1] With these prerequisites, we noticed that `git fetch` in the repo produced an infinite loop: 1. `git fetch` tries to fetch, but thinks it has all objects, so it noops. 2. At the end of cmd_fetch(), we try to write_commit_graph_reachable(). 3. write_commit_graph_reachable() does a reachability walk, including starting from Foo 4. The reachability walk tries to peel Foo, and notices it's missing 6aaaca. 5. The partial clone machinery asks for a per-object JIT fetch of 6aaaca. 6. `git fetch` (child process) is asked to fetch 6aaaca. 7. We put together the_repository->parsed_objects, adding all commit IDs reachable from local refs to it (fetch-pack.c:mark_complete_and_common_refs(), trace region mark_complete_local_refs). We see Foo, so we add 6aaaca to the_repository->parsed_objects and mark it as COMPLETE. 8. cmd_fetch notices that the object ID it was asked for is already known, so does not fetch anything new. 9. GOTO 2. The culprit is that we're assuming all local refs already must have objects in place. Let's not assume that, and explicitly check has_object() before marking objects as COMPLETE. NEEDSWORK: It could be valuable to emit a trace event when we try to mark_complete() an object where we don't actually has_object() (to understand how often we're having to heal from corruption). Signed-off-by: Emily Shaffer <emilyshaffer@google.com> Helped-by: Jonathan Tan <jonathantanmy@google.com> 1: That commit object went missing as a byproduct of this partial clone gc issue: https://lore.kernel.org/git/20241001191811.1934900-1-calvinwan@google.com/ --- On the list, Junio and Robert Coup suggested that we should notice cases when the list of object IDs we are trying to fetch gets filtered to nothing during a lazy fetch. However, this turned out to be quite tricky to implement - fetch-by-OID stores the OID in a ref object, so it's challenging to guess that the ref we care about is an OID instead of a normal ref. It also required some heuristic approach to catch the exact moment after we removed COMPLETE objects from the list we wanted to fetch, and to notice that the wrong objects were missing from that list. (From the list of alternatives I included with v1, v1 was approach iii; Junio and Robert suggested iv; this patch holds something close to ii.) Jonathan Tan suggested that instead, we could be more careful about what we mark COMPLETE or not; it seems like this is pretty straightforward to do. I do wonder if it's a layering violation to do the has_object() check in mark_complete(), which is otherwise a pretty stupid function; I included it there because all of the other flavors of mark_complete_* eventually boil down to mark_complete(), so we wouldn't need to remember to check for object existence any other time we're trying to do this COMPLETE marking thing. Note that I added a test to guarantee this behavior works, but without commit graph enabled, the test passes both before and after; I guess maybe it's better to add an explicit regression test for the combination? But, this test fails with reftable - because we don't have a way (that I know of) to force-create a ref with a bad ID, to force this error condition. In the test as written I'm writing to .git/refs/heads/bar directly; that doesn't work for reftable. But `git update-ref` is too smart to let me set a ref to garbage. Any tips there are welcome. The CI run is at https://github.com/nasamuffin/git/actions/runs/11470039702 - it seems the reftable tests are the only things failing. Also, I am sending this version, but if there are any additional comments or it requires more changes, please expect Jonathan Tan to take over driving this patch the rest of the way for me. As previously stated[2], I'll be OOO after this Friday for most of the rest of the year; the rest of this week I'm trying to get the rest of my non-upstream loose ends tied up, so I won't have time to do another iteration. See folks around Christmastime :) - Emily 2: https://lore.kernel.org/git/CAJoAoZnovapqMcu72DGR40jRRqRn57uJVTJg82kZ_rohtGDSfQ@mail.gmail.com/ Cover letter from v1 follows: There are a few alternative approaches for this issue that I talked about with some folks at $DAYJOB: i. Just disabling the commit graph rewrite allows this to fall eventually into a path where the fetch actually succeeds. I didn't like this solution - it's just whack-a-mole - so I didn't look too hard into why it succeeds that way. It *could* make sense to disable commit graph rewrite when we do a JIT fetch with blob or tree filter provided - but if later we want to implement commit filter (something we've talked about at Google) then I'd worry about this situation coming up again. ii. We could decide not to mark local refs (and commits reachable from them) as COMPLETE in the_repository->parsed_objects. I didn't try this solution out, and I'm not sure what the performance implications are, but Jonathan Tan likes this solution, so I may try it out and see what breaks shortly. iii. We could do all the JIT fetches with --refetch. In my opinion, this is the safest/most self-healing solution; the JIT fetch only happens when we really know we're missing the object, so it doesn't make sense for that fetch to be canceled by any cache. It doesn't have performance implications as far as I can guess (except that I think we still build the parsed_objects hash even though we are going to ignore it, but we already were doing that anyway). Of course, that's what this patch does. iv. We could do nothing; when cmd_fetch gets a fetch-by-object-id but decides there is nothing more to do, it could terminate with an error. That should stop the infinite recursion, and the error could suggest the user to run `git fsck` and discover what the problem is. Depending on the remediation we suggest, though, I think a direct fetch to fix this particular loop would not work. I'm curious to hear thoughts from people who are more expert than me on partial clone and fetching in general, though. This change is also still in RFC, for two reasons: First, it's intermittently failing tests for me locally, in weirdly flaky ways: - t0410-partial-clone.sh fails when I run it from prove, but passes when I run it manually, every time. - t5601-clone.sh and t5505-remote.sh fail nonsensically on `rm -rf` that should succeed (and does succeed if I stop the test with test_pause), which makes me think there's something else borked in my setup, but I'm not sure what. - t5616-partial-clone.sh actually does fail in a way that I could see having to do with this change (since I guess we might download more packs than usual), but I was so confused by the other two errors I haven't looked closely yet. And secondly, I didn't write tests verifying the breakage and that this change fixes it yet, either. I'm going to work on both those things in the background, but I wanted to get the description and RFC out early so that folks could take a look and we could decide which approach is best. Thanks, - Emily --- fetch-pack.c | 4 +++- t/t0410-partial-clone.sh | 30 ++++++++++++++++++++++++++++++ 2 files changed, 33 insertions(+), 1 deletion(-) Range-diff against v1: 1: 092be0a655 ! 1: 4db6bbb4cd promisor-remote: always JIT fetch with --refetch - ## promisor-remote.c ## -@@ promisor-remote.c: static int fetch_objects(struct repository *repo, - strvec_pushl(&child.args, "-c", "fetch.negotiationAlgorithm=noop", - "fetch", remote_name, "--no-tags", - "--no-write-fetch-head", "--recurse-submodules=no", -- "--filter=blob:none", "--stdin", NULL); -+ "--filter=blob:none", "--refetch", "--stdin", NULL); - if (!git_config_get_bool("promisor.quiet", &quiet) && quiet) - strvec_push(&child.args, "--quiet"); - if (start_command(&child)) + ## fetch-pack.c ## +@@ fetch-pack.c: static int mark_complete(const struct object_id *oid) + { + struct commit *commit = deref_without_lazy_fetch(oid, 1); + +- if (commit && !(commit->object.flags & COMPLETE)) { ++ if (commit && ++ !(commit->object.flags & COMPLETE) && ++ has_object(the_repository, oid, 0)) { + commit->object.flags |= COMPLETE; + commit_list_insert(commit, &complete); + } + + ## t/t0410-partial-clone.sh ## +@@ t/t0410-partial-clone.sh: test_expect_success 'fetching of missing objects works with ref-in-want enabled' + grep "fetch< fetch=.*ref-in-want" trace + ' + ++test_expect_success 'fetching missing objects pointed to by a local ref' ' ++ rm -rf reliable-server unreliable-client && ++ test_when_finished rm -rf reliable-server unreliable-client && ++ test_create_repo reliable-server && ++ git -C reliable-server config uploadpack.allowanysha1inwant 1 && ++ git -C reliable-server config uploadpack.allowfilter 1 && ++ test_commit -C reliable-server foo && ++ ++ git clone --filter=blob:none "file://$(pwd)/reliable-server" unreliable-client && ++ ++ # to simulate the unreliable client losing a referenced object by ++ # corruption, create the object on the server side, then create only a ++ # reference to that object on the client side (without providing the ++ # object itself). ++ test_commit -C reliable-server bar && ++ HASH=$(git -C reliable-server rev-parse HEAD) && ++ echo "$HASH" >unreliable-client/.git/refs/heads/bar && ++ ++ # the object is really missing ++ # check if we can rev-parse a partial SHA. partial so we do not fetch it, ++ # but barely partial (trim only the last char) so that we do not collide ++ test_must_fail git -C unreliable-client rev-parse ${HASH%%?} && ++ ++ # trigger a remote fetch by checking out `bar` ++ git -C unreliable-client switch bar && ++ ++ # and now we have the missing object ++ git -C unreliable-client rev-parse ${HASH%%?} ++' ++ + test_expect_success 'fetching of missing objects from another promisor remote' ' + git clone "file://$(pwd)/server" server2 && + test_commit -C server2 bar &&