Message ID | b0bf576c51a706367a758b8e30eca37edb9c2734.1697200576.git.ps@pks.im (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | commit: detect commits that exist in commit-graph but not in the ODB | expand |
Patrick Steinhardt <ps@pks.im> writes: > @@ -572,8 +572,13 @@ int repo_parse_commit_internal(struct repository *r, > return -1; > if (item->object.parsed) > return 0; > - if (use_commit_graph && parse_commit_in_graph(r, item)) > + if (use_commit_graph && parse_commit_in_graph(r, item)) { > + if (!has_object(r, &item->object.oid, 0)) > + return quiet_on_missing ? -1 : > + error(_("commit %s exists in commit-graph but not in the object database"), > + oid_to_hex(&item->object.oid)); > return 0; > + } Ever since this codepath was introduced by 177722b3 (commit: integrate commit graph with commit parsing, 2018-04-10), we blindly trusted what commit-graph file says. This change is a strict improvement in the correctness department, but there are two things that are a bit worrying. One. The additional check should certainly be cheaper than a full reading and parsing of an object, either from a loose object file or from a pack entry. It may not hurt performance too much, but it still would give us more confidence if we know by how much we are pessimising good cases where the commit-graph does match reality. Our stance on these secondary files that store precomputed values for optimization purposes is in general to use them blindly unless in exceptional cases where the operation values the correctness even when the validity of these secondary files is dubious (e.g., "fsck"), and doing this extra check regardless of the caller at this low level of the callchain is a bit worrying. Another is that by the time parse_commit_in_graph() returns true and we realize that the answer we got is bogus by asking has_object(), item->object.parsed has already been toggled on, so the caller now has a commit object that claimed it was already parsed and does not match reality. Hopefully the caller takes an early exit upon seeing a failure from parse_commit_gently() and the .parsed bit does not matter, but maybe I am missing a case where it does. I dunno. Other than that, sounds very sensible and the code change is clean. Will queue. Thanks. > diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh > index ba65f17dd9..25f8e9e2d3 100755 > --- a/t/t5318-commit-graph.sh > +++ b/t/t5318-commit-graph.sh > @@ -821,4 +821,27 @@ test_expect_success 'overflow during generation version upgrade' ' > ) > ' > > +test_expect_success 'commit exists in commit-graph but not in object database' ' > + test_when_finished "rm -rf repo" && > + git init repo && > + ( > + cd repo && > + > + test_commit A && > + test_commit B && > + test_commit C && > + git commit-graph write --reachable && > + > + # Corrupt the repository by deleting the intermittent commit > + # object. Commands should notice that this object is absent and > + # thus that the repository is corrupt even if the commit graph > + # exists. > + oid=$(git rev-parse B) && > + rm .git/objects/"$(test_oid_to_path "$oid")" && > + > + test_must_fail git rev-parse HEAD~2 2>error && > + grep "error: commit $oid exists in commit-graph but not in the object database" error > + ) > +' > + > test_done
On Fri, Oct 13, 2023 at 11:21:48AM -0700, Junio C Hamano wrote: > Patrick Steinhardt <ps@pks.im> writes: > > > @@ -572,8 +572,13 @@ int repo_parse_commit_internal(struct repository *r, > > return -1; > > if (item->object.parsed) > > return 0; > > - if (use_commit_graph && parse_commit_in_graph(r, item)) > > + if (use_commit_graph && parse_commit_in_graph(r, item)) { > > + if (!has_object(r, &item->object.oid, 0)) > > + return quiet_on_missing ? -1 : > > + error(_("commit %s exists in commit-graph but not in the object database"), > > + oid_to_hex(&item->object.oid)); > > return 0; > > + } > > Ever since this codepath was introduced by 177722b3 (commit: > integrate commit graph with commit parsing, 2018-04-10), we blindly > trusted what commit-graph file says. This change is a strict > improvement in the correctness department, but there are two things > that are a bit worrying. > > One. The additional check should certainly be cheaper than a full > reading and parsing of an object, either from a loose object file or > from a pack entry. It may not hurt performance too much, but it > still would give us more confidence if we know by how much we are > pessimising good cases where the commit-graph does match reality. > Our stance on these secondary files that store precomputed values > for optimization purposes is in general to use them blindly unless > in exceptional cases where the operation values the correctness even > when the validity of these secondary files is dubious (e.g., "fsck"), > and doing this extra check regardless of the caller at this low level > of the callchain is a bit worrying. Fair point indeed. The following is a worst-case scenario benchmark of of the change where we do a full topological walk of all reachable commits in the graph, executed in linux.git. We parse commit parents via `repo_parse_commit_gently()`, so the new code path now basically has to check for object existence of every reachable commit: Benchmark 1: git -c core.commitGraph=true rev-list --topo-order --all (git = master) Time (mean ± σ): 2.913 s ± 0.018 s [User: 2.363 s, System: 0.548 s] Range (min … max): 2.894 s … 2.950 s 10 runs Benchmark 2: git -c core.commitGraph=true rev-list --topo-order --all (git = pks-commit-graph-inconsistency) Time (mean ± σ): 3.834 s ± 0.052 s [User: 3.276 s, System: 0.556 s] Range (min … max): 3.780 s … 3.961 s 10 runs Benchmark 3: git -c core.commitGraph=false rev-list --topo-order --all (git = master) Time (mean ± σ): 13.841 s ± 0.084 s [User: 13.152 s, System: 0.687 s] Range (min … max): 13.714 s … 13.995 s 10 runs Benchmark 4: git -c core.commitGraph=false rev-list --topo-order --all (git = pks-commit-graph-inconsistency) Time (mean ± σ): 13.762 s ± 0.116 s [User: 13.094 s, System: 0.667 s] Range (min … max): 13.645 s … 14.038 s 10 runs Summary git -c core.commitGraph=true rev-list --topo-order --all (git = master) ran 1.32 ± 0.02 times faster than git -c core.commitGraph=true rev-list --topo-order --all (git = pks-commit-graph-inconsistency) 4.72 ± 0.05 times faster than git -c core.commitGraph=false rev-list --topo-order --all (git = pks-commit-graph-inconsistency) 4.75 ± 0.04 times faster than git -c core.commitGraph=false rev-list --topo-order --all (git = master) The added check does lead to a performance regression indeed, which is not all that unexpected. That being said, the commit-graph still results in a significant speedup compared to the case where we don't have it. > Another is that by the time parse_commit_in_graph() returns true and > we realize that the answer we got is bogus by asking has_object(), > item->object.parsed has already been toggled on, so the caller now > has a commit object that claimed it was already parsed and does not > match reality. Hopefully the caller takes an early exit upon seeing > a failure from parse_commit_gently() and the .parsed bit does not > matter, but maybe I am missing a case where it does. I dunno. We could also call `unparse_commit()` when we notice the stale commit graph item. This would be in the same spirit as the rest of this patch as it would lead to an overall safer end state. In any case I'll wait for additional input before sending a v2, most importantly to see whether we think that consistency trumps performance in this case. Personally I'm still of the mind that it should, which also comes from the fact that we were fighting with stale commit graphs several times in production data. Patrick > Other than that, sounds very sensible and the code change is clean. > > Will queue. Thanks. > > > diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh > > index ba65f17dd9..25f8e9e2d3 100755 > > --- a/t/t5318-commit-graph.sh > > +++ b/t/t5318-commit-graph.sh > > @@ -821,4 +821,27 @@ test_expect_success 'overflow during generation version upgrade' ' > > ) > > ' > > > > +test_expect_success 'commit exists in commit-graph but not in object database' ' > > + test_when_finished "rm -rf repo" && > > + git init repo && > > + ( > > + cd repo && > > + > > + test_commit A && > > + test_commit B && > > + test_commit C && > > + git commit-graph write --reachable && > > + > > + # Corrupt the repository by deleting the intermittent commit > > + # object. Commands should notice that this object is absent and > > + # thus that the repository is corrupt even if the commit graph > > + # exists. > > + oid=$(git rev-parse B) && > > + rm .git/objects/"$(test_oid_to_path "$oid")" && > > + > > + test_must_fail git rev-parse HEAD~2 2>error && > > + grep "error: commit $oid exists in commit-graph but not in the object database" error > > + ) > > +' > > + > > test_done
Patrick Steinhardt <ps@pks.im> writes: > Fair point indeed. The following is a worst-case scenario benchmark of > of the change where we do a full topological walk of all reachable > commits in the graph, executed in linux.git. We parse commit parents via > `repo_parse_commit_gently()`, so the new code path now basically has to > check for object existence of every reachable commit: > ... > The added check does lead to a performance regression indeed, which is > not all that unexpected. That being said, the commit-graph still results > in a significant speedup compared to the case where we don't have it. Yeah, I agree that both points are expected. An extra check that is much cheaper than the full parsing is paying a small price to be a bit more careful than before. The question is if the price is small enough. I am still not sure if the extra carefulness is warranted for all normal cases to spend 30% extra cycles Yeah, I agree that both points are expected. An extra check that is much cheaper than the full parsing is paying a small price to be a bit more careful than before. The question is if the price is small enough. I am still not sure if the extra carefulness is warranted for all normal cases to spend 30% extra cycles. Thanks.
On Tue, Oct 17, 2023 at 11:34:53AM -0700, Junio C Hamano wrote: > Patrick Steinhardt <ps@pks.im> writes: > > > Fair point indeed. The following is a worst-case scenario benchmark of > > of the change where we do a full topological walk of all reachable > > commits in the graph, executed in linux.git. We parse commit parents via > > `repo_parse_commit_gently()`, so the new code path now basically has to > > check for object existence of every reachable commit: > > ... > > The added check does lead to a performance regression indeed, which is > > not all that unexpected. That being said, the commit-graph still results > > in a significant speedup compared to the case where we don't have it. > > Yeah, I agree that both points are expected. An extra check that is > much cheaper than the full parsing is paying a small price to be a > bit more careful than before. The question is if the price is small > enough. I am still not sure if the extra carefulness is warranted > for all normal cases to spend 30% extra cycles. > > Thanks. Well, seen in contexts like the thread that spawned this discussion I think it's preferable to take a relatively smaller price compared to disabling the commit graph entirely in some situations. With that in mind, personally I'd be happy with either of two outcomes here: - We take the patch I proposed as a hardening measure, which allows us to use the commit graph in basically any situation where we could parse objects from the ODB without any downside except for a performance hit. - We say that corrupt repositories do not need to be accounted for when using metadata caches like the commit-graph. That would mean in consequence that for the `--missing` flag we would not have to disable commit graphs. The test failure that was exposed in Karthik's test only stems from the fact that we manually corrupt the repository and is not specific to the `--missing` flag. This is further stressed by the new test aded in my own patch, as we can trigger a similar bug when not using `--missing`. For end users, object pruning would happen either via git-gc(1) or via git-maintenance(1), and both of them do know to update commit graphs already. This should significantly decrease the chance that such stale commit graphs would be found out there in the wild, but it's still not impossible of course. Especially in cases where you use lower-level commands like git-repack(1) with cruft packs or git-prune(1), which is often the case for Git hosters, the risk is higher though. Patrick
On Thu, Oct 19, 2023 at 08:45:34AM +0200, Patrick Steinhardt wrote: > On Tue, Oct 17, 2023 at 11:34:53AM -0700, Junio C Hamano wrote: > > Patrick Steinhardt <ps@pks.im> writes: > > > > > Fair point indeed. The following is a worst-case scenario benchmark of > > > of the change where we do a full topological walk of all reachable > > > commits in the graph, executed in linux.git. We parse commit parents via > > > `repo_parse_commit_gently()`, so the new code path now basically has to > > > check for object existence of every reachable commit: > > > ... > > > The added check does lead to a performance regression indeed, which is > > > not all that unexpected. That being said, the commit-graph still results > > > in a significant speedup compared to the case where we don't have it. > > > > Yeah, I agree that both points are expected. An extra check that is > > much cheaper than the full parsing is paying a small price to be a > > bit more careful than before. The question is if the price is small > > enough. I am still not sure if the extra carefulness is warranted > > for all normal cases to spend 30% extra cycles. > > > > Thanks. > > Well, seen in contexts like the thread that spawned this discussion I > think it's preferable to take a relatively smaller price compared to > disabling the commit graph entirely in some situations. With that in > mind, personally I'd be happy with either of two outcomes here: > > - We take the patch I proposed as a hardening measure, which allows > us to use the commit graph in basically any situation where we > could parse objects from the ODB without any downside except for a > performance hit. > > - We say that corrupt repositories do not need to be accounted for > when using metadata caches like the commit-graph. That would mean > in consequence that for the `--missing` flag we would not have to > disable commit graphs. > > The test failure that was exposed in Karthik's test only stems from the > fact that we manually corrupt the repository and is not specific to the > `--missing` flag. This is further stressed by the new test aded in my > own patch, as we can trigger a similar bug when not using `--missing`. There's another way to handle this, which is to conditionally enable the object existence check. This would be less of a performance hit compared to disabling commit graphs altogether with `--missing`, but still ensure that repository corruption was detected. Second, it would not regress performance for all preexisting users of `repo_parse_commit_gently()`. Patrick
Patrick Steinhardt <ps@pks.im> writes: > There's another way to handle this, which is to conditionally enable the > object existence check. This would be less of a performance hit compared > to disabling commit graphs altogether with `--missing`, but still ensure > that repository corruption was detected. Second, it would not regress > performance for all preexisting users of `repo_parse_commit_gently()`. The above was what I meant to suggest when you demonstrated that the code with additional check is still much more performant than running without the commit-graph optimization, yet has observable impact on performance for normal codepaths that do not need the extra carefulness. But I wasn't sure if it is easy to plumb the "do we want to double check? in other words, are we running something like --missing that care the correctness a bit more than usual cases?" bit down from the caller, because this check is so deep in the callchain. Thanks.
On Thu, Oct 19, 2023 at 10:16:56AM -0700, Junio C Hamano wrote: > Patrick Steinhardt <ps@pks.im> writes: > > > There's another way to handle this, which is to conditionally enable the > > object existence check. This would be less of a performance hit compared > > to disabling commit graphs altogether with `--missing`, but still ensure > > that repository corruption was detected. Second, it would not regress > > performance for all preexisting users of `repo_parse_commit_gently()`. > > The above was what I meant to suggest when you demonstrated that the > code with additional check is still much more performant than > running without the commit-graph optimization, yet has observable > impact on performance for normal codepaths that do not need the > extra carefulness. > > But I wasn't sure if it is easy to plumb the "do we want to double > check? in other words, are we running something like --missing that > care the correctness a bit more than usual cases?" bit down from the > caller, because this check is so deep in the callchain. I wonder if we would want a "be extra careful" flag that is read from the environment? That is largely how GIT_REF_PARANOIA works, and then particular operations set it (though actually it is the default these days, so they no longer do so explicitly). I guess that is really like a global variable but even more gross. ;) But it is nice that it can cross process boundaries, because "how careful do we want to be" may be in the eye of the caller, especially for plumbing commands. E.g., even without --missing, you may want "rev-list" to be extra careful about checking the existence of commits. The caller in check_connected() could arguably turn that on by default (the way we used to turn on GIT_REF_PARANOIA for pruning repacks before it became the default). -Peff
Jeff King <peff@peff.net> writes: > I guess that is really like a global variable but even more gross. ;) Yeah, I tend to agree, but ... > But it is nice that it can cross process boundaries, because "how > careful do we want to be" may be in the eye of the caller, especially > for plumbing commands. E.g., even without --missing, you may want > "rev-list" to be extra careful about checking the existence of commits. ... these are all very good points. > The caller in check_connected() could arguably turn that on by default > (the way we used to turn on GIT_REF_PARANOIA for pruning repacks before > it became the default). True. Thanks.
On Fri, Oct 20, 2023 at 06:00:24AM -0400, Jeff King wrote: > On Thu, Oct 19, 2023 at 10:16:56AM -0700, Junio C Hamano wrote: > > > Patrick Steinhardt <ps@pks.im> writes: > > > > > There's another way to handle this, which is to conditionally enable the > > > object existence check. This would be less of a performance hit compared > > > to disabling commit graphs altogether with `--missing`, but still ensure > > > that repository corruption was detected. Second, it would not regress > > > performance for all preexisting users of `repo_parse_commit_gently()`. > > > > The above was what I meant to suggest when you demonstrated that the > > code with additional check is still much more performant than > > running without the commit-graph optimization, yet has observable > > impact on performance for normal codepaths that do not need the > > extra carefulness. > > > > But I wasn't sure if it is easy to plumb the "do we want to double > > check? in other words, are we running something like --missing that > > care the correctness a bit more than usual cases?" bit down from the > > caller, because this check is so deep in the callchain. > > I wonder if we would want a "be extra careful" flag that is read from > the environment? That is largely how GIT_REF_PARANOIA works, and then > particular operations set it (though actually it is the default these > days, so they no longer do so explicitly). > > I guess that is really like a global variable but even more gross. ;) > But it is nice that it can cross process boundaries, because "how > careful do we want to be" may be in the eye of the caller, especially > for plumbing commands. E.g., even without --missing, you may want > "rev-list" to be extra careful about checking the existence of commits. > The caller in check_connected() could arguably turn that on by default > (the way we used to turn on GIT_REF_PARANOIA for pruning repacks before > it became the default). Good point indeed, I also started to ask myself whether we'd want to have a config option. An environment variable `GIT_OBJECT_PARANOIA` would work equally well though and be less "official". What I like about this idea is that it would also allow us to easily unify the behaviour of `lookup_commit_in_graph()` and the new sanity check in `repo_parse_commit_gently()` that I'm introducing here. I'd personally think that the default for any such toggle should be `true`, also to keep the current behaviour of `lookup_commit_in_graph()`. But changing it would be easy enough. I'll send a v2 of this series with such a toggle. Patrick
diff --git a/commit.c b/commit.c index b3223478bc..109e9217e3 100644 --- a/commit.c +++ b/commit.c @@ -572,8 +572,13 @@ int repo_parse_commit_internal(struct repository *r, return -1; if (item->object.parsed) return 0; - if (use_commit_graph && parse_commit_in_graph(r, item)) + if (use_commit_graph && parse_commit_in_graph(r, item)) { + if (!has_object(r, &item->object.oid, 0)) + return quiet_on_missing ? -1 : + error(_("commit %s exists in commit-graph but not in the object database"), + oid_to_hex(&item->object.oid)); return 0; + } if (oid_object_info_extended(r, &item->object.oid, &oi, flags) < 0) return quiet_on_missing ? -1 : diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index ba65f17dd9..25f8e9e2d3 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -821,4 +821,27 @@ test_expect_success 'overflow during generation version upgrade' ' ) ' +test_expect_success 'commit exists in commit-graph but not in object database' ' + test_when_finished "rm -rf repo" && + git init repo && + ( + cd repo && + + test_commit A && + test_commit B && + test_commit C && + git commit-graph write --reachable && + + # Corrupt the repository by deleting the intermittent commit + # object. Commands should notice that this object is absent and + # thus that the repository is corrupt even if the commit graph + # exists. + oid=$(git rev-parse B) && + rm .git/objects/"$(test_oid_to_path "$oid")" && + + test_must_fail git rev-parse HEAD~2 2>error && + grep "error: commit $oid exists in commit-graph but not in the object database" error + ) +' + test_done
Commit graphs can become stale and contain references to commits that do not exist in the object database anymore. Theoretically, this can lead to a scenario where we are able to successfully look up any such commit via the commit graph even though such a lookup would fail if done via the object database directly. As the commit graph is mostly intended as a sort of cache to speed up parsing of commits we do not want to have diverging behaviour in a repository with and a repository without commit graphs, no matter whether they are stale or not. As commits are otherwise immutable, the only thing that we really need to care about is thus the presence or absence of a commit. To address potentially stale commit data that may exist in the graph, our `lookup_commit_in_graph()` function will check for the commit's existence in both the commit graph, but also in the object database. So even if we were able to look up the commit's data in the graph, we would still pretend as if the commit didn't exist if it is missing in the object database. We don't have the same safety net in `parse_commit_in_graph_one()` though. This function is mostly used internally in "commit-graph.c" itself to validate the commit graph, and this usage is fine. We do expose its functionality via `parse_commit_in_graph()` though, which gets called by `repo_parse_commit_internal()`, and that function is in turn used in many places in our codebase. For all I can see this function is never used to directly turn an object ID into a commit object without additional safety checks before or after this lookup. What it is being used for though is to walk history via the parent chain of commits. So when commits in the parent chain of a graph walk are missing it is possible that we wouldn't notice if that missing commit was part of the commit graph. Thus, a query like `git rev-parse HEAD~2` can succeed even if the intermittent commit is missing. It's unclear whether there are additional ways in which such stale commit graphs can lead to problems. In any case, it feels like this is a bigger bug waiting to happen when we gain additional direct or indirect callers of `repo_parse_commit_internal()`. So let's fix the inconsistent behaviour by checking for object existence via the object database, as well. Signed-off-by: Patrick Steinhardt <ps@pks.im> --- commit.c | 7 ++++++- t/t5318-commit-graph.sh | 23 +++++++++++++++++++++++ 2 files changed, 29 insertions(+), 1 deletion(-)