Message ID | 20231009105528.17777-1-karthik.188@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | rev-list: add support for commits in `--missing` | expand |
Karthik Nayak <karthik.188@gmail.com> writes: > This series is an alternate to the patch series I had posted earlier: > https://lore.kernel.org/git/20230908174208.249184-1-karthik.188@gmail.com/. > In that patch, we introduced an option `--ignore-missing-links` which > was added to expose the `ignore_missing_links` bit to the user. The > issue in that patch was that, the option `--ignore-missing-links` didn't > play well the pre-existing `--missing` option. This series avoids that > route and just extends the `--missing` option for commits to solve the > same problem. Thanks for exploring the problem space in the other direction. I haven't really thought things through yet, but it looks to me that the updated behaviour of "--missing" is more in line with what the option would have wanted to be from day one. > Karthik Nayak (3): > revision: rename bit to `do_not_die_on_missing_objects` > rev-list: move `show_commit()` to the bottom > rev-list: add commit object support in `--missing` option > > builtin/reflog.c | 2 +- > builtin/rev-list.c | 93 +++++++++++++++++++------------------ > list-objects.c | 2 +- > object.h | 2 +- > revision.c | 9 +++- > revision.h | 20 ++++---- > t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++ > 7 files changed, 145 insertions(+), 57 deletions(-) > create mode 100755 t/t6022-rev-list-missing.sh
On Mon, Oct 09, 2023 at 12:55:25PM +0200, Karthik Nayak wrote: > The `--missing` option in git-rev-list(1) was introduced intitally > to deal with missing blobs in the context of promissory notes. > Eventually the option was extended to also support tree objects in > 7c0fe330d5 (rev-list: handle missing tree objects properly,2018-10-05). > > This patch series extends the `--missing` option to also add support for > commit objects. We do this by introducing a new flag `MISSING` which is > added whenever we encounter a missing commit during traversal. Then in > `builtin/rev-list` we check for this flag and take the appropriate > action based on the `--missing=*` option used. > > This series is an alternate to the patch series I had posted earlier: > https://lore.kernel.org/git/20230908174208.249184-1-karthik.188@gmail.com/. > In that patch, we introduced an option `--ignore-missing-links` which > was added to expose the `ignore_missing_links` bit to the user. The > issue in that patch was that, the option `--ignore-missing-links` didn't > play well the pre-existing `--missing` option. This series avoids that > route and just extends the `--missing` option for commits to solve the > same problem. I had already reviewed the patches internally at GitLab, so for what it's worth please feel free to add my Reviewed-by. Patrick > Karthik Nayak (3): > revision: rename bit to `do_not_die_on_missing_objects` > rev-list: move `show_commit()` to the bottom > rev-list: add commit object support in `--missing` option > > builtin/reflog.c | 2 +- > builtin/rev-list.c | 93 +++++++++++++++++++------------------ > list-objects.c | 2 +- > object.h | 2 +- > revision.c | 9 +++- > revision.h | 20 ++++---- > t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++ > 7 files changed, 145 insertions(+), 57 deletions(-) > create mode 100755 t/t6022-rev-list-missing.sh > > -- > 2.41.0 >
Patrick Steinhardt <ps@pks.im> writes: > I had already reviewed the patches internally at GitLab, so for what > it's worth please feel free to add my Reviewed-by. Great. It seems that 'seen' with this series fails to pass the tests, though. https://github.com/git/git/actions/runs/6462854176/job/17545104753 Thanks. > > Patrick > >> Karthik Nayak (3): >> revision: rename bit to `do_not_die_on_missing_objects` >> rev-list: move `show_commit()` to the bottom >> rev-list: add commit object support in `--missing` option >> >> builtin/reflog.c | 2 +- >> builtin/rev-list.c | 93 +++++++++++++++++++------------------ >> list-objects.c | 2 +- >> object.h | 2 +- >> revision.c | 9 +++- >> revision.h | 20 ++++---- >> t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++ >> 7 files changed, 145 insertions(+), 57 deletions(-) >> create mode 100755 t/t6022-rev-list-missing.sh >> >> -- >> 2.41.0 >>
On Tue, Oct 10, 2023 at 7:09 PM Junio C Hamano <gitster@pobox.com> wrote: > > Patrick Steinhardt <ps@pks.im> writes: > > > I had already reviewed the patches internally at GitLab, so for what > > it's worth please feel free to add my Reviewed-by. > > Great. It seems that 'seen' with this series fails to pass the > tests, though. > > https://github.com/git/git/actions/runs/6462854176/job/17545104753 Seems like this is because of commit-graph being enabled, I think the best thing to do here would be to disable the commit graph of these tests. This should do: diff --git a/t/t6022-rev-list-missing.sh b/t/t6022-rev-list-missing.sh index bbff66e4fc..39a8402682 100755 --- a/t/t6022-rev-list-missing.sh +++ b/t/t6022-rev-list-missing.sh @@ -5,6 +5,11 @@ test_description='handling of missing objects in rev-list' TEST_PASSES_SANITIZE_LEAK=true . ./test-lib.sh +# Disable writing the commit graph as the tests depend on making particular +# commits hidden, the graph is created before that and rev-list would default +# to using the commit graph in such instances. +GIT_TEST_COMMIT_GRAPH=0 + # We setup the repository with two commits, this way HEAD is always # available and we can hide commit 1. test_expect_success 'create repository and alternate directory' ' Thanks for reporting. I'll wait for a day or two (for reviews) and will add this to the second version of this series.
Karthik Nayak <karthik.188@gmail.com> writes: > Seems like this is because of commit-graph being enabled, I think > the best thing to do here would be to disable the commit graph of > these tests. If the CI uncovered that the new code is broken and does not work with commit-graph, wouldn't the above a totally wrong approach to correct it? If the updated logic cannot work correctly when commit-graph covers the history you intentionally break, shouldn't the code, when the feature that is incompatible with commit-graph is triggered, disable the commit-graph? I am assuming that the new feature is meant to be used to recover from a corrupt repository, and if it does not work well when commit-graph knows (now stale after repository corruption) more about the objects that are corrupt in the object store, we do want to disable commit-graph. After all, commit-graph is a secondary information that is supposed to be recoverable from the primary data that is what is in the object store. Disabling commit-graph in the test means you are telling the end-users "do not use commit-graph if you want to use this feature", which sounds like a wrong thing to do.
On Wed, Oct 11, 2023 at 6:54 PM Junio C Hamano <gitster@pobox.com> wrote: > > Karthik Nayak <karthik.188@gmail.com> writes: > > > Seems like this is because of commit-graph being enabled, I think > > the best thing to do here would be to disable the commit graph of > > these tests. > > If the CI uncovered that the new code is broken and does not work > with commit-graph, wouldn't the above a totally wrong approach to > correct it? If the updated logic cannot work correctly when > commit-graph covers the history you intentionally break, shouldn't > the code, when the feature that is incompatible with commit-graph is > triggered, disable the commit-graph? I am assuming that the new > feature is meant to be used to recover from a corrupt repository, > and if it does not work well when commit-graph knows (now stale > after repository corruption) more about the objects that are corrupt > in the object store, we do want to disable commit-graph. After all, > commit-graph is a secondary information that is supposed to be > recoverable from the primary data that is what is in the object > store. Disabling commit-graph in the test means you are telling the > end-users "do not use commit-graph if you want to use this feature", > which sounds like a wrong thing to do. I agree with what you're saying. Disabling writing the commit-graph for only the test doesn't serve the real usage. To ensure that the feature should work with corrupt repositories with stale commit-graph, I'm thinking of disabling the commit-graph whenever the `--missing` option is used. The commit graph already provides an API for this, so it should be simple to do. Thanks!
On Thu, Oct 12, 2023 at 12:44:27PM +0200, Karthik Nayak wrote: > On Wed, Oct 11, 2023 at 6:54 PM Junio C Hamano <gitster@pobox.com> wrote: > > > > Karthik Nayak <karthik.188@gmail.com> writes: > > > > > Seems like this is because of commit-graph being enabled, I think > > > the best thing to do here would be to disable the commit graph of > > > these tests. > > > > If the CI uncovered that the new code is broken and does not work > > with commit-graph, wouldn't the above a totally wrong approach to > > correct it? If the updated logic cannot work correctly when > > commit-graph covers the history you intentionally break, shouldn't > > the code, when the feature that is incompatible with commit-graph is > > triggered, disable the commit-graph? I am assuming that the new > > feature is meant to be used to recover from a corrupt repository, > > and if it does not work well when commit-graph knows (now stale > > after repository corruption) more about the objects that are corrupt > > in the object store, we do want to disable commit-graph. After all, > > commit-graph is a secondary information that is supposed to be > > recoverable from the primary data that is what is in the object > > store. Disabling commit-graph in the test means you are telling the > > end-users "do not use commit-graph if you want to use this feature", > > which sounds like a wrong thing to do. > > I agree with what you're saying. Disabling writing the commit-graph for > only the test doesn't serve the real usage. To ensure that the feature should > work with corrupt repositories with stale commit-graph, I'm thinking of > disabling the commit-graph whenever the `--missing` option is used. The > commit graph already provides an API for this, so it should be simple to do. > > Thanks! Wouldn't this have the potential to significantly regress performance for all those preexisting users of the `--missing` option? The commit graph is quite an important optimization nowadays, and especially in commands where we potentially walk a lot of commits (like we may do here) it can result in queries that are orders of magnitudes faster. If we were introducing a new option then I think this would be an acceptable tradeoff as we could still fix the performance issue in another iteration. But we don't and instead aim to change the meaning of `--missing` which may already have users out there. Seen in that light it does not seem sensible to me to just disable the graph for them. Did you figure out what exactly the root cause of this functional regression is? If so, can we address that root cause instead? Patrick
On Thu, Oct 12, 2023 at 1:04 PM Patrick Steinhardt <ps@pks.im> wrote: > Wouldn't this have the potential to significantly regress performance > for all those preexisting users of the `--missing` option? The commit > graph is quite an important optimization nowadays, and especially in > commands where we potentially walk a lot of commits (like we may do > here) it can result in queries that are orders of magnitudes faster. > > If we were introducing a new option then I think this would be an > acceptable tradeoff as we could still fix the performance issue in > another iteration. But we don't and instead aim to change the meaning of > `--missing` which may already have users out there. Seen in that light > it does not seem sensible to me to just disable the graph for them. > Agreed, that there is a rather huge performance drop doing this. So either: 1. Add commit support for `--missing` and disable commit-graph. There is huge performance drop here for commits. 2. Add commit support for `--missing` but disabling commit-graph is provided via an additional `--disable-commit-graph` option. This expects the user to manually disable commit-graph for `--missing` to work correctly with commits. I was also thinking about this, but people who are using `--missing` till date, also use it with `--objects`. Using `--objects` is already slow and doesn't benefit from the commit-graph. So I don't know if there is much of a performance hit. Some benchmarks on the Git repo: With commit graphs: $ hyperfine --warmup 2 "git rev-list --objects --missing=print HEAD > /dev/null" Benchmark 1: git rev-list --objects --missing=print HEAD > /dev/null Time (mean ± σ): 1.336 s ± 0.013 s [User: 1.278 s, System: 0.054 s] Range (min … max): 1.323 s … 1.365 s 10 runs $ hyperfine --warmup 2 "git rev-list --missing=print HEAD > /dev/null" Benchmark 1: git rev-list --missing=print HEAD > /dev/null Time (mean ± σ): 29.6 ms ± 1.5 ms [User: 20.2 ms, System: 9.1 ms] Range (min … max): 27.5 ms … 35.1 ms 92 runs Without commit graphs: $ hyperfine --warmup 2 "git rev-list --objects --missing=print HEAD > /dev/null" Benchmark 1: git rev-list --objects --missing=print HEAD > /dev/null Time (mean ± σ): 1.735 s ± 0.022 s [User: 1.672 s, System: 0.057 s] Range (min … max): 1.703 s … 1.765 s 10 runs $ hyperfine --warmup 2 "git rev-list --missing=print HEAD > /dev/null" Benchmark 1: git rev-list --missing=print HEAD > /dev/null Time (mean ± σ): 426.6 ms ± 10.2 ms [User: 410.1 ms, System: 15.1 ms] Range (min … max): 414.9 ms … 441.3 ms 10 runs This shows that while there is definitely a performance loss when disabling commit-graphs. This loss is much lesser when used alongside the `--objects` flag. Overall, I lean towards the option "1", but okay with either. > Did you figure out what exactly the root cause of this functional > regression is? If so, can we address that root cause instead? Are you referring to `--missing` not working with commits? Looking at the history of `--missing` shows: - caf3827e2f: This commits introduces the `--missing` option, it basically states that for the upcoming partial clone mechanism, we want to add support for identifying missing objects. - 7c0fe330d5: This commit adds support for trees in `--missing`. It also goes on to state that `--missing` assumed only blobs could be missing. Both of these show that `--missing` started off support for only blobs and eventually added support to trees (similar to how we're adding support for commits here). So I guess the intention was to never work with commits. But the documentation and the option seem generic enough. Considering that trees was an extension, commits should be treated the same way.
Patrick Steinhardt <ps@pks.im> writes: > Wouldn't this have the potential to significantly regress performance > for all those preexisting users of the `--missing` option? The commit > graph is quite an important optimization nowadays, and especially in > commands where we potentially walk a lot of commits (like we may do > here) it can result in queries that are orders of magnitudes faster. The test fails only when GIT_TEST_COMMIT_GRAPH is on, which updates the commit-graph every time a commit is made via "git commit" or "git merge". I'd suggest stepping back and think a bit. My assumption has been that the failing test emulates this scenario that can happen in real life: * The user creates a new commit. * A commit graph is written (not as part of GIT_TEST_COMMIT_GRAPH that is not realistic, but as part of "maintenance"). * The repository loses some objects due to corruption. * Now, "--missing=print" is invoked so that the user can view what are missing. Or "--missing=allow-primisor" to ensure that the repository does not have missing objects other than the ones that the promisor will give us if we asked again. * But because the connectivity of these objects appear in the commit graph file, we fail to notice that these objects are missing, producing wrong results. If we disabled commit-graph while traversal (an earlier writing of it was perfectly OK), then "rev-list --missing" would have noticed and reported what the user wanted to know. In other words, the "optimization" you value is working to quickly produce a wrong result. Is it "significantly regress"ing if we disabled it to obtain the correct result? My assumption also has been that there is no point in running "rev-list --missing" if we know there is no repository corruption, and those who run "rev-list --missing" wants to know if the objects are really available, i.e. even if commit-graph that is out of sync with reality says it exists, if it is not in the object store, they would want to know that. If you can show me that it is not the case, then I may be pursuaded why producing a result that is out of sync with reality _quickly_, instead of taking time to produce a result that matches reality, is a worthy "optimization" to keep. Thanks.
Karthik Nayak <karthik.188@gmail.com> writes: > ... To ensure that the feature should > work with corrupt repositories with stale commit-graph, I'm thinking of > disabling the commit-graph whenever the `--missing` option is used. Just to avoid giving anybody a wrong guideline, for most of the other features, we should assume that the commit-graph accurately represents the reality in the object database and take advantage of it, leaving it to "git fsck" to ensure that the commit-graph is healthy. But "rev-list --missing", especially when used with actions other than allow-promisor or allow-any, is a feature that is only useful when one wants to really notice objects that are available in the object store, and it would be a bug for it to pretend as if objects the commit-graph has heard of exists without checking.
On Thu, Oct 12, 2023 at 09:17:09AM -0700, Junio C Hamano wrote: > Patrick Steinhardt <ps@pks.im> writes: > > > Wouldn't this have the potential to significantly regress performance > > for all those preexisting users of the `--missing` option? The commit > > graph is quite an important optimization nowadays, and especially in > > commands where we potentially walk a lot of commits (like we may do > > here) it can result in queries that are orders of magnitudes faster. > > The test fails only when GIT_TEST_COMMIT_GRAPH is on, which updates > the commit-graph every time a commit is made via "git commit" or > "git merge". > > I'd suggest stepping back and think a bit. > > My assumption has been that the failing test emulates this scenario > that can happen in real life: > > * The user creates a new commit. > > * A commit graph is written (not as part of GIT_TEST_COMMIT_GRAPH > that is not realistic, but as part of "maintenance"). > > * The repository loses some objects due to corruption. > > * Now, "--missing=print" is invoked so that the user can view what > are missing. Or "--missing=allow-primisor" to ensure that the > repository does not have missing objects other than the ones that > the promisor will give us if we asked again. > > * But because the connectivity of these objects appear in the > commit graph file, we fail to notice that these objects are > missing, producing wrong results. If we disabled commit-graph > while traversal (an earlier writing of it was perfectly OK), then > "rev-list --missing" would have noticed and reported what the > user wanted to know. > > In other words, the "optimization" you value is working to quickly > produce a wrong result. Is it "significantly regress"ing if we > disabled it to obtain the correct result? It depends, in my opinion. If: - Wrong results caused by the commit graph are only introduced with this patch series due to the changed behaviour of `--missing`. - We disable commit graphs proactively only because of the changed behaviour of `--missing`. Then yes, it does feel wrong to me to disable commit graphs and regress performance for usecases that perviously worked both correct and fast. > My assumption also has been that there is no point in running > "rev-list --missing" if we know there is no repository corruption, > and those who run "rev-list --missing" wants to know if the objects > are really available, i.e. even if commit-graph that is out of sync > with reality says it exists, if it is not in the object store, they > would want to know that. > > If you can show me that it is not the case, then I may be pursuaded > why producing a result that is out of sync with reality _quickly_, > instead of taking time to produce a result that matches reality, is > a worthy "optimization" to keep. Note that I'm not saying that it's fine to return wrong results -- this is of course a bug that needs to be addressed somehow. After all, things working correctly should always trump things working fast. But until now it felt more like we were going into the direction of disabling commit graphs without checking whether there is an alternative solution that allows us to get the best of both worlds, correctness and performance. So what I'm looking for in this thread is a reason why we _can't_ have that, or at least can't have it without unreasonable amounts of work. We have helpers like `lookup_commit_in_graph()` that are designed to detect stale commit graphs by double-checking whether a commit that has been looked up via the commit graph actually exists in the repository. So I'm wondering whether this could help us address the issue. If there is a good reason why all of that is not possible then I'm happy to carve in. Patrick
On Fri, Oct 13, 2023 at 07:53:36AM +0200, Patrick Steinhardt wrote: > On Thu, Oct 12, 2023 at 09:17:09AM -0700, Junio C Hamano wrote: > > Patrick Steinhardt <ps@pks.im> writes: [snip] > > My assumption also has been that there is no point in running > > "rev-list --missing" if we know there is no repository corruption, > > and those who run "rev-list --missing" wants to know if the objects > > are really available, i.e. even if commit-graph that is out of sync > > with reality says it exists, if it is not in the object store, they > > would want to know that. > > > > If you can show me that it is not the case, then I may be pursuaded > > why producing a result that is out of sync with reality _quickly_, > > instead of taking time to produce a result that matches reality, is > > a worthy "optimization" to keep. > > Note that I'm not saying that it's fine to return wrong results -- this > is of course a bug that needs to be addressed somehow. After all, things > working correctly should always trump things working fast. But until now > it felt more like we were going into the direction of disabling commit > graphs without checking whether there is an alternative solution that > allows us to get the best of both worlds, correctness and performance. > > So what I'm looking for in this thread is a reason why we _can't_ have > that, or at least can't have it without unreasonable amounts of work. We > have helpers like `lookup_commit_in_graph()` that are designed to detect > stale commit graphs by double-checking whether a commit that has been > looked up via the commit graph actually exists in the repository. So I'm > wondering whether this could help us address the issue. > > If there is a good reason why all of that is not possible then I'm happy > to carve in. I've had a quick look at this problem so that I can solidify my own train of thought a bit. The issue is `repo_parse_commit_internal()`, which calls `parse_commit_in_graph()` without verifying that the object actually exists in the object database. It's the only callsite of that function outside of "commit-graph.c", as all other external callers would call `lookup_commit_in_graph()` which _does_ perform the object existence check. So I think that the proper way to address the regression would be a patch similar to the following: diff --git a/commit.c b/commit.c index b3223478bc..109e9217e3 100644 --- a/commit.c +++ b/commit.c @@ -572,8 +572,13 @@ int repo_parse_commit_internal(struct repository *r, return -1; if (item->object.parsed) return 0; - if (use_commit_graph && parse_commit_in_graph(r, item)) + if (use_commit_graph && parse_commit_in_graph(r, item)) { + if (!has_object(r, &item->object.oid, 0)) + return quiet_on_missing ? -1 : + error(_("commit %s exists in commit-graph but not in the object database"), + oid_to_hex(&item->object.oid)); return 0; + } if (oid_object_info_extended(r, &item->object.oid, &oi, flags) < 0) return quiet_on_missing ? -1 : I wouldn't be surprised if there are other edge cases where this can lead to buggy behaviour. Also, this issue may not necessarily stem from repository corruption. It could for example happen that commits are getting garbage collected without the commit graph having been updated, whatever the reason may be for this. In that case we would happily continue to return these commits from the commit graph even though the underlying object has since been deleted. The repository itself is not corrupt though, we merely look at an out-of-date commit graph. And for what it's worth, I think that we should always gracefully handle that case or otherwise the commit graph becomes less useful overall. I didn't dig much deeper yet. And while the above patch fixes some of the test failures, it doesn't fix them all. If we agree that this is the way to go then I'd be happy to turn this into a proper patch. Patrick
Patrick Steinhardt <ps@pks.im> writes: > actually exists in the object database. It's the only callsite of that > function outside of "commit-graph.c", as all other external callers > would call `lookup_commit_in_graph()` which _does_ perform the object > existence check. > > So I think that the proper way to address the regression would be a > patch similar to the following: > > diff --git a/commit.c b/commit.c > index b3223478bc..109e9217e3 100644 > --- a/commit.c > +++ b/commit.c > @@ -572,8 +572,13 @@ int repo_parse_commit_internal(struct repository *r, > return -1; > if (item->object.parsed) > return 0; > - if (use_commit_graph && parse_commit_in_graph(r, item)) > + if (use_commit_graph && parse_commit_in_graph(r, item)) { > + if (!has_object(r, &item->object.oid, 0)) > + return quiet_on_missing ? -1 : > + error(_("commit %s exists in commit-graph but not in the object database"), > + oid_to_hex(&item->object.oid)); > return 0; > + } It is (overly) conservative, which I generally should find pleasing, but as I said, for secondary information like commit-graph that is derived from the primary source only to precompute for optimization, our general attitude should be to trust it and let the optimization kick in, unless the operation being performed primarily cares about the case where the result from using and not using the secondary source differs. An obvious example of such an operation is "fsck", where we do care and want to notice when the underlying object graph and what commit-graph records contradict with each other. And my suggestion to disable commit-graph while running the "rev-list" command with the "--missing" option is because that usage would fall into the same category (please correct me if that is not the case) [*]. So for the purpose of "rev-list --missing", the above change does take us closer to the answer we would get from the primary source, but isn't it pessimising other users unnecessarily? Do we have a rough idea (if we have a benchmark that would be even better) how much this lack of has_object() check is contributing to the performance benefit of using commit-graph? Majority of callers of this code should not have to pay the additional overhead here, so unless it is small enough, this would be pessimising the generic code for everybody to please "rev-list --missing", which is why I am worried about going in this direction.