[0/3] rev-list: add support for commits in `--missing`

Message ID	20231009105528.17777-1-karthik.188@gmail.com (mailing list archive)
Headers	show Return-Path: <git-owner@vger.kernel.org> From: Karthik Nayak <karthik.188@gmail.com> To: git@vger.kernel.org Cc: Karthik Nayak <karthik.188@gmail.com> Subject: [PATCH 0/3] rev-list: add support for commits in `--missing` Date: Mon, 9 Oct 2023 12:55:25 +0200 Message-ID: <20231009105528.17777-1-karthik.188@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	rev-list: add support for commits in `--missing` \| expand [0/3] rev-list: add support for commits in `--missing` [1/3] revision: rename bit to `do_not_die_on_missing_objects` [2/3] rev-list: move `show_commit()` to the bottom [3/3] rev-list: add commit object support in `--missing` option

karthik nayak Oct. 9, 2023, 10:55 a.m. UTC

The `--missing` option in git-rev-list(1) was introduced intitally
to deal with missing blobs in the context of promissory notes.
Eventually the option was extended to also support tree objects in
7c0fe330d5 (rev-list: handle missing tree objects properly,2018-10-05).

This patch series extends the `--missing` option to also add support for
commit objects. We do this by introducing a new flag `MISSING` which is
added whenever we encounter a missing commit during traversal. Then in
`builtin/rev-list` we check for this flag and take the appropriate
action based on the `--missing=*` option used.

This series is an alternate to the patch series I had posted earlier:
https://lore.kernel.org/git/20230908174208.249184-1-karthik.188@gmail.com/.
In that patch, we introduced an option `--ignore-missing-links` which
was added to expose the `ignore_missing_links` bit to the user. The
issue in that patch was that, the option `--ignore-missing-links` didn't
play well the pre-existing `--missing` option. This series avoids that
route and just extends the `--missing` option for commits to solve the
same problem.

Karthik Nayak (3):
  revision: rename bit to `do_not_die_on_missing_objects`
  rev-list: move `show_commit()` to the bottom
  rev-list: add commit object support in `--missing` option

 builtin/reflog.c            |  2 +-
 builtin/rev-list.c          | 93 +++++++++++++++++++------------------
 list-objects.c              |  2 +-
 object.h                    |  2 +-
 revision.c                  |  9 +++-
 revision.h                  | 20 ++++----
 t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++
 7 files changed, 145 insertions(+), 57 deletions(-)
 create mode 100755 t/t6022-rev-list-missing.sh

Junio C Hamano Oct. 9, 2023, 10:02 p.m. UTC | #1

Karthik Nayak <karthik.188@gmail.com> writes:

> This series is an alternate to the patch series I had posted earlier:
> https://lore.kernel.org/git/20230908174208.249184-1-karthik.188@gmail.com/.
> In that patch, we introduced an option `--ignore-missing-links` which
> was added to expose the `ignore_missing_links` bit to the user. The
> issue in that patch was that, the option `--ignore-missing-links` didn't
> play well the pre-existing `--missing` option. This series avoids that
> route and just extends the `--missing` option for commits to solve the
> same problem.

Thanks for exploring the problem space in the other direction.

I haven't really thought things through yet, but it looks to me that
the updated behaviour of "--missing" is more in line with what the
option would have wanted to be from day one.

> Karthik Nayak (3):
>   revision: rename bit to `do_not_die_on_missing_objects`
>   rev-list: move `show_commit()` to the bottom
>   rev-list: add commit object support in `--missing` option
>
>  builtin/reflog.c            |  2 +-
>  builtin/rev-list.c          | 93 +++++++++++++++++++------------------
>  list-objects.c              |  2 +-
>  object.h                    |  2 +-
>  revision.c                  |  9 +++-
>  revision.h                  | 20 ++++----
>  t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++
>  7 files changed, 145 insertions(+), 57 deletions(-)
>  create mode 100755 t/t6022-rev-list-missing.sh

Patrick Steinhardt Oct. 10, 2023, 6:19 a.m. UTC | #2

On Mon, Oct 09, 2023 at 12:55:25PM +0200, Karthik Nayak wrote:
> The `--missing` option in git-rev-list(1) was introduced intitally
> to deal with missing blobs in the context of promissory notes.
> Eventually the option was extended to also support tree objects in
> 7c0fe330d5 (rev-list: handle missing tree objects properly,2018-10-05).
> 
> This patch series extends the `--missing` option to also add support for
> commit objects. We do this by introducing a new flag `MISSING` which is
> added whenever we encounter a missing commit during traversal. Then in
> `builtin/rev-list` we check for this flag and take the appropriate
> action based on the `--missing=*` option used.
> 
> This series is an alternate to the patch series I had posted earlier:
> https://lore.kernel.org/git/20230908174208.249184-1-karthik.188@gmail.com/.
> In that patch, we introduced an option `--ignore-missing-links` which
> was added to expose the `ignore_missing_links` bit to the user. The
> issue in that patch was that, the option `--ignore-missing-links` didn't
> play well the pre-existing `--missing` option. This series avoids that
> route and just extends the `--missing` option for commits to solve the
> same problem.

I had already reviewed the patches internally at GitLab, so for what
it's worth please feel free to add my Reviewed-by.

Patrick

> Karthik Nayak (3):
>   revision: rename bit to `do_not_die_on_missing_objects`
>   rev-list: move `show_commit()` to the bottom
>   rev-list: add commit object support in `--missing` option
> 
>  builtin/reflog.c            |  2 +-
>  builtin/rev-list.c          | 93 +++++++++++++++++++------------------
>  list-objects.c              |  2 +-
>  object.h                    |  2 +-
>  revision.c                  |  9 +++-
>  revision.h                  | 20 ++++----
>  t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++
>  7 files changed, 145 insertions(+), 57 deletions(-)
>  create mode 100755 t/t6022-rev-list-missing.sh
> 
> -- 
> 2.41.0
>

Junio C Hamano Oct. 10, 2023, 5:09 p.m. UTC | #3

Patrick Steinhardt <ps@pks.im> writes:

> I had already reviewed the patches internally at GitLab, so for what
> it's worth please feel free to add my Reviewed-by.

Great.  It seems that 'seen' with this series fails to pass the
tests, though.

https://github.com/git/git/actions/runs/6462854176/job/17545104753

Thanks.

>
> Patrick
>
>> Karthik Nayak (3):
>>   revision: rename bit to `do_not_die_on_missing_objects`
>>   rev-list: move `show_commit()` to the bottom
>>   rev-list: add commit object support in `--missing` option
>> 
>>  builtin/reflog.c            |  2 +-
>>  builtin/rev-list.c          | 93 +++++++++++++++++++------------------
>>  list-objects.c              |  2 +-
>>  object.h                    |  2 +-
>>  revision.c                  |  9 +++-
>>  revision.h                  | 20 ++++----
>>  t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++
>>  7 files changed, 145 insertions(+), 57 deletions(-)
>>  create mode 100755 t/t6022-rev-list-missing.sh
>> 
>> -- 
>> 2.41.0
>>

karthik nayak Oct. 11, 2023, 10:37 a.m. UTC | #4

On Tue, Oct 10, 2023 at 7:09 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Patrick Steinhardt <ps@pks.im> writes:
>
> > I had already reviewed the patches internally at GitLab, so for what
> > it's worth please feel free to add my Reviewed-by.
>
> Great.  It seems that 'seen' with this series fails to pass the
> tests, though.
>
> https://github.com/git/git/actions/runs/6462854176/job/17545104753

Seems like this is because of commit-graph being enabled, I think the
best thing to do here
would be to disable the commit graph of these tests.

This should do:

    diff --git a/t/t6022-rev-list-missing.sh b/t/t6022-rev-list-missing.sh
    index bbff66e4fc..39a8402682 100755
    --- a/t/t6022-rev-list-missing.sh
    +++ b/t/t6022-rev-list-missing.sh
    @@ -5,6 +5,11 @@ test_description='handling of missing objects in rev-list'
     TEST_PASSES_SANITIZE_LEAK=true
     . ./test-lib.sh

    +# Disable writing the commit graph as the tests depend on making particular
    +# commits hidden, the graph is created before that and rev-list
would default
    +# to using the commit graph in such instances.
    +GIT_TEST_COMMIT_GRAPH=0
    +
     # We setup the repository with two commits, this way HEAD is always
     # available and we can hide commit 1.
     test_expect_success 'create repository and alternate directory' '

Thanks for reporting. I'll wait for a day or two (for reviews) and
will add this to the second version
of this series.

Junio C Hamano Oct. 11, 2023, 4:54 p.m. UTC | #5

Karthik Nayak <karthik.188@gmail.com> writes:

> Seems like this is because of commit-graph being enabled, I think
> the best thing to do here would be to disable the commit graph of
> these tests.

If the CI uncovered that the new code is broken and does not work
with commit-graph, wouldn't the above a totally wrong approach to
correct it?  If the updated logic cannot work correctly when
commit-graph covers the history you intentionally break, shouldn't
the code, when the feature that is incompatible with commit-graph is
triggered, disable the commit-graph?  I am assuming that the new
feature is meant to be used to recover from a corrupt repository,
and if it does not work well when commit-graph knows (now stale
after repository corruption) more about the objects that are corrupt
in the object store, we do want to disable commit-graph.  After all,
commit-graph is a secondary information that is supposed to be
recoverable from the primary data that is what is in the object
store.  Disabling commit-graph in the test means you are telling the
end-users "do not use commit-graph if you want to use this feature",
which sounds like a wrong thing to do.

karthik nayak Oct. 12, 2023, 10:44 a.m. UTC | #6

On Wed, Oct 11, 2023 at 6:54 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Karthik Nayak <karthik.188@gmail.com> writes:
>
> > Seems like this is because of commit-graph being enabled, I think
> > the best thing to do here would be to disable the commit graph of
> > these tests.
>
> If the CI uncovered that the new code is broken and does not work
> with commit-graph, wouldn't the above a totally wrong approach to
> correct it?  If the updated logic cannot work correctly when
> commit-graph covers the history you intentionally break, shouldn't
> the code, when the feature that is incompatible with commit-graph is
> triggered, disable the commit-graph?  I am assuming that the new
> feature is meant to be used to recover from a corrupt repository,
> and if it does not work well when commit-graph knows (now stale
> after repository corruption) more about the objects that are corrupt
> in the object store, we do want to disable commit-graph.  After all,
> commit-graph is a secondary information that is supposed to be
> recoverable from the primary data that is what is in the object
> store.  Disabling commit-graph in the test means you are telling the
> end-users "do not use commit-graph if you want to use this feature",
> which sounds like a wrong thing to do.

I agree with what you're saying. Disabling writing the commit-graph for
only the test doesn't serve the real usage. To ensure that the feature should
work with corrupt repositories with stale commit-graph, I'm thinking of
disabling the commit-graph whenever the `--missing` option is used. The
commit graph already provides an API for this, so it should be simple to do.

Thanks!

Patrick Steinhardt Oct. 12, 2023, 11:04 a.m. UTC | #7

On Thu, Oct 12, 2023 at 12:44:27PM +0200, Karthik Nayak wrote:
> On Wed, Oct 11, 2023 at 6:54 PM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Karthik Nayak <karthik.188@gmail.com> writes:
> >
> > > Seems like this is because of commit-graph being enabled, I think
> > > the best thing to do here would be to disable the commit graph of
> > > these tests.
> >
> > If the CI uncovered that the new code is broken and does not work
> > with commit-graph, wouldn't the above a totally wrong approach to
> > correct it?  If the updated logic cannot work correctly when
> > commit-graph covers the history you intentionally break, shouldn't
> > the code, when the feature that is incompatible with commit-graph is
> > triggered, disable the commit-graph?  I am assuming that the new
> > feature is meant to be used to recover from a corrupt repository,
> > and if it does not work well when commit-graph knows (now stale
> > after repository corruption) more about the objects that are corrupt
> > in the object store, we do want to disable commit-graph.  After all,
> > commit-graph is a secondary information that is supposed to be
> > recoverable from the primary data that is what is in the object
> > store.  Disabling commit-graph in the test means you are telling the
> > end-users "do not use commit-graph if you want to use this feature",
> > which sounds like a wrong thing to do.
> 
> I agree with what you're saying. Disabling writing the commit-graph for
> only the test doesn't serve the real usage. To ensure that the feature should
> work with corrupt repositories with stale commit-graph, I'm thinking of
> disabling the commit-graph whenever the `--missing` option is used. The
> commit graph already provides an API for this, so it should be simple to do.
> 
> Thanks!

Wouldn't this have the potential to significantly regress performance
for all those preexisting users of the `--missing` option? The commit
graph is quite an important optimization nowadays, and especially in
commands where we potentially walk a lot of commits (like we may do
here) it can result in queries that are orders of magnitudes faster.

If we were introducing a new option then I think this would be an
acceptable tradeoff as we could still fix the performance issue in
another iteration. But we don't and instead aim to change the meaning of
`--missing` which may already have users out there. Seen in that light
it does not seem sensible to me to just disable the graph for them.

Did you figure out what exactly the root cause of this functional
regression is? If so, can we address that root cause instead?

Patrick

karthik nayak Oct. 12, 2023, 1:23 p.m. UTC | #8

On Thu, Oct 12, 2023 at 1:04 PM Patrick Steinhardt <ps@pks.im> wrote:
> Wouldn't this have the potential to significantly regress performance
> for all those preexisting users of the `--missing` option? The commit
> graph is quite an important optimization nowadays, and especially in
> commands where we potentially walk a lot of commits (like we may do
> here) it can result in queries that are orders of magnitudes faster.
>
> If we were introducing a new option then I think this would be an
> acceptable tradeoff as we could still fix the performance issue in
> another iteration. But we don't and instead aim to change the meaning of
> `--missing` which may already have users out there. Seen in that light
> it does not seem sensible to me to just disable the graph for them.
>

Agreed, that there is a rather huge performance drop doing this. So either:
1. Add commit support for `--missing` and disable commit-graph. There is huge
performance drop here for commits.
2. Add commit support for `--missing` but disabling commit-graph is provided via
an additional `--disable-commit-graph` option. This expects the user to manually
disable commit-graph for `--missing` to work correctly with commits.

I was also thinking about this, but people who are using `--missing`
till date, also use
it with `--objects`. Using `--objects` is already slow and doesn't
benefit from the
commit-graph. So I don't know if there is much of a performance hit.

Some benchmarks on the Git repo:

With commit graphs:

$ hyperfine --warmup 2 "git rev-list --objects --missing=print HEAD > /dev/null"
Benchmark 1: git rev-list --objects --missing=print HEAD > /dev/null
  Time (mean ± σ):      1.336 s ±  0.013 s    [User: 1.278 s, System: 0.054 s]
  Range (min … max):    1.323 s …  1.365 s    10 runs

$ hyperfine --warmup 2 "git rev-list --missing=print HEAD > /dev/null"
Benchmark 1: git rev-list --missing=print HEAD > /dev/null
  Time (mean ± σ):      29.6 ms ±   1.5 ms    [User: 20.2 ms, System: 9.1 ms]
  Range (min … max):    27.5 ms …  35.1 ms    92 runs

Without commit graphs:

$ hyperfine --warmup 2 "git rev-list --objects --missing=print HEAD > /dev/null"
Benchmark 1: git rev-list --objects --missing=print HEAD > /dev/null
  Time (mean ± σ):      1.735 s ±  0.022 s    [User: 1.672 s, System: 0.057 s]
  Range (min … max):    1.703 s …  1.765 s    10 runs

$ hyperfine --warmup 2 "git rev-list --missing=print HEAD > /dev/null"
Benchmark 1: git rev-list --missing=print HEAD > /dev/null
  Time (mean ± σ):     426.6 ms ±  10.2 ms    [User: 410.1 ms, System: 15.1 ms]
  Range (min … max):   414.9 ms … 441.3 ms    10 runs

This shows that while there is definitely a performance loss when
disabling commit-graphs.
This loss is much lesser when used alongside the `--objects` flag.

Overall, I lean towards the option "1", but okay with either.

> Did you figure out what exactly the root cause of this functional
> regression is? If so, can we address that root cause instead?

Are you referring to `--missing` not working with commits? Looking at
the history
of `--missing` shows:

- caf3827e2f: This commits introduces the `--missing` option, it
basically states that
for the upcoming partial clone mechanism, we want to add support for identifying
missing objects.
- 7c0fe330d5: This commit adds support for trees in `--missing`. It
also goes on to
state that `--missing` assumed only blobs could be missing.

Both of these show that `--missing` started off support for only blobs
and eventually
added support to trees (similar to how we're adding support for
commits here). So
I guess the intention was to never work with commits. But the documentation and
the option seem generic enough. Considering that trees was an extension, commits
should be treated the same way.

Junio C Hamano Oct. 12, 2023, 4:17 p.m. UTC | #9

Patrick Steinhardt <ps@pks.im> writes:

> Wouldn't this have the potential to significantly regress performance
> for all those preexisting users of the `--missing` option? The commit
> graph is quite an important optimization nowadays, and especially in
> commands where we potentially walk a lot of commits (like we may do
> here) it can result in queries that are orders of magnitudes faster.

The test fails only when GIT_TEST_COMMIT_GRAPH is on, which updates
the commit-graph every time a commit is made via "git commit" or
"git merge".

I'd suggest stepping back and think a bit.

My assumption has been that the failing test emulates this scenario
that can happen in real life:

 * The user creates a new commit.

 * A commit graph is written (not as part of GIT_TEST_COMMIT_GRAPH
   that is not realistic, but as part of "maintenance").

 * The repository loses some objects due to corruption.

 * Now, "--missing=print" is invoked so that the user can view what
   are missing.  Or "--missing=allow-primisor" to ensure that the
   repository does not have missing objects other than the ones that
   the promisor will give us if we asked again.

 * But because the connectivity of these objects appear in the
   commit graph file, we fail to notice that these objects are
   missing, producing wrong results.  If we disabled commit-graph
   while traversal (an earlier writing of it was perfectly OK), then
   "rev-list --missing" would have noticed and reported what the
   user wanted to know.

In other words, the "optimization" you value is working to quickly
produce a wrong result.  Is it "significantly regress"ing if we
disabled it to obtain the correct result?

My assumption also has been that there is no point in running
"rev-list --missing" if we know there is no repository corruption,
and those who run "rev-list --missing" wants to know if the objects
are really available, i.e. even if commit-graph that is out of sync
with reality says it exists, if it is not in the object store, they
would want to know that.

If you can show me that it is not the case, then I may be pursuaded
why producing a result that is out of sync with reality _quickly_,
instead of taking time to produce a result that matches reality, is
a worthy "optimization" to keep.

Thanks.

Junio C Hamano Oct. 12, 2023, 4:26 p.m. UTC | #10

Karthik Nayak <karthik.188@gmail.com> writes:

> ... To ensure that the feature should
> work with corrupt repositories with stale commit-graph, I'm thinking of
> disabling the commit-graph whenever the `--missing` option is used.

Just to avoid giving anybody a wrong guideline, for most of the
other features, we should assume that the commit-graph accurately
represents the reality in the object database and take advantage of
it, leaving it to "git fsck" to ensure that the commit-graph is
healthy.

But "rev-list --missing", especially when used with actions other
than allow-promisor or allow-any, is a feature that is only useful
when one wants to really notice objects that are available in the
object store, and it would be a bug for it to pretend as if objects
the commit-graph has heard of exists without checking.

Patrick Steinhardt Oct. 13, 2023, 5:53 a.m. UTC | #11

On Thu, Oct 12, 2023 at 09:17:09AM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > Wouldn't this have the potential to significantly regress performance
> > for all those preexisting users of the `--missing` option? The commit
> > graph is quite an important optimization nowadays, and especially in
> > commands where we potentially walk a lot of commits (like we may do
> > here) it can result in queries that are orders of magnitudes faster.
> 
> The test fails only when GIT_TEST_COMMIT_GRAPH is on, which updates
> the commit-graph every time a commit is made via "git commit" or
> "git merge".
> 
> I'd suggest stepping back and think a bit.
> 
> My assumption has been that the failing test emulates this scenario
> that can happen in real life:
> 
>  * The user creates a new commit.
> 
>  * A commit graph is written (not as part of GIT_TEST_COMMIT_GRAPH
>    that is not realistic, but as part of "maintenance").
> 
>  * The repository loses some objects due to corruption.
> 
>  * Now, "--missing=print" is invoked so that the user can view what
>    are missing.  Or "--missing=allow-primisor" to ensure that the
>    repository does not have missing objects other than the ones that
>    the promisor will give us if we asked again.
> 
>  * But because the connectivity of these objects appear in the
>    commit graph file, we fail to notice that these objects are
>    missing, producing wrong results.  If we disabled commit-graph
>    while traversal (an earlier writing of it was perfectly OK), then
>    "rev-list --missing" would have noticed and reported what the
>    user wanted to know.
> 
> In other words, the "optimization" you value is working to quickly
> produce a wrong result.  Is it "significantly regress"ing if we
> disabled it to obtain the correct result?

It depends, in my opinion. If:

    - Wrong results caused by the commit graph are only introduced with
      this patch series due to the changed behaviour of `--missing`.

    - We disable commit graphs proactively only because of the changed
      behaviour of `--missing`.

Then yes, it does feel wrong to me to disable commit graphs and regress
performance for usecases that perviously worked both correct and fast.

> My assumption also has been that there is no point in running
> "rev-list --missing" if we know there is no repository corruption,
> and those who run "rev-list --missing" wants to know if the objects
> are really available, i.e. even if commit-graph that is out of sync
> with reality says it exists, if it is not in the object store, they
> would want to know that.
> 
> If you can show me that it is not the case, then I may be pursuaded
> why producing a result that is out of sync with reality _quickly_,
> instead of taking time to produce a result that matches reality, is
> a worthy "optimization" to keep.

Note that I'm not saying that it's fine to return wrong results -- this
is of course a bug that needs to be addressed somehow. After all, things
working correctly should always trump things working fast. But until now
it felt more like we were going into the direction of disabling commit
graphs without checking whether there is an alternative solution that
allows us to get the best of both worlds, correctness and performance.

So what I'm looking for in this thread is a reason why we _can't_ have
that, or at least can't have it without unreasonable amounts of work. We
have helpers like `lookup_commit_in_graph()` that are designed to detect
stale commit graphs by double-checking whether a commit that has been
looked up via the commit graph actually exists in the repository. So I'm
wondering whether this could help us address the issue.

If there is a good reason why all of that is not possible then I'm happy
to carve in.

Patrick

Patrick Steinhardt Oct. 13, 2023, 8:38 a.m. UTC | #12

On Fri, Oct 13, 2023 at 07:53:36AM +0200, Patrick Steinhardt wrote:
> On Thu, Oct 12, 2023 at 09:17:09AM -0700, Junio C Hamano wrote:
> > Patrick Steinhardt <ps@pks.im> writes:
[snip]
> > My assumption also has been that there is no point in running
> > "rev-list --missing" if we know there is no repository corruption,
> > and those who run "rev-list --missing" wants to know if the objects
> > are really available, i.e. even if commit-graph that is out of sync
> > with reality says it exists, if it is not in the object store, they
> > would want to know that.
> > 
> > If you can show me that it is not the case, then I may be pursuaded
> > why producing a result that is out of sync with reality _quickly_,
> > instead of taking time to produce a result that matches reality, is
> > a worthy "optimization" to keep.
> 
> Note that I'm not saying that it's fine to return wrong results -- this
> is of course a bug that needs to be addressed somehow. After all, things
> working correctly should always trump things working fast. But until now
> it felt more like we were going into the direction of disabling commit
> graphs without checking whether there is an alternative solution that
> allows us to get the best of both worlds, correctness and performance.
> 
> So what I'm looking for in this thread is a reason why we _can't_ have
> that, or at least can't have it without unreasonable amounts of work. We
> have helpers like `lookup_commit_in_graph()` that are designed to detect
> stale commit graphs by double-checking whether a commit that has been
> looked up via the commit graph actually exists in the repository. So I'm
> wondering whether this could help us address the issue.
> 
> If there is a good reason why all of that is not possible then I'm happy
> to carve in.

I've had a quick look at this problem so that I can solidify my own
train of thought a bit. The issue is `repo_parse_commit_internal()`,
which calls `parse_commit_in_graph()` without verifying that the object
actually exists in the object database. It's the only callsite of that
function outside of "commit-graph.c", as all other external callers
would call `lookup_commit_in_graph()` which _does_ perform the object
existence check.

So I think that the proper way to address the regression would be a
patch similar to the following:

diff --git a/commit.c b/commit.c
index b3223478bc..109e9217e3 100644
--- a/commit.c
+++ b/commit.c
@@ -572,8 +572,13 @@ int repo_parse_commit_internal(struct repository *r,
 		return -1;
 	if (item->object.parsed)
 		return 0;
-	if (use_commit_graph && parse_commit_in_graph(r, item))
+	if (use_commit_graph && parse_commit_in_graph(r, item)) {
+		if (!has_object(r, &item->object.oid, 0))
+			return quiet_on_missing ? -1 :
+				error(_("commit %s exists in commit-graph but not in the object database"),
+				      oid_to_hex(&item->object.oid));
 		return 0;
+	}

 	if (oid_object_info_extended(r, &item->object.oid, &oi, flags) < 0)
 		return quiet_on_missing ? -1 :

I wouldn't be surprised if there are other edge cases where this can
lead to buggy behaviour.

Also, this issue may not necessarily stem from repository corruption. It
could for example happen that commits are getting garbage collected
without the commit graph having been updated, whatever the reason may be
for this. In that case we would happily continue to return these commits
from the commit graph even though the underlying object has since been
deleted. The repository itself is not corrupt though, we merely look at
an out-of-date commit graph. And for what it's worth, I think that we
should always gracefully handle that case or otherwise the commit graph
becomes less useful overall.

I didn't dig much deeper yet. And while the above patch fixes some of
the test failures, it doesn't fix them all. If we agree that this is the
way to go then I'd be happy to turn this into a proper patch.

Patrick

Junio C Hamano Oct. 13, 2023, 5:07 p.m. UTC | #13

Patrick Steinhardt <ps@pks.im> writes:

> actually exists in the object database. It's the only callsite of that
> function outside of "commit-graph.c", as all other external callers
> would call `lookup_commit_in_graph()` which _does_ perform the object
> existence check.
>
> So I think that the proper way to address the regression would be a
> patch similar to the following:
>
> diff --git a/commit.c b/commit.c
> index b3223478bc..109e9217e3 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -572,8 +572,13 @@ int repo_parse_commit_internal(struct repository *r,
>  		return -1;
>  	if (item->object.parsed)
>  		return 0;
> -	if (use_commit_graph && parse_commit_in_graph(r, item))
> +	if (use_commit_graph && parse_commit_in_graph(r, item)) {
> +		if (!has_object(r, &item->object.oid, 0))
> +			return quiet_on_missing ? -1 :
> +				error(_("commit %s exists in commit-graph but not in the object database"),
> +				      oid_to_hex(&item->object.oid));
>  		return 0;
> +	}

It is (overly) conservative, which I generally should find pleasing,
but as I said, for secondary information like commit-graph that is
derived from the primary source only to precompute for optimization,
our general attitude should be to trust it and let the optimization
kick in, unless the operation being performed primarily cares about
the case where the result from using and not using the secondary
source differs.  An obvious example of such an operation is "fsck",
where we do care and want to notice when the underlying object graph
and what commit-graph records contradict with each other.  And my
suggestion to disable commit-graph while running the "rev-list"
command with the "--missing" option is because that usage would fall
into the same category (please correct me if that is not the case) [*].

So for the purpose of "rev-list --missing", the above change does
take us closer to the answer we would get from the primary source,
but isn't it pessimising other users unnecessarily?  Do we have a
rough idea (if we have a benchmark that would be even better) how
much this lack of has_object() check is contributing to the
performance benefit of using commit-graph?  Majority of callers of
this code should not have to pay the additional overhead here, so
unless it is small enough, this would be pessimising the generic
code for everybody to please "rev-list --missing", which is why I am
worried about going in this direction.

[0/3] rev-list: add support for commits in `--missing`

Message

Comments