diff mbox series

[RFC] promisor-remote: always JIT fetch with --refetch

Message ID 20241003223546.1935471-1-emilyshaffer@google.com (mailing list archive)
State New
Headers show
Series [RFC] promisor-remote: always JIT fetch with --refetch | expand

Commit Message

Emily Shaffer Oct. 3, 2024, 10:35 p.m. UTC
By the time we decide we need to do a partial clone fetch, we already
know the object is missing, even if the_repository->parsed_objects
thinks it exists. But --refetch bypasses the local object check, so we
can guarantee that a JIT fetch will fix incorrect local caching.

This manifested at $DAYJOB in a repo with the following features:
 * blob-filtered partial clone enabled
 * commit graph enabled
 * ref Foo pointing to commit object 6aaaca
 * object 6aaaca missing[a]

With these prerequisites, we noticed that `git fetch` in the repo
produced an infinite loop:
1. `git fetch` tries to fetch, but thinks it has all objects, so it
   noops.
2. At the end of cmd_fetch(), we try to write_commit_graph_reachable().
3. write_commit_graph_reachable() does a reachability walk, including
   starting from Foo
4. The reachability walk tries to peel Foo, and notices it's missing
   6aaaca.
5. The partial clone machinery asks for a per-object JIT fetch of
   6aaaca.
6. `git fetch` (child process) is asked to fetch 6aaaca.
7. We put together the_repository->parsed_objects, adding all commit IDs
   reachable from local refs to it
   (fetch-pack.c:mark_complete_and_common_refs(), trace region
   mark_complete_local_refs). We see Foo, so we add 6aaaca to
   the_repository->parsed_objects and mark it as COMPLETE.
8. cmd_fetch notices that the object ID it was asked for is already
   known, so does not fetch anything new.
9. GOTO 2.

The culprit is that we're assuming all local refs already must have
objects in place. Using --refetch means we ignore that assumption during
JIT fetch.

Signed-off-by: Emily Shaffer <emilyshaffer@google.com>

---

There are a few alternative approaches for this issue that I talked
about with some folks at $DAYJOB:

i. Just disabling the commit graph rewrite allows this to fall
eventually into a path where the fetch actually succeeds. I didn't like
this solution - it's just whack-a-mole - so I didn't look too hard into
why it succeeds that way. It *could* make sense to disable commit graph
rewrite when we do a JIT fetch with blob or tree filter provided - but
if later we want to implement commit filter (something we've talked
about at Google) then I'd worry about this situation coming up again.

ii. We could decide not to mark local refs (and commits reachable from
them) as COMPLETE in the_repository->parsed_objects. I didn't try this
solution out, and I'm not sure what the performance implications are,
but Jonathan Tan likes this solution, so I may try it out and see what
breaks shortly.

iii. We could do all the JIT fetches with --refetch. In my opinion, this
is the safest/most self-healing solution; the JIT fetch only happens
when we really know we're missing the object, so it doesn't make sense
for that fetch to be canceled by any cache. It doesn't have performance
implications as far as I can guess (except that I think we still build
the parsed_objects hash even though we are going to ignore it, but we
already were doing that anyway). Of course, that's what this patch does.

iv. We could do nothing; when cmd_fetch gets a fetch-by-object-id but
decides there is nothing more to do, it could terminate with an error.
That should stop the infinite recursion, and the error could suggest the
user to run `git fsck` and discover what the problem is. Depending on
the remediation we suggest, though, I think a direct fetch to fix this
particular loop would not work.

I'm curious to hear thoughts from people who are more expert than me on
partial clone and fetching in general, though.

This change is also still in RFC, for two reasons:

First, it's intermittently failing tests for me locally, in weirdly
flaky ways:

- t0410-partial-clone.sh fails when I run it from prove, but passes when
  I run it manually, every time.
- t5601-clone.sh and t5505-remote.sh fail nonsensically on `rm -rf` that
  should succeed (and does succeed if I stop the test with test_pause),
  which makes me think there's something else borked in my setup, but
  I'm not sure what.
- t5616-partial-clone.sh actually does fail in a way that I could see
  having to do with this change (since I guess we might download more
  packs than usual), but I was so confused by the other two errors I
  haven't looked closely yet.

And secondly, I didn't write tests verifying the breakage and that this
change fixes it yet, either.

I'm going to work on both those things in the background, but I wanted
to get the description and RFC out early so that folks could take a look
and we could decide which approach is best.

Thanks,
 - Emily

a: That commit object went missing as a byproduct of this partial clone
   gc issue that Calvin, Jonathan, Han Young, and others have been
   investigating:
   https://lore.kernel.org/git/20241001191811.1934900-1-calvinwan@google.com/
---
 promisor-remote.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Junio C Hamano Oct. 6, 2024, 10:43 p.m. UTC | #1
Emily Shaffer <emilyshaffer@google.com> writes:

> By the time we decide we need to do a partial clone fetch, we already
> know the object is missing, even if the_repository->parsed_objects
> thinks it exists. But --refetch bypasses the local object check, so we
> can guarantee that a JIT fetch will fix incorrect local caching.
> ...
> The culprit is that we're assuming all local refs already must have
> objects in place. Using --refetch means we ignore that assumption during
> JIT fetch.

Hmph.  The whole lazy fetch business looks more and more broken X-<.
There is a comment in the refetch code path that tells us to "perform
a full refetch ignoring existing objects", but if an object truly
exists, there should be no need to refetch, and it starts to smell
more like "ignoring somebody who gives us an incorrect information
that these objects exist".

But a ref that points at a missing commit is "somebody giving a
false information" and an option to ignore such misinformation would
be a perfect tool fit to sweep such a breakage under the rug.

But is this sufficient?  Looking at how check_exist_and_connected()
does its work, I am not sure how it would cope with a case where an
object that is pointed by a ref does happen to exist, but the commit
that is referred to by the commit is missing, as it only checks the
existence of the tips.

> diff --git a/promisor-remote.c b/promisor-remote.c
> index 9345ae3db2..cf00e31d3b 100644
> --- a/promisor-remote.c
> +++ b/promisor-remote.c
> @@ -43,7 +43,7 @@ static int fetch_objects(struct repository *repo,
>  	strvec_pushl(&child.args, "-c", "fetch.negotiationAlgorithm=noop",
>  		     "fetch", remote_name, "--no-tags",
>  		     "--no-write-fetch-head", "--recurse-submodules=no",
> -		     "--filter=blob:none", "--stdin", NULL);
> +		     "--filter=blob:none", "--refetch", "--stdin", NULL);
>  	if (!git_config_get_bool("promisor.quiet", &quiet) && quiet)
>  		strvec_push(&child.args, "--quiet");
>  	if (start_command(&child))

The documentation for "git fetch --refetch" says that this grabs
everything as if we are making a fresh clone, ignoring everything we
already have.  Which makes the change in this patch prohibitively
expensive for asking each single object lazily from the promisor
remote, but is that really the case?  If there is a reasonable
safety that prevents us from doing something silly like transferring
one clone worth of data for every single object we lazily fetch,
perhaps this would be a workable solution (but if that is the case,
perhaps "git fetch --refetch" documentation needs to be rephrased,
to avoid such an impression).

Thanks.
Robert Coup Oct. 7, 2024, 12:21 a.m. UTC | #2
Hi Emily,

I was the one who originally implemented --refetch in [1][2]

[1] https://lore.kernel.org/git/pull.1138.v4.git.1648476131.gitgitgadget@gmail.com/
[2] https://github.com/gitgitgadget/git/pull/1138

On Sun, 6 Oct 2024 at 23:43, Junio C Hamano <gitster@pobox.com> wrote:
>
> Hmph.  The whole lazy fetch business looks more and more broken X-<.
> There is a comment in the refetch code path that tells us to "perform
> a full refetch ignoring existing objects", but if an object truly
> exists, there should be no need to refetch, and it starts to smell
> more like "ignoring somebody who gives us an incorrect information
> that these objects exist".

Basically --refetch was originally designed to send no 'have's during a fetch,
the original motivation being changing a partial clone filter and fetching
all the newly-applicable trees & blobs in a single transfer.

> The documentation for "git fetch --refetch" says that this grabs
> everything as if we are making a fresh clone, ignoring everything we
> already have.  Which makes the change in this patch prohibitively
> expensive for asking each single object lazily from the promisor
> remote, but is that really the case?

From a very quick re-review this is correct that it's expensive: refetch sends
no 'have's, so if you pass a single commit oid then it'll fetch all ancestors
and all the dependent trees & blobs, duplicating what's in the object store and
relying on a repack to clean up. If a commit is missing that's one way to fix
it, but it's a pretty nuclear option: feels like your option iv (terminate with
an error) leading to fsck invoking/suggesting --refetch might avoid
unintentionally recloning the entire repo.

In my original RFC [3], Jonathan Tan suggested that --refetch could be useful
to repair missing objects like this, but it was out of scope for me at the time.
But maybe there's a way to improve it for this sort of case?

[3] https://lore.kernel.org/git/20220202185957.1928631-1-jonathantanmy@google.com/

> Emily Shaffer <emilyshaffer@google.com> writes:
>
> This manifested at $DAYJOB in a repo with the following features:
> * blob-filtered partial clone enabled
> * commit graph enabled
> * ref Foo pointing to commit object 6aaaca
> * object 6aaaca missing[a]

I presume there wasn't an obvious/related cause for commit 6aaaca to go missing
in the first place?

Thanks,

Rob :)
Junio C Hamano Oct. 7, 2024, 12:37 a.m. UTC | #3
Robert Coup <robert.coup@koordinates.com> writes:

> Basically --refetch was originally designed to send no 'have's during a fetch,
> the original motivation being changing a partial clone filter and fetching
> all the newly-applicable trees & blobs in a single transfer.
> ...
> If a commit is missing that's one way to fix
> it, but it's a pretty nuclear option: feels like your option iv (terminate with
> an error) leading to fsck invoking/suggesting --refetch might avoid
> unintentionally recloning the entire repo.
> ...
> In my original RFC [3], Jonathan Tan suggested that --refetch could be useful
> to repair missing objects like this, but it was out of scope for me at the time.
> But maybe there's a way to improve it for this sort of case?
>
> [3] https://lore.kernel.org/git/20220202185957.1928631-1-jonathantanmy@google.com/

Thanks for your comments on the original story behind that option.

> I presume there wasn't an obvious/related cause for commit 6aaaca to go missing
> in the first place?

Emily had this after the three-dash line


a: That commit object went missing as a byproduct of this partial clone
   gc issue that Calvin, Jonathan, Han Young, and others have been
   investigating:
   https://lore.kernel.org/git/20241001191811.1934900-1-calvinwan@google.com/

IOW, I think how the lossage was caused is well understood by now.

Thanks.
Emily Shaffer Oct. 11, 2024, 4:40 p.m. UTC | #4
Sorry for the slow response/page-context-back-in. I'll be working on
this today and try to send a different approach, but after that point,
I'm not sure when the next time I'll get a chance to work on it may
be. If I don't come up with something suitable today, it's likely that
Jonathan Tan will take over the effort from me, but I'm not sure
around when he'll be able to prioritize it.

On Sun, Oct 6, 2024 at 3:43 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Emily Shaffer <emilyshaffer@google.com> writes:
>
> > By the time we decide we need to do a partial clone fetch, we already
> > know the object is missing, even if the_repository->parsed_objects
> > thinks it exists. But --refetch bypasses the local object check, so we
> > can guarantee that a JIT fetch will fix incorrect local caching.
> > ...
> > The culprit is that we're assuming all local refs already must have
> > objects in place. Using --refetch means we ignore that assumption during
> > JIT fetch.
>
> Hmph.  The whole lazy fetch business looks more and more broken X-<.

By "lazy fetch", are you referring to the partial clone fetch, or are
you referring to the mark_complete stuff? (I know we have been having
lots of issues with the partial clone fetch at Google in the last
month or so, so excuse me disambiguating :) )

> There is a comment in the refetch code path that tells us to "perform
> a full refetch ignoring existing objects", but if an object truly
> exists, there should be no need to refetch, and it starts to smell
> more like "ignoring somebody who gives us an incorrect information
> that these objects exist".
>
> But a ref that points at a missing commit is "somebody giving a
> false information" and an option to ignore such misinformation would
> be a perfect tool fit to sweep such a breakage under the rug.
>
> But is this sufficient?  Looking at how check_exist_and_connected()
> does its work, I am not sure how it would cope with a case where an
> object that is pointed by a ref does happen to exist, but the commit
> that is referred to by the commit is missing, as it only checks the
> existence of the tips.

Is that so? mark_complete_and_common claims that it recurses through
all parents of all local refs and marks them existing, too. Looks like
it does that in fetch-pack.c:mark_recent_complete_commits(), only up
to a certain date cutoff, and doesn't do that at all if there's no
cutoff provided. I don't think I see anywhere else that it's recursing
over parents, so I'm not sure why the comment says that. In fact, I
sort of wonder if the comment is wrong; it was introduced in this[1]
series much later than this code block has existed. But then, nobody
questioned it during the series, so I can also be misreading the code
:)

>
> > diff --git a/promisor-remote.c b/promisor-remote.c
> > index 9345ae3db2..cf00e31d3b 100644
> > --- a/promisor-remote.c
> > +++ b/promisor-remote.c
> > @@ -43,7 +43,7 @@ static int fetch_objects(struct repository *repo,
> >       strvec_pushl(&child.args, "-c", "fetch.negotiationAlgorithm=noop",
> >                    "fetch", remote_name, "--no-tags",
> >                    "--no-write-fetch-head", "--recurse-submodules=no",
> > -                  "--filter=blob:none", "--stdin", NULL);
> > +                  "--filter=blob:none", "--refetch", "--stdin", NULL);
> >       if (!git_config_get_bool("promisor.quiet", &quiet) && quiet)
> >               strvec_push(&child.args, "--quiet");
> >       if (start_command(&child))
>
> The documentation for "git fetch --refetch" says that this grabs
> everything as if we are making a fresh clone, ignoring everything we
> already have.  Which makes the change in this patch prohibitively
> expensive for asking each single object lazily from the promisor
> remote, but is that really the case?  If there is a reasonable
> safety that prevents us from doing something silly like transferring
> one clone worth of data for every single object we lazily fetch,
> perhaps this would be a workable solution (but if that is the case,
> perhaps "git fetch --refetch" documentation needs to be rephrased,
> to avoid such an impression).

Yeah, this is on me for not reading the entire documentation, just
noticing in code that it disabled this COMPLETE cache thingie. You're
right that it would be too expensive to use this way. As I said at the
top, I'll try to send one of the other alternative approaches today.

 - Emily

1: https://lore.kernel.org/git/pull.451.git.1572981981.gitgitgadget@gmail.com/
Junio C Hamano Oct. 11, 2024, 5:54 p.m. UTC | #5
Emily Shaffer <nasamuffin@google.com> writes:

> By "lazy fetch", are you referring to the partial clone fetch, or are

Anything we do to compensate for the fact that that initial clone or
fetch can be told to deliberately omit objects, in the hope that we
can grab missing objects on demand.

> Yeah, this is on me for not reading the entire documentation, just
> noticing in code that it disabled this COMPLETE cache thingie. You're
> right that it would be too expensive to use this way. As I said at the
> top, I'll try to send one of the other alternative approaches today.

I thought you were already on your vacation ;-).  I'll go offline
end of this week and won't be back until the end of the month.
diff mbox series

Patch

diff --git a/promisor-remote.c b/promisor-remote.c
index 9345ae3db2..cf00e31d3b 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -43,7 +43,7 @@  static int fetch_objects(struct repository *repo,
 	strvec_pushl(&child.args, "-c", "fetch.negotiationAlgorithm=noop",
 		     "fetch", remote_name, "--no-tags",
 		     "--no-write-fetch-head", "--recurse-submodules=no",
-		     "--filter=blob:none", "--stdin", NULL);
+		     "--filter=blob:none", "--refetch", "--stdin", NULL);
 	if (!git_config_get_bool("promisor.quiet", &quiet) && quiet)
 		strvec_push(&child.args, "--quiet");
 	if (start_command(&child))